BKTreebank: Building a Vietnamese Dependency Treebank
Dependency treebank is an important resource in any language. In this paper, we present our work on building BKTreebank, a dependency treebank for Vietnamese. Important points on designing POS tagset, dependency relations, and annotation guidelines are discussed. We describe experiments on POS tagging and dependency parsing on the treebank. Experimental results show that the treebank is a useful resource for Vietnamese language processing.
Keywords: treebank, dependency parsing, POS tagging, word segmentation, Vietnamese, less-resourced language
BKTreebank: Building a Vietnamese Dependency Treebank
|School of information and communication technology,|
|Hanoi university of science and technology,|
|1 Dai Co Viet, Bach Khoa, Hai Ba Trung, Hanoi, Vietnam|
Dependency treebank is important for data-driven dependency parsing. However, building a dependency treebank is complicated and expensive.
Dependency treebanks have been available in English and several other languages. For Vietnamese, a dependency treebank, namely VnDT [Nguyen et al., 2014], was developped by automatically converting from VietTreebank (VTB) [Nguyen et al., 2009, Nguyen et al., 2015]. State-of-the-art performance on VnDT is 73.5 LAS, which is insufficient for downstream applications [Nguyen et al., 2016a].
In this work, we present the building of a dependency treebank for Vietnamese. Our treebank was manually annotated by annotators. Its annotation guidelines substantially differ from VTB. Our contributions are two-fold:
A manual dependency treebank for Vietnamese.
Experiments on POS tagging and dependency parsing based on the treebank.
The paper is organized as follows: Section 2. briefly introduces related work on building treebanks for Vietnamese and dependency treebanks for other languages. Section 3. highlights important points of annotation guidelines. Section 4. describes in brief the annotation process. Section 5. is dedicated to evaluations and discussions on automatic POS tagging and dependency parsing results. The paper is concluded in Section 6.
2. Related Work
2.1. Treebanks for Vietnamese
VTB was the pioneer treebank for Vietnamese. It has been developed from 2006-2010. It contains manual annotations on about 40K sentences for word segmentation, 10K sentences for POS tagging, and 10K sentences for bracketing.
VnDT contains dependency annotations which were automatically converted from bracketing annotations in VTB. State-of-the-art performance on VnDT is 80.7% and 73.5% on UAS and LAS, respectively [Nguyen et al., 2016a].
Recently, a new treebank for Vietnamese has been developped [Nguyen et al., 2016b, Nguyen et al., 2017]. It consists of 40K sentences annotated with word segmentation, POS tagging, and bracketing. While generally agreeing on word segmentation and bracketing, they propose a POS tagset and POS tagging guidelines which focus more on word-class transformation, particularly between verbs and other word-classes. This issue is important as Vietnamese is an analytic language. Unfortunately, their treebank has not been publicly available for research community yet.
2.2. Dependency treebank for other languages
One of the most notable dependency treebanks for English was developed by Stanford NLP group [De Marneffe and Manning, 2008]. The Stanford treebank is automatically converted from PeenTreebank phrase structures [Marneffe et al., 2006]. Current universal dependency is inherited from Penn POS tagset and Stanford typed dependency representation [Marneffe et al., 2014].
3. Annotation Guidelines
3.1. POS tagging guidelines
As Vietnamese is an analytic language, we omit tags related to plurality, tense, and superlative in Penn tagset.
CL is used for noun classifiers. In Vietnamese, a countable noun could be accompanied by a classifier when we want to indicate quantity or simply to emphasize. For example, ‘tấm’ is a classifier’ in “Anh ta giành được hai tấm huy chương vàng” (He won two gold medals); ‘chiếc’ is a classifier in “Chiếc xe này khá đắt” (This car is quite expensive). In [Nguyen et al., 2016b], the authors also dedicate two tags Nc and Ncs for noun classifiers. Similar phenomena could be found in other languages such as Korean [Kim and Yang, 2006].
PFN is used for prefix nominalizers. Many nominal expressions in Vietnamese are formed by a leading nominalizer and a verb or an adjective (see Table 1 for examples). In [Nguyen et al., 2016b], there are also POS tags mentioning word-class transformation including VA (Verb-Adjective), VN (Verb-Noun), and NA (Noun-Adjective) but it is not clear from the paper how the tags are designed.
NML is used for phrasal nominalizers. In Vietnamese, a special word such as ‘việc’ is used as a clausal adverbial marker for a clausal component. For instance, in “Việc xử lý chất thải công nghiệp cần được làm ngay” (The processing of industry garbage needs to be done immediately), ‘việc’ is the marker for the clausal subject.
VA is used for adjectival verb. In Vietnamese, when the predicate is an adjective, there is no copula verb to be. It is hence tagged as an adjectival verb. In the sentence “Tình hình tương đối khả quan” (The situation is quite positive), ‘khả quan’ is predicate and is tagged as VA.
AV stand for verbal adjective. When a verb modifies a noun, it is tagged as an verbal adjective (e.g. biển/NN quảng_cáo/AV (advertising board)).
TO is used to tagged ‘để’, which has similar meaning as ‘to’ in English.
|niềm||vui||niềm vui (happiness)|
|sự||hi sinh||sự hi sinh (sacrifice)|
|niềm||tin||niềm tin (belief)|
3.2. Dependency parsing guidelines
case:pfn is used for nominalizing modifier between a headword as a nominalizer and a verb or an adjective (see examples in Table 1).
mark:relcl is used for phrasal adverbial modifier between a headword as the predicate of the clause and a marker such as ‘việc’.
In addition, we highlight the guidelines for dependencies specific for Vietnamese:
aux is also used for relationship between a verb and a tense auxiliary (e.g. thực hiện/VB - aux - đang/MD in “đang thực hiện” (be executing)).
det is also used for relationship between a noun and its plural marker. Here, we tag a plural marker as a determiner (e.g. trường hợp/NN - det - những/DT in “những trường hợp” (cases)).
|nsubjpass||Passive nominal subject|
|csubjpass||Passive clausal subject|
|xcomp||Open clausal component|
|advcl||Adverbial clause modifier|
4. Annotation Process
We chose newswire articles from Dantri111http://dantri.vn, a general-domain online news agency, as an unannotated corpus.
Texts are first segmented by UETSegmenter [Nguyen and Le, 2016]. Sentences longer than 50 words are removed. Four annotators produce manual POS tagging and dependency parsing using the annotation tool BRAT [Stenetorp et al., 2012].
After removing invalid parsed sentences, our treebank contains 6909 manually annotated sentences on POS tagging and dependency parsing with the average speed of 7 min/sentence. In the next phase, we will use the dataset to learn a parser in order to apply bootstrapping strategy. To maintain annotation consistency, in the bootstrapping method, we will ask a first annotator first correct outputs from the parser. A second annotator will review the outputs from the first annotator. They will discuss confusing cases and make final decisions with a third annotator.
Figure 1 illustrates an annotation example using BRAT. Segmented texts are put into BRAT. Syllables of the same word are connected by ‘_’. POS tags are labeled for each tokens. Dependency parsing is annotated by creating directed relations between tagged words.
5. Annotation Evaluations
5.1. Inter annotator agreement
Our annotation process started with training phase. We decided to annotate POS tagging and dependency in parallel because the two tasks are complimentary to each other. After being explained the annotation guidelines, the annotators were first asked to separately annotate the same small dataset. After finishing annotating the dataset, they were ask to discuss the difference and to make final decisions under supervision of a forth person. If the annotators could not agree, they will discuss with the supervisor and make final decisions.
One of the main jobs is to first develop an automatic parser serving for bootstrapping annotation in the next phase. In the first round, each annotator were asked to annotate separate documents. They would discuss with the forth (the most expert) annotator and other annotators when dealing with a confusing case. In the second round, the forth annotator finally checked all the annotations and corrected errors if existed. Every week, the annotators together reviewed and discussed random annotated documents.
At the end of this phase, the three annotators were asked again to separately annotate the same small dataset to measure Inter-Annotator-Agreement (IAA). Averaged kappa is 94.5, 85.2, and 80.4 for POS tagging, unlabeled dependency parsing, and labeled dependency parsing, respectively. Note that this is agreement between annotators without revising of the forth (most expert) annotator.
5.2. Initial experiments on POS tagging and dependency parsing
We also used the dataset to learn POS taggers and dependency parsers on a training dataset of 5639 sentences. Their performance on a test dataset of 1270 sentences is described in Table 5 and Table 6. In all experiments, we used default parameter values as provided by implemented tools.
For a vanilla POS tagging model, we used the CRFSuite222http://www.chokkan.org/software/crfsuite/ implementation of first-order Conditional Random Fields with a straightforward feature set as described in Table 4. Our lexicon was built by merging the lexicon of VietTreebank [Nguyen et al., 2006] with frequent tags in our corpus under careful revision considering important differences in tagging guidelines. Only (word, tag) pairs that were tagged more than three times in the corpus were considered and were reviewed before adding to the lexicon. In the next phase, we are going to enrich the lexicon when more annotations are available.
|w[-2], w[-1], w, w, w|
|Overall accuracy: 90.7|
For dependency parsing, we used the transition-based MaltParser [Nivre et al., 2007] with default algorithm and feature set 333We also tried MaltOptimizer but the improvement was not statistically significant so we do not report the results here.
As shown in Table 5, performance of POS tagging on nouns is similar to averaged performance. Verbs are more difficult to tag as they are ambiguous, not only with nouns and adjectives, but also with verbal adjective (modifiers). Automatic tagging of verbal adjective modifiers is very challenging as such modifiers are not infectional, and in some cases it requires knowledge at syntactic level. As we observed, they are usually mistakenly tagged as a predicate verb. Verbal adjectives are also difficult because of zero-copula phenomenon.
Dependency parsing performance is promising as shown in Table 6. Accuracy at phrase-level is positive with the exception of nominal modifiers perhaps due to confusing usage of directional and temporal adverbial nouns and prepositions in Vietnamese. On the other hand, parsing at clause-level is poor. There are plenty rooms for improvement on such long-distance dependencies.
In this paper, we present the building of a dependency treebank for Vietnamese. Our work is based on previous works on treebanks for Vietnamese and dependency treebanks for other languages. Although current size of the corpus is limited, initial experimental results on POS tagging and dependency parsing is promising.
In the future, we are going to expand BKTreebank with a bootstrapping approach using automatic parsers learned from the dataset. We are going to investigate several approaches to POS tagging and dependency parsing for Vietnamese, including the joint learning approach. We are going to publish the treebank for research purpose in the near future.
This project has been partially funded by VCCorp via collaboration with Data science laboratory, School of information and communication technology, Hanoi university of science and technology. We would like to thank Vu Xuan Luong for enthusiastic discussions on VietTreebank.
8. Bibliographical References
- Berovic et al., 2012 Berovic, D., Agic, Z., and Tadic, M. (2012). Croatian dependency treebank: Recent development and initial experiments. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, may. European Language Resources Association (ELRA).
- Candito et al., 2010 Candito, M., Crabbe, B., and Denis, P. (2010). Statistical french dependency parsing: Treebank conversion and first results. In Nicoletta Calzolari (Conference Chair), et al., editors, Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, may. European Language Resources Association (ELRA).
- Choi et al., 2012 Choi, D., Park, J., and Choi, K.-S. (2012). Korean treebank transformation for parser training. In Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages, pages 78–88, Jeju, Republic of Korea, July 12. Association for Computational Linguistics.
- De Marneffe and Manning, 2008 De Marneffe, M.-C. and Manning, C. D. (2008). The stanford typed dependencies representation. In Coling 2008: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, CrossParser ’08, pages 1–8, Stroudsburg, PA, USA. Association for Computational Linguistics.
- Kim and Yang, 2006 Kim, J.-B. and Yang, J., (2006). Processing Korean Numeral Classifier Constructions in a Typed Feature Structure Grammar, pages 103–110. Springer Berlin Heidelberg, Berlin, Heidelberg.
- Marneffe et al., 2006 Marneffe, M., Maccartney, B., and Manning, C. (2006). Generating typed dependency parses from phrase structure parses. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC-2006), Genoa, Italy, May. European Language Resources Association (ELRA). ACL Anthology Identifier: L06-1260.
- Marneffe et al., 2014 Marneffe, M.-C. D., Dozat, T., Silveira, N., Haverinen, K., Ginter, F., Nivre, J., and Manning, C. D. (2014). Universal stanford dependencies: a cross-linguistic typology. In Nicoletta Calzolari (Conference Chair), et al., editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, may. European Language Resources Association (ELRA).
- Nguyen and Le, 2016 Nguyen, T. P. and Le, A. C. (2016). A hybrid approach to vietnamese word segmentation. In 2016 IEEE RIVF International Conference on Computing Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), pages 114–119, Nov.
- Nguyen et al., 2006 Nguyen, T. M. H., Romary, L., Rossignol, M., and Vu, X. L. (2006). A lexicon for vietnamese language processing. Language Resources and Evaluation, 40(3/4):291–309.
- Nguyen et al., 2009 Nguyen, P. T., Vu, X. L., Nguyen, T. M. H., Nguyen, V. H., and Le, H. P. (2009). Building a large syntactically-annotated corpus of vietnamese. In Proceedings of the Third Linguistic Annotation Workshop, pages 182–185, Suntec, Singapore, August. Association for Computational Linguistics.
- Nguyen et al., 2014 Nguyen, D. Q., Nguyen, D. Q., Pham, S. B., Nguyen, P.-T., and Le Nguyen, M., (2014). From Treebank Conversion to Automatic Dependency Parsing for Vietnamese, pages 196–207. Springer International Publishing, Cham.
- Nguyen et al., 2015 Nguyen, P.-T., Le, A.-C., Ho, T.-B., and Nguyen, V.-H. (2015). Vietnamese treebank construction and entropy-based error detection. Lang. Resour. Eval., 49(3):487–519, September.
- Nguyen et al., 2016a Nguyen, D. Q., Dras, M., and Johnson, M. (2016a). An empirical study for vietnamese dependency parsing. In Proceedings of the Australasian Language Technology Association Workshop 2016, pages 143–149, Melbourne, Australia, December.
- Nguyen et al., 2016b Nguyen, Q., Miyao, Y., Le, H., and Nguyen, N. (2016b). Challenges and solutions for consistent annotation of vietnamese treebank. In Nicoletta Calzolari (Conference Chair), et al., editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, may. European Language Resources Association (ELRA).
- Nguyen et al., 2017 Nguyen, Q. T., Miyao, Y., Le, H. T. T., and Nguyen, N. T. H. (2017). Ensuring annotation consistency and accuracy for vietnamese treebank. Language Resources and Evaluation, Jul.
- Nivre et al., 2007 Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kubler, S., Marinov, S., and Marsi, E. (2007). Maltparser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(2):95–135.
- Santorini, 1990 Santorini, B. (1990). Part-Of-Speech tagging guidelines for the Penn Treebank project (3rd revision, 2nd printing). Technical report, Department of Linguistics, University of Pennsylvania, Philadelphia, PA, USA.
- Solberg et al., 2014 Solberg, P. E., Skjarholt, A., Ovrelid, L., Hagen, K., and Johannessen, J. B. (2014). The norwegian dependency treebank. In Nicoletta Calzolari (Conference Chair), et al., editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, may. European Language Resources Association (ELRA).
- Stenetorp et al., 2012 Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., and Tsujii, J. (2012). brat: a web-based tool for nlp-assisted text annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 102–107, Avignon, France, April. Association for Computational Linguistics.