VnCoreNLP: A Vietnamese Natural Language Processing Toolkit

VnCoreNLP: A Vietnamese Natural Language Processing Toolkit

Thanh Vu, Dat Quoc Nguyen, Dai Quoc Nguyen, Mark Dras    Mark Johnson
Newcastle University, United Kingdom
thanh.vu@newcastle.ac.uk
The University of Melbourne, Australia
dqnguyen@unimelb.edu.au
Deakin University, Australia
dai.nguyen@deakin.edu.au
Macquarie University, Australia
{mark.dras, mark.johnson}@mq.edu.au
Abstract

We present an easy-to-use and fast toolkit, namely VnCoreNLP—a Java NLP annotation pipeline for Vietnamese. Our VnCoreNLP supports key natural language processing (NLP) tasks including word segmentation, part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing, and obtains state-of-the-art (SOTA) results for these tasks. We release VnCoreNLP to provide rich linguistic annotations to facilitate research work on Vietnamese NLP. Our VnCoreNLP is open-source under GPL v3, and available at: https://github.com/vncorenlp/VnCoreNLP.

VnCoreNLP: A Vietnamese Natural Language Processing Toolkit


Thanh Vu, Dat Quoc Nguyen, Dai Quoc Nguyen, Mark Drasand Mark Johnson Newcastle University, United Kingdom thanh.vu@newcastle.ac.uk The University of Melbourne, Australia dqnguyen@unimelb.edu.au Deakin University, Australia dai.nguyen@deakin.edu.au Macquarie University, Australia {mark.dras, mark.johnson}@mq.edu.au

1 Introduction

Research on Vietnamese NLP has been actively explored in the last decade, boosted by the successes of the 4-year KC01.01/2006-2010 national project on Vietnamese language and speech processing (VLSP). Over the last 5 years, benchmark datasets for key Vietnamese NLP tasks are publicly available: datasets for word segmentation and POS tagging were released for the first VLSP evaluation campaign in 2013, a high-quality dependency treebank was published in 2014 (Nguyen et al., 2014), and a NER dataset was published for the VLSP 2016 evaluation campaign. So there is a need of building a NLP pipeline for those key tasks to assist users, and to support researchers and tool developers of downstream tasks.

Nguyen et al. (2010) and Le et al. (2013) built Vietnamese NLP pipelines by wrapping existing word segmenters and POS taggers including: JVnSegmenter (Nguyen et al., 2006), vnTokenizer (Le et al., 2008), JVnTagger (Nguyen et al., 2010) and vnTagger (Le-Hong et al., 2010). However, these word segmenters and POS taggers are no longer considered SOTA models for Vietnamese (Nguyen and Le, 2016; Nguyen et al., 2016b). Pham et al. (2017) built the NNVLP toolkit for Vietnamese sequence labeling tasks by applying a BiLSTM-CNN-CRF model (Ma and Hovy, 2016). However, Pham et al. (2017) did not make a comparison to SOTA traditional feature-based models. In addition, NNVLP is slow with a processing speed at about 300 words per second, which is not practical for real-world application such as dealing with large-scale data.

Figure 1: In pipeline architecture of VnCoreNLP, annotations are performed on an Annotation object.

In this paper, we present a Java NLP toolkit for Vietnamese, namely VnCoreNLP, which aims to facilitate Vietnamese NLP research by providing rich linguistic annotations through key NLP components of word segmentation, POS tagging, NER and dependency parsing. Figure 1 describes the overall system architecture. The following items highlight typical characteristics of VnCoreNLP:

  • Easy-to-use – All VnCoreNLP components are wrapped into a single .jar file, so users do not have to install external dependencies. Users can run processing pipelines from either the command-line or the Java API.

  • Fast – VnCoreNLP is fast, so it can be used for dealing with large-scale data. Also it benefits users suffering from limited computation resources (e.g. users from Vietnam).

  • Accurate – VnCoreNLP components obtain higher results than all previous published results on the same benchmark datasets.

2 Basic usages

Our design goal is to make VnCoreNLP simple to setup and run from either the command-line or the Java API. Performing linguistic annotations for a given file can be done by using a simple command as in Figure 2.

$ java -Xmx2g -jar VnCoreNLP.jar -fin input.txt -fout output.txt

Figure 2: Minimal command to run VnCoreNLP.

Suppose that the file input.txt in Figure 2 contains a sentence “Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội.” (MrÔng Nguyen Khac Chuc isđang workinglàm_việc attại Vietnam Nationalquốc_gia Universityđại_học HanoiHà_Nội), Table 1 shows the output for this sentence in plain text forms.

1 Ông Nc O 4 sub
2 Nguyễn_Khắc_Chúc Np B-PER 1 nmod
3 đang R O 4 adv
4 làm_việc V O 0 root
5 tại E O 4 loc
6 Đại_học N B-ORG 5 pob
7 Quốc_gia N I-ORG 6 nmod
8 Hà_Nội Np I-ORG 6 nmod
9 . CH O 4 punct
Table 1: The output in file output.txt for the sentence ‘Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội.” from file input.txt in Figure 2. The output are in a 6-column format representing word index, word form, POS tag, NER label, head index of the current word, and dependency relation type.

Similarly, we can also get the same output by using the API as easy as in Listing 1.

VnCoreNLP pipeline = new VnCoreNLP() ;
Annotation annotation = new Annotation("Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội.");
pipeline.annotate(annotation);
String annotatedStr = annotation.toString();
Listing 1: Minimal code for an analysis pipeline.

In addition, Listing 2 provides a more realistic and complete example code, presenting key components of the toolkit. Here an annotation pipeline can be used for any text rather than just a single sentence, e.g. for a paragraph or entire news story.

import vn.pipeline.*;
import java.io.*;
public class VnCoreNLPExample {
 public static void main(String[] args) throws IOException {
  // "wseg", "pos", "ner", and "parse" refer to as word segmentation, POS tagging, NER and dependency parsing, respectively.
  String[] annotators = {"wseg", "pos", "ner", "parse"};
  VnCoreNLP pipeline = new VnCoreNLP(annotators);
  // Mr Nguyen Khac Chuc is working at Vietnam National University, Hanoi. Mrs Lan, Mr Chuc’s wife, is also working at this university.
  String str = "Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cÅ©ng làm việc tại đây.";
  Annotation annotation = new Annotation(str);
  pipeline.annotate(annotation);
  PrintStream outputPrinter = new PrintStream("output.txt");
  pipeline.printToFile(annotation, outputPrinter);
  // Users can get a single sentence to analyze individually
  Sentence firstSentence = annotation.getSentences().get(0);
 }
}
Listing 2: A simple and complete example code.

3 Components

This section briefly describes each component of VnCoreNLP. Note that our goal is not to develop new approach or model for each component task. Here we focus on incorporating existing models into a single pipeline. In particular, except a new model we develop for the language-dependent component of word segmentation, we apply traditional feature-based models which obtain SOTA results for English POS tagging, NER and dependency parsing to Vietnamese. The reason is based on a well-established belief in the literature that for a less-resourced language such as Vietnamese, we should use traditional feature-based models to obtain fast and accurate performances rather than using neural network-based models.

  • wseg – Unlike English where white space is a strong indicator of word boundaries, when written in Vietnamese white space is also used to separate syllables that constitute words. So word segmentation is referred to as the key first step in Vietnamese NLP. We have proposed a novel transformation rule-based learning model for Vietnamese word segmentation, which obtains the highest segmentation accuracy and speed to date. See details in Nguyen et al. (2018).

  • pos – To label words with their POS tag, we apply MarMoT which is a generic CRF framework and a SOTA POS and morphological tagger (Mueller et al., 2013).111http://cistern.cis.lmu.de/marmot/

  • ner – To recognize named entities, we apply a dynamic feature induction model that automatically optimizes feature combinations (Choi, 2016).222https://emorynlp.github.io/nlp4j/components/named-entity-recognition.html

  • parse – To perform dependency parsing, we apply the greedy version of a transition-based model with selectional branching (Choi et al., 2015).333https://emorynlp.github.io/nlp4j/components/dependency-parsing.html

4 Evaluation

We detail experimental results of the word segmentation (wseg) and POS tagging (pos) components of VnCoreNLP in Nguyen et al. (2018) and Nguyen et al. (2017b), respectively. In particular, our word segmentation component gets the highest results to date in terms of both segmentation F1 score at 97.90% and speed at 62k words per second.444All speeds reported in this paper are computed on a personal computer of Intel Core i7 2.2 GHz. Our POS tagging component also obtains the highest accuracy to date at 95.88% with a fast tagging speed at 25k words per second, and outperforms BiLSTM-CRF-based models. Following subsections present evaluations for the NER (ner) and dependency parsing (parse) components.

4.1 Named entity recognition

In this section, we make a comparison between SOTA feature-based and neural network-based models, which, to the best of our knowledge, has not done in any prior work on Vietnamese NER.

Dataset:

The NER shared task at the 2016 VLSP workshop provides a set of 16,861 manually annotated sentences for training and development, and a set of 2,831 manually annotated sentences for test, with four NER labels PER, LOC, ORG and MISC. In both datasets, words are also supplied with gold POS tags. In addition, each word representing a full personal name are separated into syllables that constitute the word. This scheme results in an unrealistic scenario: (i) gold POS tags are not available in a real-world application, and (ii) in the standard representation in Vietnamese word segmentation (Nguyen et al., 2009), a word segmenter outputs a full name as a word. So for a real-world scenario, we merge those contiguous syllables constituting a full name to form a word,555Based on the gold label PER, contiguous syllables such as “Nguyễn/B-PER”, “Khắc/I-PER” and “Chúc/I-PER” are merged to form a word as “Nguyễn_Khắc_Chúc/B-PER.” and then we replace the gold POS tags by automatic tags predicted by our POS tagging component. From the set of 16,861 sentences, we sample 2,000 sentences for development and using the remaining 14,861 sentences for training.666Note that on the original VLSP 2016 NER data, using the same experimental setup as in Pham et al. (2017), our NER component obtains a F1 score at 93.2% which is higher than NNVLP’s and all other previous published results.

Models:

We make an empirical comparison between the VnCoreNLP’s NER component and the following neural network-based models:

  • BiLSTM-CRF (Huang et al., 2015) is a sequence labeling model which extends the BiLSTM model with a CRF layer.

  • BiLSTM-CRF + CNN-char, i.e. BiLSTM-CNN-CRF, is an extension of BiLSTM-CRF, using CNN to derive character-based representations (Ma and Hovy, 2016).

  • BiLSTM-CRF + LSTM-char is an extension of BiLSTM-CRF, using BiLSTM to derive the character-based representations (Lample et al., 2016).

  • BiLSTM-CRF+POS is another extension to BiLSTM-CRF, incorporating embeddings of automatically predicted POS tags (Reimers and Gurevych, 2017).

We use an implementation which is optimized for performance of all BiLSTM-CRF-based models from Reimers and Gurevych (2017).777https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf We then follow Nguyen et al. (2017b, Section 3.4) to perform hyper-parameter tuning.888We employ pre-trained word vectors from Vu (2016).

Model F1 Speed
VnCoreNLP 88.14 19k
BiLSTM-CRF 86.48 2.8k
     + CNN-char 88.28 1.8k
     + LSTM-char 87.71 1.3k
BiLSTM-CRF+POS 86.12 _
     + CNN-char 88.06 _
     + LSTM-char 87.43 _
Table 2: F1 scores (in %) on the test set w.r.t. gold word-segmentation. “Speed” denotes the processing speed of the number of words per second (for VnCoreNLP, automatically POS tagging time is also taken into account).

Main results:

Table 2 presents F1 scores and speed of each model on the test set, where VnCoreNLP obtains the second highest score at 88.14% with a fast speed at 19k words per second. In particular, VnCoreNLP is just about 0.1% absolute lower than the most accurate model BiLSTM-CRF + CNN-char, but it obtains 10+ times faster speed than BiLSTM-CRF + CNN-char.

It is surprising that for such an isolated language as Vietnamese where all words are not inflected, using character-based representations helps producing 1+% improvements to the BiLSTM-CRF model. We find that the improvements to BiLSTM-CRF are mostly accounted for the PER label. The reason turns out to be simple: about 50% of named entities are labeled with tag PER, so character-based representations are in fact able to capture common family and middle names in ‘unknown’ full-name words in the test set. In addition, we also find that BiLSTM-CRF-based models do not benefit from additional predicted POS tags. It is probably because BiLSTM can take word order into account, while without word inflection, all grammatical information in Vietnamese is conveyed through its fixed word order, thus explicit predicted POS tags with noise grammatical information are not helpful.

4.2 Dependency parsing

Experimental setup:

We use the Vietnamese dependeny treebank VnDT (Nguyen et al., 2014) consisting of 10,200 sentences in our experiments. Following Nguyen et al. (2016a), we use the last 1020 sentences of VnDT for test while the remaining sentences are used for training. Evaluation metrics are the labeled attachment score (LAS) and unlabeled attachment score (UAS).

Main results:

Table 3 compares the dependency parsing results of VnCoreNLP with results reported in prior work, using the same experimental setup. The first six rows present the scores with gold POS tags. The next two rows show scores of VnCoreNLP with automatic POS tags,999We replace the gold POS tags by the automatic POS tags predicted by our POS tagging component in both training and test sets. while the last row presents scores of the joint POS tagging and dependency parsing model jPTDP (Nguyen et al., 2017a). Table 3 shows that compared to previously published results, VnCoreNLP produces the highest LAS score. Note that previous results are reported without using additional information of automatically predicted NER lables. In this case, the LAS score accounted for VnCoreNLP without automatic NER features (i.e. VnCoreNLP–NER in Table 3) is still higher than previous ones. Notably, we also obtain a fast parsing speed.

Model LAS UAS Speed

Gold POS

VnCoreNLP 73.39 79.02 _
VnCoreNLP–NER 73.21 78.91 _
BIST-bmstparser 73.17 79.39 _
BIST-barchybrid 72.53 79.33 _
MSTParser 70.29 76.47 _
MaltParser 69.10 74.91 _

Auto POS

VnCoreNLP 70.23 76.93 8k
VnCoreNLP–NER 70.10 76.85 9k
jPTDP 69.49 77.68 700
Table 3: LAS and UAS scores (in %) computed on all tokens (i.e. including punctuation) on the test set w.r.t. gold word-segmentation. “Speed” is defined as in Table 2. The subscript “–NER” denotes the model without using automatically predicted NER labels as features. The results of the MSTParser (McDonald et al., 2005), MaltParser (Nivre et al., 2007), and BiLSTM-based parsing models BIST-bmstparser and BIST-barchybrid (Kiperwasser and Goldberg, 2016) are reported in Nguyen et al. (2016a). The result of the jPTDP model (Nguyen et al., 2017a) for Vietnamese is mentioned in Nguyen et al. (2017b) and detailed at https://drive.google.com/drive/folders/0B5eBgc8jrKtpUmhhSmtFLWdrTzQ.

5 Conclusion

In this paper, we have presented the VnCoreNLP toolkit—a simple, fast and accurate NLP processing pipeline—providing core Vietnamese NLP steps: word segmentation, POS tagging, NER and dependency parsing. Current version of VnCoreNLP has been trained without any linguistic optimization, i.e. we only employ existing pre-defined features in the traditional feature-based models for POS tagging, NER and dependency parsing. So future work will focus on incorporating Vietnamese linguistic features into these feature-based models.

References

  • Choi (2016) Jinho D. Choi. 2016. Dynamic Feature Induction: The Last Gist to the State-of-the-Art. In Proceedings of NAACL-HLT. pages 271–281.
  • Choi et al. (2015) Jinho D. Choi, Joel Tetreault, and Amanda Stent. 2015. It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool. In Proceedings of ACL-IJCNLP. pages 387–396.
  • Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.
  • Kiperwasser and Goldberg (2016) Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations. Transactions of the Association for Computational Linguistics 4:313–327.
  • Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In Proceedings of NAACL-HLT. pages 260–270.
  • Le et al. (2008) Hong Phuong Le, Thi Minh Huyen Nguyen, Azim Roussanaly, and Tuong Vinh Ho. 2008. A hybrid approach to word segmentation of Vietnamese texts. In Proceedings of LATA. pages 240–249.
  • Le et al. (2013) Ngoc Minh Le, Bich Ngoc Do, Vi Duong Nguyen, and Thi Dam Nguyen. 2013. VNLP: An Open Source Framework for Vietnamese Natural Language Processing. In Proceedings of SoICT. pages 88–93.
  • Le-Hong et al. (2010) Phuong Le-Hong, Azim Roussanaly, Thi Minh Huyen Nguyen, and Mathias Rossignol. 2010. An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts. In Proceedings of TALN.
  • Ma and Hovy (2016) Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In Proceedings of ACL (Volume 1: Long Papers). pages 1064–1074.
  • McDonald et al. (2005) Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online Large-margin Training of Dependency Parsers. In Proceedings of ACL. pages 91–98.
  • Mueller et al. (2013) Thomas Mueller, Helmut Schmid, and Hinrich Schütze. 2013. Efficient Higher-Order CRFs for Morphological Tagging. In Proceedings of EMNLP. pages 322–332.
  • Nguyen et al. (2006) Cam-Tu Nguyen, Trung-Kien Nguyen, Xuan-Hieu Phan, Le-Minh Nguyen, and Quang-Thuy Ha. 2006. Vietnamese Word Segmentation with CRFs and SVMs: An Investigation. In Proceedings of PACLIC. pages 215–222.
  • Nguyen et al. (2010) Cam-Tu Nguyen, Xuan-Hieu Phan, and Thu-Trang Nguyen. 2010. JVnTextPro: A Java-based Vietnamese Text Processing Tool. http://jvntextpro.sourceforge.net/.
  • Nguyen et al. (2016a) Dat Quoc Nguyen, Mark Dras, and Mark Johnson. 2016a. An empirical study for Vietnamese dependency parsing. In Proceedings of ALTA. pages 143–149.
  • Nguyen et al. (2017a) Dat Quoc Nguyen, Mark Dras, and Mark Johnson. 2017a. A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing. In Proceedings of the CoNLL 2017 Shared Task. pages 134–142.
  • Nguyen et al. (2014) Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham, Phuong-Thai Nguyen, and Minh Le Nguyen. 2014. From Treebank Conversion to Automatic Dependency Parsing for Vietnamese. In Proceedings of NLDB. pages 196–207.
  • Nguyen et al. (2018) Dat Quoc Nguyen, Dai Quoc Nguyen, Thanh Vu, Mark Dras, and Mark Johnson. 2018. A Fast and Accurate Vietnamese Word Segmenter. In Proceedings of LREC. page to appear.
  • Nguyen et al. (2017b) Dat Quoc Nguyen, Thanh Vu, Dai Quoc Nguyen, Mark Dras, and Mark Johnson. 2017b. From Word Segmentation to POS Tagging for Vietnamese. In Proceedings of ALTA. pages 108–113.
  • Nguyen et al. (2009) Phuong Thai Nguyen, Xuan Luong Vu, Thi Minh Huyen Nguyen, Van Hiep Nguyen, and Hong Phuong Le. 2009. Building a Large Syntactically-Annotated Corpus of Vietnamese. In Proceedings of LAW. pages 182–185.
  • Nguyen and Le (2016) Tuan-Phong Nguyen and Anh-Cuong Le. 2016. A Hybrid Approach to Vietnamese Word Segmentation. In Proceedings of RIVF. pages 114–119.
  • Nguyen et al. (2016b) Tuan Phong Nguyen, Quoc Tuan Truong, Xuan Nam Nguyen, and Anh Cuong Le. 2016b. An Experimental Investigation of Part-Of-Speech Taggers for Vietnamese. VNU Journal of Science: Computer Science and Communication Engineering 32(3):11–25.
  • Nivre et al. (2007) Joakim Nivre, Johan Hall, Jens Nilsson, Atanas Chanev, Gülsen Eryigit, Sandra Kübler, Svetoslav Marinov, and Erwin Marsi. 2007. MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering 13(2):95–135.
  • Pham et al. (2017) Thai-Hoang Pham, Xuan-Khoai Pham, Tuan-Anh Nguyen, and Phuong Le-Hong. 2017. NNVLP: A Neural Network-Based Vietnamese Language Processing Toolkit. In Proceedings of the IJCNLP 2017 System Demonstrations. pages 37–40.
  • Reimers and Gurevych (2017) Nils Reimers and Iryna Gurevych. 2017. Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging. In Proceedings of EMNLP. pages 338–348.
  • Vu (2016) Xuan-Son Vu. 2016. Pre-trained word2vec model for vietnamese. https://github.com/sonvx/word2vecVN.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
57273
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description