PKUSEG: A Toolkit for Multi-Domain
Chinese Word Segmentation
Chinese word segmentation (CWS) is a fundamental step of Chinese natural language processing. In this paper, we build a new toolkit, named PKUSEG, for multi-domain word segmentation. Unlike existing single-model toolkits, PKUSEG targets at multi-domain word segmentation and provides separate models for different domains, such as web, medicine, and tourism. The new toolkit also supports POS tagging and model training to adapt to various application scenarios. Experiments show that PKUSEG achieves high performance on multiple domains. The toolkit is now freely and publicly available for the usage of research and industry.111https://github.com/lancopku/pkuseg-python/
Chinese word segmentation is a fundamental task of Chinese processing. Since words define the basic semantic unit of Chinese, the quality of segmentation directly influences the performance of downstream tasks. In recent years, Chinese word segmentation has undergone great development. The best-performing systems are mostly based on conditional random fields (CRF) Lafferty (2001); Sun et al. (2012). However, despite the promising results, these approaches heavily rely on feature engineering. To tackle this problem, many researches Chen et al. (2015); Cai and Zhao (2016); Liu et al. (2016); Xu and Sun (2016) explore neural networks to automatically learn better representations.
Recently, there arise several public segmentation toolkits, such as jieba, HanLP, and so on. For efficiency, they are built upon on traditional segmentation models, like perceptron Zhang and Clark (2007) or CRF, rather than time-consuming neural networks. These toolkits only provide a single coarse-grained segmentation model, mostly trained on news domain data. In real-world applications, the domain of text varies and the text from different domains has different domain-specific segmentation rules. This increases the difficulty of segmentation and drops the performance of existing toolkits on text from various domains.
To address this challenge, we propose a multi-domain segmentation toolkit, PKUSEG, based on the work of Sun et al. (2012): We adopt a fast and high-precision model CRF as implementation. PKUSEG includes multiple pre-trained domain-specific segmentation models. Since some of domains may be of low resources, we use pre-training techniques to improve the quality of segmentation. We first pre-train a coarse-grained model on a mixed corpus, including millions of data from news and web domains. Then, we fine-tune the coarse-grained model on specific domain data to get fine-grained models. In addition to provided segmentation models, PKUSEG also allows users to train a new model on their own domain data. Furthermore, POS tagging is supported in PKUSEG to adapt to various scenarios. Experimental results show that PKUSEG has achieved high performance on multi-domain datasets.
In summary, PKUSEG has the following characteristics:
Good out-of-the-box performance. The default word segmentation model provided by PKUSEG is trained on a large-scale, curated, multi-domain dataset, which shows stable and high performance across various domains.
Domain-specific pre-trained models. PKUSEG also comes with multiple pre-trained models that are fine-tuned on texts of different domains, which further elevates domain-specific performance, suitable for analyzing in-domain texts.
Easy transfer learning. For advanced users, PKUSEG supports transfer learning based on the default multi-domain model. Users could fine-tune the model on their custom segmented texts.
POS tagging. PKUSEG also provides users POS tagging interfaces for further lexical analysis.
This section gives the detailed description of toolkit implementation.
2.1 Conditional Random Field
Despite better performance, we do not use neural networks as implementation due to training them is time-consuming. Instead, we use a well-performing and fast-training model, CRF, as implementation, considering the trade-off between time cost and high accuracy. We optimize the weights of CRF by maximizing the log likelihood of the tags of the reference sequence. When calculating the log likelihood, the log likelihood function can be calculated by the recursive algorithm in linear time. When inference, the Viterbi Forney (1973) algorithm is adopted. The goal is to find the sequence of tags by dynamic programming.
2.2 ADF Algorithm
For CRF with many high-dimensional features, the amount of parameters is very large, thus the training cost is very expensive. To address this problem, we use adaptive online gradient descent based on feature frequency information (ADF) Sun et al. (2012) for training. The ADF algorithm does not use a single learning rate for all parameters like stochastic gradient descent (SGD), instead turns the learning rate into a vector with the same dimension as the parameters. The learning rate of each parameter is automatically adjusted according to the frequency of parameter. The idea is that the feature with higher frequency will be more adequate.
To handle the problem of low-resource, we adopt pre-training techniques in PKUSEG following the work of Xu and Sun (2017). We mix news and web data together as pre-training data. News data comes from dataset PKU provided by the Second International Chinese Word Segmentation Bakeoff. Web data comes from dataset Weibo provided by NLPCC-ICCPOL 2016 Shared Task Qiu et al. (2016). A hybrid dataset CTB is also involved into pre-training. In the process of fine-tuning, models are initialized with the pre-trained model and trained on domain-specific data. So far PKUSEG supports five fine-grained domains, including news, medicine, tourism, and web. Considering the covered domains are limited, we also provide a pre-trained model for generalization.
2.4 A Large-Scale Vocabulary
One major difficulty of multi-domain segmentation is spare domain-specific words. It is hard to cover all of these words on the training set. Therefore, to increase the coverage rate of PKUSEG, we automatically build a large-scale domain vocabulary. The word resource is crawled from sogou website and extracted from the training data of PKU, MSRA, Weibo, and CTB.222https://pinyin.sogou.com/dict/ In total, we extract almost 850K words. The distribution of words is shown in Table 1.
PKUSEG has high precision performance along with user-friendly interfaces. It is developed based on standard python3 libraries. PKUSEG supports common running platforms, such as Windows, Linux, and MacOS.
PKUSEG offers two user-friendly installation methods. Users can easily install it with PyPI and the corresponding models will be downloaded at the same time. A typical command is:
Users also can install PKUSEG from GitHub. After downloading the project code from GitHub, users can run the following command to install PKUSEG:
Noting that the downloaded project from GitHub does not include pre-trained models, users need to additionally download them from GitHub or train a new model.
The followings are the detailed introduction of segmentation interfaces.
If a user is aware of the domain of the text to be segmented, then he/she can use the specific model. An example code of specifying the used model is shown in Figure 1. If a model is toolkit-provided, users can directly use the domain name to call it, e.g, “medicine”, “touirsm”, “web”, and “news”. The model is automatically loaded based on parameter “model_name”. If the model is user-trained, “model_name” refers to the model path.
Although PKUSEG is designed to satisfy the situation where users know the domain of the text to be segmented, we also provide a coarse-grained model in case that the user can not distinguish the target domain. The coarse-grained model works under the default mode. Figure 2 shows an example code using the default mode.
To better recognize new words, users can add a dictionary to cover the words that do not occur in the dictionary of PKUSEG. The provided dictionary file should follow the following format. Each row has a single word and the dictionary file is encoded with the UTF-8 format. Figure 3 shows the usage of a user-defined dictionary.
PKUSEG also allows users to train a new model from scratch with their own training data. Figure 4 is an example code for showing how to train a new model.
Segmentation with POS Tagging
In addition to segmentation, PKUSEG also can label POS tags for words in a sentence. The usage of POS tagging interfaces is shown in Figure 5.
This section evaluates the performance of PKUSEG.
Msra & Pku.
MSRA and PKU are from news domain and provided by the Second International Chinese Word Segmentation Bakeoff.
Chinese Tree Bank is a hybrid domain dataset.333https://catalog.ldc.upenn.edu/LDC2013T21 It consists of approximately 1.5 million words from Chinese newswire, government documents, magazine articles, various broadcast news and broadcast conversation programs, web newsgroups and weblogs.
This dataset comes from the NLPCC-ICCPOL 2016 Shared Task. Different with the popular used newswire datasets, this dataset consists of many informal micro-texts.
Medicine & News & Tourism.
The corpus is originally constructed in Likun Qiu (2015) by annotating multi-domain texts.
4.2 Out-of-domain Results
To show the effect of domain knowledge on segmentation performance, we train a model on CTB8 dataset and report its performance on different datasets. Here we choose CTB8 as example because CTB8 is a hybrid dataset. The results are shown in Table 2. We can see that the performance drops obviously on out-of-domain datasets. Different domain has its unique segmentation standard, thus it is not suitable to provide one single model for various domain data. This result demonstrates the necessity of fine-grained segmentation toolkits.
4.3 Pre-training Results
We combine existing large-scale datasets together, including PKU (news), Weibo (web), and CTB8 (hybrid), and use them as pre-training data to obtain a coarse-grained model. Then the coarse-grained model is used to fine-tune domain-specific models. Table 5 shows the effect of pre-training. The pre-training performs much better in terms of average score, especially on datasets with lower resource (e.g., tourism).
|w/o Pre-train||w. Pre-train|
4.4 Default Performance
Considering the fact that many users tend to use the default model to test performance, with the default model and vocabulary of PKUSEG. We also report experimental results on the default mode. The results are shown in Table 4. As we can see, the performance of the default model performs worse than that of domain-specific models. Therefore, we recommend users to use domain-specific models, rather than the default model, if the user can classify the domain of text.
To learn more about the practical application of PKUSEG, we also show some segmentation examples that randomly crawled from articles which covers the domains of medicine, travel, web text, and news. The segmentation results are shown in Table 5. PKUSEG has high accuracy when dealing with words that need professional domain knowledge.
|医联 平台 ： 包括 挂号 预约 查看 院内 信息 化验单 等 ， 目前 出现 与 微信 、 支付宝 结合的 趋势 。|
|Medical Association platform includes registration appointment, in-hospital information management, etc. There is a trend of integration with WeChat and Alipay.|
|在 这里 可以 俯瞰 维多利亚港 的 香港岛 ， 九龙 半岛 两岸 ， 美景 无敌 。|
|It overlooks Victoria Harbour and the two sides of the Kowloon Peninsula. The view is so beautiful.|
|【 这是 我 的 世界 ， 你 还 未 见 过 】 欢迎 来 参加 我 的 演唱会 听点 音乐|
|This is my world that you have not seen before. Welcome to participate in my concert to listen to music.|
|他 不 忘 讽刺 加州 ： “ 加州 已 在 失控 的 高铁 项目 上 浪费 了 数十亿美元 ， 完全 没有 完成 的 希望 。|
|He did not forget to satirize California, “California has been wasting billions of dollars on the uncontrolled high-speed rail projects, which is of no hope being completed at all”.|
|乌克兰 政府 正式 通过 最新 《 宪法 修正案 》 ， 正式 确定 乌克兰 将 加入 北约 作为 重要 国家 方针 ， 该 法 强调 ， ” 这项 法律 将 于 发布 次日 起 生效 ” 。|
|The Ukrainian government officially adopted the latest Constitutional Amendment, confirming that Ukraine will regard joining the NATO as an important national policy. The law emphasizes that it will take effect from the next day.|
5 Conclusion and Future Work
In this paper, we propose a new toolkit PKUSEG for multi-domain Chinese word segmentation. SPKUSEG provides simple and user-friendly interfaces for users. Experiments on widely-used datasets demonstrate that PKUSEG performs well with high accuracy. So far PKUSEG supports domains like medicine, tourism, web, and news. In the future, we plan to release more domain-specific models and improve the efficiency of PKUSEG further.
- Cai and Zhao (2016) Deng Cai and Hai Zhao. 2016. Neural word segmentation learning for chinese. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
- Chen et al. (2015) Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang. 2015. Long short-term memory neural networks for chinese word segmentation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 1197–1206.
- Forney (1973) G. D. Forney. 1973. The viterbi algorithm. Proc. of the IEEE, 61:268 – 278.
- Lafferty (2001) John Lafferty. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. pages 282–289. Morgan Kaufmann.
- Likun Qiu (2015) Houfeng Wang Likun Qiu, Linlin Shi. 2015. Construction of multi-domain chinese dependency treebanks and analysis of influencing factors on dependency parsing. Journal of Chinese Information Processing, 29(5):69.
- Liu et al. (2016) Yijia Liu, Wanxiang Che, Jiang Guo, Bing Qin, and Ting Liu. 2016. Exploring segment representations for neural segmentation models. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 2880–2886.
- Qiu et al. (2016) Xipeng Qiu, Peng Qian, and Zhan Shi. 2016. Overview of the nlpcc-iccpol 2016 shared task: Chinese word segmentation for micro-blog texts. In NLPCC/ICCPOL, volume 10102 of Lecture Notes in Computer Science, pages 901–906. Springer.
- Sun et al. (2012) Xu Sun, Houfeng Wang, and Wenjie Li. 2012. Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection. In The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, July 8-14, 2012, Jeju Island, Korea - Volume 1: Long Papers, pages 253–262. The Association for Computer Linguistics.
- Xu and Sun (2016) Jingjing Xu and Xu Sun. 2016. Dependency-based gated recursive neural network for chinese word segmentation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 567–572.
- Xu and Sun (2017) Jingjing Xu and Xu Sun. 2017. Transfer learning for low-resource chinese word segmentation with a novel neural network. CoRR, abs/1702.04488.
- Zhang and Clark (2007) Yue Zhang and Stephen Clark. 2007. Chinese segmentation with a word-based perceptron algorithm. In ACL. The Association for Computational Linguistics.