Ancient-Modern Chinese Translation with a Large Training Dataset
Ancient Chinese brings the wisdom and spirit culture of the Chinese nation. Automatically translation from ancient Chinese to modern Chinese helps to inherit and carry forward the quintessence of the ancients. In this paper, we propose an Ancient-Modern Chinese clause alignment approach and apply it to create a large scale Ancient-Modern Chinese parallel corpus which contains 1.24M bilingual pairs. To our best knowledge, this is the first large high-quality Ancient-Modern Chinese dataset.111We will release the dataset and source code upon acceptance. Furthermore, we train the SMT and various NMT based models on this dataset and provide a strong baseline for this task.
Dayiheng Liu, Jiancheng Lv, Kexin Yang, Qian Qu Data Intelligence Laboratory, Sichuan University, Chengdu, China firstname.lastname@example.org email@example.com
Ancient Chinese is the writing language in ancient China. It is a treasure of Chinese culture which brings together the wisdom and ideas of the Chinese nation and chronicles the ancient cultural heritage of China. Learning ancient Chinese not only helps people to understand the wisdom of the ancients and inherit, but also promotes people to absorb and develop Chinese culture. 222The concept of Ancient Chinese in this paper almost refers to the cultural/historical notion of literary Chinese (called WenYanWen in Chinese).
However, it is difficult for modern people to read ancient Chinese. Firstly, compared with modern Chinese, ancient Chinese is concise and short, its order of language changes greatly. Secondly, most modern Chinese words are double syllables, while most of the ancient Chinese words are monosyllabic. Thirdly, there is more than one polysemous phenomenon in ancient Chinese. In addition, manual translation has a high cost. Therefore, it is meaningful and useful to study the automatic translation from ancient Chinese to modern Chinese. Through Ancient-Modern Chinese Translation, the wisdom, talent and accumulated experience of the predecessors can be passed on to more people.
Neural machine translation (NMT) (Sutskever et al., 2014; Bahdanau et al., 2014; Wu et al., 2016) has achieved remarkable performance on many bilingual translation tasks. It is an end-to-end learning approach for machine translation, with the potential to show great advantages over the statistic machine translation (SMT) systems. However, NMT approach has not been widely applied to the Ancient-Modern Chinese translation task. One of the main reasons is the limited high-quality parallel data resource.
The most popular method of acquiring translation examples is bilingual text alignment (Kaji et al., 1992). This kind of method can be classified into two types: lexical-based and statistical-based. The lexical-based approaches (Wang and Ren, 2005; Kit et al., 2004) focus on lexical information, which utilizes the bilingual dictionary or lexical features. While the statistical-based approaches (Brown et al., 1991; Gale and Church, 1993) rely on statistical information, such as sentence length ratio in two languages and align mode probability.
However, these methods are designed for other bilingual language pairs. The Ancient-Modern Chinese has some characteristics that are quite different from other language pairs. For example, Ancient and Modern Chinese are both written in Chinese characters, but ancient Chinese is highly concise and its syntactical structure is different from modern Chinese. The traditional methods do not take these characteristics into account. In this paper, we propose an effective Ancient-Modern Chinese text alignment method at the level of clause333The clause alignment is more fine-grained than sentence alignment. In the experiment, a sentence had been split into clauses when we meet comma, semicolon, period or exclamation mark. based on the characteristics of these two languages. The proposed method combines both lexical-based information and statistical-based information, which achieves 94.2 F1-score on our manual annotation test set. Recently, (Zhang et al., 2018) propose a simple longest common subsequence based approach for Ancient-Modern Chinese sentence alignment. Our experiments showed that our proposed alignment approach performs much better than their method.
We apply the proposed method to create a large translation parallel corpus which contains 1.24M bilingual sentence pairs. To our best knowledge, this is the first large high-quality Ancient-Modern Chinese dataset.444The dataset in (Zhang et al., 2018) contains only 57391 sentence pairs, the dataset in (Lin and Wang, 2007) only involves 205 Chinese Ancient-Modern paragraph pairs, and the dataset in (Liu and Wang, 2012) only involves one history book. Furthermore, we test SMT models and various NMT based models on the created dataset and provide a strong baseline for this task.
2 Creating Large Training Dataset
There are four steps to build the Ancient-Modern Chinese Translation dataset: (i) The parallel corpus crawling and cleaning. (ii) The paragraph alignment. (iii) The clause alignment based on aligned paragraphs. (iv) Augmenting data by merging aligned adjacent clauses. The most critical step is the third step.
2.2 Clause Alignment
In the clause alignment step, we combine both statistical-based and lexical-based information to measure the score for each possible clause alignment between Ancient and Modern Chinese strings. The dynamic programming is employed to further find overall optimal alignment paragraph by paragraph. According to the characteristics of the ancient and modern Chinese languages, we consider the following factors to measure the alignment score between a bilingual clause pair:
Lexical Matching. The lexical matching score is used to calculate the matching coverage of the ancient clause . It contains two parts: exact matching and dictionary matching. An ancient Chinese character usually corresponds to one or more modern Chinese words. In the first part, we carry out Chinese Words segmentation to the modern Chinese clause . Then we match the ancient characters and modern words in the order from left to right.555When an ancient character appears in a modern word, we define the character to exact match the word. In further matching, the words that have been matched will be deleted from the original clauses. However, some ancient characters do not appear in its corresponding words. An ancient Chinese dictionary is employed to address this issue. We preprocess the ancient Chinese dictionary and remove the stop words. In this dictionary matching part, we retrieve the dictionary definition of each unmatched ancient character and use it to match the remaining modern Chinese words. To reduce the impact of universal word matching, we use Inverse Document Frequency (IDF) to weight the matching words. The lexical matching score is calculated as:
The first term of equation (1) represents co-occurrence matching score. denotes the length of , denotes an ancient character in , and the indicator function indicates whether the character can match the words in the clause . The second term is dictionary matching score. Where and represent the remaining unmatched strings of and , respectively. denotes the -th character in the dictionary definition of the and its normalized IDF is denoted as .
Statistical Information. Similar to (Gale and Church, 1993) and (Wang and Ren, 2005), the statistical information contains alignment mode and length information. There are many alignment modes between ancient and modern Chinese languages. If one ancient Chinese clause aligns two adjacent modern Chinese clauses, we call this alignment as 1-2 alignment mode. In this paper, we only consider 1-0, 0-1, 1-1, 1-2, 2-1 and 2-2 alignment modes which account for of the validation set. We estimate the probability Prn-m of each alignment mode n-m on the validation set. To utilize length information, we make an investigation on length correlation between these two languages. Based on the assumption of (Gale and Church, 1993) that each character in one language gives rise to a random number of characters in the other language and those random variables are independent and identically distributed with a normal distribution, we estimate the mean and standard deviation from the paragraph aligned parallel corpus. Given a clause pair , the statistical information score can be calculated by:
where denotes the normal distribution probability density function.
Edit Distance. Edit distance is a way of quantifying the dissimilarity between two strings by counting the minimum number of operations (insertion, deletion, and substitution) required to transform one string into the other. Here we define the edit distance score as:
Dynamic Programming. The overall alignment score for each possible clause alignment is as follows:
Here and are pre-defined interpolation factors. We use dynamic programming to find the overall optimal alignment paragraph by paragraph. Let be total alignment scores of aligning clauses from first one to -th ancient Chinese clauses and modern Chinese clauses from first one to -th, and the recurrence then can be described as follows:
where and denotes the alignment mode n-m, and denotes the -th ancient Chinese clause and its previous ancient Chinese clause.
2.3 Ancient-Modern Chinese Dataset
Data Collection. To build the large Ancient-Modern Chinese dataset, we collected 1.7K bilingual ancient-modern Chinese articles from the internet. More specifically, a large part of the ancient Chinese data we used come from ancient Chinese history records in several dynasties (about 200BC-1000BC) and articles written by celebrities of that era. They used plain and accurate words to express what happened at that time, thus ensure the generality of the translated materials. The work of paragraph alignment is manually completed. After data cleaning and manual paragraph alignment, we obtained 35K aligned bilingual paragraphs.
We applied our clause alignment algorithm on the 35K aligned bilingual paragraphs and obtained 517K aligned bilingual clauses. Furthermore, we augmented the data in the following way: Given an aligned clause pair, we merge its adjacent clause pairs as a training pair. After the data augmentation, we filtered the sentences longer than 50. Our experiments showed that this augmentation technique can greatly improve the performance of the NMT model. Finally, we split the dataset into three sets: training (Train), development (Dev) and testing (Test). The statistical information of the three data sets is shown in Table 1. Note that all the sentences in different sets come from different articles. We show some examples of data in the Appendix.
|Set||Pairs||Src Token.||Target Token.|
3.1 Clause Alignment Results
In order to evaluate our clause align algorithm, we manually aligned bilingual clauses from 37 bilingual ancient-modern Chinese articles, and finally got 4K aligned bilingual clauses as the test set and 2K clauses as the validation set.
We evaluated the clause align algorithm with various settings on the test set. In addition, we also compared our method with the longest common subsequence (LCS) based approach proposed by (Zhang et al., 2018). We estimated and on all aligned paraphrases. The probability Prn-m of each alignment mode n-m was estimated on the validation set. The grid search was applied to search for the hyper-parameters and on the validation set. The Jieba Chinese text segmentation666A Python based Chinese word segmentation module https://github.com/fxsjy/jieba. is employed for modern Chinese word segmentation.
We used F1-score and precision as the evaluation metrics. As shown in Table 2, the abbreviation w/o means removing a particular part from the setting. We find that the lexical matching score is most important among these three factors, and statistical information score is more important than edit distance score. Moreover, the dictionary term in lexical matching score greatly improves the performance. From these results, we obtain the best setting that involves all these three factors. We used this setting for clause alignment. Furthermore, the proposed method performs much better than LCS (Zhang et al., 2018).
|w/o lexical score||84.3||86.5|
|w/o statistical score||92.8||93.9|
|w/o edit distance||93.9||94.4|
|LCS (Zhang et al., 2018)||91.3||92.2|
3.2 Translation Results
We train the SMT and various NMT based models on the dataset.
NMT. The basic NMT model is based on (Bahdanau et al., 2014). Furthermore, we tested the basic NMT model with several techniques, such as target language reversal, residual connection (He et al., 2016) and pre-trained word2vec (Mnih and Kavukcuoglu, 2013).
Transformer. We also trained the Transformer model (Vaswani et al., 2017) which is a strong baseline.
The hyper-parameters and generated samples of above models are shown in the Appendix. To verify the effectiveness of our data augmented method. We tested the NMT and SMT models on both unaugmented dataset (including 0.46M training pairs) and augmented dataset.
For the evaluation, we used the average of 1 to 4 gram BLEUs (Papineni et al., 2002) which computed by multi-bleu.perl in Moses as metrics. The results are reported in Table 3. For NMT, we can see that target language reversal, residual connection, and word2vec can further improve the performance of the basic NMT model. Moreover, the results show that training NMT model on augmented data can greatly improve the performance. For SMT, it performs better than NMT models when they were both trained on the unaugmented dataset. However, when trained on the augmented dataset, the NMT model outperforms the SMT model. It indicates that a large amount of data is necessary for NMT model. In addition, training Transformer is very fast, but the performance of it is slight worse than SMT and full NMT models.
We propose an effective Ancient-Modern Chinese clause alignment method which achieves 94.2 F1-score on test set. Based on it, we build a large scale parallel corpus which contains 1.24M bilingual sentence pairs. To our best knowledge, this is the first large high-quality Ancient-Modern Chinese dataset. In addition, we provide a strong NMT baseline for this task which achieves 25.42 BLEU score (4-gram).
5.1 NMT Configurations
The basic NMT model is based on (Bahdanau et al., 2014). Both the encoder and decoder used 2-layer RNN with 1024 LSTM cells, and the encoder is a bi-directional RNN. The batch size, threshold of element-wise gradient clipping and initial learning rate of Adam optimizer (Kingma and Ba, 2014) were set to 128, 5.0 and 0.001. When trained the model on augmented dataset, we used 4-layer RNN. Several techniques were investigated to train the model, including layer-normalization (Ba et al., 2016), RNN-dropout (Gal and Ghahramani, 2016), and learning rate decay (Wu et al., 2016). The hyper-parameters were chosen empirically and adjusted in the validation. For word embedding pre-training, we collected an external ancient corpus which contains 134M tokens.
5.2 Transformer Configurations
|Inner Hidden Size||1024|
|Word Embedding Size||512|
- Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Brown et al. (1991) Peter F Brown, Jennifer C Lai, and Robert L Mercer. 1991. Aligning sentences in parallel corpora. In Proceedings of the 29th annual meeting on Association for Computational Linguistics, pages 169–176. Association for Computational Linguistics.
- Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, pages 1019–1027.
- Gale and Church (1993) William A Gale and Kenneth W Church. 1993. A program for aligning sentences in bilingual corpora. Computational linguistics, 19(1):75–102.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
- Heafield (2011) Kenneth Heafield. 2011. Kenlm: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197. Association for Computational Linguistics.
- Kaji et al. (1992) Hiroyuki Kaji, Yuuko Kida, and Yasutsugu Morimoto. 1992. Learning translation templates from bilingual text. In Proceedings of the 14th conference on Computational linguistics-Volume 2, pages 672–678. Association for Computational Linguistics.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Kit et al. (2004) Chunyu Kit, Jonathan J Webster, King-Kui Sin, Haihua Pan, and Heng Li. 2004. Clause alignment for hong kong legal texts: A lexical-based approach. International Journal of Corpus Linguistics, 9(1):29–51.
- Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pages 177–180. Association for Computational Linguistics.
- Lin and Wang (2007) Zhun Lin and Xiaojie Wang. 2007. Chinese ancient-modern sentence alignment. In International Conference on Computational Science, pages 1178–1185. Springer.
- Liu and Wang (2012) Ying Liu and Nan Wang. 2012. Sentence alignment for ancient and modern chinese parallel corpus. In Emerging Research in Artificial Intelligence and Computational Intelligence, pages 408–415. Springer.
- Mnih and Kavukcuoglu (2013) Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings efficiently with noise-contrastive estimation. In Advances in neural information processing systems, pages 2265–2273.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.
- Wang and Ren (2005) Xiaojie Wang and Fuji Ren. 2005. Chinese-japanese clause alignment. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 400–412. Springer.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- Zhang et al. (2018) Zhiyuan Zhang, Wei Li, and Xu Sun. 2018. Automatic transferring between ancient chinese and contemporary chinese. arXiv preprint arXiv:1803.01557.