Apply Chinese Radicals Into Neural Machine Translation: Deeper Than Character Level
In neural machine translation (NMT), researchers face the challenge of un-seen (or out-of-vocabulary OOV) words translation. To solve this, some researchers propose the splitting of western languages such as English and German into sub-words or compounds. In this paper, we try to address this OOV issue and improve the NMT adequacy with a harder language Chinese whose characters are even more sophisticated in composition. We integrate the Chinese radicals into the NMT model with different settings to address the unseen words challenge in Chinese to English translation. On the other hand, this also can be considered as semantic part of the MT system since the Chinese radicals usually carry the essential meaning of the words they are constructed in. Meaningful radicals and new characters can be integrated into the NMT systems with our models. We use an attention-based NMT system as a strong baseline system. The experiments on standard Chinese-to-English NIST translation shared task data 2006 and 2008 show that our designed models outperform the baseline model in a wide range of state-of-the-art evaluation metrics including LEPOR, BEER, and CharacTER, in addition to the traditional BLEU and NIST scores, especially on the adequacy-level translation.
We also have some interesting findings from the results of our various experiment settings about the performance of words and characters in Chinese NMT, which is different with other languages. For instance, the fully character level NMT may perform very well or the state of the art in some other languages as researchers demonstrated recently, however, in the Chinese NMT model, word boundary knowledge is important for the model learning. 111* parallel authors, ranked by alphabet decreasing order
Keywords:Machine Translation Chinese-English Translation Chinese Radicals Neural Networks Translation Evaluation.
11email: firstname.lastname@example.org 22institutetext: School of Computing, Dublin City University, Dublin, Ireland
We first introduce briefly the machine translation development then come to the existing issues that we try to address. Machine Translation (MT) has a long history dating from 1950s  as one topic of artificial intelligence (AI) or intelligent machines. It began with rule-based MT (RBMT) systems that apply human defined syntactic and semantic rules of source and target languages to the machine, to example based MT (EBMT), statistical MT (SMT), Hybrid MT (e.g. the combination of RBMT and SMT) and then recent years’ Neural MT (NMT) models [48, 11, 33, 2]. MT gained much more attention from researchers after the launching of IBM mathematical models proposed in 1990s . Representative SMT works include the word alignment models , introducing of Minimum Error Rate Training (MERT) , phrase-based SMT , hierarchical structure models , and large parallel data development, e.g. , etc.
Meanwhile, many research groups developed their open source tools to advance the MT technology, such as Moses featuring statistical phrase-based MT , Joshua featuring parsing-based translations , Phrasal incorporating arbitrary model features , CDEC favoring finite-state and context-free translation Models , and NiuTrans featuring syntax-based models , etc.; and some advanced information technology (IT) companies also built theirs, such as the machine translators by Google 222translate.google.com, Baidu 333translate.baidu.com, Yandex 444translate.yandex.com, and Microsoft Bing 555www.bing.com/translator, etc.
Thanks to the work of word to vector embedding from , the NMT was available to be introduced in [28, 14, 2] by utilizing both deep learning (DL) and word representation (WR) approaches. Earlier, NMT structure  did not work out which may be due to the limitations of computational power of machines and the amount of available corpora, though neural networks were also explored later as sub-components in SMT, e.g. to smooth or re-rank the system output candidates as language models [53, 4]. One of the promotions for NMT research is the launching of 1st NMT Workshop by Google 666sites.google.com/site/acl17nmt/home, in addition to the traditional WMT workshops 777www.statmt.org/wmt17/.
NMT models treat MT task as encoder-decoder work-flow which is much different from the conventional SMT structure [15, 30]. The encoder applies in the source language side learning the sentences into vector representations, while the decoder applies in the target language side generating the words from the target side vectors. Recurrent Neural Networks (RNN) models are usually used for both encoder and decoder, though there are some researchers employing convolutions neural networks (CNN) like [14, 28]. The hidden layers in the neural nets are designed to learn and transfer the information .
There were some drawbacks in the NMT models e.g. lack of alignment information between source and target side, and less transparency, etc. To address these, attention mechanism was introduced to the decoder first by  to pay interests to part information of the source sentence selectively, instead of the whole sentence always, when the model is doing translation. This idea is similar like alignment functions in SMT and what the human translators usually perform when they undertake the translation task. Earlier, attention mechanisms were applied in neural nets for image processing tasks [36, 18]. Recently, Attention based models have appeared in most of the NMT projects, such as the the investigation of global attention-based architectures  and target information  for pure text NMT, and the exploration of Multi-modal NMT . To generalize the attention mechanism in the source language side, coverage model is introduced to balance the weights of different parts of the sentences into NMT by [59, 44].
Another drawback of NMT is that the NMT systems usually produce better fluent output, however, the adequacy is lower sometimes compared with the conventional SMT, e.g. some meaning from the source sentences will be lost in the translation side when the sentence is long [58, 59, 34, 47, 14]. One kind of reason of this phenomenon could be due to the unseen words problem, except for the un-clear learning procedure of the neural nets. With this assumption, we try to address the unseen words or out-of-vocabulary (OOV) words issue and improve the adequacy level by exploring the Chinese radicals into NMT.
For Chinese radical knowledge, let’s see two examples about their construction in the corresponding characters. This Figure 1 shows three Chinese characters (forest, tree, bridge) which contain the same part of radical (wood) and this radical can be a character independently in usage. In the history, Chinese bridge was built by wood usually, so apparently, these three characters carry the similar meaning that they all contain something related with woods.
The Figure 2 shows three Chinese characters (grass,medicine,tea) which contain the same part of radical (grass) however this radical can not be a character independently in usage. This radical means grass in the original development of Chinese language. In the history, Chinese medicine was usually developed from some nature things like the grass, and Chinese tea was usually from the leafs that are related with grass.
To the best knowledge of the authors, there is no published work about radical level NMT for Chinese language yet. The following section 2 will be the related works, section 3 our model design, section 4 the experiments, and section 5 the conclusion and future work.
2 Related Work
MT models have been developed by utilizing smaller units, i.e. phrase-level to word-level, sub-word level and character-level [31, 54, 16]. However, for Chinese language, sub-character level or radical level is also quite interesting topic since the Chinese radicals carry somehow essential meanings of the Chinese characters that they are constructed in. Some of the radicals spited from the characters can be independent new characters, meanwhile, there are some other radicals that can not be independent as characters though they also have meanings. It would be very interesting to see how these radicals or the combination of them and traditional words/characters perform in the NMT systems.
Some MT researchers explored the word composition knowledge into the systems, especially on the western languages. For instance,  developed a Machine Translation model on English-German and English-Finnish with the consideration of synthesizing compound words. This kind of knowledge is similar like the splitting Chinese character into new characters.
3 Model Design
This section introduces the baseline attention-based NMT model and our model.
3.1 Attention-based NMT
Typically, as mentioned before, neural machine translation (NMT) builds on an encoder-decoder framework [2, 57] based on recurrent neural networks (RNN). In this paper, we take the NMT architecture proposed by . In NMT system, the encoder apples a bidirectional RNN to encode a source sentence and repeatedly generates the hidden vectors over the source sentence, where is the length of source sentence. Formally, is the concatenation of forward RNN hidden state and backward RNN hidden state , and can be computed as follows:
where function f is defined as a Gated Recurrent Unit (GRU) .
The decoder is also an RNN that predicts the next word given the context vector , the hidden state of the decoder and the previous predicted word , which is computed by:
where is a non-linear function. and is the state of decoder RNN at time step , which is calculated by:
where is the context represent vector of source sentence.
Usually? the can be obtained by attention model and calculated as follows:
We also follow the implementation of attention-based NMT of dl4mt tutorial 888github.com/nyu-dl/dl4mt-tutorial/tree/master/ session2, which enhances the attention model by feeding the previous word to it, therefore the is calculated by:
where , and is a GRU function. The hidden state of the decoder is updated as following:
In this paper, we use the attention-based NMT with the changes from dl4mt tutorial 999github.com/nyu-dl/dl4mt-tutorial as our baseline and call it RNNSearch*101010To distinguish it from RNNSearch as in the paper .
3.2 Our model
Traditional NMT model usually uses the word-level or character-level information as the inputs of encoder, which ignores some knowledge of the source sentence, especially for Chinese language. Chinese words are usually composed of multiple characters, and characters can be further spited into radicals. The Chinese character construction is very complected, varying from upper-lower structure, left-right structure, to inside-outside structure and the combination of them. In this paper, we use the radical, character and word as multiple inputs of NMT and expect NMT model can learn more useful features based on the different levels of input integration.
Figure 3 illustrates our proposed model. The input embedding consists of three parts: word embedding , character embedding 111111We use the character ‘z’ to represent character, instead of ‘c’, because we already used ‘c’ as representation of context vector. and radical embedding , as follows:
where ‘;’ is concatenate operation.
For the word , it can be split into characters and further split into radicals . In our model, we use simple additions operation to get the character representation and radical representation of the word, i.e. and can be computed as follows:
Each word can be decomposed into different numbers of character and radical, and, by addition operations, we can generate a fixed length representation. In principle? our model can handle different levels of input from their combinations. For Chinese character decomposition, e.g. the radicals generation, we use the HanziJS open source toolkit 121212github.com/nieldlr/Hanzi. On the usage of target vocabulary , we choose 30,000 as the volume size.
In this section, we introduce our experiment settings and the evaluation of the designed models.
4.1 Experiments Setting
We used 1.25 million parallel Chinese-English sentences for training, which contain 80.9 millions Chinese words and 86.4 millions English words. The data is mainly from Linguistic Data Consortium (LDC) 131313www.ldc.upenn.edu parallel corpora, such as LDC2002E18, LDC2003E07, LDC2003E14, LDC2004T07, LDC2004T08, and LDC2005T06.
We tune the models with NIST06141414NIST: the National Institute for Standards and Technology. They organized yearly MT Evaluation shared tasks and released data for researchers to compare their models. as development data using BLEU metric , and use NIST08 Chinese-English parallel corpus as testing data with four references.
For the baseline model RNNSearch*, in order to effectively train the model, we limit the maximum sentence length on both source and target side to 50. We also limit both the source and target vocabularies to the most frequent 30k words and replace rare words with a special token “UNK” in Chinese and English. The vocabularies cover approximately 97.7% and 99.3% of the two corpora, respectively. Both the encoder and decoder of RNNsearch* have 1000 hidden units. The encoder of RNNsearch consists of a forward (1000 hidden unit) and backward bidirectional RNN. The word embedding dimension is set as 620. We incorporate dropout  strategy on the output layer. We used the stochastic descent algorithm with mini-batch and Adadelta  to train the model. The parameters and of Adadelta are set to 0.95 and . Once the RNNsearch* model is trained, we adopt a beam search to find possible translations with high probabilities. We set the beam width of RNNsearch* to 10. The model parameters are selected according to the maximum BLEU score points on the development set.
For our proposed model, all the experimental settings are the same as RNNSearch*, except for the word-embedding dimension and the size of the vocabularies. In our model, we set the word, character and radical to have the same dimension, all 620. The vocabulary sizes of word, character and radical are set to 30k, 2.5k and 1k respectively.
To integrate the character radicals into NMT system, we designed several different settings as demonstrated in the table. Both the baseline and our settings used the attention-based NMT structure.
In this section, we introduce the evaluation metrics we used for the designed models.
Firstly, there are many works reflecting the insufficiency of BLEU metric, such as higher or lower BLEU scores do not necessarily reflect the model quality improvements or decreasing; BLEU scores are not interpretive by many translation professionals; and BLEU did not correlate better than later developed metrics in some language pairs [10, callison2007meta, lavie2013automated]
In the light of such analytic works, we try to validate our work in a deeper and broader evaluation setting from more aspects. We use a wide range of state of the art MT evaluation metrics, which are developed in recent years, to do a more comprehensive evaluation, including hLEPOR , CharacTER , BEER , in addition to BLEU and NIST .
The model hLEPOR is a tunable translation evaluation metric yielding higher correlation with human judgments by adding n-gram position difference penalty factor into the traditional F-measures. CharacTER is a character level editing distance rate metric. BEER uses permutation trees and character n-grams integrating many features such as paraphrase and syntax. They have shown top performances in recent years’ WMT151515www.statmt.org/wmt17/metrics-task.html shared tasks [42, 41, 21, 6].
Both CharacTER and BEER metrics achieved the parallel top performance in correlation scores with human judgment on Chinese-to-English MT evaluation in WMT-17 shared tasks  . While LEPOR metric series are evaluated by MT researchers as one of the most distinguished metric families that are not apparently outperformed by others, which is stated in the metrics comparison work in  on standard WMT data.
4.2.1 Evaluation on Development Set
On the development set NIST06, we got the following evaluation scores.
The cumulative N-gram scoring of BLEU and NIST metric, with bold case as the highlight of the winner in each n-gram column situation, is shown in the table respectively. Researchers usually report their 4-gram BLEU while 5-gram NIST metric scores, so we also follow this tradition here:
From the scoring results, we can see that the model setting one, i.e. W+C+R, won the baseline models in all uni-gram to 4-gram BLEU and to 5-gram NIST scores. Furthermore, we can see that, by adding character and/or radical to the words, the model setting two and three also outperformed the baseline models. However, the setting 4 that only used character and radical information in the model lost both BLEU and NIST scores compared with the word-level baseline. This means that, for Chinese NMT, the word segmentation knowledge is important to show some guiding in Chinese translation model learning.
For uni-gram BLEU score, our Model one gets 2.1 higher score than the baseline model which means by combining W+C+R the model can yield higher adequacy level translation, though the fluency score (4-gram) does not have much difference. This is exactly the point that we want to improve about neural models, as complained by many researchers.
The evaluation scores with broader state-of-the-art metrics are show in the follow table. Since CharacTER is an edit distance based metric, the lower score means better translation result.
|Metrics on Single Reference|
From the broader evaluation metrics, we can see that our designed models also won the baseline system in all the metrics. Our model setting one, i.e. the W+C+R model, won both BEER and CharacTER scores, while our model two, i.e. the W+C, won the hLEPOR metric score, though the setting four continue to be the worest performance, which is consistent with the BLEU and NIST metrics. Interestingly, we find that the CharacTER score of setting two and three are both worse than the baseline, which means that by adding of character and radical information separately the output translation needs more editing effort; however, if we add both the character and radical information into the model, i.e. the setting one, then the editing effort became less than the baseline.
4.2.2 Evaluation on Test Sets
The evaluation results on the NIST08 Chinese-to-English test date are presented in this section.
Firstly, we show the evaluation scores on BLEU and NIST metrics, with four reference translations and case-insensitive setting. The tables show the cumulative N-gram scores of BLEU and NIST, with bold case as the winner of each n-gram situation in each column.
The results show that our model setting one won both BLEU and NIST scores on each n-gram evaluation scheme.
While model setting three, i.e. the W+R model, won the uni-gram and bi-gram BLEU scores, and got very closed score with the baseline model in NIST metric. Furthermore, the model setting four, i.e. the C+R one, continue showing the worst ranking, which may verify that word segmentation information and word boundaries are indeed helpful to Chinese translation models, so we can not omit such part.
What worth to mention is that the detailed evaluation scores from BLEU reflect our Model one yields higher BLEU score (1.58) on uni-gram, similar with the results on development data, while a little bit higher performance in 4-gram (0.25). These mean that in the fluency level our translation is similar with the state-of-the-art baseline, however, our model yields much better adequacy level translation in NMT since uni-gram BLEU reflects the adequacy aspect instead of fluency. This verifies the value of our model in the original problem we want to address.
The evaluation results on recent years’ advanced metrics are shown below. The scores are also evaluated on the four references scheme. We calculate the average score of each metric from 4 references as the final evaluation score. Bold case means the winner as usual.
|Metrics Evaluated on 4-references|
From the broader evaluations, we can see that our model setting one won both the LEPOR and BEER metrics. Though the baseline model won the CharacTER metric, the margin between the two scores from baseline (.9846) and our model three, i.e. W+R, (.9882) is quite small around 0.0036. Continuously, the setting four with C+R performed the worst though and verified our previous findings.
5 Conclusion and Future Work
We presented the different performances of the multiple model settings by integration Chinese character and radicals into state-of-the-art attention-based neural machine translation systems, which can be helpful information for other researchers to look inside and gain general clues about how the radical works.
Our model shows the full character+radical is not enough or suitable for Chinese language translation, which is different with the work on western languages such as . Our model results showed that the word segmentation and word boundary are helpful knowledge for Chinese translation systems.
Even though our model settings won both the traditional BLEU and NIST metrics, the recent years developed advanced metrics indeed showed some differences and interesting phenomena, especially the character level translation error rate metric CharacTER. This can encourage MT researchers to use the state-of-the-art metrics to find useful insight of their models.
Although the combination of words, characters and radicals mostly yielded the best scores, the broad evaluations also showed that the model setting W+R, i.e. using both words and radicals information, is generally better than the model setting W+C, i.e. words plus characters without radical, which verified the value of our work by exploring radicals into Chinese NMT. Our Model one yielded much better adequacy level translation output (by uni-gram BLEU score) compared with the baseline system, which also showed that this work is important in exploring how to improve adequacy aspect of neural models.
In the future work, we will continue to optimize our models and use more testing data to verify the performances. In this work, we aimed at exploring the effectiveness of Chinese radicals, so we did not use BPE for English side splitting, however, to promote the state-of-the-art Chinese-English translation, in our future extension, we will apply the splitting on both Chinese and English sides. We will also investigate the usage of Chinese radicals into MT evaluation area, since they carry the language meanings.
The author Han thanks Ahmed Abdelkader for the kind help, and Niel de la Rouviere for the HanziJS toolkit. This work was supported by Soochow University of China and Dublin City University of Ireland.
-  Aharoni, R., Goldberg, Y.: Towards string-to-tree neural machine translation. CoRR abs/1704.04743 (2017), http://arxiv.org/abs/1704.04743
-  Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014), http://arxiv.org/abs/1409.0473
-  Bastings, J., Titov, I., Aziz, W., Marcheggiani, D., Sima’an, K.: Graph convolutional encoders for syntax-aware neural machine translation. arXiv preprint https://arxiv.org/abs/1704.04675 (2017)
-  Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (Mar 2003), http://dl.acm.org/citation.cfm?id=944919.944966
-  Bojar, O., Graham, Y., Kamran, A.: Results of the WMT17 metrics shared task. In: Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Tasks Papers. Association for Computational Linguistics, Copenhagen, Denmark (September 2017)
-  Bojar, O., Graham, Y., Kamran, A., Stanojević, M.: Results of the wmt16 metrics shared task. In: Proceedings of the First Conference on Machine Translation. pp. 199–231. Association for Computational Linguistics, Berlin, Germany (August 2016), http://www.aclweb.org/anthology/W/W16/W16-2302
-  Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 19(2), 263–311 (Jun 1993), http://dl.acm.org/citation.cfm?id=972470.972474
-  Caglayan, O., Barrault, L., Bougares, F.: Multimodal attention for neural machine translation. CoRR abs/1609.03976 (2016), http://arxiv.org/abs/1609.03976
-  Calixto, I., Liu, Q., Campbell, N.: Multilingual multi-modal embeddings for natural language processing. CoRR abs/1702.01101 (2017), http://arxiv.org/abs/1702.01101
-  Callison-Burch, C., Osborne, M., Koehn, P.: Re-evaluating the role of bleu in machine translation research. In: Proceedings of EACL. vol. 2006, pp. 249–256 (2006)
-  Carl, M., Way, A.: Recent advances in example-based machine translation (2003)
-  Cer, D., Galley, M., Jurafsky, D., Manning, C.D.: Phrasal: A toolkit for statistical machine translation with facilities for extraction and incorporation of arbitrary model features. In: Proceedings of the NAACL HLT 2010 Demonstration Session. pp. 9–12. HLT-DEMO ’10, Association for Computational Linguistics, Stroudsburg, PA, USA (2010), http://dl.acm.org/citation.cfm?id=1855450.1855453
-  Chiang, D.: A hierarchical phrase-based model for statistical machine translation. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. pp. 263–270. ACL ’05, Association for Computational Linguistics, Stroudsburg, PA, USA (2005). https://doi.org/10.3115/1219840.1219873, https://doi.org/10.3115/1219840.1219873
-  Cho, K., van Merrienboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: Encoder-decoder approaches. CoRR abs/1409.1259 (2014), http://arxiv.org/abs/1409.1259
-  Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP 2014) (2014)
-  Chung, J., Cho, K., Bengio, Y.: A character-level decoder without explicit segmentation for neural machine translation. In: ACL (2016)
-  Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. Presented in NIPS 2014 Deep Learning and Representation Learning Workshop (2014)
-  Denil, M., Bazzani, L., Larochelle, H., de Freitas, N.: Learning where to attend with deep architectures for image tracking. CoRR abs/1109.3737 (2011), http://arxiv.org/abs/1109.3737
-  Dyer, C., Weese, J., Setiawan, H., Lopez, A., Ture, F., Eidelman, V., Ganitkevitch, J., Blunsom, P., Resnik, P.: Cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models. In: Proceedings of the ACL 2010 System Demonstrations. pp. 7–12. ACLDemos ’10, Association for Computational Linguistics, Stroudsburg, PA, USA (2010), http://dl.acm.org/citation.cfm?id=1858933.1858935
-  Elliott, D., Kádár, Á.: Imagination improves multimodal translation. CoRR abs/1705.04350 (2017), http://arxiv.org/abs/1705.04350
-  Graham, Y., Mathur, N., Baldwin, T.: Accurate evaluation of segment-level machine translation metrics. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies. Denver, Colorado (2015)
-  Han, A.L.F., Wong, D.F., Chao, L.S., He, L., Lu, Y., Xing, J., Zeng, X.: Language-independent model for machine translation evaluation with reinforced factors. In: Machine Translation Summit XIV. pp. 215–222. International Association for Machine Translation (2013)
-  Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)
-  Huang, P.Y., Liu, F., Shiang, S.R., Oh, J., Dyer, C.: Attention-based multimodal neural machine translation (2016). https://doi.org/10.18653/v1/W16-2360, http://aclanthology.coli.uni-saarland.de/pdf/W/W16/W16-2360.pdf
-  Huang, P.Y., Liu, F., Shiang, S.R., Oh, J., Dyer, C.: Attention-based multimodal neural machine translation. In: WMT (2016)
-  Jean, S., Cho, K., Memisevic, R., Bengio, Y.: On using very large target vocabulary for neural machine translation. In: ACL 2015 (2014)
-  Johnson, M., Schuster, M., Le, Q.V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viégas, F.B., Wattenberg, M., Corrado, G., Hughes, M., Dean, J.: Google’s multilingual neural machine translation system: Enabling zero-shot translation. CoRR abs/1611.04558 (2016), http://arxiv.org/abs/1611.04558
-  Kalchbrenner, N., Blunsom, P.: Recurrent continuous translation models. Association for Computational Linguistics, Seattle (October 2013)
-  Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: MT summit. vol. 5, pp. 79–86 (2005)
-  Koehn, P.: Statistical machine translation (2010)
-  Koehn, P.: Statistical Machine Translation. Cambridge University Press (2010)
-  Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of ACL (2007)
-  Koehn, P., Knight, K.: Statistical machine translation (Nov 24 2009), uS Patent 7,624,005
-  Koehn, P., Knowles, R.: Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872 (2017)
-  Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1. pp. 48–54. NAACL ’03, Association for Computational Linguistics, Stroudsburg, PA, USA (2003). https://doi.org/10.3115/1073445.1073462, https://doi.org/10.3115/1073445.1073462
-  Larochelle, H., Hinton, G.E.: Learning to combine foveal glimpses with a third-order boltzmann machine. In: Lafferty, J.D., Williams, C.K.I., Shawe-Taylor, J., Zemel, R.S., Culotta, A. (eds.) Advances in Neural Information Processing Systems 23, pp. 1243–1251. Curran Associates, Inc. (2010), http://papers.nips.cc/paper/4089-learning-to-combine-foveal-glimpses-with-a-third-order-boltzmann-machine.pdf
-  Li, J., Xiong, D., Tu, Z., Zhu, M., Zhang, M., Zhou, G.: Modeling Source Syntax for Neural Machine Translation. ArXiv e-prints (May 2017)
-  Li, Z., Callison-Burch, C., Dyer, C., Ganitkevitch, J., Khudanpur, S., Schwartz, L., Thornton, W.N.G., Weese, J., Zaidan, O.F.: Demonstration of joshua: An open source toolkit for parsing-based machine translation. In: Proceedings of the ACL-IJCNLP 2009 Software Demonstrations. pp. 25–28. ACLDemos ’09, Association for Computational Linguistics, Stroudsburg, PA, USA (2009), http://dl.acm.org/citation.cfm?id=1667872.1667879
-  Liu, F., Lu, H., Lo, C., Neubig, G.: Learning character-level compositionality with visual features. CoRR abs/1704.04859 (2017), http://arxiv.org/abs/1704.04859
-  Luong, M., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. CoRR abs/1508.04025 (2015), http://arxiv.org/abs/1508.04025
-  Machacek, M., Bojar, O.: Results of the wmt14 metrics shared task. In: Proceedings of the Ninth Workshop on Statistical Machine Translation. pp. 293–301. Association for Computational Linguistics, Baltimore, Maryland, USA (June 2014), http://www.aclweb.org/anthology/W/W14/W14-3336
-  Macháček, M., Bojar, O.: Results of the WMT13 metrics shared task. In: Proceedings of the Eighth Workshop on Statistical Machine Translation. pp. 45–51. Association for Computational Linguistics, Sofia, Bulgaria (August 2013), http://www.aclweb.org/anthology/W13-2202
-  Matthews, A., Schlinger, E., Lavie, A., Dyer, C.: Synthesizing compound words for machine translation. In: ACL (1) (2016)
-  Mi, H., Sankaran, B., Wang, Z., Ittycheriah, A.: A coverage embedding model for neural machine translation. CoRR abs/1605.03148 (2016), http://arxiv.org/abs/1605.03148
-  Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013), http://arxiv.org/abs/1301.3781
-  Neco, R.P., Forcada, M.L.: Asynchronous translations with recurrent neural nets. In: Neural Networks,1997., International Conference on. vol. 4, pp. 2535–2540 vol.4 (Jun 1997). https://doi.org/10.1109/ICNN.1997.614693
-  Neubig, G.: Neural machine translation and sequence-to-sequence models: A tutorial. arXiv preprint arXiv:1703.01619 (2017)
-  Nirenburg, S.: Knowledge-based machine translation. Machine Translation 4(1), 5–24 (1989), http://www.jstor.org/stable/40008396
-  Och, F.J.: Minimum error rate training for statistical machine translation. In: Proceedings of ACL (2003)
-  Och, F.J., Ney, H.: Improved statistical alignment models. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. pp. 440–447. Association for Computational Linguistics (2000)
-  Papineni, K., Roukos, S., Ward, T., jing Zhu, W.: Bleu: a method for automatic evaluation of machine translation. pp. 311–318 (2002)
-  Peter, J.T., Nix, A., Ney, H.: Generating alignments using target foresight in attention-based neural machine translation. The Prague Bulletin of Mathematical Linguistics 108(1), 27–36 (2017)
-  Schwenk, H., Dchelotte, D., Gauvain, J.L.: Continuous space language models for statistical machine translation. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions. pp. 723–730. COLING-ACL ’06, Association for Computational Linguistics, Stroudsburg, PA, USA (2006), http://dl.acm.org/citation.cfm?id=1273073.1273166
-  Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. CoRR abs/1508.07909 (2015), http://arxiv.org/abs/1508.07909
-  Shi, X., Zhai, J., Yang, X., Xie, Z., Liu, C.: Radical embedding: Delving deeper to chinese radicals. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). pp. 594–598. Association for Computational Linguistics (2015). https://doi.org/10.3115/v1/P15-2098, http://aclanthology.coli.uni-saarland.de/pdf/P/P15/P15-2098.pdf
-  Stanojević, M., Sima’an, K.: Beer: Better evaluation as ranking. In: Proceedings of the Ninth Workshop on Statistical Machine Translation. pp. 414–419. Association for Computational Linguistics, Baltimore, Maryland, USA (June 2014), http://www.aclweb.org/anthology/W/W14/W14-3354
-  Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. pp. 3104–3112 (2014)
-  Tu, Z., Liu, Y., Lu, Z., Liu, X., Li, H.: Context gates for neural machine translation. CoRR abs/1608.06043 (2016), http://arxiv.org/abs/1608.06043
-  Tu, Z., Lu, Z., Liu, Y., Liu, X., Li, H.: Coverage-based neural machine translation. CoRR abs/1601.04811 (2016), http://arxiv.org/abs/1601.04811
-  Wang, W., Peter, J.T., Rosendahl, H., Ney, H.: Character: Translation edit rate on character level. In: WMT. pp. 505–510 (2016)
-  Weaver, W.: Translation. Machine Translation of Languages: Fourteen Essays (1955)
-  Xiao, T., Zhu, J., Zhang, H., Li, Q.: Niutrans: An open source toolkit for phrase-based and syntax-based machine translation. In: Proceedings of the ACL 2012 System Demonstrations. pp. 19–24. ACL ’12, Association for Computational Linguistics, Stroudsburg, PA, USA (2012), http://dl.acm.org/citation.cfm?id=2390470.2390474
-  Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)