Massively Parallel Cross-Lingual Learning in Low-Resource Target Language Translation

# Massively Parallel Cross-Lingual Learning in Low-Resource Target Language Translation

Zhong Zhou
Carnegie Mellon University
zhongzhou@cmu.edu
\AndMatthias Sperber
Karlsruhe Institute of Technology
matthias.sperber@kit.edu
\AndAlex Waibel
Carnegie Mellon University
Karlsruhe Institute of Technology
alex@waibel.com
###### Abstract

We work on translation from rich-resource languages to low-resource languages. The main challenges we identify are the lack of low-resource language data, effective methods for cross-lingual transfer, and the variable-binding problem that is common in neural systems. We build a translation system that addresses these challenges using eight European language families as our test ground. Firstly, we add source and target family labels and study intra-family and inter-family influences for effective cross-lingual transfer. We achieve improvement of +8.4 BLEU score compared to single-family multi-source multi-target NMT baseline. We find that training two neighboring families closest to the low-resource language is often enough. Secondly, we construct an ablation study and find that reasonably good results can be achieved even with considerably less target data. Thirdly, we address the variable-binding problem by building an order-preserving named entity translation model. We obtain 60.6% accuracy in qualitative evaluation where our translations are akin to human translations in a preliminary study.

Massively Parallel Cross-Lingual Learning
in Low-Resource Target Language Translation

Zhong Zhou Carnegie Mellon University zhongzhou@cmu.edu                        Matthias Sperber Karlsruhe Institute of Technology matthias.sperber@kit.edu                        Alex Waibel Carnegie Mellon University Karlsruhe Institute of Technology alex@waibel.com

## 1 Introduction

We work on translation from a rich-resource language to a low-resource language. There is usually little low-resource language data, much less parallel data available (Duong et al., 2016; Anastasopoulos et al., 2017). Despite of the challenges of little data and few human experts, it has many useful applications. Applications include translating Water, sanitation and hygiene (WaSH) guidelines to protect Indian tribal children against waterborne diseases, introducing earthquake preparedness techniques to Indonesian tribal groups living near volcanos and delivering information to the disabled or the elderly in low-resource language communities (Reddy et al., 2017; Barrett, 2005; Anastasiou and Schäler, 2010; Perry and Bird, 2017). These are all useful translation examples of a closed text known in advance to a low-resource language.

There are three main challenges. Firstly, most of previous works research on individual languages instead of collective families. Cross-lingual impacts and similarities are very helpful when there is little data in low-resource language (Shoemark et al., 2016; Sapir, 1921; Odlin, 1989; Cenoz, 2001; Toral and Way, 2018; De Raad et al., 1997; Hermans, 2003; Specia et al., 2016). Secondly, most of the multilingual Neural Machine Translation (NMT) works assume the same amount of training data for all languages. In the low-resource case, it is important to exploit low or partial data in low-resource language to produce high quality translation. The third issue is the variable-binding problem that is common in neural systems, where “John calls Mary” is treated the same way as “Mary calls John” (Fodor and Pylyshyn, 1988; Graves et al., 2014). It is more challenging when both “Mary” and “John” are rare words. Solving the binding problem is crucial because the mistakes in named entities change the meaning of the translation. It is especially challenging in the low-resource case because many words are rare words.

Our contribution in addressing these issues is three-fold, extending from multi-source multi-target attentional NMT. Firstly, in order to examine intra-family and inter-family influences, we add source and target language family labels in training. Training using multiple families makes improvement of +6.98.4 BLEU score; and more specifically, we find training two neighboring families closest to the low-resource language gives reasonably good BLEU scores. Secondly, we conduct an ablation study to explore how generalization changes with different amounts of data and find that we only need a small amount of low-resource language data to produce reasonably good BLEU scores. We use full data except for the ablation study. Finally, to address the variable-binding problem, we build a parallel lexicon table across twenty-three European languages and devise a novel method of order-preserving named entity translation method. Our method works in translation of any text with a fixed set of named entities known in advance. Our goal is to minimize manual labor, but not to fully automate in order to ensure the correct translation of named entities and their ordering.

In this paper, we begin with introduction and related work in Section 1 and 2. We introduce our methods in addressing three issues that are important for translation into low-resource language in Section 4, as proposed extensions to our baseline in Section 3. Finally, we present our results in Section 5 and conclude in Section 6.

## 2 Related Work

### 2.1 Multilingual Attentional NMT

Attentional NMT trains directly in an end-to-end system and has flourished recently (Wu et al., 2016; Sennrich et al., 2016; Ling et al., 2015). Machine polyglotism, training machines to be proficient in many languages, is a new paradigm of multilingual NMT (Johnson et al., 2017; Ha et al., 2016; Firat et al., 2016; Zoph and Knight, 2016; Dong et al., 2015; Gillick et al., 2016; Al-Rfou et al., 2013; Tsvetkov et al., 2016). Many multilingual NMT systems involve multiple encoders and decoders (Ha et al., 2016), and it is hard to combine attention for quadratic language pairs bypassing quadratic attention mechanisms (Firat et al., 2016). In multi-source scenarios, multiple encoders share a combined attention mechanism (Zoph and Knight, 2016). In multi-target scenarios, every decoder handles its own attention with parameter sharing (Dong et al., 2015). Attention combination schemes include simple combination and hierarchical combination (Libovickỳ and Helcl, 2017; Firat et al., 2016).

The state-of-the-art of multilingual NMT is adding source and target language labels in training a universal model with a single attention scheme (Johnson et al., 2017; Ha et al., 2016). Byte-Pair Encoding (BPE) is used at preprocessing stage (Ha et al., 2016). This method is elegant in its simplicity and its advancement in low-resource language translation as well as zero-shot translation using pivot-based translation scheme (Johnson et al., 2017; Firat et al., 2016). However, these works have training data that increases quadratically with the number of languages (Johnson et al., 2017; Ha et al., 2016; Firat et al., 2016; Zoph and Knight, 2016; Dong et al., 2015; Gillick et al., 2016), rendering training on massively parallel corpora difficult.

## 3 Baseline Translation System

Our baseline is multi-source multi-target attentional NMT within one language family through adding source and target language labels with a single unified attentional scheme (Johnson et al., 2017; Ha et al., 2016). The source and target vocabulary are not shared.

## 4 Proposed Extensions

We present our methods in solving three issues relevant to translation into low-resource language as our proposed extensions .

### 4.1 Language Families and Cross-lingual Learning

Cross-lingual influences and similarities are important in linguistics (Shoemark et al., 2016; Sapir, 1921; Odlin, 1989; Cenoz, 2001; Toral and Way, 2018; De Raad et al., 1997; Hermans, 2003; Specia et al., 2016). The English word, “Beleaguer” originates from the Dutch word “belegeren”; “fidget” originates from the Nordic word “fikja”. English and Dutch belong to the same family and their proximity has effect on each other (Harding and Sokal, 1988; Ross et al., 2006). Furthermore, languages that do not belong to the same family affect each other too (Sapir, 1921; Ammon, 2001; Toral and Way, 2018). “Somatic” stems from the Greek word “soma”; “caret” comes from Latin. Indeed, we would like to tap into the cross-lingual similarities and effects for our research.

We add source and target family labels, in addition to the source and target language labels in training in order to improve convergence rate and increase translation performance. In Section 5, we examine intra-family and inter-family effects more closely.

### 4.2 Ablation Study on Target Training data

In order to achieve high information transfer from rich-resource language to low-resource target language, we would like to find out how much target training data is needed to produce reasonably good performance. We vary the amount of low-resource training data to examine how to achieve reasonably good BLEU score using limited low-resource data. In the era of Statistical Machine Translation (SMT), researchers have worked on data sampling and sorting measures (Eck et al., 2005; Axelrod et al., 2011).

In order to rigorously determine how much low-resource target language is needed for reasonably good results, we do a range of control experiments by drawing samples from the low-resource language data randomly with replacement and duplicate them if necessary to ensure all experiments carry the same number of training sentences. We keep the amount of training data in rich-resource languages the same, and vary the amount of training data in low-resource language to conduct rigorous control experiments. Our data selection process is different from prior research in that only the low-resource training data is reduced, simulating the real world scenario of having little data in low-resource language. By comparing results from control experiments, we determine how much low-resource data is needed.

### 4.3 Order-preserving Lexiconized NMT

The variable-binding problem is an inherent issue in connectionist architectures (Fodor and Pylyshyn, 1988; Graves et al., 2014). “John calls Mary” is not equivalent to “Mary calls John”, but neural networks cannot distinguish the two easily (Fodor and Pylyshyn, 1988; Graves et al., 2014). The failure of traditional NMT to distinguish the subject and the object of a sentence is detrimental. For example, in the narration “John told his son Ryan to help David, the brother of Mary”, it is a serious mistake if we reverse John and Ryan’s father-son relationships or confuse Ryan’s and David’s relationships with Mary.

In our research on translation, we focus mainly on text with a fixed set of named entities known in advance. We assume that experts help to translate a given list of named entities into low-resource language first before attempting to translate any text. Under this assumption, we propose an order-preserving named entity translation mechanism. Our solution is to first create a parallel lexicon table for all twenty-three European languages using a seed English lexicon table and fast-aligning it with the rest (Dyer et al., 2013). Instead of using $UNKs to replace the named entities, we use$NEs to distinguish them from the other unknowns. We also sequentially tag named entities in a sentence as $NE1,$NE2, …, to preserve their ordering. For every sentence pair in the multilingual training, we build a target named entity decoding dictionary by using all target lexicons from our lexicon table that matches with those appeared in the source sentence. During the evaluation stage, we replace all the numbered \$NEs using the target named entity decoding dictionary to present our final translation. This method improves translation accuracy greatly and preserves the order.

As a result of our contribution, the experts only need to translate a few lexicons and a small amount of low-resource text before passing the task to our system to obtain good results.

## 5 Experiments and Results

We train our proposed model on twenty-three European languages across eight families on a parallel Bible corpus. For our purpose, we treat Swedish as our hypothetical low-resource target language, English as our rich-resource language in the single-source single-target case and all other Germanic languages as our rich-resource languages in the multi-source multi-target case.

Firstly, we present our data and training parameters. Secondly, we add family tags in different configurations of language families showing intra-family and inter-family effects. Thirdly, we conduct an ablation study and plot the generalization curves by varying the amount of training data in Swedish, and we show that training one fifth of the data give reasonably good BLEU scores. Lastly, we devise an order-preserving lexicon translation method by building a parallel lexicon table across twenty-three European languages and tagging named entities in order.

### 5.1 Data and Training Parameters

We clean and align the Bible in twenty-three European languages in Table 1. We randomly sample the training, validation and test sets according to the 0.75, 0.15, 0.10 ratio. Our training set contains 23K verses, but is massively parallel. In our control experiments, we also use WMT’14 French-English dataset to compare with our results.

In all our experiments, we use a minibatch size of 64, dropout rate of 0.3, 4 RNN layers of size 1000, a word vector size of 600, learning rate of 0.8 across all LSTM-based multilingual experiments. For single-source single-target translation, we use 2 RNN layers of size 500, a word vector size of 500, and learning rate of 1.0. All learning rates are decaying at the rate of 0.7 if the validation score is not improving or it is past epoch 9. We build our code based on OpenNMT (Klein et al., 2017). For the ablation study, we train on BLEU scores directly until the Generalization Loss (GL) exceeds a threshold of (Prechelt, 1998). GL at epoch is defined as , modified by us to suit our objective using BLEU scores (Prechelt, 1998). is the validation score at epoch and is the optimal score up to epoch . We evaluate our models using both BLEU scores and qualitative evaluation.

### 5.2 Family labels and Intra-family & Inter-family Effects

We first investigate intra-family and inter-family influences and the effects of adding family labels. We use full training data in this subsection. Adding family labels not only improves convergence rate, but also increases BLEU scores.

Languages have varying closeness to each other: Single-source single-target translations of different languages in Germanic family to Swedish show huge differences in BLEU scores as shown in Table 3. These differences are well aligned with the multi-source multi-target results. Norwegian-Swedish and Danish-Swedish translations have much higher BLEU scores than the rest. This hints that Norwegian and Danish are closer to Swedish than the rest in the neural representation.

Multi-source multi-target translation improves greatly from single-source single-target translation: English-Swedish single-source single-target translation gives a low BLEU score of 6.9 as shown in Table 3, which is understandable as our dataset is very small. BLEU score for English-Swedish translation improves greatly to 34.0 in multi-source multi-target NMT training using Germanic family as shown in Table 2. In this paper, we treat Germanic multi-source multi-target NMT as our baseline model. Complete tables of multi-source and multi-target experiments are in the appendix. We present only relevant columns important for cross-lingual learning and low-resource target language translation.

Adding languages from other families into training improves translation quality within each family greatly: English-Swedish translation’s BLEU score improves significantly from 34.0 to 40.3 in the training of Germanic and Slavic families, and 40.2 in the training of Germanic and Romance families as shown in Table 3. After we add all three families in training, BLEU score for English-Swedish translation increases further to 41.8 in Table 3. Finally, after we add all eight families, BLEU score for English-Swedish translation increases to 42.1 in Table 3.

A Plateau is observed after adding more than one neighboring family: A plateau is observed when we plot Table 3 in Figure 1. The increase in BLEU scores after adding two families is much milder than that of the first addition of a neighboring family. This hints that using unlimited number of languages to train may not be necessary. Adding family labels not only improves convergence rate, but also increases BLEU scores: We see BLEU scores for most language pairs improve with the addition of family labels. The most significant jump in BLEU score is observed at the experiment involving all eight language families. This is easy to understand as the more families we have, the more useful it is to distinguish them.

Training two neighboring families nearest to the low-resource language gives better result than training languages that are further apart: Our observation of the plateau is hinting that training two neighboring families nearest to the low-resource language is good as shown in Table 3. To probe deeper, we compare our results of adding languages by family with another method of adding languages by random samples that span all eight families, defined as the following.

###### Definition 5.1 (Language Spanning).

A set of languages spans a set of families when it contains at least one language from each family.

In Figure 3, we conduct a few experiments on French-English translation, and compare performance through different ways of adding trainning data. Let us introduce a few terms first before discussing results. Firstly, let us use family addition to describe the addition of training data through adding close-by language families based on the unit of family. And let us use sparse addition to describe the addition of training data through adding language sets that spans eight language families. In sparse addition, languages are further apart as each may represent a different family. We find that family addition gives better generalization than that of sparse addition. It strengthens our earlier results that training two families closest to our low-resource language is a reliable way to reach good generalization.

Generalization is not merely an effect of increasing amount of data: In Figure 3, we compare all methods of adding languages against a WMT’14 curve by using equivalent amount of WMT’14 French-English data in each experiment. The WMT’14 curve serve as our benchmark of observing the effect of increasing data, we observe that our addition of other languages improve BLEU score much sharply than the increase in the benmark, showing that our generalization is not merely an effect of increasing data. We also observe that though increase WMT’14 data initially increases BLEU score, it reaches a plateau and adding more WMT’14 data does not increase performance from very early point.

### 5.3 Ablation Study on Target Training Data

We use full training data from all rich-resource languages, and we vary the amount of training data in Swedish, our low-resource language, spanning from one tenth to full length uniformly. We duplicate the subset to ensure all training sets, though having a different number of unique sentences, have the same number of total sentences.

Power-law relationship is observed between the performance and the amount of training data in low-resource language: Figure 5 shows how BLEU scores vary logarithmically with the number of unique sentences in the low-resource training data. It follows a linear pattern for single-source single-target translation from English to Swedish as shown in Figure 4. We also observe a linear pattern for the multi-source multi-target case, though more uneven in Figure 5. The linear pattern with BLEU scores against the logarithmic data shows the power-law relationship between the performance in translation and the amount of low-resource training data. Similar power-law relationships are also found in contemporary literature (Hestness et al., 2017).

We achieve reasonably good BLEU scores using one fifth of random samples: For the multi-source multi-target case, we find that using one fifth of the low-resource training data gives reasonably good BLEU scores as shown in Figure 5. This is helpful when we have little low-resource data. For translation into low-resource language, the experts only need to translate a small amount of seed data before passing it to our system 111Note that using nine tenth of random samples yields higher performance than using full data, but it may not be generalized to other datasets..

### 5.4 Order-preserving Lexiconized NMT

We devise a mechanism to build a parallel lexicon table across twenty-three European languages using very little data and zero manual work. A few lexicon examples are shown in Table 6. The final parallel lexicon table has 2916 named entities. In translation task into low-resource language, we assume that the experts first translate these lexicon entries, and then translate approximately one fifth random sentences before we train our NMT. If necessary, the experts evaluate and correct translations before releasing the final translations to the low-resource language community. We aim to reduce human effort in post-editing and increase machine accuracy. After labeling named entities in each sentence pair in order, we train and obtain good translation results.

We observe 60.6% accuracy in human evaluation where our translations are parallel to human translations: In Table 8, we show some examples of machine translated text, we also show the expected correct translations for comparison. Not only the named entities are correctly mapped, but also the ordering of the subject and the object is preserved. In a subset of our test set, we conduct human evaluation on 320 English-Swedish results to rate the translations into three categories: accurate (parallel to human translation), almost accurate (needing minor corrections) and inaccurate. More precisely, each sentence is evaluated using three criteria: correct set of named entities, correct positioning of named entities, and accurate meaning of overall translation. If a sentence achieves all three, then it is termed as accurate; if either a name entity is missing or its position is wrong, then it is termed as almost accurate (needing minor correction); if the meaning of the sentence is entirely wrong, then it is inaccurate. Our results are 60.6% accurate, 33.8% needing minor corrections, and 5.6% inaccurate. Though human evaluation carries bias and the sample is small, it does give us perspective on the performance of our model.

Order-preservation performs well especially when the named entities are rare words: In Table 8, we see that normal NMT without order-preservation lexiconized treatment performs well when named entities are common words at the head of the distribution. However, at the tail of the distribution, normal NMT without order-preservation fails to predict the correct set of named entities and their ordering as shown in Table 8. The last column shows the number of occurrences of each named entity. For the last example, there are many named entities that only occur in data once, which means that they never appear in training and only appear in the test set. The normal NMY predicts wrong named entities with wrong ordering. Our lexiconized order-preserving NMT, on the contrary, performs well at both the head and tail of the distribution, predicts the right set of named entities, as well as the right ordering. In the last example in Table 8, our lexiconized order-preserving NMT predicts all the named entities correctly with the right order.

Prediction with longer sentences and many named entities are handled well: In Table 8, we see that normal NMT without order-preservation lexiconized treatment performs well with short sentences and few named entities in a sentence. But as the number of the name entities per sentence increases, especially when the name entities are rare unknowns as discussed before, normal NMT cannot make correct prediction of the right set of name entities with the correct ordering 8. Our lexiconized order-preserving NMT, on the contrary, gives very high accuracy when there are many named entities in the sentence and maintains their correct ordering.

Trimming the lexicon list that keeps the tail helps to increase BLEU scores: Different from most of the previous lexiconized NMT works where BLEU scores never increase (Wang et al., 2017), our BLEU scores show minor improvements. BLEU score for German-Swedish translation increases from 35.8 to 36.6 in Table 7. As an attempt to increase our BLEU scores even further, we conduct two more experiments. In one setting, we keep only the tail of the lexicon table that occur in the Bible once. In another setting, we keep only a manual selection of lexicons. Note that this is the only place where manual work is involved and is not essential. There are minor improvements in BLEU scores in both cases.

33.8% of the translations require minor corrections: The sentence length for these translations that require minor corrections is often longer. We notice that some have repetitions that do not affect meaning, but need to be trimmed. Some have the under-prediction problem where certain named entities in the source sentence never appear; in this case, missing named entities need to be added. Some have minor issues with plurality and tense that need to be corrected. We show a few examples of the translations that need minor corrections in Table 12 in the appendices.

## 6 Conclusion and Future Directions

We present our order-preserving translation system for cross-lingual learning in European languages. We examine three issues that are important to translation into low-resource language: the lack of low-resource data, effective cross-lingual transfer, and the variable-binding problem. Firstly, we add source and target family labels in training and examined intra-family and inter-family effects. We find that training multiple families, more specifically, training two neighboring families nearest to the low-resource language gives reasonably good BLEU scores. Secondly, we devise a rigorous ablation study and show that we only need a small portion of the low-resource target data to produce reasonably good BLEU scores. Thirdly, to address the variable-binding problem, we build a parallel lexicon table across twenty-three European languages and design a novel order-preserving named entity translation method by tagging named entities in each sentence in order. We achieve reasonably good quantitative improvements and qualitative results in a preliminary study.

Our order-preserving named entity translation method works well with a fixed pool of named entities in any static document known in advance. However, researchers may need to translate dynamic document to low-resource language in real-time. Therefore, we are actively researching into the issue of discovering named entities in dynamic document timely and accurately.

Our work is helpful for translation into low-resource language, where human translators only need to translate a few lexicons and a partial set of data before passing it to our system. Human translators may be involved in post-editting. Our future goal is to minimize the human correction efforts and to present high quality translation timely.

We would also like to work on real world low-resource tribal languages where there is no or little training data. Real world low-resource languages will bring us interesting research questions we will strive to solve.

## Acknowledgments

We would like to thank Prof. Eduard Hovy for his insights and helpful suggestions.

## References

• Al-Rfou et al. (2013) Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Distributed word representations for multilingual nlp. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, Sofia, Bulgaria, pages 183–192.
• Ammon (2001) Ulrich Ammon. 2001. The dominance of English as a language of science: Effects on other languages and language communities, volume 84. Walter de Gruyter.
• Anastasiou and Schäler (2010) Dimitra Anastasiou and Reinhard Schäler. 2010. Translating vital information: Localisation, internationalisation, and globalisation. Syn-thèses Journal 3:11–25.
• Anastasopoulos et al. (2017) Antonios Anastasopoulos, Sameer Bansal, David Chiang, Sharon Goldwater, and Adam Lopez. 2017. Spoken term discovery for language documentation using translations. In Proceedings of the Workshop on Speech-Centric Natural Language Processing. pages 53–58.
• Arthur et al. (2016) Philip Arthur, Graham Neubig, and Satoshi Nakamura. 2016. Incorporating discrete translation lexicons into neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
• Axelrod et al. (2011) Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selection. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pages 355–362.
• Barrett (2005) Julia Barrett. 2005. Support and information needs of older and disabled older people in the uk. Applied ergonomics 36(2):177–183.
• Cenoz (2001) Jasone Cenoz. 2001. The effect of linguistic distance, l2 status and age on cross-linguistic influence in third language acquisition. Cross-linguistic influence in third language acquisition: Psycholinguistic perspectives 111(45):8–20.
• Chung et al. (2016) Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit segmentation for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). volume 1, pages 1693–1703.
• De Raad et al. (1997) Boele De Raad, Marco Perugini, and Zsófia Szirmák. 1997. In pursuit of a cross-lingual reference structure of personality traits: Comparisons among five languages. European Journal of Personality 11(3):167–185.
• Dong et al. (2015) Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). volume 1, pages 1723–1732.
• Duong et al. (2016) Long Duong, Antonios Anastasopoulos, David Chiang, Steven Bird, and Trevor Cohn. 2016. An attentional model for speech translation without transcription. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 949–959.
• Duong et al. (2017) Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Bird, and Trevor Cohn. 2017. Multilingual training of crosslingual word embeddings. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. volume 1, pages 894–904.
• Dyer et al. (2013) Chris Dyer, Victor Chahuneau, and Noah A Smith. 2013. A simple, fast, and effective reparameterization of ibm model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 644–648.
• Eck et al. (2005) Matthias Eck, Stephan Vogel, and Alex Waibel. 2005. Low cost portability for statistical machine translation based on n-gram frequency and tf-idf. In International Workshop on Spoken Language Translation (IWSLT) 2005.
• Firat et al. (2016) Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 866–875.
• Fodor and Pylyshyn (1988) Jerry A Fodor and Zenon W Pylyshyn. 1988. Connectionism and cognitive architecture: A critical analysis. Cognition 28(1):3–71.
• Gillick et al. (2016) Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. 2016. Multilingual language processing from bytes. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 1296–1306.
• Graves et al. (2014) Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401 .
• Ha et al. (2016) Thanh-Le Ha, Jan Niehues, and Alexander Waibel. 2016. Toward multilingual neural machine translation with universal encoder and decoder. arXiv preprint arXiv:1611.04798 .
• Harding and Sokal (1988) Rosalind M Harding and Robert R Sokal. 1988. Classification of the european language families by genetic distance. Proceedings of the National Academy of Sciences 85(23):9370–9372.
• Hermans (2003) Theo Hermans. 2003. Cross-cultural translation studies as thick translation. Bulletin of the School of Oriental and African Studies 66(3):380–389.
• Hestness et al. (2017) Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. 2017. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 .
• Johnson et al. (2017) Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5:339–351.
• Klein et al. (2017) Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. Opennmt: Open-source toolkit for neural machine translation. Proceedings of ACL 2017, System Demonstrations pages 67–72.
• Libovickỳ and Helcl (2017) Jindřich Libovickỳ and Jindřich Helcl. 2017. Attention strategies for multi-source sequence-to-sequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). volume 2, pages 196–202.
• Ling et al. (2015) Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W Black. 2015. Character-based neural machine translation. arXiv preprint arXiv:1511.04586 .
• Nguyen and Chiang (2017) Toan Q Nguyen and David Chiang. 2017. Improving lexical choice in neural machine translation. arXiv preprint arXiv:1710.01329 .
• Odlin (1989) Terence Odlin. 1989. Language transfer: Cross-linguistic influence in language learning. Cambridge University Press.
• Perry and Bird (2017) Robyn Perry and Steven Bird. 2017. Treasure language storytelling: Cross-cultural language recognition and wellbeing. 5th International Conference on Language Documentation and Conservation (ICLDC) .
• Prechelt (1998) Lutz Prechelt. 1998. Early stopping-but when? Neural Networks: Tricks of the trade pages 553–553.
• Reddy et al. (2017) B Reddy, Yadlapalli S Kusuma, Chandrakant S Pandav, Anil Kumar Goswami, Anand Krishnan, et al. 2017. Water and sanitation hygiene practices for under-five children among households of sugali tribe of chittoor district, andhra pradesh, india. Journal of environmental and public health 2017.
• Ross et al. (2006) Malcolm Ross et al. 2006. Language families and linguistic diversity. In Encyclopedia of Language and Linguistics (2nd ed), Elsevier.
• Sapir (1921) Edward Sapir. 1921. How languages influence each other. Language: an Introduction to the Study of Speech .
• Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). volume 1, pages 1715–1725.
• Shoemark et al. (2016) Philippa Shoemark, Sharon Goldwater, James Kirby, and Rik Sarkar. 2016. Towards robust cross-linguistic comparisons of phonological networks. In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology. pages 110–120.
• Specia et al. (2016) Lucia Specia, Stella Frank, Khalil Sima’an, and Desmond Elliott. 2016. A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. volume 2, pages 543–553.
• Tiedemann (2012) Jörg Tiedemann. 2012. Character-based pivot translation for under-resourced languages and domains. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pages 141–151.
• Toral and Way (2018) Antonio Toral and Andy Way. 2018. What level of quality can neural machine translation attain on literary text? arXiv preprint arXiv:1801.04962 .
• Tsvetkov et al. (2016) Yulia Tsvetkov, Sunayana Sitaram, Manaal Faruqui, Guillaume Lample, Patrick Littell, David Mortensen, Alan W Black, Lori Levin, and Chris Dyer. 2016. Polyglot neural language models: A case study in cross-lingual phonetic representation learning. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 1357–1366.
• Wang et al. (2017) Yuguang Wang, Shanbo Cheng, Liyang Jiang, Jiajun Yang, Wei Chen, Muze Li, Lin Shi, Yanfeng Wang, and Hongtao Yang. 2017. Sogou neural machine translation systems for wmt17. In Proceedings of the Second Conference on Machine Translation. pages 410–415.
• Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 .
• Zoph and Knight (2016) Barret Zoph and Kevin Knight. 2016. Multi-source neural translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 30–34.

## Appendix A Examples of Full Tables

All important results have been presented previously in the main paper. We show a few examples of full tables in the next page.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters