Effective Strategies in Zero-Shot Neural Machine Translation

# Effective Strategies in Zero-Shot Neural Machine Translation

## Abstract

In this paper, we proposed two strategies which can be applied to a multilingual neural machine translation system in order to better tackle zero-shot scenarios despite not having any parallel corpus. The experiments show that they are effective in terms of both performance and computing resources, especially in multilingual translation of unbalanced data in real zero-resourced condition when they alleviate the language bias problem.

Institute for Anthropomatics and Robotics
KIT - Karlsruhe Institute of Technology, Germany
firstname.lastname@kit.edu

## 1 Introduction

The newly proposed neural machine translation [1] has shown the best performance in recent machine translation campaigns for several language pair. Being applied to multilingual settings, neural machine translation (NMT) systems have been proved to be benefited from additional information embedded in a common semantic space across languages. However, in the extreme cases where no parallel data is available to train such system, often NMT systems suffer a bad training situation and are incapable to perform adequate translation.

In this work, we point out the underlying problem of current multilingual NMT systems when dealing with zero-resource scenarios. Then we propose two simple strategies to reduce adverse impact of the problem. The strategies need little modifications in the standard NMT framework, yet they are still able to achieve better performance on zero-shot translation tasks with much less training time.

### 1.1 Neural Machine Translation

In this section, we briefly describe the framework of Neural Machine Translation as a sequence-to-sequence modeling problem following the proposed method of  [1].

Given a source sentence and the corresponding target sentence , the NMT aims to directly model the translation probability of the target sequence:

 P(y|x)=I∏j=1P(yj|y

[1] proposed an encoder-attention-decoder framework to calculate this probability.

A bidirectional recurrent encoder reads a word from the source sentence and produces a representation of the sentence in a fixed-length vector concatenated from those of the forward and backward directions:

 hi=[→hi,←hi] →hi=d(→hi−1,Es \raisebox1.29pt$∙$ xi) (1) ←hi=d(←hi+1,Es \raisebox1.29pt$∙$ xi) (2)

where is the source word embedding matrix to be shared across the source words , is the recurrent unit computing the current hidden state of the encoder based on the previous hidden state. is then called an annotation vector which encodes the source sentence up to the time from both forward and backward directions.

Then an attention mechanism is set up in order to choose which annotation vectors should contribute to the predicting decision of the next target word. Normally, a relevance score between the previous target word and the annotation vectors is used to calculate the context vector :

 αij=exp(rel(zj−1,hi))∑i′exp(rel(zj−1,hi′)), cj=∑iαijhi

In the other end, a decoder recursively generates one target word at a time:

 P(yj|y

Where:

 zj=g(zj−1,tj−1,cj) tj−1=Et \raisebox1.29pt$∙$ yj−1 (3)

The mechanism in the decoder is similar to its counterpart in the encoder, excepts that beside the previous hidden state and target embedding , it also takes the context vector from the attention layer as inputs to calculate the current hidden state . The predicted word at time then can be sampled from a softmax distribution of the hidden state. Basically, a beam search is utilized to generate the output sequence - the translated sentence in this case.

### 1.2 Multilingual NMT

State-of-the-art NMT systems have demonstrated that machine translation in many languages can achieve high quality results with large-scale data and sufficient computational power[4, 5]. On the other hand, how to prepare such enormous corpora for low-resourced languages and specific domains has remained a big problem. Especially in zero-resourced condition where we do not possess any bilingual corpus, building a data-driven translation system requires special techniques that can enable some sort of transfer learning. A simple but effective approach called pivot-based machine translation has been developed. The idea of the pivot-based approach is to indirectly learn the translation of the source and target languages through a bridge language. However, this pivot approach is not scalable since it is necessary to build two different translation systems for each language pair in order to perform the bridge translation.

Recent work has started exploring potential solutions to perform machine translation for multiple language pairs using a single NMT system. One of the most notable differences of NMT compared to the conventional statistical approach is that the source words can be represented in a continuous space in which the semantic regularities are induced automatically. Being applied to multilingual settings, NMT systems have been proved to be benefited from additional information embedded in a common semantic space across languages, thus, by some means they are able to conduct some level of transfer learning.

In this section, we review the related work on constructing a multilingual NMT system involved in translating from several source languages to several target languages. Then we consider a potential application of such a multilingual system on zero-shot scenarios to demonstrate the capability of those systems in extreme low-resourced conditions.

We can essentially divided the work into two directions in applying the current NMT framework for multilingual scenarios. The first direction follows the idea that multilingual training of an NMT system can be seen as a special form of multi-task learning where each encoder is responsible to learn an individual modality’s representation and each decoder’s mission is to predict labels of a particular task. In such a multilingual system, each task or modality corresponds to a language. In [6], the authors utilizes a multiple encoder-decoder architecture to do multi-task learning, including many-to-many translation, parsing and image captioning. Since these tasks consider different modalities, e.g. translation and image captioning, their system could not leverage the attention mechanism although it has proven its effectiveness in each of these tasks. [7] proposed another approach which enable attention-based NMT to multilingual translation. Similar to [6], they use one encoder per source language and one decoder per target language for many-to-many translation tasks. Instead of a quadratic number of independent attention layers, however, their NMT system contains only a single, huge attention layer. In order to achieve this, the attention layer need to be provided some sort of aggregation layer between it and the encoders as well as the decoders. It is required to change their architecture to accommodate such a complicated shared attention mechanism. In [8], a more general framework toward this direction is applied to perform a multilingual translation system for 20 language pairs.

The work along the second direction also considers multilingual translation as multi-task learning, although the tasks should be the same (i.e. translation) with the same modality (i.e. textual data). The only difference here is that the encoder and decoder works on different vocabularies from different languages. When we group those vocabularies into a large vocabulary, we can use a single encoder-decoder NMT system to perform many-to-many translation as each word is viewed as a distinct entry in the large vocabulary regardless of its language. By implementing such mechanism in the preprocessing step, this approach requires little or no modification in the standard NMT architecture. In our previous work[2], we performed a two-step preprocessing:

1. Language Coding: Add the language codes to every word in source and target sentences.

2. Target Forcing: Add a special token in the beginning of every source sentence indicating the language they want the system to translate the source sentence to.1

Concurrently, [3] proposed a similar but simpler approach: they carried out only the second step as in the work of [2]. They expected that there would be only a few cases where two words in difference languages (with different meanings) having the same surface form. Thus, they did not conduct the first step. An interesting side-effect of not doing language-code adding, as [3] suggested, is that their system could accomplish code-switching multilingual translation, i.e. it could translate a sentence containing words in different languages. The main drawback of these approaches is that the sizes of the vocabularies and corpus grow proportionally to the number of languages involved. Hence, a huge amount of time and memory are necessary to train such a multilingual system. Table 1 gives us a simple example illustrating those preprocessing steps.

## 2 Multilingual-based Zero-Shot Translation

In this section, we follow the second direction of [2] and [3], hereby called mix-language approaches. First we reimplemented their preprocessing steps with some modifications and we applied their multilingual systems on the new challenge of zero-shot translation at IWSLT 2017. Then we proposed two strategies, filtered dictionary and language as a word feature, in attempts to tackle the drawbacks of their approach. The results in section 3.3 show that our strategies are highly effective in terms of both performance and training resources.

### 2.1 Target Dictionary Filtering

In [2], we discussed about observations of the language bias problem in our multilingual system: If the very first word is wrongly translated into wrong language, the following picked words are more probable in that wrong language again. The problem is more severe when the mixed target vocabulary is unbalanced, due to the language unbalance of the training corpora (whereas the zero-shot is a typical example). We reported a number of 9.7% of the sentences wrongly translated in our basic zero-shot GermanFrench system.

One solution for this problem is to enhance the balance of the corpus by adding targettarget corpora into the multilingual system as suggested in [2]. The beam search still need to consider, however, other candidates belonging to the target vocabulary that should not be considered. In this work, we propose a simple yet effective technique to eliminate this bad effect. In the translation process to a specific language, we filter out all the entries in the languages other than that desired language from the target vocabulary. It would significantly reduce the translation time in huge multilingual systems or big texts to be translated due to the fact that many search paths containing the unwanted candidates are removed. More importantly, it assures the translated words and sentences are in the correct language. The effect of this strategy in the decoding process is illustrated in Figure 1.

### 2.2 Language as a Word Feature

As briefly mentioned in Section 1.2, the main disadvantage of the mix-language approaches is the efficiency of training process. Usually in those systems, source and target vocabularies have a huge number of entries, in proportion to the number of languages whose corpora are mixed. It leads to immerse numbers of parameters laying between the embedding and hidden states of the encoder and the decoder. More problematic is the size of the output softmax - where most calculations take place.

There exist works on integrating linguistic information into NMT systems in order to help predict the output words[9, 10, 11]. In those works, the information of a word (e.g. its lemma or its part-of-speech tag) are integrated as a word features. It is conducted simply by learning the feature embeddings instead of the word embeddings. In other words, their system considers a word as a special feature together with other features of itself.

More specially, in the formula 12 and  3, the embedding matrices are the concatenation of all features’ embeddings:

 E \raisebox1.29pt$∙$ x=[,]f∈F(Ef \raisebox1.29pt$∙$ xf)

Where is the vector concatenation operation, concatenating the embeddings of individual feature in a finite, arbitrary set of word features. The target features of each target word would be jointly predicted along the word. Figure 2 denotes this modified architecture.

Inspired by their work, we attempt to encode the language information directly in the architecture instead of performing language token attachment in the preprocessing step. Being applied in our model, instead of the linguistic information at the word level, our source word features are the language of the considering word and the correct language the target sentence The only target feature is the language of the produced word by the system. For example, when we would like to translate from the sentence “put yourselves in my position” into German, the features of each source word would be the word itself, e.g. “yourselves”, and two additional features “en” and “de”. Similarly, the features of the target words are the word and “de”. This scheme of using language information looks alike to [2], but the difference is the way the language information are integrated into the NMT framework. In [2], those information are implicitly injected into the system. In this work, they are explicitly provided along with the corresponding words. Furthermore, when being used together in the embedding layers, they can share useful information and constraints which would be more helpful in choosing both correct words and language to be translated to. During decoding, the beam search is only conducted on the target words space and not on the target features. When the search is complete, the corresponding features are selected along the search path. In our case, we do not need the output of the target language features excepts for the evaluation of language identification purpose.

## 3 Evaluation

In this section, we describe a thorough evaluation of the related methods in comparisons with the direct approach as well as the pivot-based approach.

### 3.1 Experimental Settings

We participated to this year’s IWSLT zero-shot tasks for GermanDutch and GermanRomanian. The pivot language used in our experiments is English and the parallel corpora are German-English and English-Dutch or German-English and English-Romanian. The data are extracted from WIT3’s2 TED corpus[12]. The validation and test sets are dev2010 and tst2010 which are provided by the IWSLT17 organizers.

We use the Lua version of OpenNMT3[13] framework to conduct all experiments in this paper. Subword segmentation is performed using Byte-Pair Encoding [14] with 40000 merging operations. All sentence pairs in training and validation data which exceeds 50-word length are removed and the rest are shuffled inside each of every minibatch. We use 1024-cell LSTM layers[15] and 1024-dimensional embeddings with dropout of 0.3 at every recurrent layers. The systems are trained using Adam[16]. In decoding process, we use a beam search with the size of 15.

### 3.2 Baseline Systems

Let us consider the scenario that we would like to translate from a source language to a target language via a pivot language. In order to evaluate the effectiveness of our proposed strategies, we reimplemented the following baseline systems:

• Direct: A system which does not exist in the real world is trained using the parallel corpus. It is only for comparison purpose.

• Pivot: A system which uses English as the pivot language. The output of the first sourcepivot translation system is pipelined into the second system trained to translate from pivot to target.

• Google Model 1: To build this system, we add a target token to every source sentences in the parallel corpus of sourcepivot, add another target token to every pivot sentences in the parallel corpus of pivottarget, merge those two parallel corpora into a big corpus and use a standard NMT architecture to train and decode.

• Google Model 2: Same as Google Model 1 but in addition applying to two other directions pivotsource and targetpivot. The result is a parallel corpus two times larger than the corpus in Google Model 1.

• Zero: This is an extended version of our work in [2]. We apply Language Coding and Target Forcing on six parallel corpora: sourcepivot, pivotpivot, pivottarget, targettarget and merge them at the end to form a big parallel corpus.

• Zero Back-Trans: This is not a real zero-shot system where we back-translated the English part of the pivot-target parallel corpus using a target-pivot NMT system. At the end we have a source-target parallel corpus with back-translation quality. After we obtained that direct corpus, we apply the same steps as in the Zero setting to all corpora we have.

### 3.3 Results

First we applied the baseline systems from  [2] and  [3] with respect to the IWSLT zero-shot tasks. From Table 2 we can see that in general, translating from GermanRomanian is more difficult than GermanDutch, which is reasonable when German and Dutch are considered to be similar. The direct approach which uses a parallel German-target corpus and the pivot approach have similar performance in term of BLEU score[17]. Interestingly, the Zero Back-Trans performed better that the direct approach on GermanRomanian. We spectaculate that back translation might pose some translation noise which makes the translation from GermanRomanian more robust.

Compared to the Zero model (5), two Google models (3) and (4) from [3] achieved quite low scores. This explains the language-bias problem when these models used less and unbalanced corpora than the Zero system (5). However, the real zero-shot systems (2, 3, 4, 5), excepts the pivot one (2), performed worse than those using direct parallel corpora (1) and (7), since the zero-shot systems have not been shown the direct data, hence, having little or no guide to learn the translation.

When we applied the proposed strategies, it is interesting to see their effects on different types of systems. Since Google Model 1 and Google Model 2 do not have the language identity for words, we cannot apply our strategies on those systems. In contrast, it is straight-forward to adapt target dictionary filtering and language as a word feature on the systems described in [2].

Table 3 shows the performance of our strategies compared to [2] and [3] methods. When we applied the strategies on top of Zero Back-Trans system, it seems that the data it used to train is sufficient to avoid the language bias problem. Thus, our strategies did not have a significant effect of performance on this system (4a vs. 4 and 4b vs. 4). But on the real zero-shot configuration (3), both strategies helped to improve the systems by notable margins. On tst2010, Target Dictionary Filtering (3a) brought an improvement of 1.07 on GermanDutch. On the same test set, Language as a word feature achieved the gains of 2.20 BLEU scores compared to Zero (3b vs. 3). On GermanRomanian zero-shot task, the improvements of our strategies were not as great as on GermanDutch, but they still helped, especially on dev2010.

Table 5 shows two examples where Target Dictionary Filtering clearly improves the quality and readability of the translation over the Zero when applied.

Considering the effectiveness of our strategy Language as a Word Feature on computation perspective, which is shown in Table 4, we observed very positive results. We compared the Zero configuration and our Language as a Word Feature system in term of training times, size of source&target vocabularies4 and the total number of model parameters on both zero-shot translation tasks. The models were usually trained on the same GPU (Nvidia Titan Xp) for 8 epochs so they are fairly compared (seeing the same dataset the same number of times). Each type of models has the same configuration between two zero-shot tasks, excepts the parts related to vocabularies5.

By encoding the language information into word features, the number of vocabulary entries reduces to almost half of the original method. Thus, it leads to the similar reduction in term of the parameter number. This reduction allows us to use bigger minibatches as well as perform faster updates, resulting in substantially decreased training time (from 7.3 hours to 1.5 hours for each epoch in case of GermanDutch and from 6.0 hours to 1.3 hours for each epoch in case of GermanRomanian). The strategy requires minimum modifications in the standard NMT framework, yet it still achieved better performance with much less training time.

## 4 Conclusion and Future Work

In this paper, we present our experiments toward zero-shot translation tasks using a multilingual Neural Machine Translation framework. We proposed two strategies which substantially improved the multilingual systems in terms of both performance and training resources.

On the future work, we would like to look closer to the outputs of the systems in order to analyze better the effects of our strategies. We also have the plan to expand our strategies on full multilingual systems, for more languages and different data condition.

## 5 Acknowledgements

The project leading to this application has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement n 645452. The research by Thanh-Le Ha was supported by Ministry of Science, Research and the Arts Baden-Württemberg.

### Footnotes

1. In fact, we add the target language token both to the beginning and to the end of every source sentence, each place two times, to make the forcing effect stronger. Furthermore, every target sentence starts with a pseudo word playing the role of a start token in a specific target language. This pseudo word is later removed along with sub-word tags in post-processing steps.
2. https://wit3.fbk.eu/
3. http://opennmt.net/
4. In all cases, these sizes are similar numbers.
5. While the total number of parameters on GermanRomanian is bigger than that of GermanDutch, the training time of GermanRomanian systems is less due to the fact that its training corpus is smaller.

### References

1. D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” CoRR, vol. abs/1409.0473, 2014. [Online]. Available: http://arxiv.org/abs/1409.0473
2. T.-L. Ha, J. Niehues, and A. Waibel, “Toward Multilingual Neural Machine Translation with Universal Encoder and Decoder,” CoRR, vol. abs/1611.04798, 2016. [Online]. Available: http://arxiv.org/abs/1611.04798
3. M. Johnson, M. Schuster, Q. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean, “Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 339–351, 2017. [Online]. Available: https://www.transacl.org/ojs/index.php/tacl/article/view/1081
4. O. Bojar, R. Chatterjee, C. Federmann, B. Haddow, P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post, H. Saint-Amand, et al., “Findings of the 2016 Conference on Machine Translation (WMT16),” in Proceedings of the First Conference on Machine Translation (WMT16).   Berlin, Germany: Association for Computational Linguistics, 2016, pp. 12–58.
5. M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, R. Cattoni, and M. Federico, “The IWSLT 2016 Evaluation Campaign,” in Proceedings of the 13th International Workshop on Spoken Language Translation (IWSLT 2016), Seattle, WA, USA, 2016.
6. M.-T. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and L. Kaiser, “Multi-task sequence to sequence learning,” in International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, May 2016.
7. O. Firat, K. Cho, and Y. Bengio, “Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism,” CoRR, vol. abs/1601.01073, 2016. [Online]. Available: http://arxiv.org/abs/1601.01073
8. N.-Q. Pham, , M. Sperber, E. Salesky, T.-L. Ha, J. Niehues, and A. Waibel, “KIT’s Multilingual Neural Machine Translation systems for IWSLT 2017 (to be appeared),” in Proceedings of the 14th International Workshop on Spoken Language Translation (IWSLT 2017), Tokyo, Japan, 2017.
9. R. Sennrich and B. Haddow, “Linguistic input features improve neural machine translation,” CoRR, vol. abs/1606.02892, 2016. [Online]. Available: http://arxiv.org/abs/1606.02892
10. C. D. V. Hoang, R. Haffari, and T. Cohn, “Improving neural translation models with linguistic factors,” in Proceedings of the Australasian Language Technology Association Workshop 2016, 2016, pp. 7–14.
11. J. Niehues, T.-L. Ha, E. Cho, and A. Waibel, “Using factored word representation in neural network language models.” in Proceedings of the First Conference on Machine Translation (WMT16).   Berlin, Germany: Association for Computational Linguistics, 2016.
12. M. Cettolo, C. Girardi, and M. Federico, “Wit: Web inventory of transcribed and translated talks,” in Proceedings of the 16 Conference of the European Association for Machine Translation (EAMT), Trento, Italy, May 2012, pp. 261–268.
13. G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush, “OpenNMT: Open-Source Toolkit for Neural Machine Translation,” ArXiv e-prints, 2017.
14. R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare Words with Subword Units,” in Association for Computational Linguistics (ACL 2016), Berlin, Germany, August 2016.
15. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [Online]. Available: http://dx.doi.org/10.1162/neco.1997.9.8.1735
16. D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
17. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL 2002).   Association for Computational Linguistics, 2002, pp. 311–318.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters