Zero-Shot Dual Machine Translation

Zero-Shot Dual Machine Translation

&Lierni Sestorain
ETH Zürich
liernis@student.ethz.ch
&Massimiliano Ciaramita
Google
massi@google.com
&Christian Buck
Google
cbuck@google.com
&Thomas Hofmann
ETH Zürich
thomas.hofmann@inf.ethz.ch
Abstract

Neural Machine Translation (NMT) systems rely on large amounts of parallel data. This is a major challenge for low-resource languages. Building on recent work on unsupervised and semi-supervised methods, we present an approach that combines zero-shot and dual learning. The latter relies on reinforcement learning, to exploit the duality of the machine translation task, and requires only monolingual data for the target language pair. Experiments show that a zero-shot dual system, trained on English-French and English-Spanish, outperforms by large margins a standard NMT system in zero-shot translation performance on Spanish-French (both directions). The zero-shot dual method approaches the performance, within 2.2 BLEU points, of a comparable supervised setting. Our method can obtain improvements also on the setting where a small amount of parallel data for the zero-shot language pair is available. Adding Russian, to extend our experiments to jointly modeling zero-shot translation directions, all directions improve between and BLEU points, again, reaching performance near that of the supervised setting.

1 Introduction

The availability of large high-quality parallel corpora is key to building robust machine translation systems. Obtaining such training data is often expensive or outright infeasible, thereby constraining the availability of translation systems, especially for low-resource language pairs. Among the work trying to alleviate this data-bottleneck problem, Johnson et al. [11] present a system that performs zero-shot translation. Their approach is based on a multi-task model trained to translate between multiple language pairs with a single encoder-decoder architecture. Multi-tasking is enabled simply by a special symbol, introduced at the beginning of the source sentence, that identifies the desired target language. Their idea enables a NMT model [18] to translate between language pairs never encountered during training, as long as source and target language were part of the training data.

Since any machine translation task has a canonical inverse problem, that of translating in the opposite direction, He et al. [8] propose a dual learning method for NMT. They report improved translation quality of two dual, weakly trained translation models using monolingual data from both languages. For that, a monolingual sentence is translated twice, first into the target language and then back into the source language. A language model is used to measure the fluency of the intermediate translation. In addition, a reconstruction error is computed via back-translation. Fluency and reconstruction scores are combined in a reward function which is optimized with reinforcement learning.

We propose an approach that builds upon the multilingual NMT architecture [11] to improve the zero-shot translation quality by applying the reinforcement learning ideas from the dual learning work [8]. Our model outperforms the zero-shot system’s performance while being simpler and more scalable than the original dual formulation, as it learns a single model for all involved translation directions. Moreover, for the original dual method to work, some amount of parallel data is needed in order to train an initial weak translation model. If no parallel data is available for that language pair, the original dual approach is not feasible, as a random translation model is ineffective in guiding exploration for reinforcement learning. In our formulation, however, no parallel data is required at all to make the algorithm work, as the zero-shot mechanism kick-starts the dual learning.

Experiments show that our approach outperforms the zero-shot translation of the NMT model by up to BLEU points using only monolingual corpora. Furthermore, our system leads to gains of 0.67-2.53 BLEU points when the base NMT model benefits from a small parallel data for the zero-shot translation pair. We also investigate a setting where six new language pairs are learned, by adding Russian to the set of languages. These experiments produce similar improvements, and also shed additional light on the zero-shot learning process.

2 Related work

We build on the standard encoder-attention-decoder architecture for neural machine translation [15, 5, 3], and specifically on the Google NMT implementation [18]. This architecture has been extended by Johnson et al. [11] for unsupervised or zero-shot translation. Here, one single multi-task system is trained on a subset of language pairs, but then can be used to translate between all pairs, including unseen ones. The success of this approach will depend on the ability to learn interlingua semantic representations of which some evidence is provided in [11]. In a zero-shot evaluation from Portuguese to Spanish Johnson et al. [11] report a gap of 6.75 BLEU points from the supervised NMT model.

He et al. [8] propose a learning mechanism that bootstraps an initial low-quality translation model. This so-called dual learning approach leverages criteria to measure the fluency of the translation and the reconstruction error of back-translating into the source language. As shown in [8], this can be effective, provided that the initial translation model provides a sufficiently good starting point. In the reported experiments, they bootstrap an NMT model [3] trained on M parallel sentences. Dual learning improves the BLEU score on WMT’14 French-to-English by points, obtaining comparable accuracy to a model trained on 10x more data, i.e. M parallel sentences.

Xia et al. [19] use dual learning in a supervised setting and improve WMT’14 English-to-French translation by BLEU points over the baseline of [3, 10]. In the supervised setting Bahdanau et al. [4] show that Actor-Critic can be more effective than vanilla policy gradient [17].

Recent work has also explored fully unsupervised machine translation. Lample et al. [12] propose a model that takes sentences from monolingual corpora from both languages and maps them into the same latent space. Starting from an unsupervised word-by-word translation, they iteratively train encoder-decoder to reconstruct and translate from a noisy version of the input. Latent distributions from the two domains are aligned using adversarial discriminators. At test time, encoder and decoder work as a standard NMT system. Lample et al. [12] obtain BLEU score of on WMT’14 English-to-French translation, improving over a word-by-word translation baseline with inferred bilingual dictionary [6] by BLEU points.

Artetxe et al. [2] present a method related to [12].The model is an encoder-decoder system with attention trained to perform autoencoding and on-the-fly back-translation. The system handles both translation directions at the same time. Moreover, it shares the encoder across languages, in order to generate language independent representations, by exploiting pre-trained cross-lingual embeddings. This approach performs comparably on WMT’14 English-to-French, improving by BLEU points over [12]. Furthermore, WMT’14 French-to-Spanish translation obtains BLEU points.

Lample et al. [13] combine the previous two approaches. They first learn NMT language models, as denoising autoencoders, over the source and target languages. Leveraging these, two translation models are initialized, one in each direction. Iteratively, source and target sentences are translated using the current translation models and new translation models are trained on the generated pseudo-parallel sentences. The encoder-decoder parameters are shared across the two languages to share the latent representations and for regularization, respectively. Their results outperform those of [12, 2], obtaining BLEU points on WMT’14 English-to-French translation. BLEU scores are still distinctly lower than the corresponding supervised model.111Exact scores are not reported for the supervised comparison, but see Figure 2 in [13].

3 Zero-shot dual machine translation

Our method for unsupervised machine translation works as follows: We start with a multi-language NMT model as used in zero-shot translation [11]. However, we then train this model using only monolingual data in a dual learning framework similar to that proposed by He et al. [8].

At least three languages are involved in our setting, later we will also experiment with four. Let them be X, Y and Z. Let the pairs Z-X and Z-Y be the language pairs with available parallel data and X-Y be the target language pair without. We thus assume to have access to sufficiently large parallel corpora and , monolingual data and for training the language models, as well as a small corpus for the low-resource pair evaluation.

3.1 Zero-shot dual training

A single multilingual NMT model with parameters is trained on both and , in both directions. We stress that this model does not train on any X-Y sentence pairs. The multilingual NMT model is capable of generalizing to unseen pairs, that is, it can translate from X to Y and vice versa. Yet, the quality of the zero-shot ptes translations, in the best setting (Model 2 in [11]) is approximately 7 BLEU points lower than the corresponding supervised model.

Using the monolingual data and , we train two disjoint language models with maximum likelihood. We implement the language models as multi-layered LSTMs222https://www.tensorflow.org/tutorials/recurrent that process one word at a time. For a given sentence, the language model outputs a probability , quantifying the degree of fluency of the sentence in the corresponding language X. Similar to [8], we define a REINFORCE-like learning algorithm [17], where the translation model is the policy that we want to learn. The procedure on one sample is, schematically:

  1. Given a sentence in language X, sample a corresponding translation from the multilingual NMT model .

  2. Compute the fluency reward, .

  3. Compute the reconstruction reward by using for back-translation.

  4. Compute the total reward for as , where is a hyper-parameter.

  5. Update by scaling the gradient of the cross entropy loss by the reward .

The process can be started with a sentence from either language X or Y and works symmetrically.

3.2 Details of the gradient computation

The reward for generating the target from source is defined as:

(1)

We want to optimize the expected reward, under the translation model (the policy) and thus compute the gradient with respect to the expected reward:

(2)
(3)

Here we have used the product rule and the identity . Notice that, the reward depends (differentiably) on , because of the reconstruction term.

Since the gradient is a conditional expectation, one can sample to obtain an unbiased estimate of the expected gradient [16]:

(4)

Any sampling approach can be used to estimate the expected gradient. He et al. [8] use beam search instead of sampling to avoid large variance and unreasonable results. We generate from the policy by randomly sampling the next word from the predicted word distribution. However, we found it helpful to make the distribution more deterministic using a low softmax temperature (). In order to reduce variance further, we compute a baseline through the batch. That is, we compute the average of each reward through the whole batch and subtract it from each reward. We optimize the likelihood of the translation system on the monolingual data of both the source and target languages, while scaling the gradients (equivalently, the loss or the negative likelihood) with the total rewards.

3.3 System implementation

Figure 1: The zero-shot dual learning procedure (bottom-up, left-to-right). The encoder, on the forward translation step, receives the source sentence as input with the target language tag . The decoder, on the forward translation step, randomly samples a translation , which is passed to the language model to generate the fluency reward. At the same time, is encoded, by the same model, now with the source language tag . Then the decoder is used to compute the probability of the original sentence , conditioning on . Both encoders and decoders share the same parameters.

The system is implemented as an extension of the Tensorflow [1] NMT tutorial code,333https://www.tensorflow.org/tutorials/seq2seq. and will be released as open source. Figure 1 visualizes the training process. The model is optimized after two calls: one for the forward translation and another for the backward translation. During the forward translation, the encoder takes a sentence , where is the tag of the target language, and samples a translation output . At each step the decoder randomly picks the next word from the current distribution. This word is fed as input at the next decoding step.

Then, is used by the language model RNN trained on the target language. The language model is kept fixed during the whole process. Observe that initially the special begin-of-sentence symbol is fed, so that we can also obtain the probability for the first word. Here, we use the output logits in order to calculate the probability of the sample sentence. The first reward is computed as:

(5)

In addition, the backward translation call is executed: , plus the source language tag added in the beginning, , is encoded to a sequence of hidden state vectors using the encoder. These outputs are passed to the decoder which is fed with the original source words at each step. The output logits are used to calculate the probability of the source sentence being reconstructed from . This value is used as the second reward:

(6)

4 Experiments

4.1 Dataset

A key factor in the choice of dataset was the fact that we require parallel data for the supervised language pairs (Z-X, Z-Y), in order to perform zero-shot translation (X-Y). This limited us from working with the widely used WMT dataset. We chose the United Nations Parallel Corpus (UN corpus) [20], instead. The corpus is composed of official records and other parliamentary documents of the United Nations for the six official UN languages: Arabic (ar), Chinese (zh), English (en), French (fr), Russian (ru) and Spanish (es). The corpus contains pairwise aligned documents as well as a fully aligned sub-corpora for the six languages, thus it allows the control needed for our experiments, without having to resort on human ratings. Moreover, the corpus provides official development and test sets composed of the documents released in 2015. Both sets comprise 4,000 sentences, aligned for all the six languages. This allows experiments to be evaluated, and replicated, in all directions.

4.2 Vocabulary

As in [11], we segment the data using an algorithm similar to the Byte-Pair-Encoding (BPE) [14] in order to efficiently deal with unknown words while still keeping a limited vocabulary. This approach offers a good balance between the flexibility of character-delimited models and the efficiency of word-delimited models. The corpus is initially represented using a character vocabulary. The vocabulary is iteratively updated by replacing the most frequent symbol pair on the corpus with a new symbol, which adds one new element to the vocabulary.

We calculate the vocabulary on the datasets sampled for the translation system. We apply 32K merge operations which yields a vocabulary of roughly 33K sub-word units, shared across the three languages. Every model that we train –the language models and the translation systems– relies on this vocabulary, unless explicitly stated. Three language tags are added at the end of the vocabulary in the translation systems for the zero-shot purpose (the language model does not use them).

4.3 Language models

We picked 4 million random sentences from the UN corpus for each language to train the language models. We used the training datasets of each language for that. In other words, we picked the 4 million Spanish sentences from the union of the Spanish files contained in the English-Spanish and Spanish-French corpora. We picked the French training dataset symmetrically from the English-French and Spanish-French files. We remove sentences longer than 100 words.

The language model is a 2-layer, 512-unit LSTM, with 65% dropout probability applied on all non-recurrent connections. It is trained using Stochastic Gradient Descent (SGD) to minimize the negative log probability of the target words. We trained the model for six epochs, halving the learning rate, initially , in every epoch starting at the third. The norm of the gradients is clipped at 10. We used a batch size of 64 and the RNN is unrolled for 35 time steps. The vocabulary is embedded to 512 dimensions. Each language model was trained on a single Tesla-K80 for approximately 60 hours.

Our Spanish language model obtains a wordpiece-level perplexity of , while the French one obtains . Since the perplexities we report are at sub-word unit level, it is not easy to quantify their quality. For that reason, we use the KenLM implementation [9] with the same sample and vocabulary. We obtain test perplexities of and for the Spanish and French language models, respectively. Even though the difference is significant, we do not focus on improving the language models as they are sufficient to obtain a coarse fluency metric.

4.4 Translation systems

For the translation systems, we randomly sampled M parallel sentences for each language pair. All sentences are disjoint across language pairs. We removed sentences longer than 100 tokens. Development and test sets are used as they are in both cases.

For our base model, we used a NMT with LSTM cells of 1024 hidden units and 8 layers as described in [18]. We have re-used their choice of hyper-parameters444https://github.com/tensorflow/nmt/blob/master/nmt/standard_hparams/wmt16_gnmt_8_layer.json, except for the learning rate, which is initialized at and halved every K steps. The baseline zero-shot NMT was trained in a single Tesla-P100 for approximately one week.

Training with reinforcement learning is performed on 1M monolingual sentences, 500k for each Spanish and French. This data does not include aligned pairs across languages. They are also disjoint from the ones used to train the base NMT model. The model configuration is the same as the NMT base model, except for the batch size which needs to be halved for memory resources, given that we now need to load two language models. We use learning rates and and =0.005, as in [8]. We did not attempt to optimize hyper-parameters to fine tune the model. Training takes approximately 3 days on a single Tesla-P100.555All experiments were carried out on the Google Cloud Platform.

5 Results

Phrase-based NMT-0 NMT-S NMT-F Dual-0 Dual-S
Aligned en-fr (11M) en-fr (1M) en-fr (1M) en-fr (1M) en-fr (1M) en-fr (1M)
Data en-es (11M) en-es (1M) en-es (1M) en-es (1M) en-es (1M) en-es (1M)
es-fr (11M) es-fr (10k) es-fr (1M) es-fr (10k)
Monol. es (0.5M) es (0.5M)
Data fr (0.5M) fr (0.5M)
enes 61.26 49.00 43.33 44.06 37.05 38.74
esen 59.89 49.67 40.17 18.24 32.84 32.03
enfr 50.09 37.88 33.71 34.75 29.58 30.89
fren 52.22 42.12 34.17 13.58 27.95 26.00
esfr 52.44 10.02 33.10 37.67 35.54 35.63
fres 49.79 6.25 38.33 40.85 38.83 39.00
Table 1: BLEU scores on the UN corpus test set. Each line reports the BLEU scores of the corresponding translation direction. The first column refers to the phrase-based model of Ziemski et al. [20]. All NMT and Dual models are trained on 1M en-es and 1M en-fr aligned sentences, used in both directions. The NMT-S (small) model is trained additionally on 10k es-fr aligned sentences, while NMT-F (full) is trained additionally on 1M es-fr sentences. The Dual-0 model does not use any es-fr aligned data, while Dual-S is trained starting from NMT-S.

5.1 Supervised performance

We report the BLEU score on the official test sets, as computed by multi-bleu-detok.perl script, downloaded from the public implementation of Moses666http://www.statmt.org/moses/. The first column of Table 1 reports the BLEU scores from [20] for a phrase-based translation system. Notice that these models are trained on the complete fully-aligned UN data, that is, 11M sentence pairs for each language pair. The second column reports the scores of our multilingual NMT baseline model (NMT-0). This is trained on 1M en-es and 1M en-fr parallel sentences. It is not trained on any es-fr data.

The first four lines of BLEU scores (en-es, en-fr) show that, in the supervised setting, the phrase-based model outperforms NMT-0 by more than 10 points. The amount of training data obviously affects the model accuracy, but we speculate that lower performance, with respect to the phrase-based systems, is also due to model capacity. All NMT models learn to translate in six different directions at the same time – four in the case of NMT-0. The number of parameters per language/direction is decreased as the number of languages increases. In order to check this hypothesis, we have performed the following experiment. We trained a model with the same parameters for only one translation direction: enes. We built a specific vocabulary only on the data of this language pair. In this case, we obtain a BLEU score of ( BLEU points). It seems plausible that the NMT system could be tuned to match the reference UN phrase-based system but this is beyond the scope of this work.

5.2 Unsupervised performance

The last two lines of Table 1 concern the low-resource pair (es-fr) evaluation which is the main focus of this work. NMT-0 performs worse than we expected between Spanish and French: 10.02 esfr and 6.25 fres. In particular, we notice that NMT-0 often translates, partially, to English or the source language. The zero-shot dual system (Dual-0) starts training from the parameters of the NMT-0 model. We observe a marked improvement for the zero-shot translation directions, outperforming the baseline by and BLEU points for esfr and fres respectively. By inspection, the translation quality seems quite reasonable. Example translations of different models are discussed in the Appendix A. We also observe "catastrophic forgetting" [7]: BLEU scores in the supervised translation directions degrade, as no parallel data for these pairs is seen during the incremental training. This happens for all incrementally trained models.

The most important comparison is with respect to the fully supervised setting. For this purpose, we resume training the NMT-0 model with the full set of M es-fr parallel data. The resulting model is labeled NMT-F in Table 1. With respect to forgetting, we notice that, by never decoding to English, the NMT-F performance degrades more in this language. Interestingly, the dual model forgetting is more uniformly distributed across all original language pairs. However, NMT-F improves esfr from to and fres, from to BLEU points.

The results of the zero-shot dual model (Dual-0) are thus close to two BLEU points away in both cases, without seeing any parallel sentence for the es-fr pair. To the best of our knowledge, this is the closest unsupervised machine translation has come to the performance of the supervised setting. Although not directly comparable, this setup is similar to the Portuguese to Spanish evaluation in [11]. Here, the best zero-shot model (model 2) gets a BLEU score of 24.75 against 31.50 of the fully supervised NMT.

Johnson et al. [11] showed that zero-shot translation can benefit from small amounts of parallel data. We simulate this case with the NMT-S model which is trained additionally on 10k es-fr aligned sentences. The improvement is considerable as NMT-S is only 2.52 (fr es) and 4.57 BLEU points (es fr) from the “fully-supervised” NMT-F. Nevertheless, NMT-S is still outperformed by the Dual-0 model. The improvements carry over, to a smaller degree, also to the zero-shot dual model (Dual-S) trained on the 10k es-fr sentences which gets its performance closer to fully supervised setting for both language pairs.

5.2.1 Extension to more languages

Phrase-based NMT-0 NMT-F Dual-0
Aligned en-fr, en-es, en-ru en-fr, en-es, en-ru en-fr, en-es, en-ru en-fr, en-es, en-ru
en-fr, en-es, en-ru es-fr, es-ru, fr-ru
(11M each) (1M each) (1M each) (1M each)
Monol. es, fr, ru (0.5M each)
enes 61.26 47.51 44.96 44.30
esen 59.89 48.56 42.75 45.55
enfr 50.09 36.70 34.27 34.34
fren 52.22 40.75 34.99 37.75
enru 43.25 30.45 29.97 29.47
ruen 52.59 39.35 37.09 37.96
esfr 52.44 25.85 36.65 34.51
fres 49.79 22.68 40.19 37.71
esru 39.69 9.36 26.55 24.55
rues 49.61 26.26 35.98 33.23
frru 36.48 9.35 24.50 22.76
rufr 43.37 22.43 29.47 26.49
Table 2: BLEU score results for the experiments with four languages.

We experimented also with four languages, by adding Russian to the set. We follow the same procedure to sample the training data and build the shared vocabulary covering four languages. We train the NMT baseline on M parallel sentences from each pair: en-es, en-fr and en-ru. Thus, any translation direction that does not involve English is an unsupervised translation direction. We train language models for all languages except English. The purpose of this experiment is two-fold. On the one hand, we want to analyze how our approach works for languages with different alphabets. On the other hand, we want to see how it performs when there are multiple language pairs to be improved.

Table 2 summarizes the results. Regarding the supervised translation directions, NMT-0 maintains the relative performance compared to the phrase-based results, that is, it is still - BLEU points lower, as in the previous experiment. However, the zero-shot directions which do not have Russian as the target language obtain a good performance comparing to the experiment with three languages. We do not have yet a good explanation for this phenomenon. Interestingly, the NMT base model performs much better also with Russian as the source language. While translating to Russian performs poorly.

Similar to the previous experiment, we increment the baseline NMT’s training with M parallel sentences from each pair es-fr, es-ru and fr-ru so as to have an estimate of the headroom with respect to the supervised setting at comparable capacity and training schedule (NMT-F in Table 2). The last column in Table 2 shows the results of the model after the zero-shot dual learning, trained on M sentences from monolingual data, K in each Spanish, French and Russian. We can see that the performance is highly improved in this case as well. Moreover, it is consistent with what we saw in the previous experiment, i.e. the BLEU score that every zero-shot translation direction obtains is - BLEU points below of the supervised score of the base model.

6 Conclusion

We propose an approach to zero-shot machine translation between language pairs for which there is no aligned data. We build upon a multilingual NMT system [11] by applying reinforcement learning, using only monolingual data on the zero-shot translation pairs, inspired by dual learning [8].

Experiments show that this approach comes close to the performance of the corresponding supervised setting, for unsupervised language pairs. Zero-shot dual learning outperforms the multilingual NMT baseline model even when a small parallel corpus for the zero-shot language pair is available. Finally, we have shown that our model can easily scale up to improve the zero-shot translation of multiple language pairs, yielding an improvement of up to BLEU points over the base NMT model, even when languages belong to a completely different family.

These results show that this is a promising framework for machine translation for low-resource languages, given that aligned data already exists for numerous other language pairs. This framework seems particularly promising in combination with techniques like bridging, given the abundant data, to and from English and other popular languages. For future work, we would like to understand better the relation between model capacity, number of languages and language families, to optimize information sharing. Exploiting more explicitly cross-language generalization; e.g., in combination with recent unsupervised methods for embeddings optimization [12, 2], seems also promising. Finally, we would like to extend the reinforcement learning approach to include other reward components; e.g., linguistically motivated.

References

  • Abadi et al. [2016] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
  • Artetxe et al. [2018] M. Artetxe, G. Labaka, E. Agirre, and K. Cho. Unsupervised neural machine translation. In ICLR, 2018.
  • Bahdanau et al. [2015] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
  • Bahdanau et al. [2017] D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio. An actor-critic algorithm for sequence prediction. In ICLR, 2017.
  • Cho et al. [2014] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014.
  • Conneau et al. [2018] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou. Word translation without parallel data. In ICLR, 2018.
  • French [1999] R. M. French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
  • He et al. [2016] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W.-Y. Ma. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820–828, 2016.
  • Heafield [2011] K. Heafield. Kenlm: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197. Association for Computational Linguistics, 2011.
  • Jean et al. [2014] S. Jean, K. Cho, R. Memisevic, and Y. Bengio. On using very large target vocabulary for neural machine translation. In ACL-IJCNLP, 2014.
  • Johnson et al. [2016] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, et al. Google’s multilingual neural machine translation system: enabling zero-shot translation. arXiv preprint arXiv:1611.04558, 2016.
  • Lample et al. [2018a] G. Lample, L. Denoyer, and M. Ranzato. Unsupervised machine translation using monolingual corpora only. In ICLR, 2018a.
  • Lample et al. [2018b] G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato. Phrase-based & neural unsupervised machine translation. arXiv preprint arXiv:1804.07755, 2018b.
  • Sennrich et al. [2016] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. In ACL, 2016.
  • Sutskever et al. [2014] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
  • Sutton et al. [2000] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
  • Williams [1992] R. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8(3-4), 1992.
  • Wu et al. [2016] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
  • Xia et al. [2017] Y. Xia, T. Qin, W. Chen, J. Bian, N. Yu, and T.-Y. Liu. Dual supervised learning. In ICML, 2017.
  • Ziemski et al. [2016] M. Ziemski, M. Junczys-Dowmunt, and B. Pouliquen. The united nations parallel corpus v1. 0. In LREC, 2016.

Appendix A Qualitative Analysis

a.1 Experiment with Three Languages

Here we analyze the translation quality of the translations for the test set before and after the zero-shot dual learning algorithm. Some examples are shown in Table 3. Examples are separated by a double horizontal line. For each, the source, the translation from the multilingual NMT baseline (trained on en-es and en-fr data), the translation after the zero-shot dual learning and the corresponding reference are provided.

After analyzing the output translations for both directions of the zero-shot pair on a high level, we have noticed that the multilingual NMT baseline tends to translate to English, approximately half of the times. Apart from these, many sentences are rewritten in the source language. Still, there is also a number of sentences that are correctly translated to the target language, already with the multilingual NMT baseline model. We have observed that these sentences are in general short or have a specific structure, for example, meeting dates.

If we analyze the translations obtained after the reinforcement learning algorithm, instead, we could see that all outputs are in the target language; just a few sentences contain some word that is not correctly translated. This means that our approach is capable of training the model to translate to the required language, even without any parallel data for the zero-shot language pair.

(1) Source L’EIIL a mis en ligne des vidéos dans lesquelles on peut voir des
personnes subir toute une série de châtiments abominables : (…)
NMT-0 L’EIIL a mis en place des videos en indiquant que les personnes
séroviennent d’un ensemble d’abominables , (…)
Dual-0 La EIIL puso en marcha videos en los que se puede ver a
personas que padecen toda una serie de castigos abominables: (…)
Reference El propio EIIL ha publicado vídeos en los que se muestran
imágenes de personas sometidas a una serie de castigos atroces, (…)
(2) Source El propio EIIL ha publicado vídeos en los que se muestran imágenes
de personas sometidas a una serie de castigos atroces, (…)
NMT-0 L’EIIL itself a publié des videos en indiquant des images de personnes
faisant l’objet d’une série de atroces châtiments , (…)
Dual-0 L’ensemble EIIL a publié des vidéos en indiquant des images de personnes
soumises à une série de châtiments atroces, (…)
Reference L’EIIL a mis en ligne des vidéos dans lesquelles on peut voir des
personnes subir toute une série de châtiments abominables : (…)
(3) Source L’UNICEF a offert une aide en espèces d’urgence à des dizaines
de milliers de familles dans les camps de déplacés (…).
NMT-0 UNICEF provided emergency assistance to tens of thousands of
families in camps of internally displaced persons (…) .
Dual-0 El UNICEF ha ofrecido asistencia en efectivo de emergencia a decenas
de miles de familias en los campamentos de desplazados internos (…) .
Reference El UNICEF pagó ayudas monetarias de urgencia para asistir a
decenas de miles de familias desplazadas en los campamentos (…).
(4) Source Tenue de réunions périodiques par téléconférence pour coordonner
la planification conjointe des événements majeurs;
NMT-0 Continuación de reuniones periódicas de videoconferencia para
coordinar la celebración de seminarios de planificación de los
principales acontecimientos;
Dual-0 Celebración de reuniones periódicas por teleconferencia para coordinar
la planificación conjunta de los acontecimientos principales;
Reference Reuniones periódicas por teleconferencia para coordinar
la planificación conjunta de los eventos principales;
(5) Source Elles ont en outre aveuglément frappé des zones résidentielles,
y compris des camps de réfugiés, (…).
NMT-0 En addition, a menudo se abusan de camas de zonas,
y compris les camps de réfugiés , (…).
Dual-0 También han armado ataques a las zonas residenciales,
incluidas campamentos de refugiados, (…).
Reference También han atacado de manera indiscriminada zonas
residenciales, incluidos los campamentos de refugiados, (…).
(6) Source Que se celebrará el jueves 2 de abril de 2015 a las 10.15 horas
NMT-0 Que se tiendra le mardi 2 avril 2015 , à 10 h 15
Dual-0 Qui se tiendra le jeudi 12 mai 2015, à 15 heures
Reference Qui se tiendra le jeudi 2 avril 2015, à 10 h 15
(7) Source La supresión de la tasa de cambio mínima frente al euro llevó
aparejado un nuevo movimiento negativo de las tasas de interés (…).
NMT-0 The abolition of the minimum exchange rate compared to the euro
resulted in a new negative movement of interest rates (…).
Dual-0 La suppression de la croissance minimale face au euro a conduit à un
nouveau mouvement négatif des taux d’intérêt (…).
Reference La suppression du taux plancher s’est accompagnée d’une baisse
du taux d’intérêt déjà négatif (…).
Table 3: Qualitative analysis of the translations for the experiment with 3 languages. Examples are separated by a double horizontal line. For each, the first row shows the source sentence, the second is the outcome of the multilingual NMT baseline NMT-0 (trained on en-es, en-fr data), the third line shows the translation of the model after the zero-shot dual learning Dual-0 and the last row shows the reference.

The first two examples from Table 3 show the outcome of the two translation directions for the same sentence pair. On the former case, NMT-0 rewrites the sentence in French, but the zero-shot dual model completely recovers the sentence to Spanish. On the second one, the base NMT model already translates the sentence fairly well to French, even though it contains some English word (itself, video). Although the zero-shot Dual-0 avoids the English words, it does not get to improve much more.

Examples 3 and 7 show how the zero-shot dual learning obtains a translation outcome on the target language even when the base NMT-0’s translation is in English, while Example 5 shows the same result for when the NMT-0 model’s outcome is a mixture of languages.

The fourth example shows the improvement the Dual-0 obtains over the baseline NMT’s translation in the target language, by removing terms that do not belong to the reference.

Finally, Example 6 provides a sample of the sentences that, in general, are correctly translated by the base NMT-0. In this case, the day of the meeting is incorrect in the baseline’s translation. Dual-0 fixes that along with the first word, but introduces a mistake on the month and on the time.

a.2 Experiment with Four Languages

We saw in the BLEU scores of the experiment involving four languages that the zero-shot translation of the baseline NMT is good for all directions that do not have Russian as target.

For esru and frru, we have noticed that, similar to what happened with the zero-shot translation on the previous experiment, almost half of the sentences are translated to English. This could explain the similar BLEU score that these two directions and the zero-shot translations from the previous experiment obtain. However, the rest, a little more than half of the test set, is correctly decoded to Russian. Unlike in the previous experiment, there is no sentence that gets rewritten in the source language.

Holding to the behavior of the zero-shot translations in the previous examples, the translations that are correct are in general short sentences. However, most of the Russian translations do not coincide with the corresponding references. This explains the low BLEU score, even though half of the dataset is correctly decoded to Russian.

In contrast, regarding the translation directions that have Russian as the source language, most of the sentences are correctly decoded to the corresponding target language. There is a small number of sentences that get translated to English. Interestingly, Russian to Spanish translation outputs contain some French sentences, while Russian to French barely contains any Spanish sentence, although it does have some Spanish words.

Finally, esfr translations are in general correctly translated to the target language. A small amount of sentences are translated into English and an even smaller amount gets rewritten in the source language.

The output of the zero-shot dual model holds the same as in the previous experiment; the target language dominates the translation outputs. English completely disappears from the translation outputs, except for some proper names that confuse the model.

(1) Source Servicios de Gestión Estratégica
NMT-0 Strategic Management Services
Dual-0 Services de gestion stratégique
Reference Services de gestion stratégique
(2) Source 1. Aprobación del orden del día.
NMT-0 1. Adoption du jour de l’ordre du jour.
Dual-0 1. Adoption du jour.
Reference 1. Adoption de l’ordre du jour.
(3) Source Dans sa résolution adoptée en 2014 intitulée «Renforcement de l’efficacité et
l’amélioration de l’efficience des garanties de l’Agence », la Conférence (…).
NMT-0 Dans sa résolution adoptée en 2014 intitulée «Renforcement de l’efficacité et
l’amélioration de l’efficience des garanties de l’Agence », la Conférence (…).
Dual-0 En su resolución aprobada en 2014 titulado "Fortalecimiento de la eficiencia y
mejora de la eficiencia de las salvaguardias de la Autoridad", la Conferencia (…)
Reference En su resolución de 2014 relativa al fortalecimiento de la eficacia y aumento
de la eficiencia de las salvaguardias del Organismo, la Conferencia (…).
(4) Source La Oficina se propone fomentar y mantener una cultura institucional de
ética y rendición de cuentas, con el fin de aumentar tanto la credibilidad
como la eficacia de las Naciones Unidas.
NMT-0 31. Канцелярия планирует поощрять and maintain a corporate culture and
accountability culture, with a view to increasing the credibility and
effectiveness of the United Nations.
Dual-0 Управление предлагается поощрять и поддерживать институциональную
культуру этики и подотчетности, с тем чтобы повысить доверие и
эффективность Организации Объединенных Наций.
Reference Бюро призвано формировать и поддерживать корпоративную культуру
строгого соблюдения этических норм и подотчетности с целью укрепления
авторитета и эффективности Организации Объединенных Наций.
(5) Source 3. Форум рекомендует Продовольственной и сельскохозяйственной
организации Объединенных Наций (ФАО) в координации с коренными
народами организовывать учебные курсы и другие мероприятия (…).
NMT-0 1. Le Forum Permanent Forum on Indigenous Issues congratuce the
International Fund for Agricultural Development (IFAD) for its efforts in
the area of rural development to address the problems of food and hunger (…).
Dual-0 3. El Foro recomienda a la Organización de las Naciones Unidas para la
Agricultura y la Alimentación (FAO) en coordinación con los pueblos
indígenas organizando cursos de capacitación y otras actividades (…).
Reference El Foro recomienda que la Organización de las Naciones Unidas para la
Alimentación y la Agricultura (FAO), en coordinación con los pueblos
indígenas, organice cursos de formación y otras actividades (…).
(6) Source Les initiatives de la Chambre de commerce internationale ont contribué
indirectement à l’atteinte des OMD.
NMT-0 Инициативы of the International Chamber of Commerce have contributed
indirectly to the detriment of the MDGs.
Dual-0 Инициативы Палаты международной торговли внесли существенный
вклад в ущерб ЦРДТ.
Reference Инициативы Международной торговой палаты косвенно внесли
вклад в достижение целей в области развития, сформулированных
в Декларации тысячелетия.
(7) Source Les activités de l’organisation sont principalement axées sur la
sensibilisation aux violations des droits de l’homme dans le monde.
NMT-0 Деятельность Организации, главным образом, focuses on
raising awareness of human rights violations worldwide.
Dual-0 Деятельность организации главным образом уделяет повышению
информированности о нарушениях прав человека во всем мире.
Reference Деятельность Института главным образом направлена на
повышение осведомленности о нарушениях прав человека в
разных странах мира.
(8) Source Мы ожидаем в этой связи результаты глобального исследования
в отношении детей, лишен ных свободы;
NMT-0 Nous attendons en conséquence les résultats de l’étude mondiale
concernant les enfants victimes de liberté ;
Dual-0 Nous attendons en conséquence les résultats de l’étude mondiale
concernant les enfants privés de liberté ;
Reference Nous attendons à cet égard avec intérêt les résultats de l’enquête
mondiale sur les enfants privés de liberté;
Table 4: Qualitative analysis of the translations for the experiment with 4 languages. Examples are separated by a double horizontal line. For each, the first row shows the source sentence, the second is the outcome of the multilingual GNMT baseline (trained on en-es, en-fr data), the third line shows the translation of the model after the zero-shot dual learning and the last row shows the reference.

Table 4 gives a few examples that allow us to evaluate all zero-shot translation directions. The first two examples show two scenarios for the esfr translation. The first one is translated to English by the multilingual NMT baseline model while the second one is translated already to French.

The third example shows how the zero-shot translation from NMT-0 basically copies the source sentence in the whole part that is shown. However, the Dual-0 model gets a good translation.

The directions that have Russian as target (see Examples 4, 6 and 7) show a mixture between English and Russian in the baseline NMT model. Still, our zero-shot dual model is capable of learning fairly good Russian translations for the given source sentences.

It is really interesting that the translation directions that have Russian as the source language (see Examples 5 and 8) never get rewritten in Russian by NMT-0; they mostly get translated to either English or the target language.

This analysis explains the difference in BLEU score that the multilingual NMT baseline and the zero-shot dual system obtain. Furthermore, it shows the effect that our approach has on the vanilla zero-shot translation, even when multiple zero-shot translation directions are involved, proving the potential of our approach.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
199988
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description