Towards Interlingua Neural Machine Translation
A common intermediate language representation or an interlingua is the holy grail in machine translation. Thanks to the new neural machine translation approach, it seems that there are good perspectives towards this goal. In this paper, we propose a new architecture based on introducing an interlingua loss as an additional training objective. By adding and forcing this interlingua loss, we are able to train multiple encoders and decoders for each language, sharing a common intermediate representation.
Preliminary translation results on the WMT Turkish/English and WMT 2019 Kazakh/English tasks show improvements over the baseline system. Additionally, since the final objective of our architecture is having compatible encoder/decoders based on a common representation, we visualize and evaluate the learned intermediate representations. What is most relevant from our study is that our architecture shows the benefits of the dreamed interlingua since it is capable of: (1) reducing the number of production systems, with respect to the number of languages, from quadratic to linear (2) incrementally adding a new language in the system without retraining languages previously there and (3) allowing for translations from the new language to all the others present in the system.
Carlos Escolano, Marta R. Costa-jussà, José A. R. Fonollosa
TALP Research Center, Universitat Politècnica de Catalunya, Barcelona
Machine translation -in a highly multilingual environment- poses several challenges, as the number of possible combinations of translation directions grows quadratically. Among those challenges are the acquisition and curation of parallel data and the allocation of hardware resources for training and inference purposes. This situation becomes even worse given that the translation quality depends strongly on the amount of available training data when willing to offer translation for a language pair where there is little or no parallel data available.
Neural Machine Translation (NMT) Cho et al. (2014); Sutskever, Vinyals, and Le (2014) has arisen as a completely new paradigm for MT outperforming previous statistical approaches Koehn, Och, and Marcu (2003) in most of the tasks. One clear exception is low-resourced tasks Koehn and Knowles (2017), where statistical MT still can outperform or be competitive with NMT Artetxe, Labaka, and Agirre (2018); Lample et al. (2018).
Among others, one clear advantage of NMT is that it opens news challenges in MT like multimodal MT Elliott et al. (2017). NMT is progressing fast and it has high expectations, among which there is the finding of a common intermediate representation that allows training single encoders and decoders for each language reducing the number of translation systems from a quadratic dependency on languages to linear. As we will show in section 2, there are different approaches that have used the idea of a common intermediate language in NMT. However, recent research in this topic has been mainly on evaluating if the NMT architecture of encoder-decoder with recurrent neural networks (RNNs), with or without attention mechanisms, is able to reach a universal language while training multiple languages Johnson et al. (2016); Schwenk and Douze (2017). These approaches can be further explored into unsupervised MT where the system learns to translate between languages without parallel data just by enforcing the generation and representation of the tokens to be similar Artetxe et al. (2017); Lample, Denoyer, and Ranzato (2017). All these architectures share parameters between languages and/or require all languages to be trained at the same time. This forces the system to be retrained in order to add new languages to the system and share the same representation.
Differently, in this paper, we specifically pursue training a common intermediate representation for the only benefit of interlingua-based translation. Therefore, the proposed approach differs from the previous cases in which the intermediate representation is an end by itself (e.g., Conneau et al. (2017, 2018)). Our proposed architecture combines variational autoencoders and encoder-decoders based on self-attention mechanisms Vaswani et al. (2017). Also, in the optimisation process, we are adding a loss term, which is the correlation between intermediate representations from different languages. Like this, we are forcing the system to learn the intermediate representation while training multiple translation systems. One of the challenges at this point is to find a suitable intermediate representation distance function. In order to address this, we propose and evaluate different distance measures. Results on the WMT benchmark (English, Turkish and Kazakh) 111http://www.statmt.org/wmt19/ show that our architecture produces competitive translations and improves over the baseline system while benefiting from an easily and inexpensively way of extending to new languages. Our architecture is capable of scaling to new languages without requiring re-training all languages in the system.
The rest of the paper is organised as follows. Section 2 reports the related work on the topic. Section 3 underlines the contributions of this study. Next Section 4 explains the necessary background to make the manuscript self-contained. Section 5 details the architecture proposed in this study, both the joint training and how to scale to new languages. Section 7 describes the data and implementation followed in the experiments. Then, section 1 reports the translation results and section 9 provides several plots visualizing the intermediate representation. Finally, section 10 depicts the most relevant conclusions of this study.
2 Related Work
Classical interlingua approaches AlAnsary (2011) aim at finding a universal or common language representation that involves a conceptual understanding of all languages over the world. This has been the case of Esperanto Harlow (2013) or Universal Networking Language Kumar and Goel (2016) and many others. Very differently, in this work, we are focusing on training a common language representation with deep learning techniques. The objective is to train an intermediate representation that allows using independent encoders and decoders for each language. In this scenario, translation systems in a highly multilingual environment get reduced from quadratic to linear and also, translation is available for languages pairs that have not been explicitly trained. Differently, from the classical approach, there is no requirement of semantics for this intermediate representation. Following a similar objective or methodology, most related works are the following ones.
Johnson et al. (2016) feed a single encoder and decoder with multiple input and output languages. With this approach, authors show that zero-shot learning is possible. Authors show by means of visualizing similar sentences in different languages that there is some hint that these appear somehow close in the common representation.
These approaches vary from many encoders to one decoder (many-to-one) Zoph and Knight (2016), one encoder to many decoders (one-to-many) Dong et al. (2015a) and, finally, one encoder to one decoder (one-to-one), which we are focusing on because they are closest to our approach. Firat et al. 2016 propose to extend the classical recurrent NMT bilingual architecture Bahdanau, Cho, and Bengio (2015) to multilingual by designing a single encoder and decoder for each language with a shared attention-based mechanism. Schwenk et al. (2017) and Espana-Bonet et al. (2017) evaluate how a recurrent NMT architecture without attention is able to generate a common representation between languages. Authors use the inner product or cosine distance to evaluate the distance between sentence representations. Recently, Lu et al., (2018) train single encoders and decoders for each language generating interlingua embeddings which are agnostic to the input and output languages.
Other related architectures
While unsupervised MT Lample, Denoyer, and Ranzato (2017); Artetxe et al. (2017) is not directly pursuing a common intermediate representation, but it is somehow related to our approach. Artetxe et al. (2017) and Lample et al. (2017) propose a translation system that is able to translate trained only on a monolingual corpus. The architecture is basically a shared encoder with pre-trained embeddings and two decoders (one of them includes an autoencoder). On the other hand, our work is also related to recent works on sentence representations Conneau et al. (2017, 2018); Eriguchi et al. (2018). However, the main difference is that these works aim at extending representations to other natural language processing tasks, while we are aiming at finding the most suitable representation to make interlingua machine translation feasible. While interesting for further research, it is out-of-scope of this study the evaluation and adaptation of this intermediate representation to multiple tasks.
This paper proposes a proof of concept of a new multilingual NMT approach. The current approach is based on joint training without parameter sharing by enforcing a compatible representation between the jointly trained languages and using multitask learning Dong et al. (2015b). This approach is shown to offer a scalable strategy to new languages without retraining any of the previous languages in the system and enabling zero-shot translation. We show BLEU results over the baseline systems.
Another contribution is that we are proposing novel measures to evaluate the intermediate representation: first, in training time, when using the correlation to compare two intermediate representations, and second, in inference, when using BLEU to compare decoding outputs of two intermediate representations of the same input sentences coming from two different languages.
In this section, we report two techniques that are used for the development of the proposed architecture in this paper: variational autoencoders Rumelhart, Hinton, and Williams (1985) and decomposed vector quantization van den Oord, Vinyals et al. (2017).
4.1 Variational Autoencoders
Autoencoders consist of a generative model that is able to generate its own input. This is useful to train an intermediate representation, which can be later employed as a feature for another task or even as a dimensionality reduction technique. This is the case of traditional autoencoders that learn to produce an intermediate representation for an existing example. Variational autoencoders Rumelhart, Hinton, and Williams (1985); Kingma and Welling (2013); Zhang et al. (2016) present a different approach in which the objective is to learn the parameters of a probability distribution that characterizes the intermediate representation. This allows to sample new synthetic instances from the distribution and generate them using the decoder part of the network.
4.2 Decomposed Vector Quantization
One of the strategies to create variational autoencoders is vector quantization van den Oord, Vinyals et al. (2017). This consists of the addition of a table of dimension where is the number of possible representations and the dimension or set of dimensions of each of the representations. The closest vector to the output of the encoder of the network is fed to the decoder as a discrete latent representation that would be employed to reconstruct the input.
As proposed in Kaiser et al. (2018) this approach may produce a vector quantisation in which only a small part of the vectors is employed. To solve this, decomposed vector quantisation uses a set of tables in which each table is used to represent a portion of the representation that would be later concatenated and fed to the decoder. This approach presents the advantage that by using tables and optimizing the same number of parameters, possible vectors of dimension can be generated.
5 Model Architecture
In this section, we report the details of our proposed architecture. We describe the joint training and how we are scaling to new languages.
Before explaining our proposed model we introduce the annotation that will be assumed hereinafter. Languages will be referred to as capital letters while sentences will be referred in lower case given that , , and .
We consider as an encoder () the layers of the network that given an input sentence produce a sentence representation in a space. Analogously, a decoder () is the layers of the network that given the sentence representation of the source sentence is able to produce the tokens of the target sentence. Encoders and decoders will be always considered as independent modules that can be arranged and combined individually as no parameter is shared between them. Each language and module has its own weights independent from all the others present in the system.
5.2 Joint Training
Given two languages and , our objective is to train independent encoders and decoders for each language, and that produce compatible sentence representations. For instance, given a sentence in language , we can obtain a representation from that the encoder that can be used to either generate a sentence reconstruction using decoder or a translation using decoder . With this objective in mind, we propose a training schedule that combines two tasks (auto-encoding and translation) and the two translation directions simultaneously by optimizing the following loss:
where and correspond to the reconstruction losses of both language and (defined as the cross-entropy of the generated tokens and the source sentence for each language); and correspond to the translation terms of the loss measuring token generation of each decoder given a representation generation by the other language decoder (using the cross-entropy between the generated tokens and the translation reference); and corresponds to the distance metric between the representation computed by the encoders. This last term forces the representations to be similar without sharing parameters while providing a measure of similarity between the generated spaces. We have tested different distance metrics such as L1, L2 or the discriminator addition (that tried to predict from which language the representation was generated). For all these alternatives, we experienced a space collapse in which all sentences tend to be located at the same spatial region. This closeness between the sentences of the same languages makes them non-informative for decoding. As a consequence, the decoder performs as a language model, producing an output only based on the information provided by the previously decoded tokens. To prevent this collapse, we propose a less restrictive measure based on correlation distance Chandar (2015) computed as in equations 2 and 3. The rationale behind this loss is maximizing the correlation between the representations produced by each language while not enforcing the distance over the individual values of the representations.
where and correspond to the data sources we are trying to represent; and correspond to the intermediate representations learned by the network for a given observation; and and are, for a given batch, the intermediate representation mean of and , respectively.
Figure 1 shows the different task and directions that the system has been trained to perform. Each decoder is able to process the representation produced by each encoder to either translate or reconstruct the source language sentence.
5.3 Scaling to new languages
Given the jointly trained model between languages and , the following step is to add new languages in order to use our architecture as a multilingual system. Since parameters are not shared between the independent encoders and decoders, our architecture enables to add new languages without the need to retrain the current languages in the system. Let’s say we want to add language . To do so, we require to have parallel data between and any language in the system. So, assuming that we have trained and , we need to have either or parallel data. For illustration, let’s fix that we have parallel data. Then, we can set up a new bilingual system with language as source and language as target. To ensure that the representation produced by this new pair is compatible with the previously jointly trained system, we use the previous decoder () as the decoder of the new system and we freeze it. During training, we optimize the cross-entropy between the generated tokens and the language reference data but only updating the layers belonging to the language encoder (). Doing this, we train not only to produce good quality translations but also to produce similar representations to the already trained languages.
Our training schedule enforces the generation of a compatible representation, which means that the newly trained encoder can be used as input of the decoder from the jointly trained system to produce zero-shot to translations. See Figure 2 for illustration. The fact that the system enables zero-shot translation shows that the representations produced by our training schedule contain useful information and that this can be preserved and shared to new languages just by enforcing the new modules to train with the previous one, without any modification of the architecture.
A current limitation is the need to use the same vocabulary for the shared language () in both training steps. The use of subwords Sennrich, Haddow, and Birch (2015) mitigates the impact of this constraint.
6 Evaluation of the Intermediate Representation
Our main objective is creating an intermediate representation that can be understood by all the different modules trained in the system, where the modules are the encoders and decoders of all languages involved in training. Similar representations may not lead to compatible encoder/decoders. Also, different trainings can produce representations with different mean distances in the representations that can generate similar translation outputs.
In order to overcome those difficulties, we propose a new measure for the task. Given a parallel set of sentences in the languages in which the system has been trained, we can generate and . Both encodings, coming from a parallel test, have the same number of vectors each of them of the same dimensionality.
Our proposed measure consists of inferring one of the decoders in the system ( and ) using and as input. This generates two different outputs: an autoencoding output and a machine translation output. As we have parallel references for both languages we can measure the BLEU Papineni et al. (2002) of each of the results against the reference to measure how the models perform.
Additionally, we can calculate a new BLEU score comparing the outputs of the autoencoding and the machine translation outputs. In the ideal case, encoders from two different languages have to produce the same representation for the same sentences. Therefore, the difference between the BLEU score obtained in the autoencoding output and the translation output shows how different are and representations in terms of how the decoder is able to generate accurate results from them. Our measure consists in evaluating the BLEU score using the autoencoding output as a reference and the machine translation output as a hypothesis. Figure 3 shows the full pipeline of this procedure.
7 Experimental framework
In this section, we provide details about the data and implementation for the experiments.
For experiments, we use the Turkish-English parallel data from setimes2 Tiedemann (2009) which is used in WMT 2017 222http://www.statmt.org/wmt17/ and the Kazakh-English parallel data from the news domain which is used in WMT 2019 333http://www.statmt.org/wmt19/. The training set for Turkish-English is around 200,000 parallel sentences and for the Kazakh-English is around 100,000 parallel sentences. As development and test set we used newsdev2016 and newstest2016, respectively, for Turkish-English and newsdev2019 was split into development and test set for Kazakh-English experiments. Additionally, we extracted the Kazakh-Turkish test set from the OPUS database Tiedemann (2012) to evaluate the zero-shot translation.
Given that this task is low-resourced (around 200k), we propose to use a similar size task for extending the system. This is not required by the system, for better comparing languages already in the system and additional ones. Therefore, we add a subset of the WMT Spanish/English task of 200k training sentences. As development and test set we used newsdev2016 and newstest2016, respectively, for Turkish-English and newstest2012 and newstest2013 for Spanish-English. Additionally, we extracted the Kazakh-Turkish test set from the OPUS database Tiedemann (2012) to evaluate the zero-shot Spanish-Turkish translation.
Preprocessing consisted of a pipeline of punctuation normalization, tokenization, corpus filtering of longer sentences than 80 words and true-casing. These steps were performed using the scripts available from Moses tools Koehn et al. (2007). In the experiments using subwords, preprocessed data is tokenized using Byte Pair Encoding (BPE) Sennrich, Haddow, and Birch (2016).
We used the Transformer implementation provided by Fairseq 444Release v0.6.0 available at https://github.com/pytorch/fairseq. Parameters varied depending on the experiments. For experiments in section 8.1 we used 6 blocks of multihead attention of 8 heads each, embedding/hidden dimensionality of 128 and fixed learning rate of 0.001 and vocabulary size of 12,000 words. For experiments in the rest of sections, we used 6 attention blocks with 4 heads, embedding/hidden dimensionality of 512, and a fixed learning rate of 0.001 and vocabulary size of 16000 BPE tokens. For all cases, we used Adam Kingma and Ba (2014) as the optimizer. The joint training was performed in two Nvidia Titan X GPUs with 12 GB of RAM while the addition of languages used one Titan X GPU. As for stopping criterion, systems trained until non-improvement was seen on the validation set.
In this section, we report our results and discussion. Results are presented in terms of BLEU Papineni et al. (2002) which is the standard automatic measure in MT. Subsection 8.1 shows the impact of using decomposed vector quantization and different distance measures (maximum distance and correlation). The following subsection reports results on using subwords and enlarging the model. Subsection 8.4 shows the results of adding new languages in the system, and the last subsection shows the intermediate representation quality in terms of our proposed measure.
8.1 Translation quality
Table 1 shows the BLEU results in each translation direction from English-to-Turkish (EN-TR) and from Turkish-to-English (TR-EN). Results of different configurations of the proposed architecture (JointTrain) are compared to the baseline transformers (both non-variational and variational, dvq) with the same hyperparameters of our architecture.
Variational vs non-variational
The performance of the baseline transformer (non-variational) is almost competitive with the best system results from WMT 2017 García-Martínez et al. (2017). Note that we are comparing to the case of using only parallel data, without adding back-translated monolingual data (which were 10.9 for EN-TR and 14.2 for TR-EN). When using the decomposed vector quantization, the performance gets worse in both directions and the loss is higher in EN-TR. When contrasting the impact of the decomposed vector quantization in our proposed architecture, we see that the performance of non-variational architecture is also higher than the decomposed vector quantization (dqv) using any type of distance. However, the loss, when using the correlation distance, is higher in the direction of TR-EN than in the opposite.
Correlation vs Maximum distance loss
In regard to the distance loss, the correlation distance clearly provides better translation results, by approximately 1.5 BLEU in both directions when using the nong-variational architecture. The improvement of the correlation distance compared to the maximum distance is even higher when using the variational architecture.
|JoinTrain + corr||8.11||12.00|
|JoinTrain + max||6.19||10.38|
|JoinTrain + dvq + corr||7.45||7.56|
|JoinTrain + dvq + max||2.40||5.24|
Our best proposed architecture, which is non-variational autoencoder with correlation distance (JoinTrain + corr) shows similar performance compared to the baseline system (Transformer).
8.2 Subwords and Enlarging the model
Previous experiments were performed using words. At this point, we employ subword-nmt Sennrich, Haddow, and Birch (2016) which is standard tokenization of words. Our configuration follows the standard set-up with 16000 operations and shared vocabulary between both languages. Table 2 shows the performance of the architecture using BPE as tokenization only on the best systems from the previous subsection 8.1. Here, our architecture does still not gain over the baseline system.
However, when enlarging the model by using a word embedding of 512 dimensions plus BPE tokenization, we achieve gains of +0.5 BLEU from English-to-Turkish and +0.7 BLEU from Turkish-to-English over the corresponding baseline. At this point, we found interesting training our architecture without autoencoders, and we see that it improves over the baseline system but not over our complete original proposed joint training. Therefore, training with autoencoders helps improving translation performance.
8.3 Intermediate representation evaluation
We have also studied the difference in the performance of decoders when presented with the intermediate representations of both encoders. This evaluation is performed in order to analyse if we can use independent encoder/decoders in the context of MT. The model used for these results is the JoinTrain+corr+BPE+large, which is the best performing model from Table 2.
Table 3 shows that the quality of the output of the decoder is quite better when the input comes from the encoder of the same language (autoencoder) than from another (MT). We also included the BLEU score between both autoencoder and translation outputs (A-T), which is the measure that we are proposing to evaluate the quality of our intermediate representation. Low BLEUs in A-T indicates that we are still far from being able to decode from the common intermediate representation.
8.4 Adding new languages and Zero-shot Translation
At this point, we use the best configuration from previous subsection 8.2, which is using BPE with the largest model. We add Kazakh as a new language to this system as proposed in section 5.3. Table 4 shows that Kazakh-English performs 0.6 BLEU points over the baseline. The frozen English decoder previously trained using the Turkish-English parallel data may be responsible for the increase of performance.
Finally, another relevant aspect of the proposed architecture is enabling zero-shot translation. To evaluate it, we compare the performance of Kazakh-Turkish compared to a pivot system based on the cascade. Such a system consists of translating from Kazakh to English and from English to Turkish with the standard Transformer. Results show that the zero-shot translation provides slightly lower quality than the pivot baseline system. As test data, we employ 2500 lines Turkish-Kazakh parallel sentences extracted from OpenSubtitles datasetLison and Tiedemann (2016)
Our training architecture is based on training modules to produce compatible representations. In this section, we want to analyze this similarity at the last attention block of the encoders, where we are forcing the similarity. In order to graphically show the presentation, we trained a UMAP McInnes et al. (2018) model combining the representations of languages. As follows, we show 100 sentences extracted from the test set. These sentences have been selected to have a similar length to minimize the amount of padding required. Figure 4 (left) shows the sentence representations created by their encoders. The separated clusters show that languages are not yet represented in the same space. Related work Arivazhagan et al. (2018) shows similar results for the case of a multilingual system with shared encoder and decoder. While the system is able to produce compatible representations clear clusters can be observed for each language in the system. Plausible explanations for this difference may be the distance measure that we are using and/or the alignment of the source sentences.
Some distance measures cause the representations to collapse in a small region of the space making them non-informative for the decoder. Our distance measure, the correlation distance, while it enforces the representations to correlate, it does not constrain the scale of the values in the contextual vectors. This measure enforces the sentence distribution within the same language to be similar between all languages. However, since we are not constraining the scale, each language can be represented in a different space region.
The other important factor in the measure of the encoding representation is the sentence alignment. In this work, we are basing our model on the Transformer architecture. This architecture does not compute a single vector representation of the source sentence but instead, it produces fixed-size contextual representations of the sentence tokens. Further experiments would be required to measure if the system is able to reorder the source tokens to produce language-independent sentence representations.
As further work, we should test new distance measures in order to better constrain the scale of the produced representations and measure how the alignment of different languages is characterized by the final encoder representations.
We proposed a novel translation architecture which aims at a common intermediate representation. Although there are already some machine translation systems where the implicit emergence of an internal interlingua representation is suggested, the main proposed difference is forcing the Neural Machine Translation system to learn an intermediate multilingual representation. This is achieved by combining the maximum likelihood loss, normally used in Neural Machine Translation, together with an extra loss term that computes a measure of the distance between intermediate representations in different languages. By achieving an interlingua representation, this proposal makes encoders and decoders to be decoupled by having the interlingua as an interface. This leads to enabling every possible combination of encoder and decoder, effectively turning the quadratic needs, in terms of training data and resources, to linear. Furthermore, such a decoupling also allows training encoders/decoders to/from a new language that only has parallel data with one of the already supported languages, enabling the translation to/from any of the other supported languages.
We show how a bilingual system NMT can be extended to a multilingual NMT system by incremental training. We explore both self-attentive variational and non-variational autoencoders to generate the intermediate representation, without success for the variational autoencoders.
We have analyzed how the model performs for different languages. Our model outperforms current bilingual systems and we show first steps towards achieving competitive translations with a flexible architecture that enables scaling to new languages (achieving multilingual and zero-shot translation) without retraining languages in the system. Our approach supersedes pivoting approaches and can also be complemented by unsupervised Neural Machine Translation approaches. One of the next steps will be precisely to exploit monolingual data in our architecture further avoiding dependency on the availability of parallel data.
Acknowledgements.This work is supported in part by the Google Faculty Research Award 2018, Spanish Ministerio de Economía y Competitividad, the European Regional Development Fund and the Agencia Estatal de Investigación, through the postdoctoral senior grant Ramón y Cajal, the contract TEC2015-69266-P (MINECO/FEDER,EU) and the contract PCIN-2017-079 (AEI/MINECO).
- AlAnsary (2011) AlAnsary, Sameh. 2011. Interlingua-based machine translation systems : Unl versus other.
- Arivazhagan et al. (2018) Arivazhagan, Naveen, Ankur Bapna, Orhan Firat, Roee Aharoni, Melvin Johnson, and Wolfgang Macherey. 2018. The missing ingredient in zero-shot neural machine translation.
- Artetxe, Labaka, and Agirre (2018) Artetxe, Mikel, Gorka Labaka, and Eneko Agirre. 2018. Unsupervised statistical machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3632–3642, Association for Computational Linguistics, Brussels, Belgium.
- Artetxe et al. (2017) Artetxe, Mikel, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2017. Unsupervised neural machine translation. CoRR, abs/1710.11041.
- Bahdanau, Cho, and Bengio (2015) Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
- Chandar (2015) Chandar, Sarath. 2015. Correlational neural networks for common representation learning. Master’s thesis, Indian Institute of Technology Madras.
- Cho et al. (2014) Cho, Kyunghyun, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. of the Conference on EMNLP, pages 1724–1734.
- Conneau et al. (2017) Conneau, Alexis, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Association for Computational Linguistics, Copenhagen, Denmark.
- Conneau et al. (2018) Conneau, Alexis, Guillaume Lample, Adina Williams Ruty Rinott, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. Xnli: Evaluating cross-lingual sentence representations. In Proc. of EMNLP.
- Dong et al. (2015a) Dong, Daxiang, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015a. Multi-task learning for multiple language translation. In Proc. of the ACL and the IJCNLP, pages 1723–1732, Beijing, China.
- Dong et al. (2015b) Dong, Daxiang, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015b. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1723–1732.
- Elliott et al. (2017) Elliott, Desmond, Stella Frank, Loïc Barrault, Fethi Bougares, and Lucia Specia. 2017. Findings of the second shared task on multimodal machine translation and multilingual image description. In Proc. of the 2nd Conference on Machine Translation, pages 215–233, Copenhagen, Denmark.
- Eriguchi et al. (2018) Eriguchi, Akiko, Melvin Johnson, Orhan Firat, Hideto Kazawa, and Wolfgang Macherey. 2018. Zero-shot cross-lingual classification using multilingual neural machine translation. In arXiv:1809.04686.
- España-Bonet et al. (2017) España-Bonet, Cristina, Ádám Csaba Varga, Alberto Barrón-Cedeño, and Josef van Genabith. 2017. An empirical analysis of nmt-derived interlingual embeddings and their use in parallel sentence identification. IEEE Journal of Selected Topics in Signal Processing, 11(8):1340–1350.
- Firat et al. (2017) Firat, Orhan, Kyunghyun Cho, Baskaran Sankaran, Fatos T. Yarman Vural, and Yoshua Bengio. 2017. Multi-Way, Multilingual Neural Machine Translation. Computer Speech and Language, Special Issue in Deep learning for Machine Translation.
- García-Martínez et al. (2017) García-Martínez, Mercedes, Ozan Caglayan, Walid Aransa, Adrien Bardet, Fethi Bougares, and Loïc Barrault. 2017. Lium machine translation systems for wmt17 news translation task. In Proceedings of the Second Conference on Machine Translation, pages 288–295, Association for Computational Linguistics, Copenhagen, Denmark.
- Harlow (2013) Harlow, Don. 2013. Some basic information about esperanto – the international language.
- Johnson et al. (2016) Johnson, Melvin, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s multilingual neural machine translation system: Enabling zero-shot translation. CoRR, abs/1611.04558.
- Kaiser et al. (2018) Kaiser, Łukasz, Aurko Roy, Ashish Vaswani, Niki Pamar, Samy Bengio, Jakob Uszkoreit, and Noam Shazeer. 2018. Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382.
- Kingma and Ba (2014) Kingma, Diederik P and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Kingma and Welling (2013) Kingma, Diederik P. and Max Welling. 2013. Auto-encoding variational bayes. CoRR, abs/1312.6114.
- Koehn et al. (2007) Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of ACL, pages 177–180.
- Koehn and Knowles (2017) Koehn, Philipp and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proc. of the 1st Workshop on Neural Machine Translation, pages 28–39, Vancouver.
- Koehn, Och, and Marcu (2003) Koehn, Philipp, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proc. of the Conference of the NAACL, pages 48–54.
- Kumar and Goel (2016) Kumar, Parteek and Kanu Goel. 2016. Universal networking language: A framework for emerging nlp applications. In 2016 1st India International Conference on Information Processing (IICIP), pages 1–6.
- Lample, Denoyer, and Ranzato (2017) Lample, Guillaume, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2017. Unsupervised machine translation using monolingual corpora only. CoRR, abs/1711.00043.
- Lample et al. (2018) Lample, Guillaume, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Phrase-based & neural unsupervised machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5039–5049, Association for Computational Linguistics, Brussels, Belgium.
- Lison and Tiedemann (2016) Lison, Pierre and Jörg Tiedemann. 2016. Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles.
- Lu et al. (2018) Lu, Yichao, Phillip Keung, Faisal Ladhak, Vikas Bhardwaj, Shaonan Zhang, and Jason Sun. 2018. A neural interlingua for multilingual machine translation. arxiv.
- McInnes et al. (2018) McInnes, Leland, John Healy, Nathaniel Saul, and Lukas Grossberger. 2018. Umap: Uniform manifold approximation and projection. The Journal of Open Source Software, 3(29):861.
- van den Oord, Vinyals et al. (2017) van den Oord, Aaron, Oriol Vinyals, et al. 2017. Neural discrete representation learning. In Advances in Neural Information Processing Systems, pages 6309–6318.
- Papineni et al. (2002) Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318, Association for Computational Linguistics, Stroudsburg, PA, USA.
- Rumelhart, Hinton, and Williams (1985) Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1985. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science.
- Schwenk and Douze (2017) Schwenk, Holger and Matthijs Douze. 2017. Learning joint multilingual sentence representations with neural machine translation. In Proc. of the 2nd Workshop on Representation Learning for NLP, pages 157–167.
- Sennrich, Haddow, and Birch (2015) Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
- Sennrich, Haddow, and Birch (2016) Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Association for Computational Linguistics, Berlin, Germany.
- Sutskever, Vinyals, and Le (2014) Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Annual Conference on Neural Information Processing Systems, pages 3104–3112.
- Tiedemann (2009) Tiedemann, Jörg. 2009. News from OPUS - A collection of multilingual parallel corpora with tools and interfaces. In N. Nicolov, K. Bontcheva, G. Angelova, and R. Mitkov, editors, Recent Advances in Natural Language Processing, volume V. John Benjamins, Amsterdam/Philadelphia, Borovets, Bulgaria, pages 237–248.
- Tiedemann (2012) Tiedemann, JÃ¶rg. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey.
- Vaswani et al. (2017) Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.
- Zhang et al. (2016) Zhang, Biao, Deyi Xiong, jinsong su, Hong Duan, and Min Zhang. 2016. Variational neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 521–530, Association for Computational Linguistics, Austin, Texas.
- Zoph and Knight (2016) Zoph, Barret and Kevin Knight. 2016. Multi-source neural translation. CoRR, abs/1601.00710.