Enabling Multi-Source Neural Machine Translation By Concatenating Source Sentences In Multiple Languages

Enabling Multi-Source Neural Machine Translation By Concatenating Source Sentences In Multiple Languages

Abstract

In this paper, we propose a novel and elegant solution to “Multi-Source Neural Machine Translation” (MSNMT) which only relies on preprocessing a N-way multilingual corpus without modifying the Neural Machine Translation (NMT) architecture or training procedure. We simply concatenate the source sentences to form a single long multi-source input sentence while keeping the target side sentence as it is and train an NMT system using this preprocessed corpus. We evaluate our method in resource poor as well as resource rich settings and show its effectiveness (up to 4 BLEU using 2 source languages and up to 6 BLEU using 5 source languages) by comparing against existing methods for MSNMT. We also provide some insights on how the NMT system leverages multilingual information in such a scenario by visualizing attention.

1Introduction

Neural machine translation (NMT) [?] enables training an end-to-end system without the need to deal with word alignments, translation rules and complicated decoding algorithms, which are a characteristic of phrase based statistical machine translation (PBSMT) systems. However, it is reported that NMT works better than PBSMT only when there is an abundance of parallel corpora. In a low resource scenario, vanilla NMT is either worse than or comparable to PBSMT [?].
Multilingual NMT has shown to be quite effective in a variety of settings like Transfer Learning [?] where a model trained on a resource rich language pair is used to initialize the parameters for a model that is to be trained for a resource poor pair, Multilingual-Multiway NMT [?] where multiple language pairs are learned simultaneously with separate encoders and decoders for each source and target language and Zero Shot NMT [?] where a single NMT system is trained for multiple language pairs that share all model parameters thereby allowing for multiple languages to interact and help each other improve the overall translation quality.
Multi-Source Machine Translation is an approach that allows one to leverage N-way (N-lingual) corpora to improve translation quality in resource poor as well as resource rich scenarios. N-way (or N-lingual) corpora are those in which translations of the same sentence exist in N different languages1. A realistic scenario is when N equals 3 since there exist many domain specific as well as general domain trilingual corpora. For example, in Spain international news companies write news articles in English as well as Spanish and thus it is possible to utilize the same sentence written in two different languages to translate to a third language like Italian by utilizing a large English-Spanish-Italian trilingual corpus. However there do exist N-way corpora (ordered from largest to smallest according to number of lines of corpora) like United Nations [?], Europarl [?], Ted Talks [?] ILCI [?] and Bible [?] where the same sentences are translated into more than 5 languages.
Two major approaches for Multi-Source NMT (MSNMT) have been explored, namely the multi-encoder [?] and multi-source ensembling [?]. The multi-encoder approach involves extending the vanilla NMT architecture to have an encoder for each source language leading to larger models and it is clear that NMT can accommodate multiple languages [?] without needing to resort to a larger parameter space. Moreover, since the encoders for each source language are separate it is difficult to explore how the source languages contribute towards the improvement in translation quality. On the other hand, the ensembling approach is simpler since it involves training multiple bilingual NMT models each with a different source language but the same target language. This method also helps eliminate the need for N-way corpora which allows one to exploit bilingual corpora which are larger in size. Multi-source NMT ensembling works in essentially the same way as single-source NMT ensembling does except that each of the systems in the ensemble take source sentences in different languages. In the case of a multilingual multi-way NMT model, multi-source ensembling is a form of self-ensembling but such a model contains too many parameters and is difficult to train. Multi-source ensembling by using separate models for each source language involves the overhead of learning an ensemble function and hence this method is not truly end-to-end.
To overcome the limitations of both the approaches we propose a new simplified end-to-end method that avoids the need to modify the NMT architecture as well as the need to learn an ensemble function. We simply propose to concatenate the source sentences leading to a parallel corpus where the source side is a long multilingual sentence and the target side is a single sentence which is the translation of the aforementioned multilingual sentence. This corpus is then fed to any NMT training pipeline whose output is a multi-source NMT model. The main contributions of this paper are as follows:

  • A novel preprocessing step that allows for MSNMT without any change to the NMT architecture2.

  • An exhaustive study of how our approach works in a resource poor as well as a resource rich setting.

  • An empirical comparison of our approach against two existing methods [?] for MSNMT.

  • An analysis of how NMT gives more importance to certain linguistically closer languages while doing multi-source translation by visualizing attention vectors.

2Related Work

One of the first studies on multi-source MT [?] was done to see how word based SMT systems would benefit from multiple source languages. Although effective, it suffered from a number of limitations that classic word and phrase based SMT systems do including the inability to perform end-to-end training. In the context of NMT, the work on multi-encoder multi source NMT [?] is the first of its kind end-to-end approach which focused on utilizing French and German as source languages to translate to English. However their method led to models with substantially larger parameter spaces and they did not study the effect of using more than 2 source languages. Multi-source ensembling using a multilingual multi-way NMT model [?] is an end-to-end approach but requires training a very large and complex NMT model. The work on multi-source ensembling which uses separately trained single source models [?] is comparatively simpler in the sense that one does not need to train additional NMT models but the approach is not truly end-to-end since it needs an ensemble function to be learned to effectively leverage multiple source languages. In all three cases one ends up with either one large model or many small models.

3Overview of Our Method

Figure 1: Our proposed Multi-Source NMT Approach.
Figure 1: Our proposed Multi-Source NMT Approach.

Refer to Figure 1 for an overview of our method which is as follows:

  • For each target sentence concatenate the corresponding source sentences leading to a parallel corpus where the source sentence is a very long sentence that conveys the same meaning in multiple languages. An example line in such a corpus would be: source: “Hello Bonjour Namaskar Kamusta Hallo” and target:“konnichiwa”. The 5 source languages here are English, French, Marathi, Filipino and Luuxembourgish whereas the target language is Japanese. In this example each source sentence is a word conveying “Hello” in different languages. We romanize the Marathi and Japanese words for readability.

  • Apply word segmentation to the source and target sentences, Byte Pair Encoding (BPE)3 [?] in our case, to overcome data sparsity and eliminate the unknown word rate.

  • Use the training corpus to learn an NMT model using any off the shelf NMT toolkit.

4Experimental Settings

All of our experiments were performed using an encoder-decoder NMT system with attention for the various baselines and multi-source experiments. In order to enable infinite vocabulary and reduce data sparsity we use the Byte Pair Encoding (BPE) based word segmentation approach [?]. However we perform a slight modification to the original code where instead of specifying the number of merge operations manually we specify a desired vocabulary size and the BPE learning process automatically stops after it learns enough rules to obtain the prespecified vocabulary size. We prefer this approach since it allows us to learn a minimal model and it resembles the way Google’s NMT system [?] works with the Word Piece Model (WPM) [?]. We evaluate our models using the standard BLEU [?] metric4 on the translations of the test set. Baseline models are simply ones trained from scratch by initializing the model parameters with random values.

4.1Languages and Corpora Settings

Table 1: Statistics for the the N-lingual corpora extracted for the languages French (Fr), German (De), Arabic (Ar), Czech (Cs) and English (En)
corpus type Languages train dev2010 tst2010/tst2013
3 lingual Fr, De, En 191381 880 1060/886
4 lingual Fr, De, Ar, En 84301 880 1059/708
5 lingual Fr, De, Ar, Cs, En 45684 461 1016/643

All of our experiments were performed using the publicly available ILCI5 [?], United Nations6 [?] and IWSLT7 [?] corpora.
The ILCI corpus is a 6-way multilingual corpus spanning the languages Hindi, English, Tamil, Telugu, Marathi and Bengali was provided as a part of the task. The target language is Hindi and thus there are 5 source languages. The training, development and test sets contain 45600, 1000 and 2400 6-lingual sentences respectively8. Hindi, Bengali and Marathi are Indo-Aryan languages, Telugu and Tamil are Dravidian languages and English is a European language. In this group English is the farthest from Hindi, grammatically speaking, whereas Marathi is the closest to it. Morphologically speaking, Bengali is closer to Hindi compared to Marathi (which has agglutinative suffixes) but Marathi and Hindi share the same script and they also share more cognates compared to the other languages. It is natural to expect that translating from Bengali and Marathi to Hindi should give Hindi sentences of higher quality as compared to those obtained by translating from the other languages and thus using these two languages as source languages in multi-source approaches should lead to significant improvements in translation quality. We verify this hypothesis by exhaustively trying all language combinations.
The IWSLT corpus is a collection of 4 bilingual corpora spanning 5 languages where the target language is English: French-English (234992 lines), German-English (209772 lines), Czech-English (122382 lines) and Arabic-English (239818 lines). Linguistically speaking French and German are the closest to English followed by Czech and Arabic. In order to obtain N-lingual sentences we only keep the sentence pairs from each corpus such that the English sentence is present in all the corpora. From the given training data we extract trilingual (French, German and English), 4-lingual (French, German, Arabic and English) and 5-lingual corpora. Similarly we extract 3, 4 and 5 lingual development and test sets. The IWSLT corpus (downloaded from the link given above) comes with a development set called dev2010 and test sets named tst2010 to tst2013 (one for each year from 2010 to 2013). Unfortunately only the tst2010 and tst2013 test sets are N-lingual. Refer to Table 1 which contains the number of lines of training, development and test sentences we extracted.
The UN corpus spans 6 languages (French, Spanish, Arabic, Chinese, Russian and English) and we used the 6-lingual version for our multisource experiments. Although there are 11 million 6-lingual sentences we use only 2 million for training since our purpose was not to train the best system but to show that using additional source languages is useful. The development and test sets provided contain 4000 lines each and are also available as 6-lingual sentences. We chose English to be the target language and focused on Spanish, French, Arabic and Russian as source languages. Due to lack of computational facilities we only worked with the following source language combinations: French and Spanish, French and Russian, French and Arabic and Russian and Arabic.

4.2NMT Systems and Model Settings

For training various NMT systems, we used the open source KyotoNMT system9 [?]. KyotoNMT implements an Attention based Encoder-Decoder [?] with slight modifications to the training procedure. We modify the NMT implementation in KyotoNMT to enable multi encoder multi source NMT. Since the NMT model architecture used in [?] is different from the one in KyotoNMT the multi encoder implementation is not identical (but is equivalent) to the one in the original work. For the rest of the paper “baseline” systems indicate single source NMT models trained on bilingual corpora. We train and evaluate the following models:

  • One source to one target.

  • N source to one target using our proposed multi source approach.

  • N source to one target using the multi encoder multi source approach [?].

  • N source to one target using the multi source ensembling approach that late averages [?] N one source to one target models10.

The model and training details are as follows:

  • BPE vocabulary size: 8k11 (separate models for source and target) for ILCI and IWSLT corpora settings and 16k for the UN corpus setting. When training the BPE model for the source languages we learn a single shared BPE model so that the total source side vocabulary size is 8k (or 16k as applicable). In case of languages that use the same script it allows for cognate sharing thereby reducing the overall vocabulary size.

  • Embeddings: 620 nodes

  • RNN for encoders and decoders: LSTM with 1 layer, 1000 nodes output. Each encoder is a bidirectional RNN.

  • In the case of multiple encoders, one for each language, each encoder has its own separate vocabulary.

  • Attention: 500 nodes hidden layer. In case of the multi encoder approach there is a separate attention mechanism per encoder.

  • Batch size: 64 for single source, 16 for 2 sources and 8 for 3 sources and above for IWSLT and ILCI corpora settings. 32 for single source and 16 for 2 sources for the UN corpus setting.

  • Training steps: 10k12 for 1 source, 15k for 2 source and 40k for 5 source settings when using the IWSLT and ILCI corpora. 200k for 1 source and 400k for 2 source for the UN corpus setting to ensure that in both cases the models get saturated with respect to heir learning capacity.

  • Optimization algorithms: Adam with an initial learning rate of 0.01

  • Choosing the best model: Evaluate the model on the development set and select the one with the best BLEU [?] after reversing the BPE segmentation on the output of the NMT model.

We train and evaluate the following NMT models using the ILCI corpus:

  • One source to one target: 5 models (Baselines)

  • Two source to one target: 10 models (5 source languages, choose 2 at a time)

  • Five source to one target: 1 model

For evaluation we translate the test set sentences using a beam search decoder with a beam of size 16 for all corpora settings13. In the IWSLT corpus setting we did not try various combinations of source languages as we did in the ILCI corpus setting. We train and evaluate the following NMT models for each N-lingual corpus:

  • One source to one target: N-1 models (Baselines; 2 for the trilingual corpus, 3 for the 4-lingual corpus and 4 for the 5-lingual corpus)

  • N-1 source to one target: 3 models (1 for trilingual, 1 for 4-lingual and 1 for 5-lingual)

Similarly for the UN corpous setting we only tried the following 2 source combinations: French+Spanish, French+Arabic, French+Russian, Russian+Arabic. The target language is English.

BLEU scores for two source to one target setting for all language combinations and for five source to one target using the ILCI corpus. The languages are Bengali (Bn), English (En), Marathi (Mr), Tamil (Ta), Telegu (Te) and Hindi (Hi). Each cell in the upper right triangle contains the BLEU scores using a. Our proposed approach (bold), b. Multi source ensembling approach and c. Multi Encoder Multi Source approach. The highest score is the one with the asterisk mark (*).
Source
Language 1
En
(11.08)
Mr
(24.60)
Ta
(10.37)
Te
(16.55)
Bn
(19.14)

20.70*

19.45
19.10

29.02

30.10*
27.33

19.85

20.79*
18.26

22.73

24.83*
22.14
En
(11.08)
-

25.56*

23.06
26.01

14.03

15.05*
13.30

18.91

19.68*
17.53
Mr
(24.60)
- -

25.64*

24.70
23.79

27.62

28.00*
26.83
Ta
(10.37)
- - -

18.14

19.11*
17.34

All

BLEU scores for the baseline and N source settings using the IWSLT corpus. The languages are French (Fr), German (De), Arabic (Ar), Czech (Cs) and English (En). Each cell in the upper right triangle contains the BLEU scores using a. Our proposed approach (bold), b. Multi source ensembling approach and c. Multi Encoder Multi Source approach. The highest score is the one with the asterisk mark (*)
Corpus
Type
Language
Pair
tst2010 tst2013
Number
of sources
tst2010 tst2013
3 lingual
191381 lines
Fr-En 19.72 22.05 2 sources 22.56*/18.64/22.03 24.02*/18.45/23.92
De-En 16.19 16.13
4 lingual
84301 lines
Fr-En 9.02 7.78 3 sources 11.70/12.86*/10.30 9.16/9.48*/7.30
De-En 7.58 5.45
Ar-En 6.53 5.25
5 lingual
45684 lines
Fr-En 6.69 6.36 4 sources 8.34/9.23*/7.79 6.67*/6.49/5.92
De-En 5.76 3.86
Ar-En 4.53 2.92
Cs-En 4.56 3.40

BLEU scores for the baseline and 2 source settings using the UN corpus. The languages are Spanish (Es), French (Fr), Russian (Ru), Arabic (Ar) and English (En). Each cell in the upper right triangle contains the BLEU scores using a. Our proposed approach (bold), b. Multi source ensembling approach and c. Multi Encoder Multi Source approach. The highest score is the one with the asterisk mark (*). Entries with an asterisk mark (+) next to them indicate BLEU scores that are statistically significant (p <0.001) compared to those obtained using either of the source languages independently.
Language
Pair
BLEU
Language
Combination
BLEU

Es-En

49.20 Es+Fr-En

49.93*+

46.65
47.39

Fr-En

40.52 Fr+Ru-En

43.99*+

40.63
42.12+

Ar-En

40.58 Fr+Ar-En

43.85+

41.13+
44.06*+

Ru-En

38.94 Ar+Ru-En

41.66+

43.12+
43.69*+

BLEU scores for the translations from individual source languages to Hindi using the ILCI corpus

Language Pair

Bengali-Hindi
English-Hindi
Marathi-Hindi
Tamil-Hindi
Telugu-Hindi

For the results of the ILCI corpus setting, refer to Table ? for the BLEU scores for individual language pairs. Table ? contains the BLEU scores for all combinations of source languages, two at a time. Each cell in the upper right triangle contains the BLEU score and the difference compared to the best BLEU obtained using either of the languages for the two source to one target setting where the source languages are specified in the leftmost and topmost cells. The last row of Table ? contains the BLEU score for the setting which uses all 5 source languages and the difference compared to the best BLEU obtained using any single source language.
For the results of the IWSLT corpus setting, refer to Table ?. Finally, refer to Table ? for the UN corpus setting.

4.3Analysis

From Tables ?, ? and Table ? it is clear that our simple source sentence concatenation based approach is able to leverage multiple languages leading to significant improvements compared to the BLEU scores obtained using any of the individual source languages. The ensembling and the multi encoder approaches also lead to improvements in BLEU. It should be noted that in a resource poor scenario ensembling outperforms all other approaches but in a resource scenario our method as well as the multi encoder are much better. However, one important aspect of our approach is that the model size for the multi source systems is the same as that of the single source systems since the vocabulary sizes are exactly the same. The multi encoder systems involve more parameters whereas the ensembling approach does not allow for the source languages to truly interact with each other.
In the case of the ILCI corpus setting, the BLEU scores of the baseline systems indicate that the closeness of the source language to Hindi (the target) influence the translation quality since translating from Marathi gives the highest BLEU score followed by Bengali and Telugu. Marathi and Bengali are the closest to Hindi (linguistically speaking) compared to the other languages and thus when used together they help obtain an improvement of 4.39 BLEU points compared to when Marathi is used as the only source language (24.63). It is also surprising to note that Marathi and Telugu also work together to give an improvement of 2.99 BLEU points compared to when Marathi is used as the only source language because Telugu being a Dravidian language is quite different from Marathi and Hindi. Currently we do not have a clear idea as to why this happens but it does help us understand that it is not a mere coincidence that Bengali and Telugu used as source languages also give an improvement of 3.52 BLEU points compared to when Bengali is the only source language.
In general it is clear that no matter which source languages are combined there are gains over when the individual source languages are used. However it can be seen that combining any of Marathi, Bengali and Telugu with either of English or Tamil lead to smaller gains. This seems to indicate that although multiple source languages do help it is better to use source languages that are linguistically closer to each other (as evidenced by how well Marathi and Bengali work when used together). Finally, looking at the last row of Table ? shows us that using additional languages lead to further gains leading to a BLEU score of 31.3 which is 6.5 points above when only Marathi is used as the only source language and 2.11 points above when Marathi and Bengali are used as the source languages. Having five source languages is uncommon and although it does show that increasing the number of source languages does have a positive impact, there are diminishing returns14.
Similar gains in BLEU are observed in the case of the IWSLT corpus setting. Halving the size of the training corpus (from trilingual to 4-lingual) leads to baseline BLEU scores being reduced by half (19.72 to 9.62 for French-English tst2010 test set) but using an additional source leads to a gain of roughly 2 BLEU points. Although the gains are not as high as seen in the ILCI corpus setting it must be noted that the test set for the ILCI corpus is easier in the sense that it contains many short sentences compared to the IWSLT test sets. Our method does not show any gains in BLEU for the tst2013 test set in the 4-lingual setting, an anomaly which we plan to investigate in the future.
Finally in the large training corpus setting, where we used approximately 2 million training sentences, we also obtained statistically significant (p <0.001) increments in BLEU. In the case of the single source systems we observed that the BLEU score for Spanish-English was around 9 BLEU points higher than for French-English which is consistent with the observations in the original work concerning the construction of the UN corpus [?]. Furthermore, combining using French and Spanish together leads to a small (0.7) improvement in BLEU (over Spanish-English) that is statistically significant (p <0.001) which is to be expected since the BLEU for Spanish-English is already much better than the BLEU for French-English. Since the BLEU scores for French, Arabic and Russian to English are closer to each other we can see that the BLEU scores for French+Arabic, French+Russian and Arabic+Russian to English are around 3 BLEU points higher than those of their respective single source counterparts. In general it can be seen that our method is language and domain independent. However, in a resource rich scenario, if the translation quality (in terms of BLEU) of any of the single source systems is much higher (9 BLEU points) compared to the others then the gains obtained by combining the corresponding source with other sources are not high.

4.4Studying multi-source attention

In order to understand whether or not our multisource NMT approach prefers certain language over others, we extracted a subset of 50 random sentences from the test set and obtained visualizations for the attention vectors. Refer to Figure 2 for an example. The words of the target sentence in Hindi are arranged from top to bottom along the rows where as the words of the multi-source sentence are arranged from right to left across the columns. Note that the source languages are in the following order: Bengali, English, Marathi, Tamil, Telugu. The most interesting thing that can be seen is that the attention mechanism focuses on each language but with varying degrees of focus. Bengali, Marathi and Telugu are the three languages that receive most of the attention wheres English and Tamil barely receive any. This clearly reflects how when either of Bengali or Telugu were combined with Marathi there were significant gains in BLEU. Building on this observation we believe that the gains we obtained by using all 5 source languages were mostly due to Bengali, Telugu and Marathi whereas the NMT system learns to practically ignore Tamil and English. However there does not seem to be any detrimental effect of using English and Tamil.
From Figure 3 it can be seen that this observation also holds in the UN corpus setting for French+Spanish to English where the attention mechanism gives a higher weight to Spanish words compared to French words. It is also interesting to note that the attention can potentially be used to extract a multilingual dictionary simply by learning a N-source NMT system and then generating a dictionary by extracting the words from the source sentence that receive the highest attention for each target word generated.

Figure 2: Attention Visualization for ILCI corpus setting with all 5 source languages
Figure 2: Attention Visualization for ILCI corpus setting with all 5 source languages
Figure 3: Attention Visualization for UN corpus setting with French and Spanish as source languages
Figure 3: Attention Visualization for UN corpus setting with French and Spanish as source languages

5Conclusion

In this paper, we have proposed and evaluated a simple approach for “Multi-Source Neural Machine Translation” without modifying the NMT system architecture in a resource poor as well as a resource rich setting using the ILCI, IWSLT and UN corpora. We have compared our approach with two other previously proposed approaches and showed that it is highly effective, domain and language independent and the gains are significant. We furthermore observed, by visualizing attention, that NMT focuses on some languages by practically ignoring others indicating that language relatedness is one of the aspects that should be considered in a multilingual MT scenario. In the future we plan on conducting a full scale investigation of the language relatedness phenomenon by considering even more languages in resource rich as well as resource poor scenarios.


  1. http://ltrc.iiit.ac.in/icon/2014

  2. http://ltrc.iiit.ac.in/icon2015/

Footnotes

  1. Sometimes a N-lingual corpus is available as N-1 bilingual corpora (leading to N-1 sources), each having a fixed target language (typically English) and share a number of target language sentences.
  2. One additional benefit of our approach is that any NMT architecture can be used, be it attention based or hierarchical NMT.
  3. The BPE model is learned only on the training set.
  4. This is computed by the multi-bleu.pl script, which can be downloaded from the public implementation of Moses [?].
  5. This was used for the Indian Languages MT task in ICON 20141 and 20152.
  6. https://conferences.unite.un.org/uncorpus
  7. https://wit3.fbk.eu/mt.php?release=2016-01
  8. In the task there are 3 domains: health, tourism and general. However, we focus on the general domain in which half the corpus comes from the health domain the other half comes from the tourism domain.
  9. https://github.com/fabiencro/knmt
  10. In the original work a single multilingual multiway NMT model was trained and ensembled but we train separate NMT models for each source language.
  11. We also try vocabularies of size 16k and 32k but they take longer to train and overfit badly in a low resource setting
  12. We observed that the models start overfitting around 7k-8k iterations
  13. We performed evaluation using beam sizes 4, 8, 12 and 16 but found that the differences in BLEU between beam sizes 12 and 16 are small and gains in BLEU for beam sizes beyond 16 are insignificant
  14. As future work it will be worthwhile to investigate the diminishing returns obtained per additional language.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
   
Add comment
Cancel
Loading ...
10383
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description