QCRI Machine Translation Systems for IWSLT 16
This paper describes QCRI’s machine translation systems for the IWSLT 2016 evaluation campaign. We participated in the ArabicEnglish and EnglishArabic tracks. We built both Phrase-based and Neural machine translation models, in an effort to probe whether the newly emerged NMT framework surpasses the traditional phrase-based systems in Arabic-English language pairs. We trained a very strong phrase-based system including, a big language model, the Operation Sequence Model, Neural Network Joint Model and Class-based models along with different domain adaptation techniques such as MML filtering, mixture modeling and using fine tuning over NNJM model. However, a Neural MT system, trained by stacking data from different genres through fine-tuning, and applying ensemble over 8 models, beat our very strong phrase-based system by a significant 2 BLEU points margin in ArabicEnglish direction. We did not obtain similar gains in the other direction but were still able to outperform the phrase-based system. We also applied system combination on phrase-based and NMT outputs.
We describe QCRI’s phrase-based and Neural MT systems. We participated in the Arabic-to-English and English-to-Arabic MT tracks. Our translation engines have been historically based on the phrase-based system trained using the Moses toolkit , but during the course of this evaluation, we made a transition towards the newly emerged Neural MT framework , using Nematus, a toolkit used by the top performing team , during the recent WMT campaign.
An interesting challenge associated with the IWSLT campaign is the problem of domain adaptation. The in-domain data based on TED talks is available in very little quantity compared to the out-domain UN corpus , which has been found to be harmful previously when simply concatenated to the training . In this year’s IWSLT, two additional data resources Opus subtitles  and the QED corpus  were introduced. The latter was also used as an official test-set. Therefore apart from exploring phrase-based versus Neural MT, we geared ourselves towards adapting our system towards TED and QED talks in this multi-domain scenario. With these goals in mind we re-explored both model weighting and data filtering techniques, in these new data settings. Below we itemize the most successful attributes of our phrase-based system:
We applied MML-based data selection  to the UN and Open Sub-title data, with the goals of filtering out harmful data.
We tried the fine-tuning method of training the NNJM model on the out-domain data and fine-tuning with the in-domain TED data .
We trained big language models using all the English mono data available from the WMT campaign and giga word corpus for Arabic.
We trained our Neural MT system using the Nematus toolkit. We used Bidirectional RNN’s for the encoder, 1024 LSTM units, and a word embedding size of 500. Below we itemize what worked when training the neural MT system:
We trained our baseline model on all of the UN corpus, then continued training with the Open subtitles corpus, and finally fine-tuned with the in-domain data
We fine-tuned all of our models without freezing any layers in the network, since we had sufficient amount of data to train on.
We used dropout when fine-tuning with in-domain data, since it is relatively small compared to the UN and Open subtitle data.
We trained our final models with an ensemble of the last eight models, where each model was fine-tuned with the in-domain data.
Finally we applied system combination over the outputs of best Neural MT and phrase-based systems using MEMT . Our efforts were mainly focused towards the AREN TED task. In the end we just replicated our best system for the ENAR direction and the QED task. For our best Neural MT system, we were unable to use an ensemble in the ENAR direction, since we could not train several comparable models to combine.
2Data Settings and Pre/Post Processing
We trained our systems using the data made available through IWSLT 2016 campaign. This contained two in-domain data sets TED talks and QED corpus  and two out-domain data sets UN corpus  and OPUS data . The statistics are shown in Table 1. For language model we trained using the target side of the parallel corpus and all the available English data from the recent WMT campaign , and GigaWord and OPUS mono corpus for Arabic.
We segmented Arabic data using both MADAMIRA and Farasa. We found MADAMIRA  performed 0.1 BLEU points better than Farasa  (See Table 2) and decided to use it for the competition. We tokenized the English side using standard tokenizer of Moses. For EnglishArabic, outputs were detokenized using MADA detokenizer. Before scoring the output, we normalized them and reference translations using the QCRI normalizer .
We trained phrase-based Moses system, with the settings described in : a maximum sentence length of 80, Fast-Aligner for word-alignments , an interpolated Kneser-Ney smoothed 5-gram language model , lexicalized reordering model , a 5-gram operation sequence model , a 14-gram NNJM model , with the baseline settings described in . We used default distortion limit, 100-best translation options, phrase-length of 5, monotone-over-punctuation heuristic, cube-pruning with a limit of 1000 during tuning 5000 during test. We used k-best batch MIRA  for tuning. We used cased BLEU  to measure progress.
Due to our experience from previous competitions, we were wary of the fact that simply adding the UN data is harmful for the AR MT system, we therefore selected data through MML filtering . We selected 2.5%, 3.75%, 5%, 10% and 30% of the UN data and trained MT pipeline by concatenating the selected data with the in-domain data. We did not include Opus data (40 Million Sentences) and NNJM in these experiments to get the results quickly. Table 3 shows the results. We found 3.75% (685k sentences) to be the optimal threshold. Alternative to data selection, we tried training in- and out-domain phrase-tables separately and using the out-domain phrase-table only as a back-off. Second last row of Table 3 shows results. While it gave improvement on top of the baseline system, it was slightly behind MML filtering.
We then tried to find optimal cut-off on the OPUS data, and selected 20 Million sentences (half of the Opus). Our best systems used 3.75% of the UN data and half of the Opus data. Adding the selected Opus data gave an average improvement of +1.2 BLEU points.
We trained bigger language model by using all the available English data from the recent WMT campaign
3.4Interpolation of Operation Sequence Models
The OSM model has been a regular feature of the phrase-based pipeline  in the competition grade systems. It is a joint sequence translation model which integrates reordering.  recently found that an OSM model trained on plain concatenation of data is sub-optimal and can be improved by training OSM models on each domain individually and interpolating them by minimizing perplexity on the in-domain tune-set. Table 5 shows that using interpolated OSM model (OSM) instead of the one trained on plain concatenation (OSM) gives an average improvement of +0.6 BLEU points.
We also explored the award winning Neural Network Joint Model (NNJM) in our pipeline and tried to adapt it towards the in-domain data. We trained an NNJM models on the UN and Opus data for 25 epochs and then fine-tuned  it by running for 25 more epochs on the in-domain data. Because the data is huge, the entire training took 1.5 months of wall-clock time. Table 6 shows results. The NNJM model gave significant improvement (+0.6) on top of baseline which does not include it. We found fine-tuning method to give slight gains (+0.2) when the baseline model was trained on the Opus data. On the contrary, fine-tuning did not help when the model trained was on UN.
We explored the use of automatic word clusters in phrase-based models . We used 50 classes, obtained by running
mkcls. The clusters ids were included in the phrase-table. We additionally trained in-domain language model using word-classes and interpolated OSM on word-classes. But we only saw very small improvements using word classes.
3.7Handling Unknown Words
We tried to handle OOV words using
drop-oov and through transliteration . The former worked slightly better and was used in the best system. Of course the gains from the two methods are additive because they are addressing different OOVs, but there’s no good way to automatically find which word to drop and which one to transliterate.
Table 8 shows incremental progress on this ArabicEnglish language pair. Our best system included MML selected UN and Opus corpora, big language model, interpolated OSM and fine-tuned NNJM models. We we used
drop-oov option to handle unknown words.
We did not do detailed experiments for the EnglishArabic direction because of computational limitations, but simply replicated what worked for the ArabicEnglish direction. Table 9 shows progress on this language pair. The baseline system (ID) was trained on the the TED data and target side of all the permissible parallel data. In the second row, we added all the parallel data except for the UN. In the third row we additionally added the UN data that we selected in the ArabicEnglish direction. Additional parallel data gives an average improvement of +1.4 BLEU point. Then we added an NNJM model trained on in-domain TED data on top of this system to improve it by +0.8. Adding GigaWord and monolingual OPUS data (another 20M Sentences other than the target-side of the parallel data) gave an improvement of +0.3. Finally we replaced the baseline NNJM with the one trained on OPUS data and fine-tuned with the in-domain data to get our best system.
We simply replicated QED systems by replacing QED corpus to be in-domain data, instead of TED data. We used the same UN data that we selected for our ArabicEnglish system, therefore our phrase-tables remain the same. The main changes are caused when training adapted OSM and NNJM models. For NNJM we simply fine tune with QED corpus instead of the TED corpus. For interpolated OSM, we concatenated TED and QED corpus and build OSM on it, which is then interpolated with the OSM models trained on the selected UN and Opus data. We used IWSLT tuning to get the interpolation weights. This way the OSM sub-model created from TED+QED corpus gets best weights. We also retrained the language model in this similar fashion. We used the tuning weights obtained from our best TED systems and replaced the TED adapted OSM, NNJM and language models with their QED adapted variants.
4Neural Machine Translation
We used a similar pre/post-processing pipeline for Neural MT as our phrase-based systems (Section 2), and additionally applied BPE  before training them. Our BPE models are trained separately for both the Arabic and English datasets instead of jointly training them, since the character set differs between the languages. We limited the number of operations to 59,500, as suggested in . We experimented with BPE models trained on the TED data, and on the concatenation of the TED and out-domain data. We did not see any considerable difference in performance between these models. Thus we used the BPE model trained on the TED data for the experiments reported in this paper.
We used default parameters in Nematus to train our systems: a batch size of 80, source and target vocabulary of 50K entries each, 1024 LSTM units, and the embedding layer size of 500. Baseline system were trained using only TED corpus.
4.3Fine Tuning on Concatenation versus OD
The best phrase based systems are usually trained by concatenating in and out-domain data. On the other hand, deep learning systems are trained on the out-domain data first, and then fine-tuned with in-domain data. We experimented with both strategies. In the interest of time we selected 30% of the UN data using MML filtering (Table 3). We trained two systems, one by concatenating the in-domain data with the selected (30%) UN data and other just on the selected data. Then we fine-tuned both the models with the in-domain TED data after running them for 3 epochs. Table 10 shows that fine-tuning a system trained on out-domain data only, outperforms the system fine-tuned on concatenation.
4.4Fine-tuning Variants and Dropouts
The default version of Nematus applies fine-tuning by freezing the weights of embedding layer. The intuition behind freezing a layer is to not allow the weights in that layer to change with additional data. This is sometimes useful when we can learn certain layers better from out-domain data. One such layer in our case is the word embedding layer. We tried a variation in which we do not freeze any layer. This latter variant was found to outperform the default setting (See Table 11).
Dropouts are found to be useful in NN training, when the training data is small. We experimented with using dropouts in our experiments, but did not find any significant difference. Hence we decided to use it only when fine-tuning with the in-domain data (TED/QED), since both of the other datasets (UN and OPUS) were big and did not pose any risk of inducing the problem of overfitting.
Since we found data selection useful in the phrase based system, we also trained our neural systems using 5%, 30% and 100% of the UN data. In these experiments, we concatenated the 5% and 30% of the UN data with the in-domain data. To evaluate the most promising models, we trained all of the models until the learning plateaued, and then fine-tuned these models with in-domain data.
In our subsequent experiments we tried to verify if this finding holds when we add the OPUS data. We therefore trained two systems by fine-tuning 30% selected UN data or full UN data using OPUS. Here the results flipped and the we found that model that used all of the UN data performed better (Compare last two rows in Table 12). Therefore, we decided to focus our efforts on the model trained on the entire UN data for all of the following experiments.
|5% + FT(TED)||25.8||29.4||29.3||25.0||27.4|
|30% + FT(TED)||28.4||32.7||32.9||27.8||30.4|
|Full + FT(TED)||28.1||32.3||31.6||27.0||29.8|
|30% + FT(OPUS)||26.1||30.6||32.5||27.1||29.1|
|Full + FT(OPUS)||28.2||31.7||34.3||29.2||30.8|
Ensembling models has shown to give a consistent boost in performance in past best performing systems . We therefore experimented with several variations. We found the best performing combination by fine-tuning the last eight models of the UN+OPUS system, and then ensemble these eight fine-tuned models. Performance improvements from the ensemble are shown in Table 13. The second row shows systems when we fine tune our best system in Table 12 with the in-domain TED data. In the last row we perform ensemble.
|Full + FT(OPUS)||28.2||31.7||34.3||29.2||30.8|
Our final system was trained by first using all of the UN data. We then continued training on OPUS data. Once learning had plateaued on the OPUS data, we took the last eight models which were very similar in performance, and fine-tuned each of the them using TED data. We then combined these eight fine-tuned models in an ensemble as our final system. The progress is shown in Table 14. We used the same strategy for the QED systems by fine-tuning the last eight OPUS models with QED data, and combining these in an ensemble.
We used insights gained from our Arabic-to-English system experiments to train our EnglishArabic systems. Our final model for both TED and QED was first trained on all of the UN data, followed by the OPUS data, and finally fine-tuned with the in-domain data. The progress is shown in Table 15.
We combined hypotheses produced by our best Phrase-based and Neural MT systems. For this purpose we used Multi-Engine MT system, or MEMT . The results are shown in Table 16. We did not gain any substantial improvements using system combination. Small improvements were obtained in the ArabicEnglish direction baring test-2012. On the contrary significant improvement was obtained only in test-2013 in the EnglishArabic direction. Table 17 shows results on the official test-sets.
We trained a very strong phrase-based system with SOTA features such as OSM, NNJM and big LM. The system improved greatly by applying domain adaptation. To this end we applied MML-based filtering, interpolated OSM and fine-tuning of NNJM models. Overall, our phrase-based system achieved a gain of 4 BLEU points on top of the baseline system. We also applied data selection for training our NMT. However, the NMT systems quickly overfit and did not perform well. Our experiments showed that the NMT system trained on the full UN data performed best, and the final NMT system made use of all the available out-of-domain data. However, the training was performed incrementally, starting with UN data for 50k iterations, fine tuned on OPUS for 25k more iterations and then fine tuned the final model using TED talks for a few iterations. We simply replicated our settings to train QED systems. Finally we applied system combination of the two systems using MEMT.
While it is computationally expensive, we found training a neural MT system much simpler than a competitive phrase-based system, where a lot of sub-components need to be optimized independently to reach the best configuration. On the contrary, an NMT system requires least supervision. Secondly once a neural system is trained, the effort can be easily reused to adapt the system towards another domain, as in this case we simply fine-tuned our UN+OPUS system with the QED corpus. On the contrary, almost all the sub-component of a phrase-based system had to be retrained to adapt the system towards QED corpus.
- Because we were running experiments in parallel, we were not aware at this point that fine-tuning on out-domain is a better strategy
- P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst, “Moses: Open source toolkit for statistical machine translation,” in Proceedings of the Association for Computational Linguistics (ACL’07), Prague, Czech Republic, 2007.
- =2plus 43minus 4 D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in ICLR, 2015. [Online]. Available: http://arxiv.org/pdf/1409.0473v6.pdf =0pt
- =2plus 43minus 4 R. Sennrich, B. Haddow, and A. Birch, “Edinburgh neural machine translation systems for wmt 16,” in Proceedings of the First Conference on Machine Translation.1em plus 0.5em minus 0.4emBerlin, Germany: Association for Computational Linguistics, August 2016, pp. 371–376. [Online]. Available: http://www.aclweb.org/anthology/W16-2323 =0pt
- M. Ziemski, M. Junczys-Dowmunt, and B. Pouliquen, “The united nations parallel corpus v1.0,” in Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, May 23-28, 2016, 2016.
- H. Sajjad, F. Guzmán, P. Nakov, A. Abdelali, K. Murray, F. A. Obaidli, and S. Vogel, “QCRI at IWSLT 2013: Experiments in Arabic-English and English-Arabic spoken language translation,” in Proceedings of the 10th International Workshop on Spoken Language Technology (IWSLT-13), December 2013.
- P. Lison and J. Tiedemann, “ l@english Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles,” in l@english Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).1em plus 0.5em minus 0.4emEuropean Language Resources Association (ELRA), may 2016.
- A. Abdelali, F. Guzman, H. Sajjad, and S. Vogel, “The AMARA corpus: Building parallel language resources for the educational domain,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, May 2014.
- A. Axelrod, X. He, and J. Gao, “Domain adaptation via pseudo in-domain data selection,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, ser. EMNLP ’11, Edinburgh, United Kingdom, 2011.
- N. Durrani, H. Schmid, and A. Fraser, “A joint sequence translation model with integrated reordering,” in Proceedings of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11), Portland, OR, USA, 2011.
- N. Durrani, H. Sajjad, S. Joty, A. Abdelali, and S. Vogel, “Using joint models for domain adaptation in statistical machine translation,” in Proceedings of the Fifteenth Machine Translation Summit (MT Summit XV).1em plus 0.5em minus 0.4emFlorida, USA: AMTA, November 2015.
- N. Durrani, P. Koehn, H. Schmid, and A. Fraser, “Investigating the usefulness of generalized word representations in smt,” in Proceedings of the 25th Annual Conference on Computational Linguistics, ser. COLING’14, Dublin, Ireland, 2014, pp. 421–432.
- M.-T. Luong and C. D. Manning, “Stanford neural machine translation systems for spoken language domain,” in International Workshop on Spoken Language Translation, Da Nang, Vietnam, 2015.
- =2plus 43minus 4 K. Heafield and A. Lavie, “CMU system combination in WMT 2011,” in Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation, Edinburgh, Scotland, United Kingdom, July 2011, pp. 145–151. [Online]. Available: http://kheafield.com/professional/avenue/wmt_2011.pdf =0pt
- F. Guzmán, H. Sajjad, S. Vogel, and A. Abdelali, “The AMARA corpus: Building resources for translating the web’s educational content,” in Proceedings of the 10th International Workshop on Spoken Language Technology (IWSLT-13), December 2013.
- =2plus 43minus 4 O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, M. Huck, A. Jimeno Yepes, P. Koehn, V. Logacheva, C. Monz, M. Negri, A. Neveol, M. Neves, M. Popel, M. Post, R. Rubino, C. Scarton, L. Specia, M. Turchi, K. Verspoor, and M. Zampieri, “Findings of the 2016 conference on machine translation,” in Proceedings of the First Conference on Machine Translation.1em plus 0.5em minus 0.4emBerlin, Germany: Association for Computational Linguistics, August 2016, pp. 131–198. [Online]. Available: http://www.aclweb.org/anthology/W/W16/W16-2301 =0pt
- A. Pasha, M. Al-Badrashiny, M. Diab, A. El Kholy, R. Eskander, N. Habash, M. Pooleery, O. Rambow, and R. M. Roth, “MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic,” in Proceedings of the Language Resources and Evaluation Conference, ser. LREC ’14, Reykjavik, Iceland, 2014, pp. 1094–1101.
- =2plus 43minus 4 A. Abdelali, K. Darwish, N. Durrani, and H. Mubarak, “Farasa: A fast and furious segmenter for arabic,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations.1em plus 0.5em minus 0.4emSan Diego, California: Association for Computational Linguistics, June 2016, pp. 11–16. [Online]. Available: http://www.aclweb.org/anthology/N16-3003 =0pt
- A. Birch, M. Huck, N. Durrani, N. Bogoychev, and P. Koehn, “Edinburgh SLT and MT system description for the IWSLT 2014 evaluation,” in Proceedings of the 11th International Workshop on Spoken Language Translation, ser. IWSLT ’14, Lake Tahoe, CA, USA, 2014.
- C. Dyer, V. Chahuneau, and N. A. Smith, “A simple, fast, and effective reparameterization of ibm model 2,” in Proceedings of NAACL’13, 2013.
- =2plus 43minus 4 K. Heafield, “KenLM: Faster and Smaller Language Model Queries,” in Proceedings of the Sixth Workshop on Statistical Machine Translation, Edinburgh, Scotland, United Kingdom, July 2011, pp. 187–197. [Online]. Available: http://kheafield.com/professional/avenue/kenlm.pdf =0pt
- =2plus 43minus 4 M. Galley and C. D. Manning, “A Simple and Effective Hierarchical Phrase Reordering Model,” in Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Honolulu, Hawaii, October 2008, pp. 848–856. [Online]. Available: http://www.aclweb.org/anthology/D08-1089 =0pt
- N. Durrani, H. Schmid, A. Fraser, P. Koehn, and H. Schütze, “The Operation Sequence Model – Combining N-Gram-based and Phrase-based Statistical Machine Translation,” Computational Linguistics, vol. 41, no. 2, pp. 157–186, 2015.
- J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, and J. Makhoul, “Fast and robust neural network joint models for statistical machine translation,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2014.
- S. Joty, H. Sajjad, N. Durrani, K. Al-Mannai, A. Abdelali, and S. Vogel, “How to Avoid Unwanted Pregnancies: Domain Adaptation using Neural Network Models,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, September 2015.
- C. Cherry and G. Foster, “Batch tuning strategies for statistical machine translation,” in Proceedings of the 2012 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ser. NAACL-HLT ’12, Montréal, Canada, 2012.
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A Method for Automatic Evaluation of Machine Translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ser. ACL ’02, Morristown, NJ, USA, 2002, pp. 311–318.
- =2plus 43minus 4 N. Durrani, A. Fraser, H. Schmid, H. Hoang, and P. Koehn, “Can markov models over minimal translation units help phrase-based smt?” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).1em plus 0.5em minus 0.4emSofia, Bulgaria: Association for Computational Linguistics, August 2013, pp. 399–405. [Online]. Available: http://www.aclweb.org/anthology/P13-2071 =0pt
- H. Sajjad, A. Fraser, and H. Schmid, “A statistical model for unsupervised and semi-supervised transliteration mining,” in Proceedings of the Association for Computational Linguistics (ACL’12), Jeju, Korea, 2012.
- N. Durrani, H. Sajjad, H. Hoang, and P. Koehn, “Integrating an unsupervised transliteration model into statistical machine translation,” in Proceedings of the 15th Conference of the European Chapter of the ACL (EACL 2014), Gothenburg, Sweden, April 2014.
- =2plus 43minus 4 R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).1em plus 0.5em minus 0.4emBerlin, Germany: Association for Computational Linguistics, August 2016, pp. 1715–1725. [Online]. Available: http://www.aclweb.org/anthology/P16-1162 =0pt