Exploiting Out-of-Domain Data Sources for Dialectal Arabic Statistical Machine Translation
Statistical machine translation for dialectal Arabic is characterized by a lack of data since data acquisition involves the transcription and translation of spoken language. In this study we develop techniques for extracting parallel data for one particular dialect of Arabic (Iraqi Arabic) from out-of-domain corpora in different dialects of Arabic or in Modern Standard Arabic. We compare two different data selection strategies (cross-entropy based and submodular selection) and demonstrate that a very small but highly targeted amount of found data can improve the performance of a baseline machine translation system. We furthermore report on preliminary experiments on using automatically translated speech data as additional training data.
Katrin Kirchhoff Department of Electrical Engineering University of Washington Seattle, WA, USA firstname.lastname@example.org Bing Zhao††thanks: Work was done while the author was at SRI International. LinkedIn email@example.com
Wen Wang SRI International Menlo Park, CA, USA firstname.lastname@example.org
In the Arabic-speaking world, dialectal Arabic (DA) is used side-by-side with the standard form of the language, Modern Standard Arabic (MSA). Whereas the latter is used for written and formal oral communication (lectures, speeches), DA is used for everyday, casual communication. DA is almost never written; exceptions are transcriptions of spoken language, e.g., in novels, movie scripts, or in online blogs or forums. DA and MSA exhibit strong differences at the lexical, phonological, morphological, and syntactic levels; furthermore, the dialects themselves form a similarity continuum that ranges from closely related to mutual unintelligible. An overview of the main characteristics of DA can be found in [Habash, 2010].
Most natural language processing (NLP) tools that have been developed for Arabic have been targeted towards MSA, for which large amounts of written data exist. NLP for DA suffers from a sparsity of tools as well as data. Work on DA annotation tools includes the development of morphological analyzers for Arabic dialects [Habash et al., 2005, Habash et al., 2012, Habash et al., 2013], treebanks [Maamouri et al., 2006] and parsers [Chiang et al., 2006], unsupervised [Duh and Kirchhoff, 2005] or supervised [Al-Sabbagh and Girju, 2012] training of POS taggers for DA, and lexicon acquisition [Duh and Kirchhoff, 2006]. However, most of these have been targeted to the Egyptian or Levantine dialects and do not easily generalize to other dialects. There are a small number of speech and parallel text corpora for Egyptian, Levantine, and Iraqi DA, primarily available from the Linguistic Data Consortium (LDC) and the European Language Resources Association (ELRA). In general, however, spoken language needs to be recorded and transcribed to produce text data, which constitutes a bottleneck for the rapid acquisition of new data.
The lack of training data for DA in statistical machine translation (SMT) has only been addressed in a few previous studies; the standard approach has been to simply collect more training data by transcribing and translating DA speech. [Zbib et al., 2012] compare utilizing large amounts of MSA data for training and creating a small corpus of DA training data. They conclude that simply adding large amounts of mismatched (MSA) training data does not help, whereas even a small amount of dialectal data is very useful. Salloum and Habash [Salloum and Habash, 2011, Salloum and Habash, 2013] propose to transform DA to MSA by means of a combination of statistical processing and hand-coded transformation rules, and to then apply MT systems for MSA-to-English. Their work was on Egyptian Arabic, and porting this approach to a different dialect involves a fair amount of manual effort and dialect expertise. In [Aminian et al., 2014] the specific problem of out-of-vocabulary words in MT for DA is addressed by replacing DA words with their MSA equivalents.
In this paper we attempt to enrich available training data for Iraqi Arabic by automatically identifying IA-English parallel data in out-of-domain corpora of MSA and other dialects of Arabic. This procedure is based on the assumption that at least some dialects will exhibit similarities with IA. Corpora formally described as MSA may also contain dialectal data at the subsentential level due to code-switching (mixed use of MSA and DA), which is common among Arabic speakers. In principle, automatic dialect identification methods [Alorifi, 2008, Sadat et al., 2014, Zaidan and Callison-Burch, 2014] might be used for this purpose; however, these methods are themselves error-prone and have not been developed for all dialects of Arabic. Our approach is to directly select data that is matched to features (n-grams) extracted from a sample corpus of the dialect of interest. In addition to finding dialect-matched data, the selected data is also likely to be matched with respect to topic and style. Two different data selection methods are investigated, the widely-used cross entropy method of [Moore and Lewis, 2010], and a more recent submodular data selection method [Wei et al., 2013]. We demonstrate that the performance of SMT systems for IA can be improved by selecting a very small amount of highly targeted out-of-domain data. In addition, we conduct a preliminary investigation of the possibility of using automatically translated speech data as SMT training data.
The paper is structured as follows: we first report on previous work on data selection for SMT (Section 2). We then describe the submodular technique used in this paper in detail (Section 3). The data is described in Section 4; experiments are results are presented in Section 5. We provide conclusions in Section 6.
2 Data Selection: Previous Work
A currently widely-used data selection method in SMT (which we also use as a baseline in Section 5) uses the cross-entropy between two language models [Moore and Lewis, 2010], one trained on the test set of interest, and another trained on generic or out-of-domain training data. We call this the cross-entropy method. This method trains a test-set specific (or in-domain) language model, , and a generic (out-of- or mixed-domain) language model, . Each sentence in the training data is scored by both language models and is assigned the log ratio of the language model probabilities as a score:
where is the length of sentence . Sentences are then ranked in descending order based on their scores and the top sentences are chosen. Various extensions to this method have been proposed. In [Axelrod et al., 2011] the monolingual selection method is extended to bilingual corpora. In [Duh et al., 2013], neural language models are used instead of backoff language models. Finally, [Mediani et al., 2014] propose a different method for drawing the out-of-domain sample and the use of word-association models to improve the data for training the out-of-domain language model.
The cross-entropy approach ranks each sentence individually, without reference to other sentences. Thus, no sentence interactions can be modelled, such as redundancy at the sentential or sub-sentential level. Moreover, the method does not have a theoretical performance guarantee.
3 Submodular Data Selection
Submodular functions [Edmonds, 1970, Fujishige, 2005] were first developed in mathematics, operations research and economics; more recently, they have been used for a variety of optimization problems in machine learning as well. For example, they have been applied to the problems of clustering [Narasimhan and Bilmes, 2007], observation selection [Krause et al., 2008], sensor placement [Krause and Guestrin, 2011], or image segmentation [Jegelka and Bilmes, 2011]. Within natural language processing (NLP) submodular functions have been used for extractive text summarization [Lin and Bilmes, 2012].
To explain submodular functions, we introduce the following notation: assume a finite set of data elements , the ground set. A valuation function is then defined that returns a non-negative real value for any subset . The function is called submodular if it satisfies the property of diminishing returns: for all and , the following is true:
This means that the incremental value (or gain) of element decreases when the context in which is considered grows from to . The “gain” is defined as . Thus, is submodular if . Submodularity is a natural model for data selection in SMT and other NLP tasks. The ground set is the set of training data elements, and elements are selected from this set according to a submodular valuation function for any given subset of . The value of this function diminishes for items that are (partially) redundant with other items in the already-selected subset, which is precisely the submodularity property. The specific function we utilize for the purpose of MT data selection is as follows:
Here, is a set of features (such as words, n-grams, etc.), is a subset of , is a non-negative weight, is a non-negative, non-decreasing concave function, and is a score indicating how relevant is in sample . Thanks to the concave function, the contribution of each feature in the context of an existing subset diminishes as grows.
In our work the feature set consists of all n-grams up to a pre-specified length drawn from a representative in-domain data set. The feature relevance scores are the tf-idf weighted counts of the the features (n-grams). The tf-idf (term frequency, inverse document frequency) values are computed by treating each sentence as a “document”. That is, the weighting term is
where is the count of in (term frequency), and is the number of sentences out of that occurs in.
The above function can be optimized efficiently even for large data sets. Formally, we have the following optimization problem:
where is a known budget – in the present context, the budget can be, e.g., the number of words or parallel sentences to select. Solving this problem exactly is NP-complete [Feige, 1998], and expressing it as an ILP procedure renders it impractical for large data sizes. When is submodular and the cost is just size (), then the simple greedy algorithm (detailed in Algorithm 1) will have a worst-case guarantee of where is the optimal and is the greedy solution [Nemhauser et al., 1978].
This constant factor guarantee stays the same as grows; thus, it scales well to large data sets. The application of this procedure to the selection of training data for large-scale SMT tasks was described in [Kirchhoff and Bilmes, 2014]. Here, we apply it in the same way to the selection of out-of-domain data for a small-scale task.
The in-domain data available for the present study is the Transtac corpus of Iraqi Arabic; the sizes of the training, tuning and development test sets are shown in Table 1.
The out-of-domain data sources used for the selection experiments are listed in Table 2. We utilize 22 LDC corpora that include MSA and other dialects of Arabic, notably Egyptian and Levantine. For example, training corpora developed for the GALE, TIDES, and BOLT projects were included, as were the Levantine Arabic Treebank, an Egyptian Arabic word alignment corpus, and a corpus of dialectal Arabic web data (75% Levantine, 25% Egyptian) that was translated through crowdsourcing (thus, translations are noisy). Note that even though a corpus may be officially listed as MSA, it may contain segments of DA, especially when broadcast conversations (e.g., talkshows) are included.
|LDC2005E83||GALE-Y1Q1||BN, BC, WB||MSA||170k|
|LDC2006E39||Tides MT05 Eval||NW||MSA||135k|
|LDC2006E44||Tides MT04 Eval||NW||MSA||170k|
|LDC2007E101||GALE-P3R1||BC, NW, BN||MSA||530k|
|LDC2007E87||GALE-P2R3||BC, BN, NW, WB||MSA||188k|
|LDC2009E15||GALE-P4R1v2||WB, BC, BN, NW||MSA||305k|
|LDC2009E16||GALE-P4R2||BC, BN, NW, WB||MSA||273k|
|LDC2009E95||GALE-P4R3v1.2||BC, BN, NW, WB||MSA||147k|
|LDC2010E38||GALE-P3 Treebank||BC, NW||MSA||349k|
|LDC2010T17||NIST-OpenMT-2006||NW, BC, BN, WB||MSA||141k|
|LDC2012E19||BOLT-P1-R2 MT Training Data||DF||Egyptian||126k|
|LDC2012E51||BOLT-P1 ARZ word alignments||DF||Egyptian||55k|
|LDC2012T09||Web translations||Various||Egyptian, Levantine||1.613M|
5 Experiments and Results
We use two different MT systems for translation from IA to English, an in-house system based on Moses and the SRI MT system developed for the DARPA BOLT (Broad Operational Language Translation) spoken dialog translation project (see [Ayan et al., 2013, Kirchhoff et al., 2015] for more details). The former is a flat phrase-based statistical MT system with a hierarchical lexicalized reordering model and a 6-gram language model trained on the target side of the Transtac training data. For preprocessing we use a statistical morphological segmenter developed in the BOLT project. The second system is similar in nature but has a hierarchical phrase-based translation model and utilizes sparse features (see [Zhao et al., 2014] for more information).
5.1 Initial evaluation of selection techniques
In an initial set of experiments we attempted to gauge the performance of the cross-entropy vs. the submodular selection technique by subselecting the Transtac training data. We chose 10-40% of the Transtac training set; the feature set was the set of all n-grams up to length 7 of the tune and dev sets. We investigated both translation directions, IA English and English IA. Table 3 shows the BLEU scores.
|IA EN||EN IA|
Compared to using 100% of the training data, the same or even better performance can be obtained by using a subset of the data when the submodular subselection technique is used, even at small percentages of the training data. The cross-entropy method falls short of this performance, presumably due to the failure of this method to control for redundancy in the selected set.
5.2 Selection of out-of-domain training data
In order to integrate additional out-of-domain training data, we set a budget constraint of 100k words on the source side. The LDC corpora were preprocessed in the same manner as the Transtac data, i.e. , they were preprocessed and morphologically segmented. The greedy algorithm was used in combination with Equation 3 to select parallel sentences from the corpora listed in Table 2 such that the resulting corpus contains at most 100k words on the source side. The selected data was then added to both the MT and LM training data. Table 4 shows the BLEU scores and position-independent word error rate (PER) for the in-house MT system that was used for development purposes (note that baseline results are different from those in Table 3 because the baseline MT system changed in between experiments and was trained on different data set definitions and tokenization schemes). We again compared the cross-entropy against the submodular selection method. Improvements in the system are small; however, the submodular technique again shows slightly better results.
|BLEU (%)||PER (%)|
|BLEU (%)||PER (%)|
We subsequently used the selected data with the submodular method in the second MT system, viz. the evaluation system developed for a bilingual dialog system, and tested the system on additional in-domain data sets. BLEU scores (shown in Table 5) show slight improvements of up to 0.5 absolute. Note that the selected data set was very small, containing only 7k sentences. Larger sets (up to 20k) were tried but were not found to be useful.
|+ 7k data||17.5||35.5||32.7||33.5|
We analyzed the selected data as to its origin and found that the top three data sources were broadcast conversations from various GALE corpora (47%), the dialectal web corpus (35.7%), and the BOLT MT training data (9.9%).
5.3 Using translated speech data
In addition to the various parallel text corpora listed in Table 2 we also had access to an Iraqi Arabic Conversational Telephone Speech (CTS) corpus (LDC2006T16). This corpus includes with speech transcriptions but no translations. Although the data matches the dialect of interest is is not necessarily matched in topic or style. To obtain parallel data we translated the transcriptions of this corpus with our baseline IA EN translation system. Those segments that were translated contiguously (i.e., without intervening out-of-vocabulary words) were extracted and added to the data from the corpora in Table 2. Data selection was then re-run. We found that in this experiment 80% of the selected data came from the CTS corpus; however, the translation performance did not improve (see Table 6). The likely reason is that translations were too noisy to be used as parallel data and introduced more confusability and irrelevant variation rather than contributing useful translations. The use of automatically translated speech data might be improved by selecting only the most confident translations according to the translation model scores.
|BLEU (%)||PER (%)|
|BLEU (%)||PER (%)|
We have described data selection procedures for identifying Iraqi
Arabic data resources in unrelated dialectal and/or MSA corpora. We
have demonstrated that judiciously selected data can improve MT performance
even when the overall amount is very small. Furthermore, we have
compared two different data selection techniques, the widely-used
cross-entropy selection method, and a more recently developed method
that relies on submodular function optimization. The latter performed
slightly better than the former. Finally, we have conducted initial
experiments on utilizing automatically translated conversational
speech as additional training data. Whereas the data was strongly
matched to the in-domain data on the source side, the translations
were too noisy to yield any further improvement in machine translation
This study was funded by the Defense Advanced Research Projects Agency (DARPA) under contract HR0011-12-C-0016 - subcontract 19-000234 and by the Intelligence Advanced Research Projects Activity (IARPA) under agreement number FA8650-12-2-7263. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Intelligence Advanced Research Projects Activity (IARPA) or the U.S.ËGovernment.
- [Al-Sabbagh and Girju, 2012] Al-Sabbagh, R. and R. Girju. 2012. A supervised POS tagger for written Arabic social networking corpora. In Proceedings of KONVENS.
- [Alorifi, 2008] Alorifi, F.S. 2008. Automatic Identification of Arabic Dialects USING Hidden Markov Models. Ph.D. thesis, Swanson School of Engineering, University of Pittsburgh.
- [Aminian et al., 2014] Aminian, M., M. Ghoneim, and M. Diab. 2014. Handling OOV words in dialectal Arabic to English machine translation. In EMNLP Workshop on Language Technology for Closely Related Languages and Language Variants (LT4CloseLang), page 99â108.
- [Axelrod et al., 2011] Axelrod, A., X. He, and J. Gao. 2011. Domain adaptation via pseudo in-domain data selection. In Proceedings of EMNLP, pages 355–362, Edinburgh, Scotland.
- [Ayan et al., 2013] Ayan et al., N.F. 2013. Can you give me another word for hyperbaric? - improving speech translation using targeted clarification questions. In Proceedings of ICASSP.
- [Chiang et al., 2006] Chiang, David, M.Diab, N. Habash, O. Rambow, and S. Shareef. 2006. Parsing Arabic dialects. In Proceedings of EACL.
- [Duh and Kirchhoff, 2005] Duh, K. and K. Kirchhoff. 2005. Pos tagging of dialectal Arabic: a minimally supervised approach. In In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, page 55â62.
- [Duh and Kirchhoff, 2006] Duh, K. and K. Kirchhoff. 2006. Lexicon acquisition for dialectal Arabic using transductive learning. In Proceedings of EMNLP.
- [Duh et al., 2013] Duh, K., G. Neubig, K. Sudoh, and H. Tsukada. 2013. Adaptation data selection using neural language models: Experiments in machine translation. In Proceedings of ACL.
- [Edmonds, 1970] Edmonds, J., 1970. Combinatorial Structures and their Applications, chapter Submodular functions, matroids and certain polyhedra, pages 69–87. Gordon and Breach.
- [Feige, 1998] Feige, U. 1998. A threshold of ln n for approximating set cover. Journal of the ACM (JACM), 45(4):634–652.
- [Fujishige, 2005] Fujishige, S. 2005. Submodular functions and optimization. Annals of Discrete Mathematics, volume 58. Elsevier Science.
- [Habash et al., 2005] Habash, N., O. Rambow, and G. Kiraz. 2005. Morphological analysis and generation for Arabic dialects. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages.
- [Habash et al., 2012] Habash, N., R. Eskander, and A. Hawwari. 2012. A morphological analyzer for Egyptian Arabic. In Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology (SIGMORPHON2012), page 1â9.
- [Habash et al., 2013] Habash, N., R. Roth, O. Rambow, R. Eskander, and N. Tomeh. 2013. Morphological analysis and disambiguation for dialectal Arabic. In Proceedings of NAACL, pages 426–432.
- [Habash, 2010] Habash, N. 2010. Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies, 3:1–187.
- [Jegelka and Bilmes, 2011] Jegelka, S. and J.A. Bilmes. 2011. Submodularity beyond submodular energies: coupling edges in graph cuts. In Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, June.
- [Kirchhoff and Bilmes, 2014] Kirchhoff, K. and J. Bilmes. 2014. Submodularity for data selection in statistical machine translation. In Proceedings of EMNLP, pages 131–141.
- [Kirchhoff et al., 2015] Kirchhoff, K., Y.C. Tam dn C. Richey, and W. Wang. 2015. Morphological modeling for machine translation of English-iraqi Arabic spoken dialogs. In Proceedings of NAACL.
- [Krause and Guestrin, 2011] Krause, A. and C. Guestrin. 2011. Submodularity and its applications in optimized information gathering. ACM Transactions on Intelligent Systems and Technology, 2(4).
- [Krause et al., 2008] Krause, A., H.B. McMahan, C. Guestrin, and A. Gupta. 2008. Robust submodular observation selection. Journal of Machine Learning Research, 9:2761–2801.
- [Lin and Bilmes, 2012] Lin, H. and J. Bilmes. 2012. Learning mixtures of submodular shells with application to document summarization. In Uncertainty in Artifical Intelligence (UAI), Catalina Island, USA, July. AUAI.
- [Maamouri et al., 2006] Maamouri, M., A. Bies, T. Buckwalter, M. Diab, N. Habash, O. Rambow, and D. Tabessi. 2006. Developing and using a pilot dialectal Arabic treebank. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC).
- [Mediani et al., 2014] Mediani, M., J. Winebarger, and A. Waibel. 2014. Improving in-domain data selection for small in-domain sets. In Proceedings of IWSLT.
- [Moore and Lewis, 2010] Moore, R. and W. Lewis. 2010. Intelligent selection of language model training data. In Proceedings EMNLP.
- [Narasimhan and Bilmes, 2007] Narasimhan, M. and J. Bilmes. 2007. Local search for balanced submodular clustering. In Proceedings of IJCAI.
- [Nemhauser et al., 1978] Nemhauser, G.L., L.A. Wolsey, and M.L. Fisher. 1978. An analysis of approximations for maximizing submodular set functions i. Mathematical Programming, 14:265–294.
- [Sadat et al., 2014] Sadat, F., F. Kazemi, and A. Farzindar. 2014. Automatic identification of Arabic language varieties and dialects in social media. In Proceedings of the Second Workshop on Natural Language Processing for Social Media (SocialNLP), page 22â27.
- [Salloum and Habash, 2011] Salloum, W. and N. Habash. 2011. Dialectal to standard Arabic paraphrasing to improve Arabic-English statistical machine translation. In Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties, pages 10–21.
- [Salloum and Habash, 2013] Salloum, W. and N. Habash. 2013. Dialectal Arabic to English machine translation: Pivoting through modern standard Arabic. In Proceedings of NAACL.
- [Wei et al., 2013] Wei, K., Y. Liu, K. Kirchhoff, and J. Bilmes. 2013. Using document summarization techniques for speech data subset selection. In Proceedings of NAACL.
- [Zaidan and Callison-Burch, 2014] Zaidan, O. and C. Callison-Burch. 2014. Arabic dialect identification. Computational Linguistics, 40:171–202.
- [Zbib et al., 2012] Zbib, Rabih, Erika Malchiodi, Jacob Devlin, David Stallard, Spyros Matsoukas, Richard Schwartz, John Makhoul, Omar F. Zaidan, and Chris Callison-Burch. 2012. Machine translation of Arabic dialects. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, page 49â59.
- [Zhao et al., 2014] Zhao, B., Y.-C. Tam, and J. Zheng. 2014. An autoencoder with bilingual sparse features for improved statistical machine translation. In Proceedings of ICASSP.