Aspects of Terminological and Named Entity Knowledge within Rule-Based Machine Translation Models for Under-Resourced Neural Machine Translation Scenarios

Aspects of Terminological and Named Entity Knowledge within Rule-Based Machine Translation Models for Under-Resourced Neural Machine Translation Scenarios

Abstract

Rule-based machine translation is a machine translation paradigm where linguistic knowledge is encoded by an expert in the form of rules that translate text from source to target language. While this approach grants extensive control over the output of the system, the cost of formalising the needed linguistic knowledge is much higher than training a corpus-based system, where a machine learning approach is used to automatically learn to translate from examples. In this paper, we describe different approaches to leverage the information contained in rule-based machine translation systems to improve a corpus-based one, namely, a neural machine translation model, with a focus on a low-resource scenario. Three different kinds of information were used: morphological information, named entities and terminology. In addition to evaluating the general performance of the system, we systematically analysed the performance of the proposed approaches when dealing with the targeted phenomena. Our results suggest that the proposed models have limited ability to learn from external information, and most approaches do not significantly alter the results of the automatic evaluation, but our preliminary qualitative evaluation shows that in certain cases the hypothesis generated by our system exhibit favourable behaviour such as keeping the use of passive voice.

\colingfinalcopy

1 Introduction

In rule-based machine translation (RBMT), a linguist formalises linguistic knowledge into lexicons and grammar rules, which is used by the system to analyse sentences in the source language and translate them. While this approach does not require any parallel corpora for training and grants control over the translations created by the system, the process of encoding linguistic knowledge requires a great amount of expert time. Notable examples of RBMT systems are the original, rule-based Systran [toma1977systran], Lucy LT [alonso2003comprendium] and the Apertium platform [forcada2011apertium].

Instead, corpus-based machine translation (MT) systems learn to translate from examples, usually in the form of sentence-level aligned corpora. On the one hand, this approach is generally computationally more expensive and offers limited control over the generated translations. Furthermore, it is not feasible for language pairs that have limited to no available parallel resources. On the other hand, if parallel resources are available, it boasts a much higher coverage of the targeted language pair. Examples of corpus-based MT paradigms are phrase-based statistical machine translation (PBSMT) [koehn2003statistical] and neural machine translation (NMT) [DBLP:journals/corr/BahdanauCB14].

In this work, we focused on leveraging RBMT knowledge for improving the performance of NMT systems in an under-resourced scenario. Namely, we used the information provided by Lucy LT, an RBMT system where the linguistic knowledge is formalised by human linguists as computational grammars, monolingual and bilingual lexicons. Grammars are collections of transformations to annotated trees. Monolingual lexicons are collections of lexical entries, where each lexical entry is a set of feature-value pairs containing morphological, syntactic and semantic information. Bilingual lexicon entries include source-target lexical correspondences and, optionally, contextual conditions and actions. The Lucy LT system divides the translation process into three sequential phases: analysis, transfer, and generation. During the analysis phase, the source sentence is morphologically analysed using a lexicon that identifies each surface form and all its plausible morphological readings. Next, the Lucy LT chart parser together with an analysis grammar consisting of augmented syntactic rules extracts the underlying syntax tree structure and annotates it. The transfer and generation grammars are then applied in succession on that tree, which undergoes multiple annotations and transformations that add information about the equivalences in the target language and adapt the source language structures to the appropriate ones in the target language. Finally, the terminal nodes of the generation tree are assembled into the translated sentence. We focused on the analysis phase, with a special interest for two of the features used: the morphological category (CAT) and the inflexion class (CL) or classes of the lexical entries.

Additionally, we focused on two language phenomena that are easily addressable when using RBMT but present a challenge when using corpus-based MT: named entities and terminological expressions.

A named entity (NE) is a word or a sequence of words that unequivocally refer to a real-world object, such as proper nouns, toponyms, numbers or dates. In the context of MT, NEs present different challenges. For example, if an English sentence starts with the word Smith, we do not know a priori if we are dealing with the name of a profession, that will have to be translated, or a proper noun that may have to be left untranslated, or maybe transliterated to a different script. A second issue may arise when using subword units: while word-level models may accidentally preserve an out-of-vocabulary NE, the subword level model will generate a (most likely nonsensical) translation for it. NEs are one of the main out-of-vocabulary word classes, which often cause translation problems that seriously affect the meaning of the sentence [li2018neural].

Similarly, a terminological expression can consist of a single word or a sequence of words that may have a different meaning depending on the context or domain they appear. Hence, the translation for the term might be different depending on the context or domain. Moreover, different contexts and domains may impose additional restrictions on the language used, such as different modes or the use of active or passive voice, and the presence of particular terminology may suggest that a translation is not acceptable even if the meaning of the source sentence is preserved. Accurate terminology translation is crucial to produce adequate translations [arcan2017leveraging].

In this work we extend and further analyse the injection of morphological information technique that we proposed in a previous word [torregrosa_mtsummit2019] and we propose an approach to NEs and terminology that does not rely on any particular technology and can be applied to any MT approach using any kind of resource to detect and translate the NEs and terminological expressions. To test our proposed approach, we focused on English-Spanish (both generic and medical domain), English-Basque, English-Irish and English-Simplified Chinese language pairs in an under-resourced scenario, using corpora with around one million parallel entries per language pair and domain. Additional test sets that contain several examples of terms, NEs and rich morphology have also been selected and used to further explore the performance of the proposed approaches. Results suggest that, while obtaining results that are not statistically significantly different than the baseline in several scenarios, the proposed approaches show appropriate behaviours such as keeping the passive voice characteristic of some domains.

2 Related Work

In this section, we present the existing work on incorporating linguistic, terminological and NE information into NMT systems.

2.1 Use of Linguistic Knowledge

Several approaches have been proposed to incorporate linguistic knowledge into MT models in order to improve translation quality. One of the approaches is to include the knowledge as features or extra tokens for the model. For example, morphological features, part of speech (POS) tags and syntactic dependency labels [sennrich-haddow:2016:WMT] were proven to improve translation quality when translating between English and German and English to Romanian. A different approach used interleaved CCG supertags within the target word sequence [nadejde2017], comparing favourably to multi-task learning when translating from German and Romanian to English. Information can also be added to the target side by replacing it with a linearised and lexicalised constituency tree [aharoni2017], which shows improved word reordering when translating from German, Czech and Russian to English both in automatic and small-scale human evaluation.

A second approach is to modify the architecture of the recurrent neural network to capture linguistic knowledge. The encoder of the NMT ensemble was replaced with a graph convolutional network, that places no rigid constraints on the structure of the sentence [bastings-etal-2017-graph], which showed improvements when using syntactic dependency trees for the source language translating from English to German and Czech. An alternative approach modified the encoder to process tree-based syntactic representations of the source language, and the attention to be able to address both sentences and phrases [eriguchi2016], which improved results for English to Japanese translation.

A different approach is to use multi-task learning to improve translation quality by adding information from similar tasks, such as POS tagging. For example, two decoders were used to predict lemmas and factors (POS, gender, number, tense, person) independently [GarcaMartnez2016FactoredNM] when translating from English to French, which led to increased vocabulary coverage. Another approach generated both the translation of the sentence, tagged the POS of the source sentence, and recognised NEs in the source language [niehues-cho-2017-exploiting]. Different architectures that shared encoders, attention mechanisms and even decoders were used, showing improvements of all individual tasks when translating from German to English

Finally, different subword unit strategies have been tested. Generating compositional representations of the input words by using an auxiliary recurrent neural network [ataman2018compositional] showed improved results compared to systems using byte-pair encoding when translating from morphologically rich languages (Arabic, Czech, German, Italian and Turkish) to English. Another alternative used morpheme-based segmentation [banerjee-bhattacharyya-2018-meaningless], which compared favourably to byte-pair encoding when translating English to Hindi, English to Bengali and Bengali to Hindi; what is more, a combination of both strategies showed even better results. Other representations, such as linguistically motivated or frequency-based word segmentation methods [etchegoyhen2018], were also explored when using NMT, RBMT and PBSMT.

It has also been investigated whether the encoder of NMT models learns syntactic information from the source sentence [shi-etal-2016-string] when performing three different tasks: translating from English to French and English to German, generating a linearised constitutional tree from English, and auto-encoding from English to permutated English. The authors found that different types of syntactic information are captured in different layers.

2.2 Terminology and Named Entities

Several strategies have been tested for dealing with NE translation. For example, identifying NEs before translating and replacing the tokens with special tags or translating the NE using an external translation model [yan2018impact]. This model showed performance improvements over the baseline model when translating sentences with person names from Simplified Chinese to English. A different approach used alignment information to align source and target language NEs before translating [li2018neural]. As using information from both sides can help improving NE tagging, the model showed improvements over the baseline when translating from Simplified Chinese to English. Addressing multi-word NEs by using additional features to indicate where each NE starts and ends was also investigated [ugawa2018neural], which showed improvements when translating from English to Japanese, Romanian and Bulgarian.

Similarly, terminology translation has been approached in different ways. The use of a cache-based model within PBSMT capable of combining both an static phrase table and language model with smaller, dynamic-loaded extensions [arcan2014enhancing, arcan2017leveraging] compared favourably both to the baseline model and an XML-based markup mode that enables enforcing the translation of some tokens in the sentence (i.e. enforcing a particular translation for a term) when translating between English and Italian and English and German. A mechanism named guided NMT decoding [chatterjee2017guiding], similar in concept to the XML-based markup for PBSMT, was also tested, comparing favourably to baseline models, both in English to German translation and automatic post-editing.1 This model was only able to guide the decoder, but not to enforce the restrictions; hence, a multi-stack approach using finite-state acceptor to enforce the constraints was proposed [hasler2018neural], showing improved results when translating from English to German and Simplified Chinese in scenarios using gold tokens and phrases present in the reference but not produced by the baseline system, or dictionaries. This information would be present in translation memories and glossaries provided by a possible customer. Finally, an approach that encodes the information encoded in knowledge graphs, i.e. terminological expressions and NEs, as embeddings that are then concatenated to the word embeddings was tested [moussallem2019augmenting], showing improved results for English to German translation. Additionally, the performance of SMT and NMT have been explored when translating terminology without context, both using baseline and domain adapted models [arcan2019translating], showing that BPE-based NMT models benefit the most from domain adaptation.

2.3 Data Selection

Finally, data selection has been used to improve the performance of the trained models, reduce the computational cost of training, or both [Rousseau13, chen2016bilingual]. Thought, to the best of our knowledge, applying data selection to the selection of targeted tests sets that frequently exhibit the studied feature to attain a higher insight of the performance of the model has not been previously explored in the literature.

3 Methodology

In this section, we describe the methodology to leverage rule-based machine translation (RBMT) information in neural machine translation (NMT).

3.1 Information Acquisition From RBMT

{Verbatim}

[fontsize=] ("snake" NST ALO "snake" CL (P-S S-01) KN CNT ON CO SX (N) TYN (ANI)) ("snake" VST ALO "snak" ARGS (((ADV DIR))) CL (G-ING I-E P-ED PA-ED PR-ES1) ON CO PLC (NF))

Figure 1: The word snake as a noun (NST) and a verb (VST) in Lucy LT dictionaries. Each entry is composed of a canonical form, the category (POS), and a list of key-value features, such as the inflexion class (CL), the vocalic onset (ON), etc.
\adjustbox

trim=0 2ex 0 2ex {forest} [S:169 [$:[$]] [CLS:135 [NP:97 [NO:57 [PRN:[I]]]] [PRED:83 [VB:60 [VST:[own]]]] [NP:130 [DETP:61 [DET:[the]]] [NO:62 [NST:[house]]] ] [PP:107 [PREPP:68 [PREP:[down]]] [NP:103 [DETP:69 [DET:[the]]] [NO:70 [NST:[street]] ]]]] [$:[$]]]

Figure 2: Example of the parse tree for the English sentence I own the house down the street.

Lucy LT monolingual lexicons are language-pair independent (i.e. the same English knowledge is used for all translation pairs including English as a source or target language) and mainly encode morphological and contextual information. Each entry has a word or multi-word expression (MWE) along with several features, such as the part of speech (POS) and morphological features. The bilingual lexicons mainly encode word-to-word or MWE-to-MWE translations and describe which target language word should replace each source language word. Still, the direct usage of the lexicon entries as a source of information presented a challenge, as there is no means to determine ambiguous surface words. For example, in English, most nouns will also be classified as verbs, as they share the same surface form; e.g. the word snake can be both a noun and a verb (Figure 1). For addressing this problem, we compare two different approaches: using ambiguity classes that describe all the possible analysis for a given surface word; and using external information (in the form of a monolingual POS tagger) for disambiguating ambiguous POS classes. For the former approach, we used a unique tag for each possible category (CAT) and class (CL) values concatenation. In the previous example, snake is both noun (NST) and verb (VST) (Figure 1), so the value for the CAT feature would be NST_VST. For the latter, we used the Stanford POS tagger [toutanova2003feature], that uses the Penn Treebank [Marcus:1994:PTA:1075812.1075835] tag set for English, the AnCora [civit2004building] tag set for Spanish. The IXA pipeline POS tagger [AGERRI14.775] with the Universal Dependencies POS tag set [11234/1-2895] was used for the Basque language. All POS tag sets were mapped to the tag set used by Lucy LT. If the tagger provided POS tag was equivalent to one or more Lucy LT tags, then the non-matching Lucy LT tags were removed. Otherwise, we kept the set of tags; e.g. if the POS tag emits noun as the most likely tag, then only NST and the concatenation of all the inflexion classes for the corresponding entry would be used as additional information. As a comparison, we also evaluated NMT models trained with Stanford or IXA POS tags as additional information.

3.2 Leveraging Syntactic Tree Information

In addition to the direct use of the linguistic knowledge for the lexicon entries, the grammars (monolingual and bilingual lexicons) were indirectly used by exploring the results of each internal intermediate stage of the translation process, which Lucy LT expresses as annotated trees. For example, the sentence parsed in Figure 2, {adjustwidth}0.25in0.25in I own the house down the street is encoded as {adjustwidth}0.25in0.25in I own the house down the street.2 We use this representation as source text when training the NMT models, as sequence-to-sequence deep neural network models do not generally accept hierarchical information. We also used an additional feature: the linguistic phrase the word belongs to. This information is present in the grandparent of each node; e.g. in Figure 2 the noun house appears in a noun phrase (NP).

3.3 Named Entities and Terminology

One of the main features of RBMT is that the linguist who is encoding the knowledge usually has full control over the output, letting the user define entries with more complex contexts that ensure that a certain possible translation case is covered. Conversely, corpus-based MT does not offer this feature: while the sentences used to train the system will have an impact on the words used when translating, it is not readily possible to enforce lexical selections. We devised two different strategies to address this situation.

The first strategy involved tagging each token with a feature that contains information about the kind of NE that the token belongs to, if any. Two different tag sets were used: a binary tag indicating if the token is part of a NE, and a collection of tags with the actual category of NE, according to CoreNLP classes. The second strategy involved replacing NEs with a special token. Like in the previous approach, we replaced each NE either with a generic token (similar to the binary tag) or with a special token representing the category of the NE. We used the same approach for terms, but as we only target the medical domain, there would be no difference between using binary tags or the actual classes.

For example, for tagging medical (MED) terminology, given the sentence

{adjustwidth}

0.25in0.25in He should discuss it with his cardiologist.

In the first approach, all words would get tagged with the domain as a feature:

{adjustwidth}

0.25in0.25in He|GEN should|GEN discuss|GEN it|GEN with|GEN his|GEN cardiologist|MED .|GEN

In the second approach, NEs or terms get replaced with the kind of NE or domain of the term:

{adjustwidth}

0.25in0.25in He should discuss it with his MED.

In the case dealing with NEs, cardiologist would be tagged as NE when using binary tags and TITLE when using CoreNLP classes.

Two different approaches were taken when detecting which tokens are part of a NE or term. During training, we detected NEs in the source side using CoreNLP, aligned source and target using eflomal [Ostling2016efmaral], and replaced the source tokens and the corresponding aligned target tokens with the source tag. In the case of terminology, we used the information contained in Lucy LT bilingual lexicons. When translating, we detected NEs and terms in the source language using CoreNLP or Lucy LT information and replaced or tagged them with the corresponding labels.

We used the same pre-processing both when training the NMT models and when translating. After translating, each tag generated in the hypothesis sentence is aligned to the most likely tag in the source sentence using the soft-alignment produced by the attention mechanism, and replaced with the actual translation of the NE or terminological expression.

To obtain the actual translation for NEs and terms, we used the Lucy LT lexicons, selecting the entry corresponding to the targeted domain in the case of terminology translation. As a comparison, we used Google Translate to generate translations for the NEs and terms; while unfair due to the lack of context and the lack of an option to select a specific domain for the translation, it can be used as a baseline for our method. Additionally, the OpenNMT feature that lets user include a phrase table to replace unknown tokens was used; as it can only handle one-token phrases, the dictionary extracted from Lucy LT was aligned using eflomal and the most likely alignment for each word was used as a phrase.

In some cases, a sentence may have two or more NEs or terms of the same kind. When using the replacement strategy, it is possible that the model will not learn how to properly align the tags with the correct source words. For this reason, we have also tested a system where sentences that have NEs or terms have been duplicated and kept intact, i.e. as if the NEs or terms were generic words instead. This approach slightly increases the size of the training set, but does not add new information to the training approach.

Finally, in the case of terminology translation, we compared the performance of our approach against back-translation, a commonly taken approach in scenarios where domain adaptation is needed [sennrich2015improving].

On the one hand, this approach is completely independent regarding the MT implementation and the resource used to detect and/or obtain translations for NEs and terminological expressions, hence being applicable in many scenarios. On the other hand, some MT implementations (namely, most corpus-based MT ones) will not guarantee that the generated hypothesis will contain all the tags present in the source sentence, which can lead to lower translation adequacy.

3.4 Focused Evaluation

In this work, we targeted several phenomena that appear when translating a sentence, namely morphology, NEs and terminology. While the quantitative or qualitative evaluation of the models using the same test sets is necessary, it might not properly capture the improvements to the targeted phenomena. For this reason, we proposed an additional evaluation focused on the targeted phenomena by using specially selected corpora. In the case of morphology, we selected a corpus that contained the Spanish verbs tener (to have), poder (can/may) and decir (to say) in different surface forms. In the case of NEs and terminology, we selected sentences that contained NEs or terminology according to CoreNLP or Lucy LT respectively. All these sentences do not appear in any training or development set.

4 Experimental Setting

In this section, we describe the resources we used to train and evaluate the systems, along with the NMT framework used.

4.1 Training and Evaluation Datasets

Source (English) Target
English– –Spanish (generic) t
v
e
English– –Spanish (EMEA) t
v
e
English– –Basque t
v
e
English– –Irish t
v
e
English– –Simplified Chinese t
v
e
Table 1: Statistics on the used train (t), validation (v) and evaluation (e) datasets. English-Spanish (EMEA) is a subset of the whole EMEA corpus.
English Spanish
EMEA t
v
Word BT t
BPE BT t
Word BT- t
BPE BT- t
BPE BT- t
Baseline+ t
v
Table 2: Statistics on the used train (t), validation (v) and evaluation (e) datasets for the models focused on terminology translation. EMEA contains the statistics for the subset of the EMEA corpus that was back-translated (BT), and BT contains the stats for the -th round of back-translation with word-based or BPE-based models. The same validation set was used for all BT models. Baseline+ duplicates those lines with detected medical terms, going from to , an increase of lines.
English Spanish
Named Entities e
Morphology e
Terminology e
Table 3: Statistics on the named entities, terminology and morphology focused evaluation (e) datasets.

In this work, besides studying the impact of leveraging RBMT knowledge into NMT systems, we further focused on NMT for under-resourced scenarios. On the one hand, we consider languages, such as Basque or Irish, which do not have a significant amount of parallel data necessary to train a neural model. On the other hand, an under-resourced scenario can be a specific domain, e.g. medical, where a significant amount of data exists, but does not cover the targeted domain. The Table 1 shows the statistics on the used datasets.

For English-Basque and English-Irish, we used the available corpora stored on the OPUS webpage.3 We used OpenSubtitles2018 [LISON16.947],4 Gnome and KDE4 datasets [Tiedemann:2012]. Additionally, the English-Irish parallel corpus is augmented with second level education textbooks (Cuimhne na dTéacsleabhar) in the domain of economics and geography [ARCAN16.9].

In addition to that, we also focused on well resourced languages (Spanish and Simplified Chinese), but limited the training datasets to around one million aligned sentences. To ensure a broad lexical and domain coverage of our NMT system, we merged the existing English-Spanish parallel corpora from the OPUS web page into one parallel data set and randomly extracted the sentences. In addition to the previous corpora, we added Europarl [Koehn:2005], DGT [DBLP:journals/lre/SteinbergerEPCSPG14], MultiUN corpus [MultiUN], EMEA and OpenOffice [Tiedemann2009]. Sentences extracted from the rest of the corpus were used for the targeted evaluation. To evaluate the targeted under-resourced scenario within medical domain and terminology translation, we exclusively used the EMEA corpus.

For Simplified Chinese, we used a parallel corpus provided by the industry partner, which was collected from bilingual English-Simplified Chinese news portals. The corpora were tokenised using the OpenNMT toolkit and lowercased, with the exception of Simplified Chinese, that was tokenized using Jieba.5

Some experiments used or generated additional data; namely, those evaluating the different strategies on specific corpus exhibiting the studied feature. Statistics for those corpora are described in Table 2 and Table 3.

4.2 NMT Framework

We used OpenNMT [2017opennmt], a generic deep learning framework mainly specialised in sequence-to-sequence models covering a variety of tasks such as machine translation, summarisation, speech processing and question answering as NMT framework. Due to computational complexity, the vocabulary in NMT models had to be limited. To overcome this limitation, we used byte pair encoding (BPE) to generate subword units [journals/corr/SennrichHB15]. BPE is a form of data compression that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte. We also added the different morphological and syntactic information as word features.

We used the following default neural network training parameters: two hidden layers, hidden LSTM (long short term memory) units per layer, input feeding enabled, epochs, batch size of , dropout probability, dynamic learning rate decay, dimension embeddings, unlimited different values for the word features and between and dimension embeddings for word features.6 For word models, we used a maximum vocabulary size of words. For subword models, we used or subwords, a maximum vocabulary size of and a maximum of unique BPE merge operations.

4.3 Evaluation Metrics

In order to evaluate the performance of the different systems, we used BLEU [papineni2002bleu], an automatic evaluation that boasts high correlation with human judgements, and translation error rate (TER) [snover2006study], a metric that represents the cost of editing the output of the MT systems to match the reference, and chrF3 [popovic:2015:WMT], a character n-gram metric which shows very good correlations with human judgements on the WMT2015 shared metric task [stanojevic-EtAl:2015:WMT], especially when translating from English into morphologically rich(er) languages.

Additionally, we used bootstrap resampling [koehn2004statistical] with a sample size of and iterations, and reported statistical significance with . In addition, we compared the performance of our NMT systems with the NMT-based Google Translate,7 and the translations performed using Lucy LT RBMT; for the latter, only English-Spanish and English-Basque models are available.

5 Results

In this section, we describe the quantitative and qualitative evaluation of the different models: the NMT baseline (Baseline), baseline enhanced with ambiguous CAT and CL (CAT-CL), baseline with disambiguated CAT and CL (CAT-CL D), baseline with external POS tags (POS), baseline with indirect CAT, CL and syntactic information (CAT-CL L), the hierarchical model (Tree), Lucy LT (RBMT) and Google Translate (Google). Systems that are not shared between evaluations are described in the corresponding subsection.

5.1 Quantitative Results

In this section, we describe the quantitative evaluation of the different models.

General Evaluation

English Spanish Basque
BLEU  chrF  TER  pt BLEU  chrF  TER 
Word Baseline
CAT-CL
CAT-CL D
CAT-CL L
Tree
POS
BPE Baseline
CAT-CL
CAT-CL D
CAT-CL L
Tree
POS
RBMT
Google
English Spanish Basque
BLEU  chrF  TER  pt BLEU  chrF  TER 
Word Baseline
CAT-CL
CAT-CL D
CAT-CL L
Tree
POS
BPE Baseline
CAT-CL
CAT-CL D
CAT-CL L
Tree
POS
RBMT
Google
Table 4: Results for the evaluation for English-Spanish and English-Basque. Models marked with are significantly better than the NMT BPE-based baseline. All BPE models are statistically significantly better than their word-based counterparts. All models are statistically significantly better than RBMT, and all models for English-Basque and Basque-English are statistically significantly better than Google Translate.
English Irish Simplified Chinese
BLEU  chrF  TER  pt BLEU  chrF  TER 
Word Baseline
CAT-CL
CAT-CL D
POS
BPE Baseline
CAT-CL
CAT-CL D
POS
Google
English Irish Simplified Chinese
BLEU  chrF  TER  pt BLEU  chrF  TER 
Word
BPE
Google
Table 5: Results for the evaluation for English-Irish and English-Simplified Chinese. All BPE models for English-Chinese, Chinese-English and Irish-English are statistically significantly better than their word-based counterparts. No RMBT models are available for Irish and Simplified Chinese in Lucy LT, and all models for English-Irish and Irish-English are statistically significantly better than Google Translate.

The quantitative results of the evaluation are presented in Table 4 and Table 5. All the models tested significantly outperformed the RBMT system Lucy LT both when using BLEU and TER as evaluation metrics. Even when trained with only around a million sentences, the NMT baseline model for English-Basque and English-Irish performed better than Google Translate with generic domain corpora, and were not statistically significantly different for EnglishSimplified Chinese. Conversely, Google Translate was significantly better than the NMT baselines only for the English-Spanish generic domain, excluding EnglishSpanish TER. While some of the feature-enriched models obtained slightly better results in terms of BLEU and TER compared to the baseline, no model obtains scores that are statistically significantly different than the baseline subword model. In the case of the tree model, the results were consistently lower than the rest. We learned that the system could not cope with this complex representation with the amount of data available.

Evaluation Focused on Morphological Information

EnglishSpanish EnglishSpanish
BLEU  chrF  TER  pt BLEU  chrF  TER 
Word Baseline
CAT-CL
CAT-CL D
CAT-CL L
Tree
POS
BPE Baseline
CAT-CL
CAT-CL D
CAT-CL L
Tree
POS
RBMT
Google
Table 6: Results for the generic models when tested with the morphology-focused test set. Models marked with are significantly better than the NMT Baseline model. All BPE models are statistically significantly better than their word-based counterparts, and all models are statistically significantly better than RBMT.

The results presented in Table 6 show that the added morphological information has a greater impact when using the BPE model, especially when translating from English to Spanish, that is, from a language with less morphological information to one with more. Also, in this scenario the tree-based model vastly outperforms the baseline when using subword units; the structural information along with the morphological information is helping the system make better decisions when translating this corpus, that has a high density of verbs.

Evaluation Focused on Named Entities

EnglishSpanish EnglishSpanish
BLEU  chrF  TER  pt BLEU  chrF  TER 
Word Baseline
F Binary
NE
PG Binary
NE
PL Binary
NE
BPE Baseline
F Binary
NE
PL Binary
NE
PG Binary
NE
RBMT
Google
Table 7: Results for the NE focused models when tested with the generic test set. F refers to the models trained with the NE tag as a word feature, and PL and PG to the model trained to replace NEs with the corresponding token and translating the contents with Lucy LT and Google Translate respectively. Binary models only classify words as NE or not NE, while NE models classifies each NE with the corresponding class according to CoreNLP. All BPE models are statistically significantly better than their word-based counterparts, and all models are statistically significantly better than RBMT.
EnglishSpanish EnglishSpanish
BLEU  chrF  TER  pt BLEU  chrF  TER 
Word Baseline
F Binary
NE
PL Binary
NE
PG Binary
NE
BPE Baseline
F Binary
NE
PL Binary
NE
PG Binary
NE
RBMT
Google
Table 8: Results for the NE focused models when tested with the NE focused test set. F refers to the models trained with the NE tag as a word feature, and PL and PG to the model trained to replace NEs with the corresponding token and translating the contents with Lucy LT and Google Translate respectively.
Generic Specific
ENES ENES ENES ENES
Baseline
F Binary
NE
P_ Binary
NE
Table 9: Number of <unk> tokens generated by each approach. Both PG and PL have the same number of <unk>.

Table 7 and Table 8 show the results of the evaluation of the NE-focused models when tested with the generic and specific datasets respectively. Models using the NE feature are not significantly different from the baseline, while the models using the replacement strategy lower the performance of the system. Still, the models using protected sequences reduce the number of <unk> tokens in the hypotheses for the specific evaluation corpus, as shown in Table 9. Producing less <unk> may help to improve the adequacy of the sentences.

Evaluation Focused on Terminology Injection

EnglishSpanish EnglishSpanish
BLEU  chrF  TER  pt BLEU  chrF  TER 
Word Baseline
EMEA
PT
F MED
MED+
PL MED
MED+
PG MED
MED+
BPE Baseline
EMEA
F MED
MED+
PL MED
MED+
PG MED
MED+
RBMT
Google
Table 10: Results for the terminology focused models when tested with the generic test set. Models marked with are significantly better than the NMT Baseline model. All BPE models are statistically significantly better than their word-based counterparts, and all models are statistically significantly better than RBMT.
EnglishSpanish EnglishSpanish
BLEU  chrF  TER  pt BLEU  chrF  TER 
Word Baseline
EMEA
BT
BT
PT
F MED
MED+
PL MED
MED+
PG MED
MED+
BPE Baseline
EMEA
BT
BT
BT
F MED
MED+
PL MED
MED+
PG MED
MED+
RBMT
Google
Table 11: Results for the terminology focused models when tested with the terminology focused test set. PT uses the OpenNMT phrase table feature with the dictionary extracted from Lucy LT, BT refers to the model trained on the back-translated corpus on the -th iteration, F refers to the models trained with the MED tag as a word feature, and PL and PG to the model trained replacing terms with the MED token and translating the contents with Lucy LT and Google Translate respectively. MED+ models duplicate those lines that have terms, leaving them untouched, while processing the other. Models marked with are significantly better than the NMT Baseline model, and all BPE models are statistically significantly better than RBMT.

Results of the automatic evaluation for terminology injection can be seen in Table 10 (with the generic test set) and Table 11 (with the specific EMEA test set). EMEA was trained with the corpus labelled English-Spanish (EMEA) in Table 1, and should be treated as an upper bound. During our experiments, we observed that the back-translated models outperform all other alternatives. While the BPE-level model improved in the first three back-translation rounds, the word-level model only improved on the first back-translation round. Still, back-translation is not only injecting terminology into the models but also other linguistic information, as full sentences are being fed to the system.

All the approaches that add information to the word-level models performed slightly better than the baseline model, but the opposite happened for the BPE models. Only sentences in the train set had medical terms, limiting the effect of this approach. The mode using the phrase table replacement is replacing the <unk> tokens with information contained in the provided phrase table, hence obtaining a very small but significant improvement over the baseline in the generic test set, but a major one on the specific test set.

5.2 Qualitative Evaluation

In this section, we describe the qualitative evaluation of the different models.

General Evaluation

Table 12 analyses a sentence translated using all different models from Spanish to English. The analysis showed that, even when RBMT makes some grammatical mistakes, the sentence still conveyed the correct message. Nevertheless, it was the only hypothesis with a BLEU of , as it shared no four-gram with the reference, and was the hypothesis with the highest TER. The baseline model hypothesis was tied for the best TER score and the second best BLEU score, but it failed to convey the proper message, as it lacked translation for easing of price increases.

Source Pese a que los incrementos de los precios fueron menores en el segundo semestre de 2008 , los precios siguen siendo muy elevados . BLEU TER
Reference Despite an easing of price increases in the second half of 2008, prices remain at very high levels.
Baseline Despite the increases in prices in the second half of 2008, prices remain very high.
CAT-CL Although price increases were minor in the second half of 2008, prices remain very high.
CAT-CL D Although increases in prices were lower in the second half of 2008, prices remain high.
POS Despite the fact that price increases were lower in the second half of 2008, prices remain very high.
CAT-CL L Although price increases were lower in the second half of 2008, prices remain very high.
Tree Although prices of prices were lower in the second half of 2008 prices remain very high.
RBMT Even though the increases of the prices were smaller in the second semester of 2008, the prices keep being sky-high.
Google Although the price increases were lower in the second half of 2008, prices are still very high.
Table 12: Qualitative analysis of a sentence translated by all models for Spanish to English translation. Fragments in bold face are translation mistakes, and fragments in italics are translation alternatives that, while being penalised by TER and BLEU, can be considered correct.

Evaluation Focused on Morphological Information

Source He could have been sent home tomorrow if only you had kept quiet. chrF TER
Reference Hubiera podido salir mañana si no hubiera abierto la boca.
Baseline Podría haber sido enviado a casa mañana si sólo se hubiera mantenido callado.
BPE Podría haber sido enviado a casa mañana si hubieras mantenido silencio.
CAT-CL Podría haber sido enviado a casa mañana si sólo hubieras mantenido silencio.
CAT-CL D Podría haber sido enviado a casa mañana si sólo te quedaba callado.
POS Él podría haber sido enviado a casa mañana si sólo se hubiera callado.
CAT-CL L Podría haber sido enviado a casa mañana si sólo hubieras mantenido callado.
Tree Podría haber sido enviado a casa mañana si sólo hubieras mantenido silencio.
RBMT Se le podría haber enviado a casa mañana si solamente usted hubiera estado callado.
Google Podría haber sido enviado a casa mañana si solo hubieras guardado silencio.
Table 13: Qualitative analysis of morphology. Most BLEU scores were 0; instead, chrF was used.

Table 13 shows an example of a sentence translated with all the different models. Even when translating a fairly complex sentence that accepts many different options, most of the models are able to produce tenses that keep the sense of the sentence. The reference translation is fairly idiomatic, thus no model perfectly matches with it.

Evaluation Focused on Named Entities

Source Langdon deduce que Sauniere fue miembro del priorato de Sion, una sociedad secreta asociada a la orden del temple. chrF TER
Reference Langdon deduces from this that Sauniere was a member of the priory of Sion, a secret society associated with the knights templar.
Baseline <unk> follows that <unk> was a member of the priory of Zion, a secret society associated with the order of the temple.
BPE Langdon deduces that Sapuniere was a member of the priory, a secret society associated with the order of the temple.
F Langdon argues that Sauniere was a member of the priory of Zion, a secret society associated with the order of the temple.
P_ PERSON states that PERSON was a member of the ORGANIZATION of MISC, a secret society associated with the ORGANIZATION
PL Langdon states that Sauniere was a member of the priorate of Zion, a secret society associated with the order of the temper.
PG Langdon states that Sauniere was a member of the priory of Zion, a secret society associated with the order of the temple.
RBMT Langdon deduces that Sauniere was a member of the priorate of Zion, a secret society associated to the order to the temper.
Google Langdon deduces that Sauniere was a member of the priory of Sion, a secret society associated with the order of the temple.
Table 14: Qualitative analysis of named entities. F refers to the BPE-level model with NE classes marked with features, P_ refers to the BPE-level model with NE replaced with a unique token for each NE class, then translated using Lucy LT (PL) or Google Translate (PG). Finally, RBMT refers to Lucy LT translation and Google to Google Translate. Most BLEU scores were 0; instead, chrF was used.

The example sentence in Table 14 shows that using the proposed approach can lead to improved translations. Langdon and Sauniere are both unknown words in the Baseline and BPE cases, leading to <unk> tokens in the output of the former, and the incorrect translation Sapuniere for the latter. No model was able to properly translate orden del temple to knights templar; in some cases, the word temple is untranslated (such as in PG or Google), and in the others it gets improperly translated to temper.

Evaluation Focused on Terminology Injection

The example sentence in Table 15 shows that using this approach to focus on terminology can lead to better translations. The Spanish word vaso can be translated to glass or cup (e.g. a glass of water) or to vessel or vein (e.g. a blood vessel), depending upon the context of the input sentence. The example sentence in Table 15 shows that both baseline models use the first sense when translating, but when replacing the identified term vaso sanguíneo with the tag MED, the correct sense is used. As a side effect, the produced hypothesis keep the passive voice characteristic of medical text. Still, no model is able to properly translate atravesar; all models use the incorrect sense of going through or crossing, instead the proper term, entering.

We also analysed how Lucy LT and Google Translate certain medical terminology in Table 16. Translations of medical terms produced by Google Translate appear to have a higher overlap withe the reference corpus, hence leading to higher evaluation score.

Source Debe tenerse precaución para no atravesar un vaso sanguíneo. chrF TER
Reference Care should be taken to ensure that a blood vessel has not been entered.
Baseline You must be careful not to go through a glass of blood.
BPE You must be careful to not go through a blood glass.
F You must be careful not to get through a blood cup.
P_ Caution must be taken not to cross a MED.
PL Caution must be taken not to cross a blood vessel.
PG Caution must be taken not to cross a blood vessel.
RBMT Precaution must be had not to go across a blood vessel.
Google Care must be taken not to cross a blood vessel.
Table 15: Qualitative analysis of terminology injection. F refers to the BPE-level model with terms marked with features, P_ refers to the BPE-level model with terms replaced with the MED token, then translated using Lucy LT (PL) or Google Translate (PG). Finally, RBMT refers to Lucy LT translation and Google to Google Translate. Most BLEU scores were 0; instead, chrF was used.
Spanish RBMT Google
dosis shot dose
medicamento medication medicine
frasco vial bottle
análisis test analysis
presión arterial arterial tension blood pressure
miocardio coronary myocardial
ictericia icterus jaundice
Table 16: Terminology selection for each MT system. The number indicates how many times the word appears in the Spanish side of the test set, and how many times the proposed translation appeared in the corresponding reference.

6 Conclusions and Future Work

In this work, we explored the use of rule-based machine translation (RBMT) knowledge to improve the performance of neural machine translation (NMT) models in an under-resourced scenario, showing that the models had limited ability to learn from the external information.

We also tested different approaches to inject named entities (NE) and terminological expressions contained in the RBMT model to NMT. The approaches treat the NMT model as a black box, that is, in such a way that there is no need to know or modify the inner workings of the system, thus being applicable to any model, implementation and architecture. Only the approaches injecting terminology in word-based models improved the baseline, albeit not statistically significantly. In some scenarios, the use of some approaches led to translations that, while not having a significantly different automatic evaluation score, appear to be closer to the style of the targeted text; namely, in the case of terminology translation, some strategies managed to retain the passive voice of the corpus.

One of the paths of our future work will further focus on the extraction of RBMT knowledge and the inclusion of transfer rules to improve the performance of the NMT model. The model that was trained following the structure with the parse tree was not able to properly deal with the information, and generally performed worse than the rest; integrating this information differently might produce better results.

A second path is using approaches that modify the architecture of the neural network. For example, using multiple encoders to take both the source sentence and the output of the RBMT system. This approach has been used to improve the performance of NMT [N16-1004]. As previously mentioned, corpus-based MT gives limited control over the output to the user, especially when dealing with homographs and terminology; instead, RBMT gives total control. Combining the source sentence with the RBMT output that contains the user-selected translations might lead to improvements in domain-specific or low resource scenarios.

Finally, we also plan to leverage information contained in other freely available RBMT systems, such as Apertium, that contains features similar to the ones used in this work.

Acknowledgments

This publication has emanated from research supported in part by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289, co-funded by the European Regional Development Fund, and the Enterprise Ireland (EI) Innovation Partnership Programme under grant agreement No IP20180729, NURS - Neural Machine Translation for Under-Resourced Scenarios.

References

Appendix I: Models

  • Baseline: no extra features or protected sequences

  • CAT-CL: category (CAT) and class (CL) ambiguity classes as features

  • CAT-CL D: category (CAT) and class (CL) disambiguated with Lucy LT as features

  • CAT-CL L: category (CAT) and class (CL) disambiguated with CoreNLP as features

  • Tree: category (CAT) and class (CL) from Lucy LT as features, extra tokens for tree structure

  • POS: CoreNLP POS tags as features

  • RBMT: Lucy LT translation

  • Google: Google Translate translation

  • BT: Back-translated corpus on the -th iteration

  • MED: Corpus with medical (MED) domain terms tagged

  • MED+: Corpus with medical (MED) domain terms tagged, sentences with MED domain get duplicated and tagged as generic (GEN)

  • F: Terms or named entities tagged as word features

  • PL: Terms or named entities replaced by tag, content translated with Lucy LT

  • PG: Terms or named entities replaced by tag, content translated with Google Translate

Footnotes

  1. That is, translating from English that is likely to have low adequacy, usually MT hypotheses, to post-edited, more adequate English.
  2. To avoid collisions with parenthesis in the text, we used the left (, U+2985) and right (, U+2986) white parenthesis.
  3. opus.nlpl.eu
  4. opensubtitles.org
  5. github.com/fxsjy/jieba
  6. The size of the embedding for word features depend on the number of unique values for the feature.
  7. translate.google.com/ retrieved between March and August 2019.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
414622
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description