Transductive Data-Selection Algorithms for Fine-Tuning Neural Machine Translation
Machine Translation models are trained to translate a variety of documents from one language into another. However, models specifically trained for a particular characteristics of the documents tend to perform better. Fine-tuning is a technique for adapting an NMT model to some domain. In this work, we want to use this technique to adapt the model to a given test set. In particular, we are using transductive data selection algorithms which take advantage the information of the test set to retrieve sentences from a larger parallel set.
In cases where the model is available at translation time (when the test set is provided), it can be adapted with a small subset of data, thereby achieving better performance than a generic model or a domain-adapted model.
Machine Translation (MT) models aim to generate a text in the target language which corresponds to the translation of a text in the source language, the test set. These models are trained with a set of parallel sentences so they can learn how to generalize and infer a translation when a new document is seen.
In the field of MT, Neural Machine Translation (NMT) models tend to achieve the best performances when large amounts of parallel sentences are used. However, relevant data is more useful than having more data. Previous studies (silva2018extracting) showed that models trained with in-domain sentences perform better than general-domain models.
However, training models for domains that are distant from general domains, such as scientific documents, is not always a simple task as parallel sentences are not always available. In addition, identifying the domain adds complexity if the domain of the document to be translated is too specific. The alternative explored in this work is to build models adapted to a given test set.
In order to build task-specific models, data selection algorithms play an important role as they retrieve sentences from the training data. Data selection methods can be classified (eetemadi2015survey) according to the criteria considered to select sentences (e.g. select sentences of a particular domain, good quality sentences, etc.). In this work, we use the transductive (Vapnik1998) data selection methods which use the document to be translated to select sentences that are the most relevant for translating such text.
In some cases, the organizations in charge of translating a document are also the owner of the translation model and training data. Therefore, knowing the test set is an advantage that can be helpful for adapting the generic MT model towards the test set (utiyama2009two; liu2012locally).
The approaches presented here consist of building a single NMT model and delay part of the process of training data for adapting the model when the test set is available. Although this implies increasing the time involved in translating a document, it also has some benefits.
First, using a single model causes storing multiple task-adapted models not to be necessary. Moreover, identifying the domain of the document (and so, the most appropriate model) before the translation is also avoided. In addition, due to the fine-grained adaptation, other characteristics that may have not been foreseen (e.g. formal or informal register, technical or literal vocabulary, the gender of the speaker etc.) are also considered.
This paper presents the performance of three transductive data selection algorithms (TA), applied to NMT models, showing how these models can be improved by adapting them with a small set of data. The TAs are executed using the test set as seed, but there are other approaches such as using an approximated target-side (poncelas2018data; poncelas2018adapt).
The remainder of this paper is structured as follows. In Section 2, we state the research questions that we want to investigate. Section 3 contains some insights of other works that are related to this and Section 4 describes the data selection methods used in the experiments. In Section 5 we perform an analysis of fine-tuning and in Section 6 we build the models used as baselines in later experiments. The results of the main experiments are explained in Section 7 and finally, in Section 8, we conclude and indicate further research that can be carried out in the future.
2 Research Questions
In this work, we are using a general-domain data set to build an NMT model. Then, this model will be adapted, performing fine-tuning, to two different test sets in two domains: news and health. The data used to adapt the model is retrieved by the algorithms described in Section 4. These methods will retrieve sentences from: (i) the general domain data; (ii) different in-domain datasets; and (iii) from a concatenation of both the general domain and in-domain set. Therefore the research questions we propose to explore are the following three:
Can a model fine-tuned with a subset of data outperform the model trained with general domain data?
The work of poncelas2018feature showed that performing fine-tuning on a subset of data (used to build the model) yields small improvements (and not statistically significant at level p=0.01). A limitation in their experiments is that, as BPE is not applied, the vocabulary of the adapted model remains the same as the general model. As in these experiments we are processing the data using BPE, the limitation of the vocabulary should disappear (as sub-words are considered rather than complete words). We are interested in exploring whether performing fine-tuning with a subset of the data (in which BPE was applied) can improve the base model.
Can a model fine-tuned with a subset of in-domain data outperform the model fine-tuned with the complete data set?
The general uses of fine-tuning (luong2015stanford; freitag2016fast) consist of using in-domain data set to adapt a model. However, we want to investigate whether applying data selection in smaller in-domain set can also lead to improvements.
Can a model fine-tuned with a dataset mixture of general-domain and in-domain data outperform the previous-mentioned models?
By considering both datasets (general and in-domain data), the number of candidate sentences is increased. This also poses a challenge to the transductive algorithm as most of the candidate sentences are not in-domain. We are interested in exploring whether these algorithms can successfully retrieve sentences that lead to improvements.
3 Related Work
There are several adaptation techniques for NMT. chu2018survey structure them into two main groups, data centric (techniques which involve augmenting or modifying the training data) and model centric (techniques which involve modifying the architecture or the procedure with which the model is trained). In this paper, we use a combination of both as we use data selection methods (data centric) and fine-tuning (model centric).
The technique of fine-tuning (luong2015stanford; freitag2016fast) consists of training an NMT model with a general domain data set until convergence, and then using an in-domain set for the last epochs.
The work of van2017dynamic showed that training an NMT model using less (but more in-domain) data each epoch achieves improvements over a model trained with all data. Their experiments include weighting the sentences using Cross Entropy Difference (axelrod2011domain), and then, each epoch the top- sentences are used as training data where
A proposal in which they use the test set to adapt the model is the work of li2018one. In particular, they fine-tune a pre-built NMT model for each sentence in the test set. They use three methods to retrieve the sentences that are the most similar to a sentence of the test set: (i) Levenshtein distance (levenshtein1966binary); (ii) cosine similarity of the average of the word embeddings (mikolov2013distributed); and (iii) the cosine similarity between hidden states of the encoder in NMT. The main difference with our work is that they adapt the model sentence-wise (one model for each sentence) whereas the adaptations presented here are document-wise (one model for each test set). Although performing adaptations sentence-wise gives more fine-grained adaptations, it also has several disadvantages: (i) the computational cost is higher as there are several iterations (as many as sentences in the test set) of selecting data and fine-tuning; (ii) the usage of the data is less efficient as a same sentence can be extracted multiple times (in different iterations); and (iii) using different models for each sentence has the potential risk of performing translations that are not consistent throughout the entire document.
4 Transductive Data Selection Algorithms
In this work, we investigate data selection methods that exploit the information of the test set to retrieve sentences. These methods select a subset of from the parallel set used as training data. In particular, they select sentences based on overlaps of n-grams between the test set and the source side of the parallel data . In this work, we explore the following three techniques:
TF-IDF Distance Method:
Distance methods measure how close two sentences are by using metrics as Levenshtein distance (which computes the minimum number of insertion, deletions or substitutions of characters that are necessary to transform one sentence into the other) to score the similarities. hildebrand2005adaptation propose TF-IDF distance i.e. to use cosine between TF-IDF (tfidf1973) vectors as distance metric. In their work, for each the top sentences from are selected. Although they are aware that the resulting set contains duplicated sentences, in their experiments the models containing duplicated sentences achieve slightly better results.
TF-IDF measures the importance of the terms in a set of documents. Each document can be represented as a vector of terms , where is the size of the vocabulary. Each is calculated as in (1):
where is the term frequency (TF) of the k-th term in , i.e. the number of occurrences, and is the inverse document (IDF) frequency of the k-th term, as in (2):
The similarity between two sentences and is computed as the inverse of the cosine distance of their TF-IDF vectors, and , as in Equation (3):
In the TFIDF transductive method, each sentence in the Candidate data is scored according to the highest similarity with a sentence from the test set computed as in Equation (4):
Infrequent n-gram Recovery
(INR): parcheta2018data propose extracting those sentences containing n-grams from the test set that are considered infrequent (gasco2012does) (so frequent words such as stop words are ignored).
A sentence is scored according to the number of infrequent n-grams shared with the set of sentences of the test set . It is computed as in Equation (5):
where is the count of in the selected set of sentences (those that have been selected already). is the number of occurrences of an n-gram to be considered infrequent. If the number of occurrences of is above the threshold then is considered frequent n-gram (the component is 0) and it does not contribute for scoring the sentence. When a sentence is added to the selected pool the count of the n-gram in the candidate data is updated (gasco2012does).
Feature Decay Algorithms
(FDA): Feature Decay Algorithms biccici2011instance selects data trying to maximize the variability of n-grams in the selected data by decreasing their value as they are added to a selected pool , which eventually becomes the selected data.
In order to do that, the n-grams in the test set are extracted and assigned an initial value. Each sentence in the set of candidate sentences has an importance score (i.e. the normalized sum of the score of its n-grams) of being selected.
Then, iteratively, the sentence with the highest score in the candidate data is selected and added to a set of selected pool . In addition, the values of the n-grams of the selected sentence are decreased to ensure a variability of n-grams. The values are decreased according to the decay function in Equation (6):
where is the count of the n-gram in . and are parameters of FDA. By default they have a value of and , respectively.
The function in Equation (6) indicates the score of the feature at a particular iteration, so it is dependent on the set of selected sentences .
The sentence is scored as a normalized (by length of the sentence) sum of the scores of the features. Considering the default values in Equation (6), the resulting score function is as in Equation (7):
where is the set of n-grams in sentence .
Once the selected pool contains the desired amount of sentences, the sentences are retrieved as selected data.
5 Experimental Setup
The data sets used in the experiments are based on the ones used in the work of (biccici2013feature):
We build German-to-English NMT model using the data provided in the WMT 2015 (bojar-EtAl:2015:WMT) (4.5M sentence pairs). We consider this data set as the general-domain training data to build the non-adapted NMT (BASE). As development data, we use 5K randomly sampled sentences from development sets of previous years.
The BASE model is adapted to two domains: news and health. Therefore we also use two test sets and two in-domain training set (for the research question 2 and 3 explained in Section 2):
News Domain: We use the test set provided in WMT 2015 News Translation Task, and the in-domain rapid2016111https://tilde.com/ data set (1.3M sentence pairs) provided in WMT 2017 News Translation(bojar2017findings).
Health Domain: German-to-English parallel text from the European Medicines Agency (EMEA)222http://opus.nlpl.eu/EMEA.php (Tiedemann:RANLP5) (361K sentence pairs). For health domain test set we use the Cochrane 333http://www.himl.eu/test-sets dataset provided in WMT 2017 biomedical translation shared task (yepes2017findings).
Note that the general-domain set contains sentences from a corpus such as Europarl (koehn2005europarl) which causes the domain to be closer to the news domain.
All data sets are tokenized, truecased and Byte Pair Encoding (BPE) (sennrich2016neural) is applied with 89500 merge operations (the number of operations used in the work of sennrich2016neural). The models have been built using OpenNMT-py(opennmt). We keep the default settings of OpenNMT-py: 2-layer LSTM with 500 hidden units, vocabulary size of 50000 words for each language.
We use different evaluation metrics to evaluate the performance of the models built in the experiments. These models are evaluated on the test sets using several evaluation metrics: BLEU (papineni2002bleu), TER (snover2006study) and METEOR (banerjee2005meteor). The scores assigned by this metrics indicate an estimation of the quality of the translation (compared to a human-translated reference). Higher scores of BLEU and METEOR indicate better translation quality. TER is an error metric, therefore lower scores indicate better performance.
In each table, scores that are better than the baseline are shown in bold. Furthermore, scores that constitute a statistically significant improvement at level p=0.01 over the baseline are marked with an asterisk. This was computed with multeval (clark2011better) using Bootstrap Resampling (koehn04).
6 Baseline Results
6.1 Baseline Results with General-domain Data
Table 1 presents the results evaluated with the news test set evaluated in the 12th epoch of the base model (BASE12) and the 13th epoch (BASE13). Similarly, Table 2 presents the results evaluated with the test set in the health domain. These results help to confirm that the models trained for 12 epochs are close to convergence: In Table 1 the increment in performance from the 12th to the 13th epoch is just of 0.0018 BLEU points and in Table 2 the performance is worse in the 13th epoch.
6.2 Baseline Results With In-domain Data
|BASE12||BASE12 + rapid2016|
|BASE12||BASE12 + EMEA|
Following the work of luong2015stanford; freitag2016fast we adapt the base system (BASE12) by performing the 13th iteration in a different, smaller, in-domain data set. We create two new models, one adapted to the domain of news (BASE12 + rapid2016) and another one to the health domain (BASE12 + EMEA).
We see, in Table 4, how using in-domain data for fine-tuning can increase the performance with more than 2 BLEU points. However, the data set chosen for performing fine-tuning is important, as in Table 3 we see the performance of the model becomes worse after fine-tuning with the rapid2016 dataset. This also indicates that the addition of new data is not necessarily good.
7 Main Experiments
In order to answer the questions in Section 2, we perform three set of experiments: fine-tune the BASE12 model with a subset of the general domain data (Section 7.1), with a subset of in-domain data (Section 7.2), and with a subset of data retrieved from both general domain data and in-domain data (Section 7.3).
We use the default configuration of the data selection methods. We use , and 3-grams as features in FDA (Equation (6)).
In the INR method we also use 3-grams as (in Equation (5)). In order to find a value of the threshold for the experiments, in this paper we execute several runs of INR using different values of , multiplying by two in each execution (we try 10, 20, 40, 80 …). In the experiments we use the highest value of that fulfills one of the following criteria: (i) the execution time should be under 48 hours or (ii) the number of sentences retrieved at least 500K. Accordingly, the value of in news domain is 80 (230K sentences retrieved) and in health domain 640 (275K sentences retrieved).
7.1 Results of Models Trained in a Subset of General-Domain Data
|BASE13||BASE12 + TFIDF||BASE12 + INR||BASE12 + FDA|
|BASE13||BASE12 + TFIDF||BASE12 + INR||BASE12 + FDA|
In order to investigate the first question mentioned in Section 2 we select a subset of sentences of the general-domain data (the data set used to build BASE12). We extract subsets of three different sizes: 100K, 200K, and 500K lines. The only exception is the INR method which, with the established configuration, retrieves at most 230K sentences and 275K sentences using the news and health test, respectively. The BASE12 model is fine-tuned for a 13th epoch using the subset of data extracted.
In Table 5 and Table 6 we show the performance of the base model in the first column (BASE13 column) and then the model in which the last epoch is fine-tuned using data selected by one of the three data selection algorithms. As we can see, fine-tuning the model with the selected data leads to improvements for most of the experiments (numbers in bold).
The vocabulary considered in the fine-tuning is the same used for building the BASE12 model. However, as BPE has been applied, this restriction is less strict. For example, in the sentence of the news test set \saydas Bildungsministerium teilte mit, etwa ein Dutzend Familien sei noch nicht zurückgekehrt. (according to the reference, \saythe Education Ministry said about a dozen families still had not returned.) the word \sayBildungsministerium (\sayEducation Ministry) would have been left out (even if in the selected data there are several occurrences) if BPE was not applied because it is infrequent in the general domain set. As in these experiments we use BPE, the adapted models achieves improvements in terms of fluency.
The non-adapted, BASE13 model translates the above-mentioned sentence as \saythe Ministry of Education said, for example, that a dozen families did not return.. In this sentence, the phrase \sayfor example has been added. The model adapted using TFIDF (100K lines) generates a similar sentence (i.e. \saythe Ministry of Education said, for example, that a dozen families had not returned.), but this problem is corrected by the model adapted using INR and FDA (100K lines) as both of them generate the same translation: \saythe Ministry of Education said, about a dozen families have not returned.. Here the phrase \sayfor example added by BASE13 model is removed.
7.2 Results of Models Trained with a Subset of In-Domain Data
|BASE12 + rapid2016||BASE12 + TFIDF rapid2016||BASE12 + INR rapid2016||BASE12 + FDA rapid2016|
|BASE12 + EMEA||BASE12 + TFIDF EMEA||BASE12 + INR EMEA||BASE12 + FDA EMEA|
In order to answer the second research question stated in Section 2, we also execute the same transductive algorithms (using the same configuration) in the in-domain set (i.e. rapid2016 and EMEA). We retrieve the same amount of sentences: 100K, 200K and 500K lines for news domain; and 100K and 200K for the health domain (as EMEA only has 361K sentences).
In Table 7 we show in the first column, BASE12+rapid2016, the performance of the model fine-tuned with the complete in-domain rapid2016 set (also presented in Table 3). The other columns contain the evaluation scores after fine-tuning BASE12 model with subsets of rapid2016. Similarly, Table 8 indicates the performance of the model fine-tuned with theEMEA dataset and different subsets (evaluated with health test). Note also that the number of sentences retrieved by INR (using the same configuration as in the previous section) is less than 200K lines, so those experiments are not executed.
Using a subset of in-domain data can improve the performance as again, most of the scores in Table 7 and Table 8 are marked in bold. We see that the impact of the models evaluated in the news domain (Table 7) is higher as all experiments achieve statistically significant improvements at level p=0.01 for at least one evaluation metric. Despite that, none of the models improve the BASE13 model (column BASE13 in Table 1).
7.3 Results of Models Trained with a Mixture of General-Domain and In-Domain Data
As we have seen in previous sections, applying fine-tuning with subsets of data can perform better than using the complete dataset. In this section, we aim to explore the performance of models fine-tuned on data retrieved from a mixture of the two datasets used in previous sections: data used for building the BASE12 model, and in-domain data (rapid2016 or EMEA datasets). These experiments are particularly interesting in the case of news test because using an external dataset led to worse results.
In Table 9 we present the percentage of lines from the general domain dataset present in the selected data. We observe that in the news domain (the first subtable in Table 9) the percentages are higher than in the health domain (the second subtable). This indicates how these transductive methods are capable of identifying better sentences. As shown in Table 3, the sentences from the base dataset are more useful for the news test as using the rapid2016 set for tuning the model leads to worse results.
If we perform a (column-wise) comparison of the three methods, we can observe that the INR and FDA methods retrieve a similar amount of sentences from the base set. By contrast, the TFIDF method seems to retrieve a smaller amount of sentences from the general domain data (the percentages in column TFIDF of Table 9 are much lower than the other columns).
|BASE13||BASE12 + rapid2016||BASE12 + TFIDF||BASE12 + INR||BASE12 + FDA|
|BASE13||BASE12 + EMEA||BASE12 + TFIDF||BASE12 + INR||BASE12 + FDA|
In Table 10 and Table 11 we show two baselines: (i) column BASE13 shows the model built performing 13 epochs; and (ii) column BASE12+rapid2016 and BASE12+EMEA present the results observed in Table 3 and Table 4, respectively. In those tables we indicate in bold those scores that are better than both baselines.
The models adapted to the news test (Table 10) using INR and FDA tend to perform better than both the BASE13 and the BASE12+rapid2016 models. This is especially true for smaller datasets (the adaptation with 100K lines achieves statistically significant improvements at p=0.01) but becomes closer to BASE13 when more sentences are retrieved (500K lines subtable). For the TFIDF method, despite the fact that it achieves better results than the BASE12+rapid2016 model, most of the scores are worse than the BASE13 model. As mentioned earlier, TFIDF tends to retrieve more sentences from the rapid2016 set (Table 9), and as we saw before using more sentences from this set leads to worse performing models.
In the health domain (Table 11), by contrast, TFIDF performs slightly better (the only experiment that achieves statistically significant improvements at p=0.01 for the three evaluation metrics).
8 Conclusion and Future Work
In this work, we have shown how general domain models can be adapted to a test set by fine-tuning not only to a particular domain but also to a special subset of sentences (retrieved from in-domain or out-of-domain data) that are closer to a test set and achieve better results.
We have seen that fine-tuning a model using a subset of data can achieve better performance than the model trained with the full training set. This is also applicable when using an additional set of in-domain sentences. Nonetheless, the best results are observed when augmenting the candidate sentences (i.e. combining general and in-domain sentences) as presented in Section 7.3.
FDA offers a good balance in performance and speed. INR achieve results similar to FDA, but the execution time is dependent on the configuration (i.e. value of the threshold ) and it may cause to exceed several hours (FDA requires less than one hour for the same execution). The configuration also restricts the amount of sentences retrieved. In the experiments performed, we retrieved no more 200K sentences to evaluate INR whereas for the other TA we could retrieve 500K parallel lines. Moreover, in this work we have used the same values of for all the experiments, which have been determined following the most restrictive assumption of not knowing the in-domain data. In the future, we want to evaluate the models fine-tuned with data retrieved from INR using different values of .
TFIDF technique, although achieving comparable results, we find to be the weakest of the TA explored. The main differences with the other two is that is not a context-dependent (i.e. it does not consider the selected pool to retrieve new sentences) and in addition, each sentence is considered independently. This caused that for larger test set such news, the improvements tend to be smaller or not to find statistically signifficant improvements at p=0.01 (e.g. tables 5 and 10).
The experiments carried out in this paper can be further expanded using different language pairs, different domains and different selected-data sizes. Moreover, other configurations of data selection algorithms could be investigated. For example, using n-grams of higher order, executing INR with different values of , in Equation (5), or FDA with different values of and , in Equation (6) (poncelasextending; poncelas2017applying).
The techniques explored here can also be used in combination with other approaches aiming to adapt models towards a particular domain. The models presented in Section 7.3 can be further expanded by adding a tag in the source sentences indicating the domain explicitly (chu2017empirical; poncelas2019adapting), using a target-side seed or using synthetic sentences (chinea2017adapting; poncelas2019adaptation).
This research has been supported by the ADAPT Centre for Digital Content Technology which is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
This work has also received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 713567.