Lost in Machine Translation: A Method to Reduce Meaning Loss

Lost in Machine Translation: A Method to Reduce Meaning Loss

Reuben Cohn-Gordon
Stanford
&Noah D. Goodman
Stanford
Abstract

A desideratum of high-quality translation systems is that they preserve meaning, in the sense that two sentences with different meanings should not translate to one and the same sentence in another language. However, state-of-the-art systems often fail in this regard, particularly in cases where the source and target languages partition the “meaning space” in different ways. For instance, “I cut my finger.” and “I cut my finger off.” describe different states of the world but are translated to French (by both Fairseq and Google Translate) as “Je me suis coupé le doigt.”, which is ambiguous as to whether the finger is detached. More generally, translation systems are typically many-to-one (non-injective) functions from source to target language, which in many cases results in important distinctions in meaning being lost in translation. Building on Bayesian models of informative utterance production, we present a method to define a less ambiguous translation system in terms of an underlying pre-trained neural sequence-to-sequence model. This method increases injectivity, resulting in greater preservation of meaning as measured by improvement in cycle-consistency, without impeding translation quality (measured by BLEU score).

Lost in Machine Translation: A Method to Reduce Meaning Loss


Reuben Cohn-Gordon Stanford                        Noah D. Goodman Stanford

1 Many-to-One Translations

Languages differ in what meaning distinctions they must mark explicitly. As such, translations risk mapping from a form in one language to a more ambiguous form in another. For example, the definite (1) and indefinite (2) both translate (under Fairseq and Google Translate) to (3) in French, which is ambiguous in definiteness.

The animals run fast. (1)
Animals run fast. (2)
Les animaux courent vite (3)
Figure 1: State-of-the-art neural image captioner loses a meaning distinction which informative model preserves.

Survey

To evaluate the nature of this problem, we explored a corpus111Generated by selecting short sentences from the Brown corpus (Kučera and Francis, 1967), translating them to German, and taking the best two candidate translations back into English, if these two themselves translate to a single German sentence. of 500 pairs of distinct English sentences which map to a single German one (the evaluation language in section 2.3). We identify a number of common causes for the many-to-one maps. Two frequent types of verbal distinction lost when translating to German are tense (54 pairs, e.g. “…others {were, have been} introduced .”) and modality (16 pairs, e.g. “…prospects for this year {could, might} be better.”), where German “können” can express both epistemic and ability modality, distinguished in English with “might” and “could” respectively. Owing to English’s large vocabulary, lexical difference in verb (31 pairs, e.g. “arise” vs. “emerge” ), noun (56 pairs, e.g. “mystery” vs. “secret”), adjective (47 pairs, e.g. “unaffected” vs. “untouched”) or deictic/pronoun (32 pairs, usually “this” vs “that”) are also common. A large number of the pairs differ instead either orthographically, or in other ways that do not correspond to a clear semantic distinction (e.g. “She had {taken, made} a decision.”).

Our approach

While languages differ in what distinctions they are required to express, all are usually capable of expressing any given distinction when desired. As such, meaning loss of the kind discussed above is, in theory, avoidable. To this end, we propose a method to reduce meaning loss by applying the Rational Speech Acts (RSA) model of an informative speaker to translation. RSA has been used to model natural language pragmatics (Goodman and Frank, 2016), and recent work has shown its applicability to image captioning (Andreas and Klein, 2016; Vedantam et al., 2017; Mao et al., 2016), another sequence-generation NLP task. Here we use RSA to define a translator which reduces many-to-one mappings and consequently meaning loss, in terms of a pretrained neural translation model. We introduce a strategy for performing inference efficiently with this model in the setting of translation, and show gains in cycle-consistency222Formally, say that a pair of functions , is cycle-consistent if , the identity function. If is not one-to-one, then is not cycle-consistent. (Note however that when and are infinite, the converse does not hold: even if and are both one-to-one, need not be cycle-consistent, since many-to-one maps between infinite sets are not necessarily bijective.) as a result. Moreover, we obtain improvements in translation quality (BLEU score), demonstrating that the goal of meaning preservation directly yields improved translations.

He is wearing glasses.
He wears glasses.
Er trägt eine Brille.
Er trägt eine Brille .
Er trägt jetzt eine Brille.
Er hat eine Brille.
Figure 2: Similar to Figure 1, collapses two English sentences into a single German one, whereas distinguishes the two in German.

2 Meaning Preservation as Informativity

In the RSA framework, the informative speaker model is given a state , and chooses an utterance to convey to ’s own model of a listener. For translation, the state space is a set of source language sentences (sequences of words in the language), while is a set of target language sentences. ’s informative behavior discourages many-to-one maps that a non-informative translation model might allow.

Model

BiLSTMs with attention (Bahdanau et al., 2014), and more recently CNNs (Gehring et al., 2016) and entirely attention based models (Vaswani et al., 2017) constitute the state-of-the-art architectures in neural machine translation . All of these systems, once trained end-to-end on aligned data, can be viewed as a conditional distribution333We use and respectively to distinguish word and sentence level speaker models , for a word wd in the target language, a source language sentence , and a partial sentence in the target language. yields a distribution over full sentences444Python list indexing conventions are used, “+” means concatenation of list to element or list:

(4)

returns a distribution over continuations of into full target language sentences555In what follows, we omit c when it is empty, so that is the probability of sentence given . To obtain a sentence from given a source language sentence , one can greedily choose the highest probability word from at each timestep, or explore a beam of possible candidates. We implement (in terms of which all our other models are defined) using Fairseq’s publicly available666https://github.com/pytorch/fairseq pretrained Transformer models for English-German and English-French, and for German-English train a CNN using Fairseq.

2.1 Explicit Distractors

We first describe a model for the simple case where a source language sentence needs to be distinguished from a presupplied distractor (as in the pairs shown in figures (2) and (1)). We use this model as a stepping stone to one which requires an input sentence in the source language only, and no distractors. We begin by defining a listener , which receives a target language sentence and infers which sentence (a presupplied set such as the pair (1) and (2)) would have resulted in the pretrained neural model producing :

(5)

This allows the informative speaker model to be defined in terms of , where is the set of all possible target language sentences777 is a hyperparameter of ; as it increases, the model cares more about being informative and less about producing a reasonable translation.:

(6)

The key property of this model is that, for , when translating , prefers translations of that are unlikely to be good translations of . So for pairs like (1) and (2), is compelled to produce a translation for the former that reflects its difference from the latter, and vice versa.

Inference

Since is an infinite set, exactly computing the most probable utterance under is intractable, because of the normalizing term. Andreas and Klein (2016) and Mao et al. (2016) perform approximate inference by sampling the subset of produced by a beam search from . Vedantam et al. (2017) and Cohn-Gordon et al. (2018) employ a different method, using an incremental model as an approximation of on which inference can be tractably performed.

considers informativity not over the whole set of utterances, but instead at each decision of the next word in the target language sentence. For this reason, the incremental method avoids the problem of lack of beam diversity encountered when subsampling from , which becomes especially bad when producing long sequences, as is often the case in translation. is defined as the product of informative decisions, specified by and in turn , which are defined analogously to (6) and (5).

(7)
(8)
(9)

Examples

is able to avoid many-to-one mappings by choosing more informative translations. For instance, its translation of (1) is “Ces animaux courent vite” (These animals run fast.). See figures (1) and (2) for other examples of many-to-one mappings under avoided by .

2.2 Avoiding Explicit Distractors

While can disambiguate between pairs of sentences, it has two shortcomings. First, it requires one (or more) distractors to be provided, so translation is no longer fully automatic. Second, because the distractor set consists of only a pair (or finite set) of sentences, only cares about being informative up to the goal of distinguishing between these sentences. Intuitively, total meaning preservation is achieved by a translation which distinguishes the source sentence from every other sentence in the source language which differs in meaning.

Both of these problems can be addressed by introducing a new “cyclic” model which reasons not about but about a pretrained translation model from target language to source language, .

(10)

is like , but its goal is to produce a translation which allows a listener model (now ) to infer the original sentence, not among a small set of presupplied possibilities, but among all source language sentences. As such, an optimal translation of under has high probability of being generated by and high probability of being back-translated888Unlike back-translation to augment data during training (Sennrich et al., 2015), our model uses pretrained translators. to by .

Incremental Model

Exact inference is again intractable, though as with , it is possible to approximate by subsampling from . This is very close to the approach taken by (Li et al., 2016), who find that reranking a set of outputs by probability of recovering input “dramatically decreases the rate of dull and generic responses.” in a question-answering task. However, because the subsample is small relative to , they use this method in conjunction with a diversity increasing decoding algorithm.

As in the case with explicit distractors, we instead opt for an incremental model, now which approximates . The definition of (12) is more complex than the incremental model with explicit distractors () since must receive complete sentences, rather than partial ones like . As such, we need to marginalize over continuations of partial sentences in the target language:

(11)
(12)

Since the sum over continuations of in (11) is intractable to compute exactly, we approximate it by a single continuation, obtained by greedily unrolling . The following pseudocode resembles the Python code999Note the use of Python indexing conventions, and Numpy (numerical Python) broadcasting. implementing . In practice, we fix WIDTH=2:

def S1-WD-C.fwd(src=s,c=[]):
  support,logprobs = S0-WD.fwd(s)
  scores = []
  for wd in support[:WIDTH]:
   ext=S0-SNT.fwd(src=s,c=c+[wd])
   sc=L0-SNT.logprob(tgt=s,src=ext)
   scores.append(sc)
  unnorm=logprobs+scores
  next_word=support[argmx(unnorm)]
  return next_word

2.3 Evaluating the Informative Translator

Our objective is to improve meaning preservation without detracting from translation quality in other regards (e.g. grammaticality). We conduct our evaluations on English to German translation.

We use cycle-consistency as a measure of meaning preservation, since the ability to recover the original sentence requires meaning distinctions not to be collapsed. In evaluating cycle-consistency, it is important to use a separate target-source translation mechanism than that used to define the . Otherwise, the system has access to the model which evaluates it and may improve cycle-consistency without producing meaningful target language sentences. For this reason, we translate German sentences (produced by or ) back to English with Google Translate. To measure cycle-consistency, we use the BLEU metric (implemented with sacreBLEU (Post, 2018)), with the original sentence as the reference.

To further ensure that translation quality is not compromised by , we evaluate BLEU scores of the German sentences it produces.

We perform both evaluations (cycle-consistency and translation) on 750 sentences101010Our implementation of was not efficient, and we could not evaluate on more sentences for reasons of time. of the 2018 English-German WMT News test-set.111111http://www.statmt.org/wmt18/translation-task.html We use greedy unrolling in all models (using beam search is a goal for future work). For (which represents the trade-off between informativity and translation quality) we use , obtained by tuning on validation data.

Results

As shown in table (1), improves over not only in cycle-consistency, but in translation quality as well. This suggests that the goal of preserving information, in the sense defined by and approximated by , is important for translation quality.

Model Cycle Translate
43.35 37.42
47.34 38.29
Table 1: BLEU score on cycle-consistency (c) and translation (t) for WMT, across baseline and informative models. Greedy unrolling and

3 Conclusions

We identify a shortcoming of state-of-the-art translation systems and show that a version of the RSA framework’s informative speaker , adapted to the domain of translation, alleviates this problem in a way which improves not only cycle-consistency but translation quality as well. The success of on two fairly similar languages suggests the possibility of larger improvements when translating between languages in which larger scale differences exist in what information is obligatorily represented - such as evidentiality or formality marking.

References

  • Andreas and Klein (2016) Jacob Andreas and Dan Klein. 2016. Reasoning about pragmatics with neural listeners and speakers. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1173–1182. Association for Computational Linguistics.
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • Cohn-Gordon et al. (2018) Reuben Cohn-Gordon, Noah Goodman, and Christopher Potts. 2018. Pragmatically informative image captioning with character-level inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 439–443. Association for Computational Linguistics.
  • Gehring et al. (2016) Jonas Gehring, Michael Auli, David Grangier, and Yann N Dauphin. 2016. A convolutional encoder model for neural machine translation. arXiv preprint arXiv:1611.02344.
  • Goodman and Frank (2016) Noah D Goodman and Michael C Frank. 2016. Pragmatic language interpretation as probabilistic inference. Trends in Cognitive Sciences, 20(11):818–829.
  • Kučera and Francis (1967) Henry Kučera and Winthrop Nelson Francis. 1967. Computational analysis of present-day American English. Dartmouth Publishing Group.
  • Li et al. (2016) Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562.
  • Mao et al. (2016) Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20.
  • Post (2018) Matt Post. 2018. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771.
  • Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
  • Vedantam et al. (2017) Ramakrishna Vedantam, Samy Bengio, Kevin Murphy, Devi Parikh, and Gal Chechik. 2017. Context-aware captions from context-agnostic supervision. In Computer Vision and Pattern Recognition (CVPR), volume 3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
340852
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description