Toward Making the Most of Context in Neural Machine Translation
Document-level machine translation manages to outperform sentence level models by a small margin, but have failed to be widely adopted. We argue that previous research did not make a clear use of the global context, and propose a new document-level NMT framework that deliberately models the local context of each sentence with the awareness of the global context of the document in both source and target languages. We specifically design the model to be able to deal with documents containing any number of sentences, including single sentences. This unified approach allows our model to be trained elegantly on standard datasets without needing to train on sentence and document level data separately. Experimental results demonstrate that our model outperforms Transformer baselines and previous document-level NMT models with substantial margins of up to 2.1 BLEU on state-of-the-art baselines. We also provide analyses which show the benefit of context far beyond the neighboring two or three sentences, which previous studies have typically incorporated.
Recent studies suggest that neural machine translation (NMT) [14, 1, 18] has achieved human parity, especially on resource-rich language pairs . However, standard NMT systems are designed for sentence-level translation, which cannot consider the dependencies among sentences and translate entire documents. To address the above challenge, various document-level NMT models, viz., context-aware models, are proposed to leverage context beyond a single sentence [20, 10, 23, 22] and have achieved substantial improvements over their context-agnostic counterparts.
Figure 1 briefly illustrates typical context-aware models, where the source and/or target document contexts are regarded as an additional input stream parallel to the current sentence, and incorporated into each layer of encoder and/or decoder [23, 16]. More specifically, the representation of each word in the current sentence is a deep hybrid of both global document context and local sentence context in every layer. We notice that these hybrid encoding approaches have two main weaknesses:
Models are context-aware, but do not fully exploit the context. The deep hybrid makes the model more sensitive to noise in the context, especially when the context is enlarged. This could explain why previous studies show that enlarging context leads to performance degradation. Therefore, these approaches have not taken the best advantage of the entire document context.
Models translate documents, but cannot translate single sentences. Because the deep hybrid requires global document context as additional input, these models are no longer compatible with sentence-level translation based on the solely local sentence context. As a result, these approaches usually translate poorly for single sentence documents without document-level context.
In this paper, we mitigate the aforementioned two weaknesses by designing a general-purpose NMT architecture which can fully exploit the context in documents of arbitrary number of sentences. To avoid the deep hybrid, our architecture balances local context and global context in a more deliberate way. To be more specific, our architecture independently encodes local context in the source sentence, instead of mixing it with global context from the beginning so it is robust to when the global context is large and noisy. Furthermore our architecture translates in a sentence-by-sentence manner with access to the partially generated document translation as the target global context which allows the local context to govern the translation process for single-sentence documents.
We highlight our contributions in three aspects:
We propose a new NMT framework that is able to deal with documents containing any number of sentences, including single-sentence documents, making training and deployment simpler and more flexible.
We conduct experiments on four document-level translation benchmark datasets, which show that the proposed unified approach outperforms Transformer baselines and previous state-of-the-art document-level NMT models both for sentence-level and document-level translation.
Based on thorough analyses, we demonstrate that the document context really matters; and the more context provided, the better our model translates. This finding is in contrast to the prevailing consensus that a wider context deteriorates translation quality.
2 Related Work
Context beyond the current sentence is crucial for machine translation. \citeauthorbawden2018evaluating [\citeyearbawden2018evaluating], \citeauthorlaubli2018has [\citeyearlaubli2018has], \citeauthormueller2018alarge [\citeyearmueller2018alarge], and \citeauthorvoita2018context [\citeyearvoita2018context] show that without access to the document-level context, NMT is likely to fail to maintain lexical, tense, deixis and ellipsis consistencies, resolve anaphoric pronouns and other discourse characteristics, and propose corresponding testsets for evaluating discourse phenomena in NMT.
Most of the current document-level NMT models can be classified into two main categories, context-aware model, and post-processing model. The post-processing models introduce an additional module that learns to refine the translations produced by context-agnostic NMT systems to be more discourse coherence [21, 19]. While this kind of approach is easy to deploy, the two-stage generation process may result in error accumulation.
In this paper, we pay our attention mainly on context-aware models, while post-processing approaches can be incorporated with and facilitate any NMT architectures. \citeauthortiedemann2017neural [\citeyeartiedemann2017neural] and \citeauthorjunczys2019microsoft [\citeyearjunczys2019microsoft] use the concatenation of multiple sentences (usually a small number of preceding sentences) as NMT’s input/output. Going beyond simple concatenation, \citeauthorjean2017does [\citeyearjean2017does] introduce a separate context encoder for a few previous source sentences. \citeauthorwang2017exploiting [\citeyearwang2017exploiting] includes a hierarchical RNN to summarize source-side context. There are other approaches using a dynamic cache memory to store representations of previously translated contents [17, 5, 6, 8]. \citeauthormiculicich2018document [\citeyearmiculicich2018document], \citeauthorzhang2018improving [\citeyearzhang2018improving], \citeauthoryang2019enhancing [\citeyearyang2019enhancing], \citeauthormaruf2019selective [\citeyearmaruf2019selective] and \citeauthortan2019hierarchical [\citeyeartan2019hierarchical] extend context-aware model to Transformer architecture with additional context related modules.
While claiming that modeling the whole document is not necessary, these models only take into account a few surrounding sentences [8, 10, 23, 22], or even only monolingual context [23, 22, 16], which is not necessarily sufficient to translate a document. On the contrary, our model can consider the entire arbitrary long document and simultaneously exploit contexts in both source and target languages. Furthermore, most of these document-level models cannot be applied to sentence-level translation, lacking both simplicity and flexibility in practice. They rely on variants of components specifically designed for document context (e.g., encoder/decoder-to-context attention embedded in all layers [23, 10, 16]), being limited to the scenario where the document context must be the additional input stream. Thanks to our general-purpose modeling, the proposed model manages to do general translation regardless of the number of sentences of the input text.
Sentence-level NMT Standard NMT models usually model sentence-level translation (SentNmt), adopting an encoder-decoder framework . Here SentNmt models aim to approximate the conditional distribution over a target sentence given a source sentence . Training criterion for a sentence-level NMT model is to maximize the conditional log-likelihood on abundant parallel bilingual data of i.i.d observations.
Document-level NMT Given a document-level parallel dataset , where is a source document containing sentences while is a target document with sentences, the training criterion for document-level NMT model (DocNmt) is to maximize the conditional log-likelihood over the pairs of document translation sentence by sentence by:
where denotes the history translated sentences prior to , while means the rest of the source sentences other than the current -th source sentence .
By the definition of local and global contexts, general translation can be seen as a hierarchical natural language understanding and generation problem based on local and global contexts. Accordingly, we propose a general-purpose architecture to exploit context machine translation to a better extent. Figure 2 illustrates the idea of our proposed architecture:
Given a source document, the encoder builds local context for each individual sentence (local encoding) and then retrieves global context from the entire source document to understand the inter-sentential dependencies (global encoding) and form hybrid contextual representations (context fusion). For single sentence generation, the global encoding will be dynamically disabled and the local context can directly flow through to the decoder to dominate translation. (Section 4.1)
Once the local and global understanding of the source document is constructed, the decoder generates target document by sentence basis, based on source representations of the current sentence as well as target global context from previous translated history and local context from the partial translation so far. (Section 4.2)
This general-purpose modeling allows the proposed model to fully utilize bilingual and entire document context and go beyond the restricted scenario where models must have document context as additional input streams and fail to translate single sentences. These two advantages meet our expectation of a unified and general NMT framework.
Lexical and Positional Encoding
The source input will be transformed to representations by lexical and positional encoding. We use word position embedding in Transformer  to represent the ordering of words. Note that we reset the word position for each sentence, i.e., the -th word in each sentence shares the word position embedding . Besides, we introduce segment embedding to represent the -th sentence. Therefore, the representation of -th word in -th sentence is given by , where means word embedding of .
Local Context Encoding
We construct the local context for each sentence with a stack of standard transformer layers . Take the -th source sentence as an example. The local encoder leverages stacked identical layers to map the sentence into corresponding encoded representations.
where denotes self-attention, while indicate queries, keys, and values, respectively. means the attention is performed in a multi-headed fashion . We let the input representations to be the -th layer representations , while we denote the -th layer of the local encoder as the local context for each sentence, i.e., .
Global Context Encoding
We add an additional layer on the top of the local context encoding layers, which retrieves global context from the entire document by a segment-level relative attention, and outputs final representations based on hybrid local and global context by gated context fusion mechanism.
Segment-level Relative Attention Given the local representations of each sentences, we propose to extend the relative attention  from token-level to segment-level to model the inter-sentence global context:
where denotes the proposed segment-level relative attention. Let us take as query as an example, its the contextual representations by the proposed attention is computed over all words (e.g., ) in the document regarding the sentence (segment) they belong to:
where is the attention weight of to . The corresponding attention logit can be computed with respect to relative sentence distance by:
where is a parameter vector corresponding to the relative distance between the -th and -th sentences, providing inter-sentential clues. , , and are linear projection matrices for the queries, keys and values, respectively.
Gated Context Fusion After the global context is retrieved, we adopt a gating mechanism to obtain the final encoder representations by fusing local and global context:
where is a learnable linear transformation. denotes concatenation operation. is sigmoid activation which leads the value of the fusion gate to be between 0 to 1. indicates element-wise multiplication.
The goal of the decoder is to generate translations sentence by sentence by considering the generated previous sentences as target global context. A natural idea is to store the hidden states of previous target translations and allow the self attentions of the decoder to access to these hidden states as extended history context.
To that purpose, we leverage and extend Transformer-XL  as the decoder. Transformer-XL is a novel Transformer variant, which is designed to cache and reuse the previous computed hidden states in the last segment as an extended context, so that long-term dependency information occurs many words back could propagate through the recurrence connections between segments, which just meets our requirement of generating document long text. We cast each sentence as a ”segment” in translation tasks and equip the Transformer-XL based decoder with cross-attention to retrieve time-dependent source context for the current sentence. Formally, given two consecutive sentences, and , the -th layer of our decoder first employs self-attention over the extended history context:
where the function stands for stop-gradient. is a variant of self-attention with word-level relative position encoding. For more specific details, please refer to . After that, the cross-attention module fetching the source context from encoder representation is computed as:
Based on the final representations of the last decoder layer , the output probability of current target sentence are computed as:
We experiment on four widely used document-level parallel datasets in two language pairs for machine translation:
TED (Zh-En/En-De). The Chinese-English and English-German TED datasets are from IWSLT 2015 and 2017 evaluation campaigns respectively. We mainly explore and develop our approach on TED Zh-En, where we take dev2010 as development set and tst2010-2013 as testset. For TED En-De, we use tst2016-2017 as our testset and the rest as development set.
News (En-De). We take News Commentary v11 as our training set. The WMT newstest2015 and newstest2016 are used for development and testsets respectively.
Europarl (En-De). The corpus are extracted from the Europarl v7 according to the method mentioned in \citeauthormaruf2019selective [\citeyearmaruf2019selective].
We applied byte pair encoding [12, BPE] to segment all sentences with 32K merge operations. We splited each document by 20 sentences to alleviate memory consumption in the training of our proposed models.
We used the Transformer architecture as our sentence-level, context-agnostic baseline and develop our proposed model on the top of it. For models on TED Zh-En, we used a configuration smaller than transformer_base  with model dimension , dimension and number of layers . As for models on the rest datasets, we change the dimensions to 512/2048.
We used the Adam optimizer  and the same learning rate schedule strategy as  with 8,000 warmup steps.
The training batch consisted of approximately 2048 source tokens and 2048 target tokens. Label smoothing of value 0.1  was used for training.
For inference, we used beam search with a width of 5 with a length penalty of 0.6. The evaluation metric is BLEU 
5.1 Main Results
Document-level Translation We list the results of our experiments in Table 1, comparing four context-aware NMT models, i.e., Document-aware Transformer [23, DocT], Hierarchical Attention NMT [10, HAN], Selective Attention NMT [9, SAN] and Query-guided Capsule Network [22, QCN].
As shown in Table 1, by leveraging document context, our proposed model obtains 2.1, 2.0, 2.5, and 1.0 gains over sentence-level Transformer baselines in terms of BLEU score on TED Zh-En, TED En-De, News and Europarl datasets, respectively. Among them, our model archives new state-of-the-art results on TED Zh-En and Europarl, showing the superiority of exploiting the whole document context. Though our model is not the best on TED En-De and News tasks, it is still comparable with QCN and HAN and achieves the best average performance on English-German benchmarks by at least 0.47 BLEU score over the best previous model. We suggest this could probably because we did not apply the two-stage training scheme used in \citeauthormiculicich2018document [\citeyearmiculicich2018document] or regularizations introduced in \citeauthoryang2019enhancing [\citeyearyang2019enhancing]. In addition, while sacrificing training speed, the parameter increment and decoding speed could be manageable.
|DocNMT (documents as input/output)||14.2|
Sentence-level Translation We compare the performance on single sentence translation in Table 2, which demonstrates the good compatibility of our proposed model on both document and sentence translation, whereas the performance of other approaches greatly leg behind the sentence-level baseline. The reason is while our proposed model does not, the previous approaches require document context as a separate input stream. This difference ensures the feasibility in both document and sentence-level translation in this unified framework. Therefore, our proposed model can be directly used in general translation tasks with any input text of any number of sentences, which is more deployment-friendly.
5.2 Analysis and Discussion
Does Bilingual Context Really Matter? Yes.
|SentNmt ||11.4 (21.0)|
|DocNmt (documents as input/output)||n/a (17.0)|
|Modeling source context|
|+ reset word positions for each sentence||10.0|
|+ segment embedding||10.5|
|+ segment-level relative attention||12.2|
|+ context fusion gate||12.4|
|Modeling target context|
|Transformer-XL decoder [Sent2Doc]||12.4|
To investigate how important the bilingual context is and corresponding contributions of each component, we summary the ablation study in Table 3. First of all, using the entire document as input and output directly cannot even generate document translation with the same number of sentences as source document, which is much worse than sentence-level baseline and our model in terms of document-level BLEU. For source context modeling, only casting the whole source document as an input sequence (Doc2Sent) does not work. Meanwhile, reset word positions and introducing segment embedding for each sentence alleviate this problem, which verifies one of our motivations that we should focus more on local sentences. Moreover, the gains by the segment-level relative attention and gated context fusion mechanism demonstrate retrieving and integrating source global context are useful for document translation. As for target context, employing Transformer-XL decoder to exploit target historically global context also leads to better performance on document translation. This is somewhat contrasted to  claiming that using target context leading to error propagation. In the end, by jointly modeling both source and target contexts, our final model can obtain the best performance.
Effect of Quantity of Context: the More, the Better. We also experiment to show how the quantity of context affects our model in document translation. As shown in Figure 3, we find that providing only one adjacent sentence as context helps performance on document translation, but that the more context is given, the better the translation quality is, although there does seem to be an upper limit of 20 sentences. Successfully incorporating context of this size is something related work has not successfully achieved [23, 10, 22]. We attribute this advantage to our hierarchical model design which leads to more gains than pains from the increasingly noisy global context guided by the well-formed, uncorrupted local context.
Effect of Transfer Learning: Data Hungry Remains a Problem for Document-level Translation. Due to the limitation of document-level parallel data, exploiting sentence-level parallel corpora or monolingual document-level corpora draws more attention. We investigate transfer learning (TL) approaches on TED Zh-En. We pretrain our model on WMT18 Zh-En sentence-level parallel corpus with 7m sentence pairs, where every single sentence is regarded as a document. We then continue to finetune the pretrained model on TED Zh-En document-level parallel data (source & target TL). We also compare to a variant only whose encoder is initialized (source TL). As shown in Table 4, transfer learning approach can help alleviate the need for document level data in source and target languages to some extent. However, the scarcity of document-level parallel data still prevents the document-level NMT from extending at scale.
What Does Model Learns about Context? A Case Study. Furthermore, we are interested in what the proposed model learns about context. In Figure 4, we visualize the sentence-to-sentence attention weights of a source document based on segment-level relative attention. Formally, the weight of the -th sentence attending to the -th sentence are computed by , where is defined by Eq.(1). As shown in Figure 4, we find very interesting patterns (which are also prevalent in other cases): 1) first two sentences (blue frame), which contain the main topic and idea of a document, seem to be a very useful context for all sentences; 2) the previous and subsequent adjacent sentences (red and purple diagonals, respectively) draw dense attention, which indicates the importance of surrounding context; 3) although sounding contexts are crucial, the subsequent sentence significantly outweighs the previous one. This may imply that the lack of target future information but the availability of the past information in the decoder forces the encoder to retrieve more knowledge about the next sentence than the previous one; 4) the model seems to not that care about the current sentence. Probably because the local context can flow through the context fusion gate, the segment-level relative attention just focuses on fetching useful global context; 5) the 6-th sentence also gets attraction by all the others (brown frame), which may play a special role in the inspected document.
Analysis on Discourse Phenomena. We also want to examine whether the proposed model actually learns to utilize document context to resolve discourse inconsistencies that context-agnostic models cannot handle. We use contrastive test sets for the evaluation of discourse phenomena for English-Russian by \citeauthorvoita2018context [\citeyearvoita2018context]. There are four test sets in the suite regarding deixis, lexicon consistency, ellipsis (inflection), and ellipsis (verb phrase). Each testset contains groups of contrastive examples consisting of a positive translation with correct discourse phenomenon and negative translations with incorrect phenomena. The goal is to figure out if a model is more likely to generate a correct translation compared to the incorrect variation. We summarize the results in Table 5. Our model is better at resolving discourse consistencies compared to context-agnostic baseline. \citeauthorvoita2018context [\citeyearvoita2018context] use a context-agnostic baseline, trained on larger data, to generate first-pass drafts, and perform post-processings, which is not directly comparable, but would be easily incorporated with our model to achieve better results.
In this paper, we propose a unified local and global NMT framework, which can successfully exploit context regardless of how many sentence(s) are in the input. Extensive experimentation and analysis show that our model has indeed learned to leverage a larger context. In future work we will investigate the feasibility of extending our approach to other document-level NLP tasks, e.g., summarization.
Shujian Huang is the corresponding author. This work was supported by the National Science Foundation of China (No. U1836221, 61772261, 61672277). Zaixiang Zheng was also supported by China Scholarship Council (No. 201906190162). Alexandra Birch was supported by the European Union’s Horizon 2020 research and innovation programme under grant agreements No 825299 (GoURMET) and also by the UK EPSRC fellowship grant EP/S001271/1 (MTStretch).
- Equal contribution.
- This work was done when Zaixiang was visiting at the University of Edinburgh.
- The last two corpora are from \citeauthormaruf2019selective [\citeyearmaruf2019selective]
- (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §1, §3.
- (2019) Transformer-xl: attentive language models beyond a fixed-length context. In ACL, Cited by: §4.2.
- (2018) Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567. Cited by: §1.
- (2014) Adam: A method for stochastic optimization. In ICLR, Cited by: §5.
- (2018) Modeling coherence for neural machine translation with dynamic and topic caches. In COLING, Cited by: §2.
- (2018) Fusing recency into neural machine translation with an inter-sentence gate model. In COLING, Cited by: §2.
- (2019) Pretrained Language Models for Document-Level Neural Machine Translation. arXiv preprint. Cited by: Table 4.
- (2018) Document context neural machine translation with memory networks. In ACL, Cited by: §2, §2.
- (2019) Selective attention for context-aware neural machine translation. In NAACL-HLT, Cited by: Table 1, §5.1.
- (2018) Document-level neural machine translation with hierarchical attention networks. In EMNLP, Cited by: §1, §2, Table 1, §5.1, §5.2, Table 2.
- (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, Cited by: §5.
- (2016) Neural machine translation of rare words with subword units. In ACL, Cited by: §5.
- (2018) Self-attention with relative position representations. In NAACL-HLT, Cited by: §4.1.3.
- (2014) Sequence to sequence learning with neural networks. In NIPS, Cited by: §1.
- (2016) Rethinking the inception architecture for computer vision. In CVPR, Cited by: §5.
- (2019) Hierarchical modeling of global context for document-level neural machine translation. In EMNLP-IJCNLP, Cited by: §1, §2.
- (2018) Learning to remember translation history with a continuous cache. TACL. Cited by: §2.
- (2017) Attention is All you Need. In NIPS, Cited by: §1, §4.1.1, §4.1.2, Table 1, Table 3, Table 4, §5.
- (2019) Context-aware monolingual repair for neural machine translation. In EMNLP-IJCNLP, Cited by: §2.
- (2017) Exploiting cross-sentence context for neural machine translation. In EMNLP, Cited by: §1.
- (2019) Modeling coherence for discourse neural machine translation. In AAAI, Cited by: §2.
- (2019) Enhancing context modeling with a query-guided capsule network for document-level translation. In EMNLP-IJCNLP, Cited by: §1, §2, Table 1, §5.1, §5.2.
- (2018) Improving the transformer translation model with document-level context. In EMNLP, Cited by: §1, §1, §2, Table 1, §5.1, §5.2, §5.2.