Global-Context Neural Machine Translation through Target-Side Attentive Residual Connections
Neural sequence-to-sequence models achieve remarkable performance not only in machine translation (MT) but also in other language processing tasks. One of the reasons for their effectiveness is the ability of their decoder to capture contextual information through its recurrent layer. However, its sequential modeling over-emphasizes the local context, i.e. the previously translated word in the case of MT. As a result, the model ignores important information from the global context of the translation. In this paper, we address this limitation by introducing attentive residual connections from the previously translated words to the output of the decoder, which enable the learning of longer-range dependencies between words. The proposed model can emphasize any of the previously translated words, as opposed to only the last one, gaining access to the global context of the translated text. The model outperforms strong neural MT baselines on three language pairs, as well as a neural language modeling baseline. The analysis of the attention learned by the decoder confirms that it emphasizes a wide context, and reveals resemblance to syntactic-like structures.
Neural machine translation  has recently become the state-of-the-art approach to machine translation . Due to its quality and simplicity, in contrast to the previous phrase-based statistical machine translation approaches , neural machine translation (NMT) has been adopted widely in both the academia and industry. The effectiveness of the attention-based NMT can be largely attributed to the ability of its decoder to capture contextual information through its recurrent layer. However, its sequential modeling over-emphasizes the local context, mainly the previously translated word, and, consequently, gives insufficient importance to the global context of a translated text. Ignoring the global context may lead to sub-optimal lexical choices and imperfect fluency of an output text.
To address this limitation, we propose a novel approach, represented in Figure 2, which is based on attentive residual connections from the previously translated words to the output of the decoder, within an attention-based NMT architecture. The ability of the model to access to the global context at any time improves the flow of information and enables the learning of longer-range dependencies between words than previous models. Moreover, these benefits require only a small computational overhead with respect to previous models. We demonstrate the generality of the proposed approach by applying it, in addition to NMT, to language modeling. For both tasks, our approach outperforms strong baselines, respectively in terms of BLEU  and perplexity scores.
Our contributions are the following ones:
We propose and compare several options for using attentive residual connections within a standard decoder for attention-based NMT, which enable access to the global context of a translated text.
We demonstrate consistent improvements over strong SMT and NMT baselines for three language pairs (English-to-Chinese, Spanish-to-English, and English-to-German), as well as for language modeling.
We perform an ablation study and analyze the learned attention function in terms of its behavior and structure, providing additional insights on its actual contributions.
Enhancing sequential models in order to better capture the structure of sentences has been explored in a variety of studies. Many of them investigated recursive long short-term memory (LSTM)  architectures to combine words into phrases and higher-level arrangements for generating sentence structures . More specifically,  proposed a tree-structure LSTM to model syntactic properties of a sentence. Other architectures were designed to model sentences as graphs , or used a self-attention memory to capture word dependencies or relations in a sentence . In contrast to the latter study, which modified the LSTM units using memory networks, the self-attention mechanism proposed here uses residual connections and does not interfere with the recurrent layer, thus preserving its efficiency. For instance, in the tree-to-sequence approach proposed by  , the encoder is modified to use information from the syntactic tree of the source sentence. In the string-to-tree model proposed by  , linearized trees are used to incorporate structural information of the source and target sentences in a sequential model. In the sequence-to-dependency method proposed by  , a dependency tree is used on the decoder side. All these approaches utilize external parsers to obtain structural information. In the present work, we let the network learn meaningful connections by itself. The dependencies that appear to be learned are not strictly syntactic as the ones obtained from a parser, but are important for optimizing the objective function of the translation model.
To account for longer-range dependencies, our model relies on residual connections, also called shortcut connections, which have been shown to improve the learning process of deep neural networks by addressing the vanishing gradient problem . These connections create a direct path from previous layers, which helps in the transmission of information. Recently, several architectures using residual connections with LSTMs have been proposed , applying the shortcut connections between LSTM layers in the spatial domain. In contrast, here, we propose to use residual connections in the temporal domain.
3Background: Neural Machine Translation
Neural machine translation aims to compute the conditional distribution of emitting a sentence in a target language given a sentence in a source language, noted , where is the set of parameters of the neural model, and and are respectively the representations of source and target sentences as sequences of words. The parameters are learned by training a sequence-to-sequence neural model on a corpus of parallel sentences. In particular, the learning objective is typically to maximize the following conditional log-likehood:
The attention-based NMT model designed by  has become the de-facto baseline for machine translation. The model is based on RNNs, typically using gated recurrent units (GRUs) as in the system designed by  or LSTMs . The architecture is comprised of three main components: an encoder, a decoder, and an attention mechanism.
The goal of the encoder is to build meaningful representations of the source sentences. It consists of a bidirectional RNN which includes contextual information from past and future words into the vector representation of a particular word vector , formally defined as follows:
Here, and are the hidden states of the forward and backward passes of the bidirectional RNN respectively, and is a non-linear activation function.
The decoder (Figure Figure 1) is in essence a recurrent language model. At each time step, it predicts a target word conditioned over the previous words and the information from the encoder using the following posterior probability:
where is a non-linear multilayer function that outputs the probability of .
The hidden state of the decoder is defined as:
It depends on a context vector that is learned by the attention mechanism.
The attention mechanism allows the decoder to select which parts of the source sentence are more useful to predict the next output word. This goal is achieved by considering a weighted sum over all hidden states of the encoder as follows:
where is a weight calculated using a normalized exponential function, also known as alignment function, which computes how good is the match between the input at position and the output at position :
4Decoder with Attentive Residual Connections
The state-of-the-art decoder of the attention-based NMT model uses a residual connection from the previously predicted words to the output classifier in order to enhance the performance of translation. As we can see in Equation 1, the probability of prediction of a particular word is calculated by a function which takes as input the hidden state of the recurrent layer , the representation of the previously predicted word , and the context vector . In theory, should be enough for predicting the next word given that it is dependent on the other two local-context components according to Equation 2. However, this model over-emphasizes the local context (through and ), and essentially, the last predicted word plays the most important role for generating the next word. How can we allow the model to exploit a broader context for translation?
To answer this question, we propose to change the decoder formula by including residual connections not only from the previous time step , but from all previous time steps from to . This reinforces the memory of the recurrent layer towards what has been translated so far, through weighted connections applied to all previously predicted words. At each time step, the new model decides which of the previously predicted words should be emphasized to predict the next one. In order to deal with the dynamic length of this new input, we use a target-side summary vector that can be interpreted as the representation of the decoded sentence until the time in a word embedding space. We therefore modify Equation 1 replacing with :
The replacement of with means that the number of parameters added to the model is dependent only on the calculation of . Figure 2 illustrates the modification made to the decoder. We define two methods for summarizing the context into , which are described in the following sections.
4.1Mean Residual Connections
One simple way to aggregate information from multiple word embeddings is by averaging them. The average of the embedding vectors represents a shared space of the previously decoded words, which can be understood as the sentence representation until time . We hypothesize that this representation is more informative than using only the embedding of the previous word. Formally, this representation is computed as follows:
4.2Attentive Residual Connections
Averaging is a simple and cheap way to aggregate information from multiple words, but may not be sufficient for all dependencies. Instead, we propose a dynamic way to aggregate information in each sentence, such that different words have different importance according to their relation with the prediction of the next word. We propose to use a shared self-attention mechanism to obtain a summary representation of the translation, i.e. a weighted average representation of the words translated from to . This mechanism aims to model, in part, important non-sequential dependencies among words, and serves as a complementary memory to the recurrent layer.
The weights of the attention model are computed by a scoring function that predicts how important each previous word ( or ) is for the current prediction . Figure 2 illustrates the attentive residual connections in the proposed decoder at the moment of prediction of , in contrast with the baseline model presented in Figure 1. We experiment with four different scoring functions:
where , , and are parameters of the network; and are dimensions of the word embedding, and contextual vector respectively.
Firstly, we explore a simple scoring function that is calculated based only on the previous hidden states of the decoder as used e.g. by  . Secondly, we study the same scoring function (noted sum) as for the attention mechanism to the encoder, as proposed by  . Thirdly, we consider an element-wise product, and lastly a dot product function as proposed by  . In contrast to the first attention function which is based only on the previous word, the other three functions make use of the context vector .
5Experimental Data and Setup
To evaluate the proposed MT models in different conditions, we select three language pairs with increasing amounts of training data: English-Chinese (0.5M sentences), Spanish-English (2.1M), and English-German (4.5M).
For English-to-Chinese, we use a subset of the UN parallel corpus 
We report translation quality using BLEU scores over tokenized and tru ecased texts using the scrips available in the Moses toolkit 
5.2Setup and Parameters
We use the implementation of the attention-based NMT baseline provided by its authors,
6Analysis of Results
The BLEU scores of the different NMT models for the three language pairs are shown in Table 1. Along with the NMT baseline, we included an SMT one based on Moses  with the same training/tuning/test data as the NMT. The results show that all the proposed models outperform the attention-based NMT baseline, and the best scores are obtained by the NMT model with attentive residual connections. Despite their simplicity, the mean residual connections already improve the translation, without increasing the number of parameters.
|[1mm] SMT Winning WMT 14||–||–||20.7|
|NMT + Mean residual connections||23.6||25.7||22.9|
|NMT + Attentive residual connections||24.0||26.3||23.2|
6.1Role of the attention function.
We now examine the four scoring functions that can be used for attention to residual connections presented above in Eq. 15, considering only English-to-Chinese (the smallest dataset) due to limited computing time.
Table ? shows the BLEU scores for the test set: the best method is the simple matching function. This depends only on the word embeddings, in contrast to the other three methods, which are dependent additionally on the context vector. When using both word embeddings and local context for computing the attention, the performance is better than the baseline system but not better than the simple attention; this is likely because learning such a function is harder than one based on the global context only.
The improved performance comes at the cost of only a slight increase of the number of parameters compared to the baseline system, specifically by , where is the dimensionality of the target word embeddings. These numbers correspond respectively to the parameters and in Equation 7 above.
6.2Performance according to human evaluation.
Manual evaluation on samples of 50 sentences for each language pair helped to ascertain the conclusions obtained from the BLEU scores, and to provide a qualitative understanding of the improvements brought by our model. For each language, we employed one evaluator who was a native speaker of the target language and had good knowledge of the source language. The evaluator ranked three translations of the same source sentence – one from each of our models: baseline, mean residual connections, and attentive residual connections – according to the translation quality. The three translations were presented in a random order, so that the system that had generated them could not be identified. To integrate the judgments, we proceed in pairs, and count the number of times each system was ranked higher, equal to, or lower than another competing system. The results shown in Table 2 indicate that the attentive model outperforms the one with mean residual connections, and both outperform the baseline, for all three language pairs. The rankings are thus identical to those obtained using BLEU in Table 1.
|Mean vs. Baseline||26||56||18||20||64||16||28||58||24|
|Attentive vs. Baseline||28||60||12||28||56||16||32||54||14|
|Attentive vs. Mean||24||62||14||28||58||14||32||56||12|
6.3Performance on language modeling.
The overall aim of our proposal is to model word dependencies which are meaningful for translation, and to extend the memory of the decoder. We hypothesize that the component that is actually affected by our change in the NMT architecture is the language modeling component. In order to verify this hypothesis, we incorporate the proposed residual connections to a neural model tailored for a language modeling (LM) task. This is formulated as the probability of a given text, expressed as a sequence of words noted as :
The neural LM uses a RNN to approximate this probability, thanks to the following equations:
where and are non-linear functions, and is the hidden state the recurrent layer. Similar to the formulation used in the decoder, we add the summary vector to the LM by modifying Equation 6 as follows:
For our experiment, we use a Pytorch LM implementation based on the work of 
We perform a qualitative analysis of the attentive residual connections method. In particular, we analyze the distribution of the weights assigned by the attention mechanism, and the kind of connections that this approach learns. We also present sample translations to contrast the baseline system with the proposed approach.
7.1Distribution of Attention.
Figure ? shows the distribution of the attention to previous words obtained with attentive residual connections, on Spanish-to-English NMT (the other two pairs exhibit similar distributions). The weights on the test corpus are summarized using two normalizations. First, we normalize the weights by the number of preceding words, because at each time step the number of preceding words increases, so the weights are inherently more dispersed for words which are at the end of the sentence. Second, we normalize the weights by the sentence length given its variability throughout the test set.
Figure 4 shows the distribution of relative positions which received maximal attention. In other words, we selected, at each word prediction, the preceding word with maximal weight, and counted its relative position, represented on the -axis. Although the maximal attention was most frequently directed to the closest preceding words, in many cases distant words (more than 30 positions away) were still given maximum attention. Figure 5 shows the mean weight assigned at each word position, now considering the weights assigned to all preceding words (not only the maximal one as in the previous case). We observe a uniform distribution (with noise for sentences longer than our training samples of 50 words) showing that the attention is indeed distributed throughout a sentence.
7.2Structures Learned by the Attentive Model.
Figure 3 shows matrices with target-side attention weights for four samples translated from Spanish-to-English with our attentive model. The sentence is on the vertical axis, while the horizontal axis represents the previous words. The color shades at each position (“cell”) represent the attention weights. A first attempt to determine classes of words which receive more attention did not lead to conclusive results, except possibly prepositions or conjunctions. In a second attempt, we observed the formation of sub-phrases. By analyzing the changes of attention as a binary tree, it appears on the examined examples that the decoder learns certain syntactic patterns. In Figure 3, starting from the first word, we mark with brackets the formation of a new sub-phrase each time the focus of attention changes to a new word. In these examples, we can see that the sub-phrases coincide with noun phrases or verb phrases.
Table 3 shows some examples of translation of the baseline and the attentive residual connections model. The first part (upper three examples) includes cases when the proposed model reached a higher BLEU score than the baseline system. Here, the structure of the sentence or at least the word order is improved. The second part (lower three examples) contains cases where the baseline had better BLEU score than our model. In the first example, the structure of the sentence is different but the information is correct, in the second one the lexical choices differ from the reference but are still correct, while in the third sample, the translation is simply incorrect.
|S:||Estudiantes y profesores se están tomando a la ligera la fecha.|
|R:||Students and teachers are taking the date lightly.|
|B:||Students and teachers are being taken lightly to the date.|
|A:||Students and teachers are taking the date lightly.|
|S:||No porque compartiera su ideología, sino porque para él los Derechos Humanos son indivisibles.|
|R:||Not because he shared their world view, but because for him, human rights are indivisible.|
|B:||Not because I share his ideology, but because he is indivisible by human rights.|
|A:||Not because he shared his ideology, but because for him human rights are indivisible.|
|S:||¿El PSG puede convertirse en un gran club europeo a corto plazo?|
|R:||Can PSG become a top European club in the short term?|
|B:||Can the PG can become a large short-term European club?|
|A:||Can the PSG be a great European club in the short term?|
|S:||El gobierno intenta que no se construyan tantas casas pequeñas.|
|R:||The Government is trying not to build so many small houses.|
|B:||The government is trying not to build so many small houses.|
|A:||The government is trying to ensure that so many small houses are not built.|
|S:||Otras personas pueden tener niños .|
|R:||Other people can have children.|
|B:||Other people can have children.|
|A:||Others may have children.|
|S:||Dar a luz en el espacio|
|R:||Give birth in space|
|B:||Give birth in space|
|A:||In the light of the space|
We presented a novel approach to enrich the sequential model of state-of-the-art decoders for neural machine translation. We proposed to extend the context by introducing residual connections to the previous word predictions. To manage the variable lengths of previous predictions, we proposed two methods for context summarization: mean residual connections and attentive residual connections. The first one is an average of the word embedding vectors, which can be interpreted as a representation of the sentence until the time of the new word prediction. The second one uses an attention mechanism over the word embedding vectors to emphasizes relevant information from the previous context. We evaluated our proposal over three datasets, for Chinese-to-English, Spanish-to-English, and English-to-German NMT. In each case, we improved the BLEU score compared to the NMT baseline. A manual evaluation over a small set of sentences for each language pair confirmed the improvement. We hypothesized that most of the improvement comes from the language model component, and showed indeed that our model has a lower perplexity than the baseline on a LM task. Finally, a qualitative analysis showed that the attentive model distributes weights throughout an entire sentence, and learns structures resembling syntactic ones.
To encourage further research in global-context NMT our code will be made available upon publication.
AAppendix A: Detailed Architecture
This appendix describes in detail the implementation of the Global-Context NMT model which is based on the attention-based NMT implementation of
The input of the model is a source sentence denoted as 1-of-k coded vector, where each element of the sequence corresponds to a word:
and the output is a target sentence denoted as well as 1-of-k coded vector:
where is the size of the vocabulary of target and source side, and are the lengths of the source and target sentences respectively. We omit the bias vectors for simplicity.
Each word of the source sentence is embedded in a -dimensional vector space using the embedding matrix . The hidden states are -dimensional vectors modeled by a bi-directional GRU. The forward states are computed as:
Here, and are weight matrices. The backward states are computed in similar manner. The embedding matrix is shared for both passes, and the final hidden states are formed by the concatenation of them:
The context vector at time is calculated by:
Here, , and are weight matrices.
The input of the decoder are the previous word and the context vector , the objective is to predict . The hidden states of the decoder are initialized with the mean of the context vectors:
where is a weight matrix, is the size of the source sentence. The following hidden states are calculated with a GRU conditioned over the context vector at tine as follows:
Here, is the embedding matrix for the target language. , , and are weight matrices. The intermediate vector is calculated from a simple GRU:
In the attention-based NMT model, the probability of a target word is given by:
whereas in our global-context NMT model, the probability of a target word is given by:
Here, , , , are weight matrices. The summary vector can be calculated in different manners based on previous words to . First, a simple average:
The second, by using an attention mechanism:
The matching function also can be calculated in different manners:
where , , and are weight matrices.
Aharoni, R., and Goldberg, Y. Towards string-to-tree neural machine translation.
Bahdanau, D.; Cho, K.; and Bengio, Y. Neural machine translation by jointly learning to align and translate.
Bojar, O., et al. Findings of the 2013 Workshop on Statistical Machine Translation.
Bojar, O., et al. Findings of the 2016 Conference on Machine Translation.
Buck, C.; Heafield, K.; and van Ooyen, B. N-gram counts and language models from the common crawl.
Chen, Q.; Zhu, X.; Ling, Z.; Wei, S.; and Jiang, H. Enhancing and combining sequential and tree LSTM for natural language inference.
Cheng, J.; Dong, L.; and Lapata, M. Long short-term memory-networks for machine reading.
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. Learning phrase representations using RNN encoder–decoder for statistical machine translation.
Eriguchi, A.; Hashimoto, K.; and Tsuruoka, Y. Tree-to-sequence attentional neural machine translation.
He, K.; Zhang, X.; Ren, S.; and Sun, J. Deep residual learning for image recognition.
Hochreiter, S., and Schmidhuber, J. Long short-term memory.
Kalchbrenner, N., and Blunsom, P. Recurrent continuous translation models.
Kim, J.; El-Khamy, M.; and Lee, J. Residual LSTM: Design of a deep recurrent architecture for distant speech recognition.
Koehn, P., et al. Moses: Open source toolkit for statistical machine translation.
Koehn, P. Statistical Machine Translation.
Le, P., and Zuidema, W. Compositional distributional semantics with long short term memory.
Luong, T.; Pham, H.; and Manning, D. C. Effective approaches to attention-based neural machine translation.
Marcus, M. P.; Marcinkiewicz, M. A.; and Santorini, B. Building a large annotated corpus of English: The Penn Treebank.
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation.
Pappas, N., and Popescu-Belis, A. Explicit document modeling through weighted multiple-instance learning.
Press, O., and Wolf, L. Using the output embedding to improve language models.
Rafalovitch, A., et al. United Nations General Assembly resolutions: A six-language parallel corpus.
Sennrich, R.; Haddow, B.; and Birch, A. Neural machine translation of rare words with subword units.
Sutskever, I.; Vinyals, O.; and Le, Q. V. Sequence to sequence learning with neural networks.
Tai, K. S.; Socher, R.; and Manning, C. D. Improved semantic representations from tree-structured long short-term memory networks.
Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions.
Wu, S.; Zhang, D.; Yang, N.; Li, M.; and Zhou, M. Sequence-to-dependency neural machine translation.
Zeiler, M. D. ADADELTA: an adaptive learning rate method.
Zhang, Y.; Chen, G.; Yu, D.; Yaco, K.; Khudanpur, S.; and Glass, J. Highway long short-term memory RNNs for distant speech recognition.
Zhu, X.; Sobhani, P.; and Guo, H. DAG-structured long short-term memory for semantic compositionality.
Zhu, X.; Sobihani, P.; and Guo, H. Long short-term memory over recursive structures.