A GRU-Gated Attention Model for Neural Machine Translation††thanks: © 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Neural machine translation (NMT) heavily relies on an attention network to produce a context vector for each target word prediction. In practice, we find that context vectors for different target words are quite similar to one another and therefore are insufficient in discriminatively predicting target words. The reason for this might be that context vectors produced by the vanilla attention network are just a weighted sum of source representations that are invariant to decoder states. In this paper, we propose a novel GRU-gated attention model (GAtt) for NMT which enhances the degree of discrimination of context vectors by enabling source representations to be sensitive to the partial translation generated by the decoder. GAtt uses a gated recurrent unit (GRU) to combine two types of information: treating a source annotation vector originally produced by the bidirectional encoder as the history state while the corresponding previous decoder state as the input to the GRU. The GRU-combined information forms a new source annotation vector. In this way, we can obtain translation-sensitive source representations which are then feed into the attention network to generate discriminative context vectors. We further propose a variant that regards a source annotation vector as the current input while the previous decoder state as the history. Experiments on NIST Chinese-English translation tasks show that both GAtt-based models achieve significant improvements over the vanilla attention-based NMT. Further analyses on attention weights and context vectors demonstrate the effectiveness of GAtt in improving the discrimination power of representations and handling the challenging issue of over-translation.
Neural machine translation (NMT), as a large, single and end-to-end trainable neural network, has attracted wide attention in recent years [DBLP:journals/corr/SutskeverVL14, DBLP:journals/corr/BahdanauCB14, DBLP:journals/corr/ShenCHHWSL15, jean-EtAl:2015:ACL-IJCNLP, luong-EtAl:2015:ACL-IJCNLP, DBLP:journals/corr/WangLTLXZ16]. Currently, most NMT systems use an encoder to read a source sentence into a vector and a decoder to map the vector into the corresponding target sentence. What makes NMT outperform conventional statistical machine translation (SMT) is the attention mechanism [DBLP:journals/corr/BahdanauCB14], an information bridge between the encoder and the decoder that produces context vectors by dynamically detecting relevant source words for predicting the next target word.
Intuitively, different target words would be aligned to different source words so that the generated context vectors differ significantly from one another across different decoding steps. In other words, these context vectors should be discriminative enough for target word prediction otherwise the same target words might be generated repeatedly (a well-known issue of NMT: over-translation, see Section 5.6). However, this is often not true in practice, even when “attended” source words are rather relevant. We observe that (see Section 5.5) the context vectors are very similar to each other, and that the variance in each dimension of these vectors across different decoding steps is very small. These indicate that the vanilla attention mechanism suffers from its inadequacy in distinguishing different translation predictions. The reason behind, we conjecture, lies in the architecture of the attention mechanism which simply calculates a linearly weighted sum of source representations that are invariant across decoding steps. Such invariance in source representations may lead to the undesirable small variance of context vectors.
In order to handle this issue, in this paper, we propose a novel GRU-gated attention model (GAtt) for NMT. The key is that we can increase the degree of variance in context vectors by refining source representations according to the partial translation generated by the decoder. The refined source representations are composed of the original source representations and the previous decoder state at each decoding step. We show the overall framework of our model and highlight the difference between GAtt and the vanilla attention in Figure 1. GAtt significantly extends the vanilla attention by inserting a gating layer between the encoder and the vanilla attention network. Specifically, we model this gating layer with a GRU unit [journals/corr/ChungGCB14], which takes the original source representations as its history and the corresponding previous decoder state as its current input. In this way, GAtt can produce translation-sensitive source representations so as to improve the variance in context vectors and therefore its discrimination ability in target word prediction.
As GRU is able to control the information flow between the history and current input through its reset and update gate, we further propose a variant of GAtt that, instead, regards the previous decoder state as the history while the original source representations as the current inputs. Both models are simple yet efficient in training and decoding.
We testify GAtt on Chinese-English translation tasks. Experimental results show that both GAtt-based models significantly outperform the vanilla attention-based NMT. We further analyze the generated attention weights and context vectors, showing that the attention weights are more accurate and the context vectors are more discriminative for target word prediction.
2 Related Work
Our work contributes to the development of attention mechanism in NMT. Originally, NMT does not have the attention mechanism and mainly relies on the encoder to summarize all source-side semantic details into a fixed-length vector [DBLP:journals/corr/SutskeverVL14, cho-EtAl:2014:EMNLP2014]. Bahdanau et al. \shortciteDBLP:journals/corr/BahdanauCB14 find that using a fixed-length vector, however, is not adequate to represent a source sentence and propose the popular attention mechanism, enabling the model to automatically search for parts of a source sentence that are relevant to the next target word. From then on, the attention mechanism has gained extensive concern. Luong et al. \shortciteluong-pham-manning:2015:EMNLP explore several effective approaches to the attention network, introducing the local and global attention model. Tu et al. \shortciteDBLP:journals/corr/TuLLLL16 introduce a coverage vector to keep track of the attention history such that the attention network can pay more attention to untranslated source words. Mi et al. \shortcitemi-wang-ittycheriah:2016:EMNLP2016 leverage well-trained word alignments to directly supervise the attention weights in NMT. Yang et al. \shortcite2016arXiv160705108Y bring a recurrence along the context vector to help adjust the future attention. Cohn et al. \shortciteDBLP:journals/corr/CohnHVYDH16 incorporate several structural bias, such as position bias, markov condition and fertilities, into the attention-based neural translation model. However, all these models mainly focus on how to make the attention weights more accurate. As we mentioned, even with well-designed attention models, context vectors may be lack of discrimination ability for target word prediction.
Another closely related work is the interactive attention model [meng-EtAl:2016:COLING] which treats source representations as a memory and models the interaction between the decoder and this memory during translation via reading and writing operations. To some extent, our model can also be regarded as a memory network, which only includes the reading operation. However, our reading operation differs significantly from that in the interactive attention, where we employ the GRU unit for composition while they merely use the content-based addressing. Compared with the interactive attention, our GAtt, without the writing operation, is more efficient in both training and decoding.
The gate mechanism in our GAtt is built on the GRU unit. GRU usually acts as a recurrent unit that leverages a reset gate and an update gate to control how much information flow from the history state and the current input respectively [journals/corr/ChungGCB14]. It is an extension of the vanilla recurrent neural network (RNN) unit with the advantage of alleviating the vanishing and exploding gradient problems during training [Bengio-trnn94], and also a simplification of the LSTM model [Hochreiter:1997:LSM:1246443.1246450] with the advantage of efficient computation. The idea of using GRU as a gate mechanism, to the best of our knowledge, has never been investigated before.
Additionally, our model is also related with the tree-structured LSTM [tai-socher-manning:2015:ACL-IJCNLP], where LSTM is adapted to compose vary-sized children nodes and current input node in a dependency tree into the current hidden state. GAtt differs significantly from the tree-structured LSTM in that the latter employs the sum operation to deal with the vary-sized representations, while our model leverages the attention mechanism.
In this section, we briefly review the vanilla attention-based NMT [DBLP:journals/corr/BahdanauCB14]. Unlike conventional SMT, NMT directly maps a source sentence to its target translation using an encoder-decoder framework. The encoder reads the source sentence , and encodes the representation of each word by summarizing the information of neighboring words. As shown by the blue color in Figure 1, this is achieved by a bidirectional RNN, specifically the bidirectional GRU model.
The decoder is a conditional language model which generates the target sentence word by word using the following conditional probability (see the yellow lines in Figure 1 (a)):
where is a partial translation, is the embedding of previously generated target word , is the -th target-side decoder state and is a highly non-linear function. Please refer to [DBLP:journals/corr/BahdanauCB14] for more details. What we concern in this paper is , which is the translation-sensitive context vector produced by the attention mechanism.
Attention Mechanism acts as a bridge between the encoder and the decoder, which makes them tightly coupled. The attention network aims at recognizing which source words are relevant to the next target word and giving high attention weights to these words in computing the context vector . This is based on the encoded source representations and the previous decoder state (see the purple color in Figure 1 (a)). Formally,
denotes the whole process. It first computes an attention weight to measure the degree of relevance of a source word for predicting the target word via a feed-forward neural network:
The relevance score is estimated via an alignment model as in [DBLP:journals/corr/BahdanauCB14]: . Intuitively, the higher attention weight is, the more important word is for the next word prediction. Therefore, generates by directly weighting the source representations with their corresponding attention weights :
Although this vanilla attention model is very successful, we find that, in practice, the resulted context vectors are very similar to one another. In other words, these context vectors are not discriminative enough. This is undesirable because it makes the decoder (Eq. (1)) hesitate in deciding which target word should be predicted. We attempt to solve this problem in the next section.
4 GRU-Gated Attention for NMT
The problem mentioned above reveals some shortcomings of the vanilla attention mechanism. Let’s revisit the generation of in Eq. (4). As different target words might be aligned to different source words, the attention weights of source words vary across different decoding steps. However, no matter how the attention weights of source words vary, the source representations remain the same, i.e. they are decoding-invariant. And this invariance would limit the discrimination power of the generated context vectors.
Accordingly, we attempt to break up this invariance by refining the source representations before they are input to the vanilla attention network at each decoding step. To this end, we propose the GRU-gated attention (GAtt), which, similar to the vanilla attention, can be formulated into the following form:
The gray color in Figure 1 (b) highlights the major difference between GAtt and the vanilla attention. Specifically, GAtt consists of two layers: a gating layer and an attention layer.
Gating Layer. This layer aims at refining the source representations according to the previous decoder state so as to compute translation-relevant source representations. Formally,
The should be capable of dealing with the complex interactions between the source sentence and the partial translation, and freely controlling the semantic match and information flow between them. Instead of using conventional gating mechanism [journals/corr/ChungGCB14], we directly choose the whole GRU unit to perform this task. For a source representation , GRU treats it as the history representation and refines it using the current input, i.e. the previous decoder state :
where is the sigmoid function, and denotes the element-wise multiplication. Intuitively, the reset gate and update gate measure the degree of the semantic match between the source sentence and partial translation. The former determines how much the original source information could be used to combine the partial translation, while the latter defines how much the original source information can be kept around. As a result, becomes translation-sensitive, rather than decoding-invariant, which is desired to strengthen the discrimination power of .
Attention Layer. This layer is the same as the vanilla attention mechanism:
The in Eq. (8) denotes the same procedure as that in Eq. (2). However, instead of paying attention to the original source representations , this layer relies on the gate-refined source representations . Notice that is adaptive during decoding, indicated with the subscript . Ideally, we expect is decoding-specific enough such that can vary significantly across different target words.
Notice that is not a multi-stepped RNN. It is simply a composition function, or only one-stepped RNN. Therefore, it is computationally efficient. To train our model, we employ the standard training objective, i.e. maximizing the log-likelihood of the training data, and optimize the model parameters using the standard stochastic gradient algorithm.
Model Variant We refer to the above model as GAtt, which regards the source representations as the history and the previous decoder state as the current input. Which information should be treated as input or history does not matter, especially for the GRU unit since GRU is able to control the information flow freely. We can also use the previous decoder state as the history and the source representations as the current input. We refer to this model as GAtt-Inv. Formally,
The major difference lies at the order of the inputs in , since the inputs to GRU are directional. We verify both model variants through the following experiments.
|Source||他 说 , 难民 重新 融入 社会 是 临时 政府 的 工作 重点 之一 。|
|Reference||he said the refugees ’ re-integration into society is one of the top priorities on the interim government ’s agenda .|
|RNNSearch||he said that refugee is one of the key government tasks .|
|GAtt||he said that the re - integration of the refugees was one of the key tasks of the interim government .|
|GAtt-Inv||he said that the refugees ’ integration into society is one of the key tasks of the interim government .|
We evaluated the effectiveness of our model on Chinese-English translation tasks. Our training data consists of 1.25M sentence pairs, with 27.9M Chinese words and 34.5M English words respectively111This data is a combination of LDC2002E18, LDC2003E07, LDC2003E14, Hansards portion of LDC2004T07, LDC2004T08 and LDC2005T06.. We chose the NIST 2005 dataset as the development set to perform model selection, and the NIST 2002, 2003, 2004, 2006 and 2008 datasets as our test sets. There are 878, 919, 1788, 1082 and 1664 sentences in NIST 2002, 2003, 2004, 2005, 2006, 2008 dataset respectively. We evaluated the translation quality using the case-insensitive BLEU-4 metric [PapineniEtAl2002]222https://github.com/moses-smt/mosesdecoder/blob/master/scripts/
generic/multi-bleu.perl and TER metric [Snover06astudy]333http://www.cs.umd.edu/ snover/tercom/. We performed paired bootstrap sampling [koehn04] for statistical significance test using the script in Moses444https://github.com/moses-smt/mosesdecoder/blob/master/scripts/
We compared our proposed model against the following two state-of-the-art SMT and NMT systems:
Moses [Koehn:2007:MOS:1557769.1557821]: an open source state-of-the-art phrase-based SMT system.
RNNSearch [DBLP:journals/corr/BahdanauCB14]: a state-of-the-art attention-based NMT system using the vanilla attention mechanism. We further feed the information of to the attention, and implemented the decoder with two GRU layers, following the suggestions in dl4mt555https://github.com/nyu-dl/dl4mt-tutorial/tree/master/session3.
For Moses, we trained a 4-gram language model on the target portion of training data using the SRILM666http://www.speech.sri.com/projects/srilm/download.html toolkit with modified Kneser-Ney smoothing. The word alignments were obtained with GIZA++ [Och:2003:SCV:778822.778824] on the training corpora in both directions, using the “grow-diag-final-and” strategy [Koehn:2003:SPT:1073445.1073462]. All other parameters were kept as the default settings.
For RNNSearch, we limit the vocabulary of both source and target languages to be the most frequent 30K words, covering approximately 97.7% and 99.3% of the two corpora respectively. The words that do not appear in the vocabulary were mapped to a special token “UNK”. We trained our model with the sentences of length up to 50 words in the training data. Following the settings in [DBLP:journals/corr/BahdanauCB14], we set , . We initialized all parameters randomly according to a normal distribution () except the square matrices which are initialized with random orthogonal matrices. We used the Adadelta algorithm [DBLP:journals/corr/abs-1212-5701] for optimization, with a batch size of and gradient norm as . The model parameters were selected according to the maximum BLEU points on the development set. Additionally, during decoding, we used the beam-search algorithm, and set the beam size to 10.
For GAtt, we randomly initialized its parameters as what we do in RNNSearch. All the other settings are the same as RNNSearch. All NMT systems were trained on a GeForce GTX 1080 using the computational framework Theano. In one hour, the RNNSearch system processes about 2769 batches while GAtt processes 1549 batches.
5.3 Translation Results
The results are summarized in Table 1. Both GAtt and GAtt-Inv outperform both Moses and RNNSearch. Specially, GAtt yields 35.70 BLEU and 56.06 TER scores on average, with improvements of 4.59 BLEU and 1.61 TER points over Moses, and 1.66 BLEU and 2.12 TER points over RNNSearch; GAtt-Inv achieves 35.70 BLEU and 55.99 TER scores on average, with gains of 4.59 BLEU and 1.68 TER points over Moses, and 1.66 BLEU and 2.19 TER points over RNNSearch. All improvements are statistically significant.
It seems that GAtt-Inv obtains very slightly better performance than GAtt in terms of TER on average. However, these improvements are neither significant nor consistent. In other words, GAtt is as efficient as GAtt-Inv. This is reasonable, since the difference of GAtt and GAtt-Inv lies at the order of inputs to GRU, and GRU is able to control its information flow from each input through its reset and update gate.
5.4 Effects of Model Ensemble
We further testify whether the ensemble of our models and RNNSearch can generate better performance against any single system. We ensemble different systems by simply averaging their predicted target word probabilities at each decoding step, as suggested in [luong-EtAl:2015:ACL-IJCNLP]. We show the results in Table 2. Not surprisingly, all the ensemble systems achieves significant improvements over the best single system. And the ensemble of “RNNSearch+GAtt+GAtt-Inv” produces the best results, 38.64 BLEU and 54.15 TER scores on average. This demonstrates that these neural models are complementary and beneficial to each other.
5.5 Translation Analysis
In order to have a deep understanding of how the proposed models work, we dug into the translated sentences of different neural systems. Table 3 shows an example. All the neural models generate very fluent translations. However, RNNSearch only translates the rough meaning of the source sentence, ignoring important sub-phrases “ 重新 融入 社会 ” and “ 临时 ”. These missing translations resonate with the finding of Tu et al. \shortciteDBLP:journals/corr/TuLLLL16. sIn contrast, GAtt and GAtt-Inv are able to capture these two sub-phrases, generating the key translations “integration” and “interim”.
|Tu et al. \shortciteDBLP:journals/corr/TuLLLL16||64.25||50.50|
To find the underlying reason, we investigated their generated attention weights. Rather than using the generated target sentences, we feed the same reference translations into RNNSearch and GAtt for making a fair comparison777We do not analyze GAtt-Inv because it is very similar to GAtt.. Figure 2 visualizes the attention weights. Both RNNSearch and GAtt have very intuitive attentions, e.g. “refugees” is aligned to “ 难民 ”, “government” is aligned to “ 政府 ”. However, compared against those of RNNSearch, the attentions learned by GAtt are more focused and accurate. In other words, the refined source representations in GAtt help the attention mechanism concentrate its weights on translation-related words.
To verify this point, we evaluated the quality of word alignments induced from different neural systems in terms of alignment error rate (AER) [Och:2003:SCV:778822.778824] and the soft version (SAER) of AER, following Tu et al. [DBLP:journals/corr/TuLLLL16].888Notice that we used the same dataset and evaluation script as Tu et al. [DBLP:journals/corr/TuLLLL16]. We refer the readers to [DBLP:journals/corr/TuLLLL16] for more details. Table 4 display the evaluation results of word alignments. We find that both GAtt and GAtt-Inv significantly outperform RNNSearch in terms of both AER and SAER. Specifically, GAtt obtains a gain of 7.91 SAER and 7.3 AER points over RNNSearch. As we obtain word alignments by connecting target words to source words with the highest alignment probabilities computed according to their attention weights, the consistent improvements of our model over RNNSearch on AER score indicate that our model indeed learns more accurate attentions.
Another very important question is whether GAtt enhances the discrimination of the context vectors. We answer this question by visualizing these vectors, as shown in Figure 3. We can observe that the heatmap of RNNSearch is very smooth, which varies very slightly across different decoding steps (the horizontal axis). This means that these context vectors are very similar to one another, thus lacking of discrimination. In contrast, there are obvious variations in GAtt. Statistically, the mean variance of the context vectors across different dimensions in RNNSearch is 0.0057, while it is 0.0365 in GAtt, 6 times larger than that of RNNSearch. Additionally, across different decoding steps, the mean variance is 0.0088 in RNNSearch, while it is 0.0465 in GAtt. All these strongly suggest that our model makes the context vectors more discriminative across different target words.
5.6 Over-Translation Evaluation
Over-translation or repeatedly predicting the same target words [DBLP:journals/corr/TuLLLL16] is a challenging problem for NMT. We conjecture that the reason behind the over-translation issue is partially due to the small differences in context vectors learned by the vanilla attention mechanism. As the proposed GAtt can improve the discrimination power of context vectors, we hypothesize that our model can deal better with the over-translation issue than the vanilla attention network. To testify this hypothesis, we introduce a metric called N-Gram Repetition Rate (N-GRR) that calculates the portion of repeated n-grams in a sentence:
where denotes the number of total n-grams in the -th translation of the -th sentence in the testing corpus and the number of n-grams after duplicate n-grams are removed. In our test sets, there are sentences with and translations for the Reference and NMT systems respectively. If we compare N-GRR scores of machine-generated translation against those of reference translations, we can roughly know how serious the over-translation problem is.
We show N-GRR results in Table 5. Compared with reference translations (Reference), RNNSearch yields significant high scores, indicating that RNNSearch generates redundant repeated n-grams in translations, and therefore the over-translation problem in RNNSearch is serious. In contrast, both GAtt and GAtt-Inv achieve considerable improvements over RNNSearch in terms of N-GRR. Especially, we find that GAtt-Inv performs better than GAtt on all n-grams, which is in accordance with the translation results in Table 1. These N-GRR results strongly suggest that the proposed models are able to handle the over-translation issue and that generating more discriminative context vectors makes NMT suffer less from the over-translation issue.
In this paper, we have presented a novel GRU-gated attention model (GAtt) for NMT. Instead of using decoding-invariant source representations, GAtt produces new source representations that vary across different decoding steps according to the partial translation so as to improve the discrimination of context vectors for translation. This is achieved by a gating layer that regards the source representations and previous decoder state as the history and input to a gated recurrent unit. Experiments on Chinese-English translation tasks demonstrate the effectiveness of our model. In-depth analysis further reveals that our model is able to significantly reduce repeated redundant translations (over-translations).
In the future, we would like to apply our model to other sequence learning tasks as our model is easily to be adapted to any other sequence-to-sequence tasks (e.g. document summarization, neural conversion model, speech recognition, .etc). Additionally, except for the GRU unit, we will explore more different end-to-end neural architectures, such as convolutional neural network, LSTM unit as the gate mechanism plays a very important role in our model. Finally, we are interested in adapting our GAtt model as a tree-structured unit to compose different nodes in a dependency tree.