Fine-Grained Attention Mechanism for Neural Machine Translation

# Fine-Grained Attention Mechanism for Neural Machine Translation

Heeyoul Choi
Handong Global University
heeyoul@gmail.com &Kyunghyun Cho
New York University
kyunghyun.cho@nyu.edu
&Yoshua Bengio
University of Montreal
CIFAR Senior Fellow
yoshua.bengio@umontreal.ca
###### Abstract

Neural machine translation (NMT) has been a new paradigm in machine translation, and the attention mechanism has become the dominant approach with the state-of-the-art records in many language pairs. While there are variants of the attention mechanism, all of them use only temporal attention where one scalar value is assigned to one context vector corresponding to a source word. In this paper, we propose a fine-grained (or 2D) attention mechanism where each dimension of a context vector will receive a separate attention score. In experiments with the task of En-De and En-Fi translation, the fine-grained attention method improves the translation quality in terms of BLEU score. In addition, our alignment analysis reveals how the fine-grained attention mechanism exploits the internal structure of context vectors.

## 1 Introduction

Neural machine translation (NMT), which is an end-to-end approach to machine translation Kalchbrenner and Blunsom (2013); Sutskever et al. (2014); Bahdanau et al. (2015), has widely become adopted in machine translation research, as evidenced by its success in a recent WMT’16 translation task Sennrich et al. (2016); Chung et al. (2016). The attention-based approach, proposed by Bahdanau et al. (2015), has become the dominant approach among others, which has resulted in state-of-the-art translation qualities on, for instance, En-Fr Jean et al. (2015a), En-De Jean et al. (2015b); Sennrich et al. (2016), En-Zh Shen et al. (2016), En-Ru chung2016character and En-Cz chung2016character; Luong and Manning (2016). These recent successes are largely due to better handling a large target vocabulary Jean et al. (2015a); Sennrich2015; chung2016character; Luong and Manning (2016), incorporating a target-side monolingual corpus sennrich2015improving; Gulcehre et al. (2015) and advancing the attention mechanism luong2015effective; Cohn et al. (2016); Tu et al. (2016).

We notice that all the variants of the attention mechanism, including the original one by Bahdanau et al. (2015), are temporal in that it assigns a scalar attention score for each context vector, which corresponds to a source symbol. In other words, all the dimensions of a context vector are treated equally. This is true not only for machine translation, but also for other tasks on which the attention-based task was evaluated. For instance, the attention-based neural caption generation by Xu2015 assigns a scalar attention score for each context vector, which corresponds to a spatial location in an input image, treating all the dimensions of the context vector equally. See Cho et al. (2015) for more of such examples.

On the other hand, in Choi et al. (2017), it was shown that word embedding vectors have more than one notions of similarities by analyzing the local chart of the manifold that word embedding vectors reside. Also, by contextualization of word embedding, each dimension of the word embedding vectors could play different role according to the context, which, in turn, led to better translation qualities in terms of the BLEU scores.

Inspired by the contextualization of word embedding, in this paper, we propose to extend the attention mechanism so that each dimension of a context vector will receive a separate attention score. This enables finer-grained attention, meaning that the attention mechanism may choose to focus on one of many possible interpretations of a single word encoded in the high-dimensional context vector Choi et al. (2017); Van der Maaten and Hinton (2012). This is done by letting the attention mechanism output as many scores as there are dimensions in a context vectors, contrary to the existing variants of attention mechanism which returns a single scalar per context vector.

We evaluate and compare the proposed fine-grained attention mechanism on the tasks of En-De and En-Fi translation. The experiments reveal that the fine-grained attention mechanism improves the translation quality up to +1.4 BLEU. Our qualitative analysis found that the fine-grained attention mechanism indeed exploits the internal structure of each context vector.

## 2 Background: Attention-based Neural Machine Translation

The attention-based neural machine translation (NMT) from Bahdanau et al. (2015) computes a conditional distribution over translations given a source sentence :

 p(Y=(wy1,wy2,…,wyT′)|X). (1)

This is done by a neural network that consists of an encoder, a decoder and the attention mechanism.

The encoder is often implemented as a bidirectional recurrent neural network (RNN) that reads the source sentence word-by-word. Before being read by the encoder, each source word is projected onto a continuous vector space:

 \boldmathxt=\boldmathEx[⋅,wxt], (2)

where is -th column vector of , a source word embedding matrix, where and are the word embedding dimension and the vocabulary size, respectively.

The resulting sequence of word embedding vectors is then read by the bidirectional encoder recurrent network which consists of forward and reverse recurrent networks. The forward recurrent network reads the sequence in the left-to-right order while the reverse network reads it right-to-left:

 −−−−−−−−−→\boldmathht=→ϕ(−−−−−−−−−→\boldmathht−1,\boldmathxt), ←−−−−−−−−−\boldmathht=←ϕ(←−−−−−−−−−\boldmathht+1,\boldmathxt),

where the initial hidden states and are initialized as all-zero vectors or trained as parameters. The hidden states from the forward and reverse recurrent networks are concatenated at each time step to form an annotation vector :

 \boldmathht=[−−−−−−−−−→\boldmathht;←−−−−−−−−−\boldmathht].

This concatenation results in a context that is a tuple of annotation vectors:

 C={\boldmathh1,\boldmathh2,…,% \boldmathhT}.

The recurrent activation functions and are in most cases either long short-term memory units (LSTM, Hochreiter and Schmidhuber (1997)) or gated recurrent units (GRU, Cho et al. (2014)).

The decoder consists of a recurrent network and the attention mechanism. The recurrent network is a unidirectional language model to compute the conditional distribution over the next target word given all the previous target words and the source sentence:

 p(wyt′|wy

By multiplying this conditional probability for all the words in the target, we recover the distribution over the full target translation in Eq. (1).

The decoder recurrent network maintains an internal hidden state . At each time step , it first uses the attention mechanism to select, or weight, the annotation vectors in the context tuple . The attention mechanism, which is a feedforward neural network, takes as input both the previous decoder hidden state, and one of the annotation vectors, and returns a relevant score :

 et′,t=fAtt(\boldmathzt′−1,\boldmathht), (3)

which is referred to as score function luong2015effective; chung2016character. The function can be implemented by fully connected neural networks with a single hidden layer where can be applied as activation function. These relevance scores are normalized to be positive and sum to 1.

 αt′,t=exp(et′,t)∑Tk=1exp(et′,k). (4)

We use the normalized scores to compute the weighted sum of the annotation vectors

 \boldmathct′=T∑t=1αt′,t\boldmathht, (5)

which will be used by the decoder recurrent network to update its own hidden state by

 \boldmathzt′=ϕz(\boldmathzt′−1,\boldmathyt′−1,\boldmathct′).

Similarly to the encoder, is implemented as either an LSTM or GRU. is a target-side word embedding vector obtained by

 \boldmathyt′−1=\boldmathEy[⋅,wyt′−1],

similarly to Eq. (2).

The probability of each word in the target vocabulary is computed by

 p(wyt′=i|wy

where is the -th row vector of and is the bias.

The NMT model is usually trained to maximize the log-probability of the correct translation given a source sentence using a large training parallel corpus. This is done by stochastic gradient descent, where the gradient of the log-likelihood is efficiently computed by the backpropagation algorithm.

### 2.1 Variants of Attention Mechanism

Since the original attention mechanism was proposed as in Eq. (3) Bahdanau et al. (2015), there have been several variants luong2015effective.

luong2015effective presented a few variants of the attention mechanism on the sequence-to-sequence model Sutskever et al. (2014). Although their work cannot be directly compared to the attention model in Bahdanau et al. (2015), they introduced a few variants for score function of attention model–content based and location based score functions. Their score functions still assign a single value for the context vector as in Eq. (3).

Another variant is to add the target word embedding as input for the score function Jean et al. (2015a); chung2016character as follows:

 et′,t=fAttY(\boldmathzt′−1,\boldmathht,\boldmathyt′−1), (6)

and the score is normalized as before, which leads to , and can be a fully connected neural network as Eq. (3) with different input size. This method provides the score function additional information from the previous word. In training, teacher forced true target words can be used, while in test the previously generated word is used. In this variant, still a single score value is given to the context vector .

## 3 Fine-Grained Attention Mechanism

All the existing variants of attention mechanism assign a single scalar score for each context vector . We however notice that it is not necessary to assign a single score to the context at a time, and that it may be beneficial to assign a score for each dimension of the context vector, as each dimension represents a different perspective into the captured internal structure. In Choi et al. (2017), it was shown that each dimension in word embedding could have different meaning and the context could enrich the meaning of each dimension in different ways. The insight in this paper is similar to Choi et al. (2017), except two points: (1) focusing on the encoded representation rather than word embedding, and (2) using 2 dimensional attention rather than the context of the given sentence.

We therefore propose to extend the score function in Eq. (3) to return a set of scores corresponding to the dimensions of the context vector . That is,

 edt′,t=fdAttY2D(\boldmathzt′−1,\boldmathht,\boldmathyt′−1), (7)

where is the score assigned to the -th dimension of the -th context vector at time . Here, is a fully connected neural network where the number of output node is . These dimension-specific scores are further normalized dimension-wise such that

 αdt′,t=exp(edt′,t)∑Tk=1exp(edt′,k). (8)

The context vectors are then combined by

 \boldmathct′=T∑t=1\boldmathαt′,t⊙\boldmathht, (9)

where is , and an element-wise multiplication.

We contrast the conventional attention mechanism against the proposed fine-grained attention mechanism in Fig. 1.

## 4 Experimental Settings

We evaluate the proposed fine-grained attention mechanism on two translation tasks; (1) En-De and (2) En-Fi. For each language pair, we use all the parallel corpora available from WMT’15 for training, which results in 4.5M and 2M sentence pairs for En-De and En-Fi, respectively. In the case of En-De, we preprocessed the parallel corpora following Jean et al. (2015a) and ended up with 100M words on the English side. For En-Fi, we did not use any preprocessing routine other than simple tokenization.

Instead of space-separated tokens, we use 30k subwords extracted by byte pair encoding (BPE), as suggested in Sennrich2015. When computing the translation quality using BLEU, we un-BPE the resulting translations, but leave them tokenized.

### 4.2 Decoding and Evaluation

Once a model is trained, we use a simple forward beam search with width set to 12 to find a translation that approximately maximizes from Eq. (1). The decoded translation is then un-BPE’d and evaluated against a reference sentence by BLEU (in practice, BLEU is computed over a set of sentences.) We use newstest2013 and newstest2015 as the validation and test sets for En-De, and newsdev2015 and newstest2015 for En-Fi.

### 4.3 Models

We use the attention-based neural translation model from Bahdanau et al. (2015) as a baseline, except for replacing the gated recurrent unit (GRU) with the long short-term memory unit (LSTM). The vocabulary size is 30K for both source and target languages, the dimension of word embedding is 620 for both languages, the number of hidden nodes for both encoder and decoder is 1K, and the dimension of hidden nodes for the alignment model is 2K.

Based on the above model configuration, we test a variant of this baseline model, in which we feed the previously decoded symbol directly to the attention score function from Eq. (3) (AttY). These models are compared against the model with the proposed fine-grained model (AttY2D).

We further test adding a recently proposed technique, which treats each dimension of word embedding differently based on the context. This looks similar to our fine-grained attention in a sense that each dimension of the representation is treated in different ways. We evaluate the contextualization (Context) proposed by Choi et al. (2017). The contextualization enriches the word embedding vector by incorporating the context information:

 \boldmathcx=1TT∑t=1NN%\boldmathθ(\boldmathxt),

where is a feedforward neural network parametrized by . We closely follow Choi et al. (2017).

All the models were trained using Adam Kingma and Ba (2014) until the BLEU score on the validation set stopped improving. For computing the validation score during training, we use greedy search instead of beam search in order to minimize the computational overhead. That is 1 for the beam search. As in Bahdanau et al. (2015), we trained our model with the sentences of length up to 50 words.

## 5 Experiments

### 5.1 Quantitative Analysis

We present the translation qualities of all the models on both En-De and En-Fi in Table 1. We observe up to +1.4 BLEU when the proposed fine-grained attention mechanism is used instead of the conventional attention mechanism (Baseline vs Baseline+AttY vs Baseline+AttY2D) on the both language pairs. These results clearly confirm the importance of treating each dimension of the context vector separately.

With the contextualization (+Context or +C in the table), we observe the same pattern of improvements by the proposed method. Although the contextualization alone improves BLEU by up to +1.8 compared to the baseline, the fine-grained attention boost up the BLEU score by additional +1.4.

The improvements in accuracy require additional time as well as larger model size. The model size increases 3.5% relatively from +AttY to +AttY2D, and 3.4% from +C+AttY to +C+AttY2D. The translation times are summarized in Table. 2, which shows the proposed model needs extra time (from 4.5% to 14% relatively).

### 5.2 Alignment Analysis

Unlike the conventional attention mechanism, the proposed fine-grained one returns a 3–D tensor representing the relationship between the triplet of a source symbol , a target symbol and a dimension of the corresponding context vector . This makes it challenging to visualize the result of the fine-grained attention mechanism, especially because the dimensionality of the context vector is often larger (in our case, 2000.)

Instead, we first visualize the alignment averaged over the dimensions of a context vector:

 At,t′=1dim(\boldmathct)dim(%\boldmath$c$t)∑d=1αdt′,t.

This computes the strength of alignment between source and target symbols, and should be comparable to the alignment matrix from the conventional attention mechanism.

In Fig. 2, we visualize the alignment found by (left) the original model from Bahdanau et al. (2015), (middle) the modification in which the previously decoded target symbol is fed directly to the conventional attention mechanism (AttY), and (right) the averaged alignment from the proposed fine-grained attention mechanism. There is a clear similarity among these three alternatives, but we observe a more clear, focused alignment in the case of the proposed fine-grained attention model.

Second, we visualize the alignment averaged over the target:

 At,d=1|Y||Y|∑t′=1αdt′,t.

This matrix is expected to reveal the dimensions of a context vector per source symbol that are relevant for translating it without necessarily specifying the aligned target symbol(s).

In Fig. 3, we can see very sparse representation where each source word receives different pattern of attentions on different dimensions.

We can further inspect the alignment tensor by visualizing the -th slice of the tensor. Fig. 4 shows 6 example dimensions, where different dimensions focus on different perspective of translation. Some dimensions represent syntactic information, while others do semantic one. Also, syntactic information is handled in different dimensions, according to the word type, like article (‘a’ and ‘the’), preposition (‘to’ and ‘of’), noun (‘strategy’, ‘election’ and ‘Obama’), and adjective (‘Republican’ and ‘re-@@’). As semantic information, Fig. 4(f) shows a strong pattern of attention on the words ‘Republican’, ’strategy’, ‘election’ and ‘Obama’, which seem to mean ‘politics’. Although we present one example of attention matrix, we observed the same patterns with other examples.

## 6 Conclusions

In this paper, we proposed a fine-grained (or 2D) attention mechanism for neural machine translation. The experiments on En-De and En-Fi show that the proposed attention method improves the translation quality significantly. When the method was applied with the previous technique, contextualization, which was based on the similar idea, the performance was further improved. With alignment analysis, the fine-grained attention method revealed that the different dimensions of context play different roles in neural machine translation.

We find it an interesting future work to test the fine-grained attention with other NMT models like character-level models or multi-layered encode/decode models Ling et al. (2015); chung2016character. Also, the fine-grained attention mechanism can be applied to different tasks like speech recognition.

## Acknowledgments

The authors would like to thank the developers of Theano Bastien et al. (2012). This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (2017R1D1A1B03033341). Also, we acknowledge the support of the following agencies for research funding and computing support: NSERC, Calcul Québec, Compute Canada, the Canada Research Chairs, CIFAR and Samsung. KC thanks the support by Facebook and Google (Google Faculty Award 2016).

## References

• Bahdanau et al. (2015) Bahdanau, D., K. Cho, and Y. Bengio, 2015: Neural Machine Translation by Jointly Learning to Align and Translate. In Proc. Int’l Conf. on Learning Representations (ICLR).
• Bastien et al. (2012) Bastien, F., P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-farley, and Y. Bengio, 2012: Theano : new features and speed improvements. In NIPS 2012 deep learning workshop.
• Cho et al. (2015) Cho, K., A. Courville, and Y. Bengio, 2015: Describing multimedia content using attention-based encoder–decoder networks. IEEE Transactions on Multimedia, 17(11), 1875–1886.
• Cho et al. (2014) Cho, K., B. van Merrienboer, D. Bahdanau, and Y. Bengio, 2014: On the properties of neural machine translation: Encoder-decoder approaches. In SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 103–111.
• Choi et al. (2017) Choi, H., K. Cho, and Y. Bengio, 2017: Context-dependent word representation for neural machine translation. Computer Speech and Language, 45, 149–160.
• Chung et al. (2016) Chung, J., K. Cho, and Y. Bengio, 2016: Nyu-mila neural machine translation systems for wmt’16. In The First Conference on Statistical Machine Translation (WMT).
• Cohn et al. (2016) Cohn, T., C. D. V. Hoang, E. Vymolova, K. Yao, C. Dyer, and G. Haffari, 2016: Incorporating structural alignment biases into an attentional neural translation model. In NAACL-HLT, pp. 876–885.
• Gulcehre et al. (2015) Gulcehre, C., O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, and Y. Bengio, 2015: On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535.
• Hochreiter and Schmidhuber (1997) Hochreiter, S. and J. Schmidhuber, 1997: Long short-term memory. Neural computation, 9(8), 1735–80.
• Jean et al. (2015a) Jean, S., K. Cho, R. Memisevic, and Y. Bengio, 2015a: On Using Very Large Target Vocabulary for Neural Machine Translation. In 53rd Annual Meeting of the Association for Computational Linguistics.
• Jean et al. (2015b) Jean, S., O. Firat, K. Cho, R. Memisevic, and Y. Bengio, 2015b: Montreal neural machine translation systems for wmt15. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pp. 134–140.
• Kalchbrenner and Blunsom (2013) Kalchbrenner, N. and P. Blunsom, 2013: Recurrent continuous translation models. EMNLP, 3(39), 413.
• Kingma and Ba (2014) Kingma, D. P. and J. L. Ba, 2014: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
• Ling et al. (2015) Ling, W., I. Trancoso, C. Dyer, and A. W. Black, 2015: Character-based neural machine translation. arXiv preprint arXiv:1511.04586.
• Luong and Manning (2016) Luong, M.-T. and C. D. Manning, 2016: Achieving open vocabulary neural machine translation with hybrid word-character models. In 54th Annual Meeting of the Association for Computational Linguistics, p. 1054?1063.
• Sennrich et al. (2016) Sennrich, R., B. Haddow, and A. Birch, 2016: Edinburgh neural machine translation systems for wmt 16. In The First Conference on Statistical Machine Translation (WMT).
• Shen et al. (2016) Shen, S., Y. Cheng, Z. He, W. He, H. Wu, M. Sun, and Y. Liu, 2016: Minimum Risk Training for Neural Machine Translation.
• Sutskever et al. (2014) Sutskever, I., O. Vinyals, and Q. V. Le, 2014: Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems (NIPS).
• Tu et al. (2016) Tu, Z., Z. Lu, Y. Liu, X. Liu, and H. Li, 2016: Modeling coverage for neural machine translation. In 54th Annual Meeting of the Association for Computational Linguistics, pp. 76–85.
• Van der Maaten and Hinton (2012) Van der Maaten, L. and G. Hinton, 2012: Visualizing non-metric similarities in multiple maps. Machine learning, 87(1), 33–55.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters