Sparse and Constrained Attention for Neural Machine Translation
Abstract
In NMT, words are sometimes dropped from the source or generated repeatedly in the translation. We explore novel strategies to address the coverage problem that change only the attention transformation. Our approach allocates fertilities to source words, used to bound the attention each word can receive. We experiment with various sparse and constrained attention transformations and propose a new one, constrained sparsemax, shown to be differentiable and sparse. Empirical evaluation is provided in three languages pairs.
1 Introduction
Neural machine translation (NMT) emerged in the last few years as a very successful paradigm (Sutskever et al., 2014; Bahdanau et al., 2014; Gehring et al., 2017; Vaswani et al., 2017). While NMT is generally more fluent than previous statistical systems, adequacy is still a major concern Koehn and Knowles (2017): common mistakes include dropping source words and repeating words in the generated translation.
Previous work has attempted to mitigate this problem in various ways. Wu et al. (2016) incorporate coverage and length penalties during beam search—a simple yet limited solution, since it only affects the scores of translation hypotheses that are already in the beam. Other approaches involve architectural changes: providing coverage vectors to track the attention history Mi et al. (2016); Tu et al. (2016), using gating architectures and adaptive attention to control the amount of source context provided Tu et al. (2017a); Li and Zhu (2017), or adding a reconstruction loss Tu et al. (2017b). Feng et al. (2016) also use the notion of fertility implicitly in their proposed model. Their “fertility conditioned decoder” uses a coverage vector and an “extract gate” which are incorporated in the decoding recurrent unit, increasing the number of parameters.
In this paper, we propose a different solution that does not change the overall architecture, but only the attention transformation. Namely, we replace the traditional softmax by other recently proposed transformations that either promote attention sparsity (Martins and Astudillo, 2016) or upper bound the amount of attention a word can receive (Martins and Kreutzer, 2017). The bounds are determined by the fertility values of the source words. While these transformations have given encouraging results in various NLP problems, they have never been applied to NMT, to the best of our knowledge. Furthermore, we combine these two ideas and propose a novel attention transformation, constrained sparsemax, which produces both sparse and bounded attention weights, yielding a compact and interpretable set of alignments. While being inbetween soft and hard alignments (Figure 2), the constrained sparsemax transformation is endtoend differentiable, hence amenable for training with gradient backpropagation.
To sum up, our contributions are as follows:^{1}^{1}1Our software code is available at the OpenNMT fork www.github.com/Unbabel/OpenNMTpy/tree/dev and the running scripts at www.github.com/Unbabel/
sparse_constrained_attention.

We formulate constrained sparsemax and derive efficient linear and sublineartime algorithms for running forward and backward propagation. This transformation has two levels of sparsity: over time steps, and over the attended words at each step.

We provide a detailed empirical comparison of various attention transformations, including softmax (Bahdanau et al., 2014), sparsemax (Martins and Astudillo, 2016), constrained softmax (Martins and Kreutzer, 2017), and our newly proposed constrained sparsemax. We provide error analysis including two new metrics targeted at detecting coverage problems.
2 Preliminaries
Our underlying model architecture is a standard attentional encoderdecoder (Bahdanau et al., 2014). Let and denote the source and target sentences, respectively. We use a BiLSTM encoder to represent the source words as a matrix . The conditional probability of the target sentence is given as
(1) 
where is computed by a softmax output layer that receives a decoder state as input. This state is updated by an autoregressive LSTM, , where is an input context vector. This vector is computed as , where is a probability distribution that represents the attention over the source words, commonly obtained as
(2) 
where is a vector of scores. We follow Luong et al. (2015) and define as a bilinear transformation of encoder and decoder states, where is a model parameter.^{2}^{2}2This is the default implementation in the OpenNMT package. In preliminary experiments, feedforward attention (Bahdanau et al., 2014) did not show improvements.
3 Sparse and Constrained Attention
In this work, we consider alternatives to Eq. 2. Since the softmax is strictly positive, it forces all words in the source to receive some probability mass in the resulting attention distribution, which can be wasteful. Moreover, it may happen that the decoder attends repeatedly to the same source words across time steps, causing repetitions in the generated translation, as Tu et al. (2016) observed.
With this in mind, we replace Eq. 2 by , where is a transformation that may depend both on the scores and on upper bounds that limit the amount of attention that each word can receive. We consider three alternatives to softmax, described next.
Sparsemax.
The sparsemax transformation (Martins and Astudillo, 2016) is defined as:
(3) 
where . In words, it is the Euclidean projection of the scores onto the probability simplex. These projections tend to hit the boundary of the simplex, yielding a sparse probability distribution. This allows the decoder to attend only to a few words in the source, assigning zero probability mass to all other words. Martins and Astudillo (2016) have shown that the sparsemax can be evaluated in time (same asymptotic cost as softmax) and gradient backpropagation takes sublinear time (faster than softmax), by exploiting the sparsity of the solution.
Constrained softmax.
The constrained softmax transformation was recently proposed by Martins and Kreutzer (2017) in the context of easyfirst sequence tagging, being defined as follows:
(4)  
where is a vector of upper bounds, and is the KullbackLeibler divergence. In other words, it returns the distribution closest to whose attention probabilities are bounded by . Martins and Kreutzer (2017) have shown that this transformation can be evaluated in time and its gradients backpropagated in time.
To use this transformation in the attention mechanism, we make use of the idea of fertility (Brown et al., 1993). Namely, let denote the cumulative attention that each source word has received up to time step , and let be a vector containing fertility upper bounds for each source word. The attention at step is computed as
(5) 
Intuitively, each source word gets a credit of units of attention, which are consumed along the decoding process. If all the credit is exhausted, it receives zero attention from then on. Unlike the sparsemax transformation, which places sparse attention over the source words, the constrained softmax leads to sparsity over time steps.
Constrained sparsemax.
In this work, we propose a novel transformation which shares the two properties above: it provides both sparse and bounded probabilities. It is defined as:
(6)  
The following result, whose detailed proof we include as supplementary material (Appendix A), is key for enabling the use of the constrained sparsemax transformation in neural networks.
Proposition 1
Let be the solution of Eq. 6, and define the sets , , and . Then:

Gradient backpropagation. Backpropagation takes sublinear time . Let be a loss function, be the output gradient, and and be the input gradients. Then, we have:
(7) (8) where .
4 Fertility Bounds
We experiment with three ways of setting the fertility of the source words: constant, guided, and predicted. With constant, we set the fertilities of all source words to a fixed integer value . With guided, we train a word aligner based on IBM Model 2 (we used fast_align in our experiments, Dyer et al. (2013)) and, for each word in the vocabulary, we set the fertilities to the maximal observed value in the training data (or 1 if no alignment was observed). With the predicted strategy, we train a separate fertility predictor model using a biLSTM tagger.^{3}^{3}3A similar strategy was recently used by Gu et al. (2018) as a component of their nonautoregressive NMT model. At training time, we provide as supervision the fertility estimated by fast_align. Since our model works with fertility upper bounds and the word aligner may miss some word pairs, we found it beneficial to add a constant to this number (1 in our experiments). At test time, we use the expected fertilities according to our model.
DeEn  JaEn  RoEn  

bleu  meteor  rep  drop  bleu  meteor  rep  drop  bleu  meteor  rep  drop  
29.51  31.43  3.37  5.89  20.36  23.83  13.48  23.30  29.67  32.05  2.45  5.59  
+ CovPenalty  29.69  31.53  3.47  5.74  20.70  24.12  14.12  22.79  29.81  32.15  2.48  5.49 
+ CovVector  29.63  31.54  2.93  5.65  21.53  24.50  11.07  22.18  30.08  32.22  2.42  5.47 
29.73  31.54  3.18  5.90  21.28  24.25  13.09  22.40  29.97  32.12  2.19  5.60  
+ CovPenalty  29.83  31.60  3.24  5.79  21.64  24.49  13.36  21.91  30.07  32.20  2.20  5.47 
+ CovVector  29.22  31.18  3.13  6.15  21.35  24.74  10.11  21.25  29.30  31.84  2.18  5.87 
()  29.39  31.33  3.29  5.86  20.71  24.00  12.38  22.73  29.39  31.83  2.37  5.64 
()  29.85  31.76  2.67  5.23  21.31  24.51  11.40  21.59  29.77  32.10  1.98  5.44 
bleu  meteor  

constant,  29.66  31.60 
constant,  29.64  31.56 
guided,  29.56  31.45 
predicted,  29.78  31.60 
predicted,  29.85  31.76 
Sink token.
We append an additional sink token to the end of the source sentence, to which we assign unbounded fertility (). The token is akin to the null alignment in IBM models. The reason we add this token is the following: without the sink token, the length of the generated target sentence can never exceed words if we use constrained softmax/sparsemax. At training time this may be problematic, since the target length is fixed and the problems in Eqs. 4–6 can become infeasible. By adding the sink token we guarantee , eliminating the problem.
Exhaustion strategies.
To avoid missing source words, we implemented a simple strategy to encourage more attention to words with larger credit: we redefine the preattention word scores as , where is a constant ( in our experiments). This increases the score of words which have not yet exhausted their fertility (we may regard it as a “soft” lower bound in Eqs. 4–6).
5 Experiments
We evaluated our attention transformations on three language pairs. We focused on small datasets, as they are the most affected by coverage mistakes. We use the IWSLT 2014 corpus for DeEn, the KFTT corpus for JaEn (Neubig, 2011), and the WMT 2016 dataset for RoEn. The training sets have 153,326, 329,882, and 560,767 parallel sentences, respectively. Our reason to prefer smaller datasets is that this regime is what brings more adequacy issues and demands more structural biases, hence it is a good test bed for our methods. We tokenized the data using the Moses scripts and preprocessed it with subword units Sennrich et al. (2016) with a joint vocabulary and 32k merge operations. Our implementation was done on a fork of the OpenNMTpy toolkit Klein et al. (2017) with the default parameters ^{4}^{4}4We used a 2layer LSTM, embedding and hidden size of 500, dropout 0.3, and the SGD optimizer for 13 epochs.. We used a validation set to tune hyperparameters introduced by our model. Even though our attention implementations are CPUbased using NumPy (unlike the rest of the computation which is done on the GPU), we did not observe any noticeable slowdown using multiple devices.
As baselines, we use softmax attention, as well as two recently proposed coverage models:

CovPenalty (Wu et al., 2016, §7). At test time, the hypotheses in the beam are rescored with a global score that includes a length and a coverage penalty.^{5}^{5}5Since our sparse attention can become for some words, we extended the original coverage penalty by adding another parameter , set to : . We tuned and with grid search on , as in Wu et al. (2016).

CovVector (Tu et al., 2016). At training and test time, coverage vectors and additional parameters are used to condition the next attention step. We adapted this to our bilinear attention by defining .
We also experimented combining the strategies above with the sparsemax transformation.
As evaluation metrics, we report tokenized BLEU, METEOR (Denkowski and Lavie (2014), as well as two new metrics that we describe next to account for over and undertranslation.^{6}^{6}6Both evaluation metrics are included in our software package at www.github.com/Unbabel/
sparse_constrained_attention.
REPscore:
a new metric to count repetitions. Formally, given an gram , let and be the its frequency in the model translation and reference. We first compute a sentencelevel score
The REPscore is then given by summing over sentences, normalizing by the number of words on the reference corpus, and multiplying by 100. We used , and .
DROPscore:
a new metric that accounts for possibly dropped words. To compute it, we first compute two sets of word alignments: from source to reference translation, and from source to the predicted translation. In our experiments, the alignments were obtained with fast_align Dyer et al. (2013), trained on the training partition of the data. Then, the DROPscore computes the percentage of source words that aligned with some word from the reference translation, but not with any word from the predicted translation.
Table 1 shows the results. We can see that on average, the sparse models ( as well as combined with coverage models) have higher scores on both BLEU and METEOR. Generally, they also obtain better REP and DROP scores than and , which suggests that sparse attention alleviates the problem of coverage to some extent.
To compare different fertility strategies, we ran experiments on the DeEn for the transformation (Table 2). We see that the Predicted strategy outperforms the others both in terms of BLEU and METEOR, albeit slightly.
Figure 2 shows examples of sentences for which the fixed repetitions, along with the corresponding attention maps. We see that in the case of repetitions, the decoder attends repeatedly to the same portion of the source sentence (the expression “letzten hundert” in the first sentence and “regierung” in the second sentence). Not only did avoid repetitions, but it also yielded a sparse set of alignments, as expected. Appendix B provides more examples of translations from all models in discussion.
6 Conclusions
We proposed a new approach to address the coverage problem in NMT, by replacing the softmax attentional transformation by sparse and constrained alternatives: sparsemax, constrained softmax, and the newly proposed constrained sparsemax. For the latter, we derived efficient forward and backward propagation algorithms. By incorporating a model for fertility prediction, our attention transformations led to sparse alignments, avoiding repeated words in the translation.
Acknowledgments
We thank the Unbabel AI Research team for numerous discussions, and the three anonymous reviewers for their insightful comments. This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by the Fundação para a Ciência e Tecnologia through contracts UID/EEA/50008/2013, PTDC/EEISII/7092/2014 (LearnBig), and CMUPERI/TIC/0046/2014 (GoLocal).
References
 Almeida and Martins (2013) Miguel B. Almeida and André F. T. Martins. 2013. Fast and Robust Compressive Summarization with Dual Decomposition and MultiTask Learning. In Proc. of the Annual Meeting of the Association for Computational Linguistics.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .
 Blum et al. (1973) Manuel Blum, Robert W Floyd, Vaughan Pratt, Ronald L Rivest, and Robert E Tarjan. 1973. Time bounds for selection. Journal of Computer and System Sciences 7(4):448–461.
 Brown et al. (1993) Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational linguistics 19(2):263–311.
 Denkowski and Lavie (2014) Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation.
 Dyer et al. (2013) Chris Dyer, Victor Chahuneau, and Noah A Smith. 2013. A simple, fast, and effective reparameterization of ibm model 2. Association for Computational Linguistics.
 Feng et al. (2016) Shi Feng, Shujie Liu, Nan Yang, Mu Li, Ming Zhou, and Kenny Q. Zhu. 2016. Improving attention modeling with implicit distortion and fertility for machine translation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. The COLING 2016 Organizing Committee, Osaka, Japan, pages 3082–3092.
 Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, and Yann Dauphin. 2017. A convolutional encoder model for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). volume 1, pages 123–135.
 Gu et al. (2018) Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2018. Nonautoregressive neural machine translation. In Proc. of International Conference on Learning Representations.
 Klein et al. (2017) Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. Opennmt: Opensource toolkit for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30  August 4, System Demonstrations. pages 67–72.
 Koehn and Knowles (2017) Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation. Association for Computational Linguistics, Vancouver, pages 28–39.
 Li and Zhu (2017) Junhui Li and Muhua Zhu. 2017. Learning when to attend for neural machine translation. arXiv preprint arXiv:1705.11160 .
 Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attentionbased neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. pages 1412–1421.
 Martins and Astudillo (2016) Andre Martins and Ramon Astudillo. 2016. From softmax to sparsemax: A sparse model of attention and multilabel classification. In International Conference on Machine Learning. pages 1614–1623.
 Martins and Kreutzer (2017) André FT Martins and Julia Kreutzer. 2017. Learning what’s easy: Fully differentiable neural easyfirst taggers. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pages 349–362.
 Mi et al. (2016) Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and Abe Ittycheriah. 2016. Coverage embedding models for neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 14, 2016. pages 955–960.
 Neubig (2011) Graham Neubig. 2011. The Kyoto free translation task. http://www.phontron.com/kftt.
 Pardalos and Kovoor (1990) Panos M. Pardalos and Naina Kovoor. 1990. An algorithm for a singly constrained class of quadratic programs subject to upper and lower bounds. Mathematical Programming 46(1):321–328.
 Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 712, 2016, Berlin, Germany, Volume 1: Long Papers.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. pages 3104–3112.
 Tu et al. (2017a) Zhaopeng Tu, Yang Liu, Zhengdong Lu, Xiaohua Liu, and Hang Li. 2017a. Context gates for neural machine translation .
 Tu et al. (2017b) Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. 2017b. Neural machine translation with reconstruction. In AAAI. pages 3097–3103.
 Tu et al. (2016) Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling coverage for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. pages 6000–6010.
 Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 .
Appendix A Proof of Proposition 1
We provide here a detailed proof of Proposition 1.
a.1 Forward Propagation
The optimization problem can be written as
s.t. 
The Lagrangian function is:
(9) 
To obtain the solution, we invoke the KarushKuhnTucker conditions. From the stationarity condition, we have , which due to the primal feasibility condition implies that the solution is of the form:
(10) 
From the complementarity slackness condition, we have that implies that and therefore . On the other hand, implies , and implies . Hence the solution can be written as , where is determined such that the distribution normalizes:
(11) 
with and . Note that depends itself on the set , a function of the solution. In §A.3, we describe an algorithm that searches the value of efficiently.
a.2 Gradient Backpropagation
We now turn to the problem of backpropagating the gradients through the constrained sparsemax transformation. For that, we need to compute its Jacobian matrix, i.e., the derivatives and for . Let us first express as
(12) 
with as in Eq. 11. Note that we have and . Thus, we have the following:
(13) 
and
(14) 
Finally, we obtain:
(15)  
and
(16)  
where .
a.3 LinearTime Evaluation
Finally, we present an algorithm to solve the problem in Eq. 6 in linear time.
Pardalos and Kovoor (1990) describe an algorithm, reproduced here as Algorithm 1, for solving a class of singlyconstrained convex quadratic problems, which can be written in the form above (where each ):
(17) 
The solution of the problem in Eq. A.3 is of the form , where is a constant. The algorithm searches the value of this constant (which is similar to in our problem), which lies in a particular interval of splitpoints (line 3), iteratively shrinking this interval. The algorithm requires computing medians as a subroutine, which can be done in linear time (Blum et al., 1973). The overall complexity in (Pardalos and Kovoor, 1990). The same algorithm has been used in NLP by Almeida and Martins (2013) for a budgeted summarization problem.
Appendix B Examples of Translations
We show some examples of translations obtained for the GermanEnglish language pair with different systems. Blue highlights the parts of the reference that are correct and red highlights the corresponding problematic parts of translations, including repetitions, dropped words or mistranslations.
input  überlassen sie das ruhig uns . 

reference  leave that up to us . 
give us a silence .  
leave that to us .  
let’s leave that .  
leave it to us . 
input  so ungefähr , sie wissen schon . 

reference  like that , you know . 
so , you know , you know .  
so , you know , you know .  
so , you know , you know .  
like that , you know . 
input  und wir benutzen dieses wort mit solcher verachtung . 

reference  and we say that word with such contempt . 
and we use this word with such contempt contempt .  
and we use this word with such contempt .  
and we use this word with like this .  
and we use this word with such contempt . 
input  wir sehen das dazu , dass phosphor wirklich kritisch ist . 

reference  we can see that phosphorus is really critical . 
we see that that phosphorus is really critical .  
we see that that phosphorus really is critical .  
we see that that phosphorus is really critical .  
we see that phosphorus is really critical . 
input  also müssen sie auch nicht auf klassische musik verzichten , weil sie kein instrument spielen . 

reference  so you don’t need to abstain from listening to classical music because you don’t play an instrument . 
so you don’t have to rely on classical music because you don’t have an instrument .  
so they don’t have to kill classical music because they don’t play an instrument .  
so they don’t have to rely on classical music , because they don’t play an instrument .  
so you don’t have to get rid of classical music , because you don’t play an instrument . 
input  je mehr ich aber darüber nachdachte , desto mehr kam ich zu der ansicht , das der fisch etwas weiß . 

reference  the more i thought about it , however , the more i came to the view that this fish knows something . 
the more i thought about it , the more i got to the point of the fish .  
the more i thought about it , the more i got to the point of view of the fish .  
but the more i thought about it , the more i came to mind , the fish .  
the more i thought about it , the more i came to the point that the fish knows . 
input  all diese menschen lehren uns , dass es noch andere existenzmöglichkeiten , andere denkweisen , andere wege zur orientierung auf der erde gibt . 

reference  all of these peoples teach us that there are other ways of being , other ways of thinking , other ways of orienting yourself in the earth . 
all of these people teach us that there are others , other ways , other ways of guidance to the earth .  
all these people are teaching us that there are other options , other ways , different ways of guidance on earth .  
all of these people teach us that there’s other ways of doing other ways of thinking , other ways of guidance on earth .  
all these people teach us that there are other actors , other ways of thinking , other ways of guidance on earth . 
input  in der reichen welt , in der oberen milliarde , könnten wir wohl abstriche machen und weniger nutzen , 

aber im durchschnitt wird diese zahl jedes jahr steigen und sich somit insgesamt mehr als verdoppeln ,  
die zahl der dienste die pro person bereitgestellt werden .  
reference  in the rich world , perhaps the top one billion , we probably could cut back and use less , but every year , this number , 
on average , is going to go up , and so , over all , that will more than double the services delivered per person .  
in the rich world , in the upper billion , we might be able to do and use less use , but on average , that number  
is going to increase every year and so on , which is the number of services that are being put in .  
in the rich world , in the upper billion , we may be able to do and use less use , but in average , that number  
is going to rise every year , and so much more than double , the number of services that are being put together .  
in the rich world , in the upper billion , we might be able to take off and use less , but in average , this number  
is going to increase every year and so on , and that’s the number of people who are being put together per person .  
in the rich world , in the upper billion , we may be able to turn off and use less , but in average , that number will  
rise every year and so far more than double , the number of services that are being put into a person . 