Variational Attention for SequencetoSequence Models
Abstract
The variational encoderdecoder (VED) encodes source information as a set of random variables using a neural network, which in turn is decoded into target data using another neural network. In natural language processing, sequencetosequence (Seq2Seq) models typically serve as encoderdecoder networks. When combined with a traditional (deterministic) attention mechanism, the variational latent space may be bypassed by the attention model, making the generated sentences less diversified. In our paper, we propose a variational attention mechanism for VED, where the attention vector is modeled as normally distributed random variables. Experiments show that variational attention increases diversity while retaining high quality. We also show that the model is not sensitive to hyperparameters.
Variational Attention for SequencetoSequence Models
Hareesh Bahuleyan^{†}^{†}thanks: The first two authors contributed equally. , Lili Mou^{1}^{1}footnotemark: 1 , Olga Vechtomova, Pascal Poupart University of Waterloo, ON, Canada {hpallika, ovechtomova, ppoupart}@uwaterloo.ca, doublepower.mou@gmail.com
1 Introduction
The variational autoencoder (VAE), proposed by Kingma and Welling (2013), encodes data to latent (random) variables, and then decodes the latent variables to reconstruct data. Theoretically, it optimizes a variational lower bound of the loglikelihood of data. Compared with traditional variational methods such as meanfield approximation (Wainwright et al., 2008), VAE leverages modern neural networks and hence is a more powerful density estimator. Compared with traditional autoencoders (Hinton and Salakhutdinov, 2006), which are deterministic, VAE populates hidden representations to a region (instead of a single point), making it possible to generate diversified data from the vector space (Bowman et al., 2016) or even control the generated samples (Hu et al., 2017).
In natural language processing (NLP), recurrent neural networks (RNNs) are typically used as both encoder and decoder, known as sequencetosequence (Seq2Seq) models. Although variational Seq2Seq models are much trickier to train in comparison to the image domain, Bowman et al. (2016) succeed in training a sequencetosequence VAE and generating sentences from a continuous latent space. Such an architecture can further be extended to variational encoderdecoder (VED) to transform one sequence into another utilizing the “variational” property (Serban et al., 2017; Zhou and Neubig, 2017).
When applying attention mechanisms (Bahdanau et al., 2014) to variational Seq2Seq models, however, we find the generated sentences are of less variety. The attention mechanism summarizes source information as an attention vector by weighted sum, where the weights are a learned probabilistic distribution; then the attention vector is fed to the decoder. Evidence shows that attention significantly improves Seq2Seq performance in translation (Bahdanau et al., 2014), summarization (Rush et al., 2015), etc. In the variational Seq2Seq, the attention mechanism unfortunately may serve as a “bypassing” mechanism. In other words, the variational latent space does not need to learn much, as long as the attention mechanism itself is powerful enough to capture source information.
In this paper, we propose a variational attention mechanism to address this problem. We model the attention vector as random variables by imposing a probabilistic distribution. We follow traditional VAE and model the prior of the attention vector to follow a Gaussian distribution. However, our prior has a mean being the average of source information, as opposed to a vector of all zeros. This is more suited in our scenario because attention is a weighted sum of source information. Experiments show that the proposed models with variational attention have a higher diversity than variational Seq2Seq models with deterministic attention, while retaining the quality of generated sentences.
2 Background and Motivation
In this section, we introduce the variational autoencoders and attention mechanism. We also present a pilot experiment motivating our variational attention model.
2.1 Variational Autoencoder (VAE)
VAE encodes data (e.g., a sentence) as hidden random variables , based on which VAE reconstructs data . Consider a generative model, parameterized by , as
(1) 
Given a dataset , the likelihood of a data point is
(2) 
VAE models both and with neural networks, parametrized by and , respectively. Figure 1a shows the graphical model of this process. The training objective is to maximize the lower bound of likelihood , which is equivalent to minimizing the objective:
(3) 
The first term, called reconstruction loss, is the (expected) negative loglikelihood of data, similar to traditional deterministic autoencoders. The expectation is obtained by Monte Carlo sampling. The second term is the KLdivergence between ’s posterior and prior distributions. Typically the prior is set to standard normal .
2.2 Variational EncoderDecoder (VED)
In some applications, we would like to transform source information to target information, e.g., machine translation, dialogue systems, and text summarization. In these tasks, “auto”encoding is not sufficient, and an encodingdecoding framework is needed. Different efforts have been proposed to extend VAE to variational encoderdecoder (VED) frameworks, which transform an input to output . One possible extension is to condition all probabilistic distributions further on (Zhang et al., 2016; Cao and Clark, 2017; Serban et al., 2017). This, however, introduces a discrepancy between training and prediction, since is not available during prediction.
2.3 Attention Mechanism
In NLP, sequencetosequence recurrent neural networks are typically used as the encoder and decoder, as they are suitable for modeling a sequence of words (i.e., sentences). Figure 2a shows a basic Seq2Seq model in the VAE/VED scenario (Bowman et al., 2016). The encoder has an input and outputs and as the parameters of ’s posterior normal distribution. Then a decoder generates based on a sample drawn from its posterior distribution.
Attention mechanisms are proposed to dynamically align and during generation. At each time step in the decoder, the attention mechanism computes a probabilistic distribution by
(4) 
where is a prenormalized score, computed by in our model. Here, and are the hidden representations of the th step in target and th in the source, and is a learnable weight matrix.
Then the source information is summed by weights to obtain the attention vector
(5) 
which is fed to the decoder RNN at the th step. Figure 2c shows the variational Seq2Seq model with such traditional attention.
Input: the men are playing musical instruments 

(a) VAE w/o hidden state init. (Avg entropy: 2.52) 
the men are playing musical instruments 
the men are playing video games 
the musicians are playing musical instruments 
the women are playing musical instruments 
(b) VAE w/ hidden state init. (Avg entropy: 2.01) 
the men are playing musical instruments 
the men are playing musical instruments 
the men are playing musical instruments 
the man is playing musical instruments 
2.4 “Bypassing” Phenomenon
In this part, we explain the “bypassing” phenomenon in VAE or VED, if the network is not designed properly; this motivates our variational attention described in Section 3.
We observe that, if the decoder has a direct, deterministic access to the encoder, the latent variables might not capture much information so that the VAE or VED does not play a role in the process. We call this a bypassing phenomenon.
Theoretically, if is aware of by itself, then might be learned as without hurting the reconstruction loss , but the term in Eq. (3) can be minimized. This degrades a variational Seq2Seq model to a deterministic one.
The phenomenon can be best shown with a bypassing connection between the encoder and decoder for hidden state initialization. Some previous studies set the decoder’s initial state to be the encoder’s final state (Cao and Clark, 2017), shown in Figure 2c. We conducted a pilot study with a Seq2Seq VAE with a subset (80k samples) of the massive dataset provided by Bowman et al. (2015). We show generated sentences and entropy in Table 1. We see that the variational Seq2Seq can only generate very similar sentences with such bypassing connections (Table 1b), as opposed to generating diversified samples from the latent space only (Table 1a). Quantitatively, the entropy decreases by 0.5 over 1k unseen samples on average, showing a significant difference since entropy is a logarithmic metric. This analysis provides design philosophy of neural architectures in VAE or VED.
Since attention largely improves model performance for deterministic Seq2Seq models, it is tempting to include attention in the variational Seq2Seq as well. However, our pilot experiment raises the doubt if a traditional attention mechanism, which is deterministic, may bypass the latent space in VED, as illustrated by a graphical model in Figure 1c. Also, evidence in Zheng et al. (2017) shows the attention mechanism is so powerful that removing other connections between the encoder and decoder has little effect on BLEU scores in machine translation. In other words, VED with deterministic attention might learn reconstruction mostly from attention, whereas the posterior of the latent space needs to fit to its prior in order to minimize the term.
To alleviate this problem, we propose a variational attention mechanism for variational Seq2Seq models, as is described in detail in the next section.
3 The Proposed Variational Attention
Let us consider the decoding process of an RNN. At a time step , it adjusts its hidden state with an input of a word embedding (typically the groundtruth during training and the prediction from the previous step during testing). This is given by . In our experiments, we use long shortterm memory units (Hochreiter and Schmidhuber, 1997) as RNN’s transition. Enhanced with attention, the RNN is computed by . The predicted word is given by a softmax layer (where is weight). As discussed earlier, is computed by Eq. (5) in a deterministic fashion in traditional attention.
To build a variational attention, we treat both traditional latent space and the attention vector as random variables. The recognition and reconstruction graphical models are shown in Figure 1d.
3.1 Lower Bound
Since the likelihood of the th data point decomposes for different time steps, we consider the lower bound at the th step. The variational lower bound, i.e., Eq. (2), becomes
(6)  
(7) 
The second step is due to the independence in both recognition and reconstruction phrases. The posterior factorizes as because and are conditional independent given (dashed lines in Figure 1d), whereas the prior factorizes because and are marginally independent (solid lines in Figure 1d). In this way, the sampling procedure can be done separately and the loss can also be computed independently.
3.2 Prior
We have two plausible prior distributions for .

We observe that the attention vector has to be inside the convex hull of hidden representations of the source sequence, i.e., . We impose a normal prior whose mean is the average of , i.e, , where .
3.3 Posterior
We model the posterior of as a normal distribution , where the parameters and are obtained by a recognition neural network. Following VAE, we compute parameters as if they are deterministic attention in Eq. (5) and then transform it by another layer, shown in Figure 2d. As are realvalues, its transformation is done with a feedforward neural network with activation, followed by a linear layer. Likewise, a similar transformation is carried out to obtain , followed by an additional activation function to ensure that the values are positive.
3.4 Training Objective
The overall training objective of Seq2Seq with both variational latent space and variational attention is to minimize
(8) 
Here, we have a hyperparameter to balance the reconstruction loss and KL losses. further balances the attention’s KL loss and ’s KL loss. Since VAE and VED are tricky with Seq2Seq models (e.g., requiring KL annealing), we tie the change of both KL terms and only anneal . (Training details will be presented in Section 4.2.)
Notice that if has a prior of , the derivative of the KL term also goes to . This can be computed straightforwardly or by autodifferentiation tools, e.g., TensorFlow.
3.5 Geometric Interpretation
We present a geometric interpretation of both deterministic and variational attention mechanisms in Figure 3.
Suppose the hidden representations is of dimensional space (represented as the 3d space in Figure 3). In the deterministic mechanism, the attention model is a convex combination of , as the weights in Eq. (5) are a probabilistic distribution. The attention vector is a point in the convex null , shown in Figure 3a.
For variational attention in Figures 3b and 3c, the mean is still in the convex hull, but the sample drawn from the posterior is populated over the entire space (although mostly around the mean, shown as a ball). The difference between the two is that the standard normal prior pulls the posterior to the origin, whereas the prior pulls the posterior to the mean of . They are shown as red arrows.
Finally we would like to present a (potential) alternative of modeling variational attention. Instead of treating as random variables, we might also treat as random variables. Since is the parameter of a categorical distribution, its conjugate prior is a Dirichlet distribution. In this case, the resulting attention vector populates the entire convex hull (Figure 3d). However, it relies on a reparametrization trick to propagate reconstruction error’s gradient back to the recognition neural network (Kingma and Welling, 2013). In other words, the sampling of latent variables should be drawn from a fixed distribution (without parameters) and then transformed to a desired sample with the distribution’s parameters. This is nontrivial for Dirichlet distributions and further research is needed to address this problem.
4 Experiments
Model  Inference  BLEU1  BLEU2  BLEU3  BLEU4  Entropy  Dist1  Dist2 

Previous work (Du et al., 2017)  MAP  43.09  25.96  17.50  12.28       
DED (w/o Attn)  MAP  39.46  28.49  20.74  8.10       
DED+DAttn  MAP  42.34  30.86  22.74  11.60       
VED+DAttn  MAP  42.50  31.13  23.09  12.38       
Sampling  42.48  31.10  23.08  12.30  2.37  0.18  0.26  
VED+DAttn (2stage training)  MAP  42.17  30.96  22.95  11.98       
Sampling  41.98  30.82  22.81  11.78  2.41  0.19  0.27  
VED+VAttn  MAP  41.77  30.54  22.53  11.37       
Sampling  41.73  30.51  22.49  11.27  2.44  0.20  0.28  
VED+VAttn  MAP  42.10  30.71  22.70  11.55       
Sampling  42.03  30.62  22.66  11.50  2.44  0.20  0.29 
4.1 Task, Dataset, and Metrics
We evaluated our approach on a question generation task. It aims to generate questions based on a sentence in a paragraph. We followed Du et al. (2017) and used the Stanford Question Answering Dataset (SQuAD) dataset (Rajpurkar et al., 2016), except that we had a different split of 1k and 1k samples for validation and testing, respectively, from 86k pairs of sentencequestion pairs in total. As reported by Du et al. (2017), attention mechanism is especially critical in this task in order to generate relevant questions. Also, generated questions do need some variety (e.g., in the creation of reading comprehension datasets), as opposed to machine translation, which is typically deterministic.
We followed Du et al. (2017) and used BLEU1 to BLEU4 scores (Papineni et al., 2002) to evaluated the quality (in the sense of accuracy) of generated sentences. Besides, we adopted entropy and distinct metrics to measure the diversity. The entropy is computed as , where is the unigram probability in generated sentences. Distinct metrics—used in previous work to measure diversity (Li et al., 2016)—compute the percentage of distinct unigrams or bigrams (denoted as Dist1 and Dist2, respectively).
4.2 Training Details
We used LSTMRNNs with 100 hidden units for both the encoder and decoder; the dimension of the latent vector was also 100d. We adopted 300d pretrained word embeddings from Mikolov et al. (2013). For both the source and target side, the vocabulary was limited to the most frequent 40k tokens. We use the Adam optimizer (Kingma and Ba, 2014) to train all models, with an initial learning rate of 0.005, decay of 0.95, and other default hyperparameters. The batch size was set to be 100.
As shown in Bowman et al. (2016), Seq2Seq VAE is hard to train because of the issues associated with the term vanishing to zero. Following Bowman et al. (2016), we adopted cost annealing and word dropout during training. The coefficient of the KL term was gradually increased using a logistic annealing schedule, allowing the model to learn to reconstruct the input accurately during the early stages of training. A fixed word dropout rate of was used.
4.3 Performance
Table 2 represents the performance of various models. We first implemented a traditional Seq2Seq with attention (DED+DAttn) and generally replicated the results in Du et al. (2017), showing that our implementation is fair. We also tried a Seq2Seq model (DED) without attention, and we see the performance is degraded by 3.5 BLEU points, which is large, showing that the task is suited for testing the attention mechanism.
In the variational encoderdecoder (VED) framework, we report results obtained by both max a posterior (MAP) inference as well as sampling. In the sampling setting, we draw 10 samples ( and ) from the posterior given for each data point, and report average BLEU scores. We see that VED with deterministic attention (VED+DAttn) yields the best performance in terms of BLEU scores. However, it is not satisfactory if we would like to take diversity into account. Although it is better than variational Seq2Seq without attention (shown in Section 2.4), it is less diversified than our proposed variational attention with a fair margin, since entropy is a logarithmic measure. Variational attention models also generate more distinct unigrams and bigrams, as indicated by distinct metrics.
We also tried a heuristic of 2stage training to correct the lack of diversity. In this baseline, we first trained VED without attention for 20 epochs, since the variational latent space is more difficult to train. Then we added the attention mechanism to the model. This yields an entropy value in between the variational attention and the deterministic attention applied at the beginning.
The proposed variational attention achieves the best diversity while maintaining high quality. The prior of and yield similar diversity. Their resulting models are denoted as VED+VAttn0 and VED+VAttn. Regarding quality, is better than in terms of all BLEU scores, being a more reasonable prior. Its corresponding model VED+VAttn has less than 1 BLEU score degradation compared with VED+DAttn, but is comparable to the deterministic Seq2Seq with attention (DED+DAttn), showing that variational attention with a proper prior does not hurt performance much, despite its diversity.
Strength of Attention’s KL Loss. We tuned the KL loss’s strength of variational attention, i.e., in Eq. (8), and plot the BLEU4 and entropy metrics in Figure 4. In this experiment, we used the VED+DAttn variant. As shown, the strength of attention’s KL loss does not have a large effect on both BLEU and entropy. The result is expected because in the limit of to infinity, the model is mathematically equivalent (regardless of computational issues) to mean pooling of the source’s hidden states with a noise drawn from standard normal . The experiment shows that the proposed variational attention is not sensitive to this hyperparameter.
5 Conclusion
In this paper, we proposed a variational attention mechanism for variational encoderdecoder (VED) frameworks. We observe that, in VED, if the decoder has direct access to the encoder, the connection may bypass the variational space. Traditional attention mechanisms might serve as such bypassing connection, making the output less diverse. Our variational attention imposes a probabilistic distribution on the attention vector. We also proposed different priors for the attention vector, among which, choosing a normal distribution centered at the mean of source representations is appropriate in our scenario. The proposed model was evaluated on a question generation task, showing that variational attention yields more diversified samples while retaining high quality. We also show that the model is not sensitive to the strength of attention’s term.
Acknowledgments
We thank Hao Zhou for helpful discussion. The Titan Xp GPU used for this research was donated by the NVIDIA Corporation to Olga Vechtomova.
References
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .
 Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. pages 632–642. https://doi.org/10.18653/v1/D151075.
 Bowman et al. (2016) Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. pages 10–21. https://doi.org/10.18653/v1/K161002.
 Cao and Clark (2017) Kris Cao and Stephen Clark. 2017. Latent variable dialogue models and their diversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. pages 182–187. https://doi.org/10.18653/v1/E172029.
 Du et al. (2017) Xinya Du, Junru Shao, and Claire Cardie. 2017. Learning to ask: Neural question generation for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pages 1342–1352. https://doi.org/10.18653/v1/P171123.
 Hinton and Salakhutdinov (2006) Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313(5786):504–507.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural computation 9(8):1735–1780.
 Hu et al. (2017) Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. 2017. Toward controlled generation of text. In Proceedings of the 34th International Conference on Machine Learning. volume 70, pages 1587–1596.
 Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. 2013. Autoencoding variational Bayes. arXiv preprint arXiv:1312.6114 .
 Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversitypromoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 110–119. https://doi.org/10.18653/v1/N161014.
 Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. pages 3111–3119.
 Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pages 311–318.
 Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. pages 2383–2392. https://doi.org/10.18653/v1/D161264.
 Rush et al. (2015) Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. pages 379–389. https://doi.org/10.18653/v1/D151044.
 Serban et al. (2017) Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C Courville, and Yoshua Bengio. 2017. A hierarchical latent variable encoderdecoder model for generating dialogues. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence. pages 3295–3301.
 Wainwright et al. (2008) Martin J Wainwright, Michael I Jordan, et al. 2008. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning pages 1–305.
 Zhang et al. (2016) Biao Zhang, Deyi Xiong, jinsong su, Qun Liu, Rongrong Ji, Hong Duan, and Min Zhang. 2016. Variational neural discourse relation recognizer. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pages 382–391. https://doi.org/10.18653/v1/D161037.
 Zheng et al. (2017) Zaixiang Zheng, Hao Zhou, Shujian Huang, Lili Mou, Xinyu Dai, Jiajun Chen, and Zhaopeng Tu. 2017. Modeling past and future for neural machine translation. arXiv preprint arXiv:1711.09502 (to appear in TACL) .
 Zhou and Neubig (2017) Chunting Zhou and Graham Neubig. 2017. Morphological inflection generation with multispace variational encoderdecoders. In Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection. pages 58–65.