Amortized Context Vector Inference for Sequence-to-Sequence Networks

Amortized Context Vector Inference for Sequence-to-Sequence Networks

Sotirios Chatzis
Cyprus University of Technology &Aristotelis Charalampous
Cyprus University of Technology &Kyriacos Tolias
Cyprus University of Technology &Sotiris A. Vassou
Cyprus University of Technology

Neural attention (NA) is an effective mechanism for inferring complex structural data dependencies that span long temporal horizons. As a consequence, it has become a key component of sequence-to-sequence models that yield state-of-the-art performance in as hard tasks as abstractive document summarization (ADS), machine translation (MT), and video captioning (VC). NA mechanisms perform inference of context vectors; these constitute weighted sums of deterministic input sequence encodings, adaptively sourced over long temporal horizons. However, recent work in the field of amortized variational inference (AVI) has shown that it is often useful to treat the representations generated by deep networks as latent random variables. This allows for the models to better explore the space of possible representations. Based on this motivation, in this work we introduce a novel regard towards a popular NA mechanism, namely soft-attention (SA). Our approach treats the context vectors generated by SA models as latent variables, the posteriors of which are inferred by employing AVI. Both the means and the covariance matrices of the inferred posteriors are parameterized via deep network mechanisms similar to those employed in the context of standard SA. To illustrate our method, we implement it in the context of popular sequence-to-sequence model variants with SA. We conduct an extensive experimental evaluation using challenging ADS, VC, and MT benchmarks, and show how our approach compares to the baselines.

1 Introduction

Sequence-to-sequence (seq2seq) or encoder-decoder models 1 constitute a novel solution to inferring relations between sequences of different lengths. They are broadly used for addressing tasks including machine translation (MT) 2, 3, abstractive document summarization (ADS), descriptive caption generation (DCG) 4, and question answering (QA) 5, to name just a few. Seq2seq models comprise two distinct RNN models: an encoder RNN, and a decoder RNN. Their main principle of operation is based on the idea of learning to infer an intermediate context vector representation, , which is “shared” among the two RNN modules of the model, i.e., the encoder and the decoder. Specifically, the encoder converts the source sequence to a context vector (e.g., the final state of the encoder RNN), while the decoder is presented with the inferred context vector to produce the target sequence.

Despite these merits, though, baseline seq2seq models cannot learn temporal dynamics over long horizons. This is due to the fact that a single context vector is capable of encoding rather limited temporal information. This major limitation has been addressed via the development of neural attention (NA) mechanisms 2. NA has been a major breakthrough in Deep Learning, as it enables the decoder modules of seq2seq models to adaptively focus on temporally-varying subsets of the source sequence. This capacity, in turn, enables flexibly capturing long temporal dynamics in a computationally efficient manner.

Among the large collection of recently devised NA variants, the vast majority build upon the concept of Soft Attention (SA) 4. Under this rationale, at each sequence generation (decoding) step, NA-obtained context vectors essentially constitute deterministic representations of the dynamics between the source sequence and the decodings obtained thus far. However, recent work in the field of amortized variational inference (AVI) 6, 7 has shown that it is often useful to treat representations generated by deep networks as latent random variables. Indeed, it is now well-understood that, under such an inferential setup, the trained deep learning models become more effective in exploring the space of possible representations, instead of getting trapped to representations of poor quality. Then, model training reduces to inferring posterior distributions over the introduced latent variables. This can be performed by resorting to variational inference 8, where the sought variational posteriors are parameterized via appropriate deep networks.

Motivated from these research advances, in this paper we consider a novel formulation of SA. Specifically, we propose an NA mechanism formulation where the generated context vectors are considered random latent variables, over which AVI is performed. We dub our approach amortized context vector inference (ACVI). To exhibit the efficacy of ACVI, we implement it into: (i) Pointer-Generator Networks 9, which constitute a state-of-the-art approach for addressing ADS tasks; (ii) baseline seq2seq models with additive SA, applied to the task of VC; and (iii) baseline seq2seq models with multiplicative SA, applied to MT.

The remainder of this paper is organized as follows: In Section 2, we briefly present the seq2seq model variants in the context of which we implement our method and exhibit its efficacy. In Section 3, we introduce the proposed approach, and elaborate on its training and inference algorithms. In Section 4, we perform an extensive experimental evaluation of our approach using benchmark ADS, MT, and VC datasets. Finally, in the concluding Section, we summarize the contribution of this work.

2 Methodological Background

2.1 Abstractive Document Summarization

Abstractive document summarization consists in not only copying from an original document, but also learning to generate new sentences or novel words during the summarization process. The introduction of seq2seq models has rendered ADS both feasible and effective 10, 11. Dealing with out-of-vocabulary (OOV) words was one of the main difficulties that early ADS models were confronted with. Word and/or phrase repetition was a second issue. The pointer-generator model presented in 9 constitutes one of the most comprehensive efforts towards ameliorating these issues.

In a nutshell, this model comprises one bidirectional LSTM 12 (BiLSTM) encoder, and a unidirectional LSTM decoder, which incorporates an SA mechanism 2. The word embedding of each token, , in the source sequence (document) is presented to the encoder BiLSTM; this obtains a representation (encoding) , where is the corresponding forward LSTM state, and is the corresponding backward LSTM state. Then, at each generation step, , the decoder LSTM gets as input the (word embedding of the) previous token in the target sequence. During training, this is the previous word in the available reference summary; during inference, this is the previous generated word. On this basis, it updates its internal state, , which is then presented to the postulated SA network. Specifically, the attention distribution, , is given by:


where the are trainable weight matrices, is a trainable bias vector, and is a trainable parameter vector of the same size as . Then, the model updates the maintained context vector, , by taking an weighted average of all the source token encodings; in that average, the used weights are the inferred attention probabilities. We obtain:


Eventually, the predictive distribution over the next generated word yields:


where and are trainable weight matrices, while and are trainable bias vectors.

In parallel, the network also computes an additional probability, , which expresses whether the next output should be generated by sampling from the predictive distribution, , or the model should simply copy one of the already available source sequence tokens. This mechanism allows for the model to cope with OOV words; it is defined via a simple sigmoid layer of the form:


where is the decoder input, while the and are trainable parameter vectors. The probability of copying the th source sequence token is considered equal to the corresponding attention probability, . Eventually, the obtained probability that the next output word will be (found either in the vocabulary or among the source sequence tokens) yields:


Finally, a coverage mechanism may also be employed 13, as a means of penalizing words that have already received attention in the past, to prevent repetition. Coverage acts as a global attention aggregator that is added to the attention layer (1). Specifically, the coverage vector, , is defined as:


Using the so-obtained coverage vector, expression (1) is modified as follows:


where is a trainable parameter vector of size similar to .

Model training is performed via categorical cross-entropy minimization. A coverage term is also added to the loss function, to facilitate repetition prevention; it takes the form:


Here, controls the influence of the coverage term; in the remainder of this work, we set .

2.2 Video Captioning

Seq2seq models with attention have been successfully applied to several datasets of multimodal nature. Video captioning constitutes a popular such application. In this work, we consider a simple seq2seq model with additive SA that comprises a BiLSTM encoder, an LSTM decoder, and an output distribution of the form (4). The used encoder is presented with visual features obtained from a pretrained convolutional neural network (CNN). Using a pretrained CNN as our employed visual feature extractor ensures that all the evaluated attention models are presented with identical feature descriptors of the available raw data. Hence, it facilitates fairness in the comparative evaluation of our proposed attention mechanism. We elaborate on the specific model configuration in Section 4.2.

2.3 Machine Translation

Machine translation constitutes one of the first sequential data modeling applications where seq2seq models were shown to obtain state-of-the-art performance. In this work, we perform MT by means of a baseline seq2seq model comprising a BiLSTM encoder, an LSTM decoder, a predictive distribution over the next generated word which is given by (4), and a multiplicative SA mechanism. The latter is described by 3:


in conjunction with Eq. (2); therein, is a trainable weights matrix. Our consideration here of multiplicative SA both serves the purpose of implementing and evaluating our approach under diverse SA variants, and is congruent with the best reported results in the related literature.

3 Proposed Approach

We begin by introducing the core assumption that the computed context vectors, , constitute latent random variables. Further, we assume that, at each time point, , the corresponding context vector, , is drawn from a distribution associated with one of the available source sequence encodings, . Let us introduce the set of binary latent indicator variables, , , with denoting that the context vector is drawn from the th density, that is the density associated with the th encoding, , and otherwise. Then, we postulate the following hierarchical model:


where comprises the set of source and target training sequences, denotes the parameters set of the context vector conditional density, and denotes the probability of drawing from the th conditional at time . Notably, we assume that the assignment probabilities are functions of the attention probabilities, . This is reasonable, since higher affinity of the decoder state, , with the th encoding, , at time , should result in higher probability that the context vector be drawn from the corresponding conditional density. Similarly, we consider that the parameters set is a function parameterized by the encodings vector it is associated with, .

Having defined the hierarchical model (11)-(12), it is important that we examine the resulting expression of the posterior density . By marginalizing over (11) and (12), we obtain:


In other words, we obtain a finite mixture model posterior over the context vectors, with mixture conditional densities associated with the available source sequence encodings, and mixture weights associated with the corresponding attention vectors. In addition, it is also interesting to compare this expression to the definition of context vectors under the conventional SA scheme. From (3), we observe that conventional SA is merely a special case of our proposed model, obtained by introducing two assumptions: (i) that the postulated mixture component assignment probabilities are identity functions of the associated attention probabilities, i.e.


and (ii) that the conditional densities of the context vectors have all their mass concentrated on , that is they collapse onto the single point, :


Indeed, by combining (13) - (15), we yield:


whence we obtain (3) with probability 1.

Thus, our approach replaces the simplistic conditional density expression (15) with a more appropriate family , as in (13). Such a latent variable consideration may result in significant advantages for the postulated seq2seq model. Specifically, our trained model becomes more agile in searching for effective context representations, as opposed to getting trapped to poor local solutions. Indeed, recent developments in deep learning research have provided strong evidence of these advantages in a multitude of deep network configurations and addressed tasks, e.g. 6, 7, 14.

In the following, we examine conditional densities of Gaussian form. Adopting the inferential rationale of AVI, we consider that these conditional Gaussians are parameterized via the postulated BiLSTM encoder. Specifically, we assume:




is a trainable Multi-Layer Perceptron, comprising one hidden ReLU layer of size , and the encodings, , are obtained from a BiLSTM encoder, similar to conventional models. Hence:


Thus, we have arrived at a full statistical treatment of the context vectors, . The corresponding density is defined as a Gaussian mixture model; its means and log-covariance diagonals are parameterized via the encoder BiLSTM module of the seq2seq model. This concludes the formulation of the proposed ACVI mechanism.

Relation to Recent Work. From the above exhibition, it becomes apparent that our approach generalizes the concept of neural attention by introducing stochasticity in the computation of context vectors. As we have already discussed, the ultimate goal of this construction is to allow for better exploring the space of possible representations, by leveraging Bayesian inference arguments. We emphasize that this is in stark contrast to recent efforts toward generalizing neural attention by deriving more complex attention distributions. Specifically, 15 have recently introduced structured attention, whereby the attention probabilities are considered to be interdependent over consecutive time-steps; for instance, they postulate the first-order Markov dynamics assumption:


Thus, 15 compute more involved attention distributions, while our approach provides a method for better exploring the representations space of context vectors. Note also that Eq. (20) gives rise to the need of executing much more computationally complex algorithms to perform attention distribution inference, e.g. the forward-backward algorithm 16. In contrast, our method imposes computational costs comparable to conventional SA.

Training Algorithm. To perform training of a seq2seq model equipped with the ACVI mechanism, we resort to maximization of the resulting evidence lower-bound (ELBO) expression. To this end, we need first to introduce some prior assumption over the context latent variables, . To serve the purpose of simplicity, and also offer a valid way to effect model regularization, we consider:


On the grounds of these assumptions, it is easy to show that the resulting ELBO expression becomes:


where is the error function at decoding time of the seq2seq model, while the posterior expectation , as well as the KL divergence term in (22), are approximated by drawing MC samples from the posteriors. To ensure that the resulting MC estimators will be of low variance, we adopt the reparameterization trick. To this end, we rely on the posterior expressions (17) and (14); based on these, we express the drawn MC samples as follows:


In this expression, the are samples from the conditional Gaussians (17), which employ the standard reparameterization trick rationale, as applied to Gaussian variables. In other words, we have:


where .

On the other hand, the are samples from the Categorical distribution (14). To allow for performing backpropagation through these samples, while ensuring that the obtained gradients will be of low variance, we may draw by making use of the Gumbel-Softmax relaxation 17. However, we have empirically found that the modeling performance of our approach remains essentially the same if we replace the expression (23) with a (rough, yet eventually effective) approximation: We use a simple weighted average of the samples , with the weights being the attention probabilities, :


The advantage of this approximation consists in the fact that it alleviates the computational costs of employing the Gumbel-Softmax relaxation, which dominates the costs of sampling from the mixture posterior (19). Hence, in the following we adopt (25), and report results under this approximation.

Having obtained a reparameterization of the model ELBO that guarantees low variance estimators, we proceed to its maximization by resorting to a modern, off-the-shelf, stochastic gradient optimizer. Specifically, we adopt simple stochastic gradient descent (SGD) for the MT tasks, and Adam with its default settings 18 for the rest.

Inference Algorithm. To perform target decoding by means of a seq2seq model that employs the ACVI mechanism, we resort to Beam search 19. In our experiments, Beam width is set to five in the case of the ADS and VC tasks (Sections 4.1 and 4.2), and to ten in the case of the MT tasks (Section 4.3).

4 Experimental Evaluation111We have developed our source codes in Python, using the TensorFlow library 20. We run our experiments on a server with an NVIDIA Tesla K40 GPU.

4.1 Abstractive Document Summarization

Our experiments are based on the non-anonymized CNN/Daily Mail dataset, similar to the experiments of 9. To obtain some comparative results, we use pointer-generator networks as our evaluation platform 9; therein, we employ our ACVI mechanism, the standard SA mechanism considered in 9, as well as structured attention using the first-order Markov assumption in Eq. (20) 15. This allows for directly assessing the benefits of introducing a latent variable regard towards the inferred context vectors, which is amenable to AVI.

The observations presented to the encoder modules constitute 128-dimensional word embeddings of the original 50K-dimensional one-hot-vectors of the source tokens. Similarly, the observations presented to the decoder modules are 128-dimensional word embeddings pertaining to the summary tokens (reference tokens during training; generated tokens during inference). Both these embeddings are trained, as part of the overall training procedure of the evaluated models. Following the suggestions in 9, we evaluate all approaches with LSTMs that comprise 256-dimensional states. We use (ROUGE-1, ROUGE 2 and ROUGE-L) 21 and 22 as our performance metrics. METEOR is evaluated both in exact match mode (rewarding only exact matches between words), as well as full mode (which additionally rewards matching stems, synonyms and paraphrases).

We direct the reader for further details on the adopted experimental setup, as well as for some indicative examples of the generated summaries, to Appendix A (Tables 6-9). Our quantitative evaluation is provided in Table 1. To allow for deeper insights, we show therein how performance of the evaluated models varies before and after we introduce the coverage mechanism. For completeness sake, we also cite the performance of alternative state-of-the-art approaches on the same data. These include ADS and Extractive Summarization models. The latter simply rely on copying from the source document, as opposed to learning to generate anew; this may produce lower quality summaries, but substantially less prone to grammatical or syntactic errors. As we observe, utilization of ACVI outperforms all the alternatives by a large margin. Finally, it is interesting to examine whether ACVI increases the propensity of a trained model towards generating novel words, that is words that are not found in the source document, as well as the capacity to adopt OOV words. The related results are provided in Table 2. We observe that ACVI increases the number of generated novel words by 7 times compared to the alternatives. In a similar vein, ACVI appears to help the model better cope with OOV words.

1 2 L Exact Match + stem/syn/para
abstractive model* 23 35.46 13.30 32.65 - -
seq2seq with SA (150K vocabulary) 30.49 11.17 28.08 11.65 12.86
seq2seq with SA (50K vocabulary) 31.33 11.81 28.83 12.03 13.20
pointer-generator (SA) 36.44 15.66 33.42 15.35 16.65
pointer-generator + coverage (SA) 39.53 17.28 36.38 17.32 18.72
pointer-generator + structured attention 36.72 15.84 33.64 15.43 16.87
pointer-generator + structured attention + coverage 40.12 17.61 36.74 17.38 18.93
pointer-generator + ACVI 39.96 17.41 36.40 16.68 17.84
pointer-generator + ACVI + coverage 42.71 19.24 39.05 18.47 20.09
lead-3 baseline 9 40.34 17.70 36.57 20.48 22.21
lead-3 baseline* 24 39.2 15.7 35.5 - -
extractive model* 24 39.6 16.2 35.3 - -
Tabela 1: Abstractive Document Summarization: ROUGE scores on the test set. Models and baselines in the top section are abstractive, while those in the bottom section are extractive. Those marked with * were trained and evaluated on the anonymized CNN/Daily Mail dataset.

4.2 Video Captioning

Our evaluation of the proposed approach in the context of a VC application is based on the Youtube2Text video corpus 25. To reduce the entailed memory requirements, we process only the first 240 frames of each video. To obtain some initial video frame descriptors, we employ a pretrained GoogLeNet CNN 26 (implementation provided in Caffe 27). Specifically, we use the features extracted at the pool5/7x7_s1 layer of this pretrained model. We select 24 equally-spaced frames out of the first 240 from each video, and feed them into the prescribed CNN to obtain a 1024 dimensional frame-wise feature vector. These are the visual inputs eventually presented to the trained models. All employed LSTMs comprise 1000-dimensional states. These are mapped to 100-dimensional features via the matrices and in Eq. (1). The decoders are presented with 256-dimensional word embeddings, obtained in a fashion similar to our ADS experiments.

More details on the adopted experimental setup are given in Appendix B. We yield some comparative results by evaluating seq2seq models configured as described in Section 2.2; we use ACVI, structured attention in the form (20), or the conventional SA mechanism. Our quantitative evaluation is performed on the grounds of the ROUGE-L and CIDEr 28 scores, on both the validation set and the test set. The obtained results are depicted in Table 3; they show that our method outperforms the alternatives by an important margin. Finally, we provide some indicative examples of the generated results in Appendix B (Figs. 1-8). These vouch for the capacity of our approach to detect salient visual semantics, as well as subtle correlations between related lingual terms (e.g., hamster–>small animal, a car drives –> several people drive, meat–>pork).

SA Structured Attention ACVI
Rate of Novel Words 0.05 0.05 0.38
Rate of OOV Words Adoption 1.16 1.18 1.25
Tabela 2: Abstractive Document Summarization: Novel words generation rate and OOV words adoption rate obtained by using pointer-generator networks.
Method ROUGE: Valid. Set ROUGE: Test Set CIDEr: Valid. Set CIDEr: Test Set
SA 0.5628 0.5701 0.4575 0.421
Structured Attention 0.5804 0.5712 0.5071 0.4283
ACVI 0.5968 0.5766 0.6039 0.4375
Tabela 3: Video Captioning: Performance of the considered alternatives.

4.3 Machine Translation

Our experiments make use of publicly available corpora, namely WMT’16 English-to-Romanian (EnRo) and Romanian-to-English (RoEn), as well as IWSLT’15 English-to-Vietnamese (EnVi) and Vietnamese-to-English (ViEn). We benchmark the evaluated models against word-based vocabularies, and present our results in terms of the BLEU score 29.

Following the related literature, we utilize byte pair encoding (BPE) 30 in the case of the (En, Ro) pair. This allows for seamlessly handling rare words, by breaking a given vocabulary into a fixed-size vocabulary of variable-length character sequences (subwords). Subword vocabularies are shared among the languages of a source/destination pair. This way, we promote frequent subword units, thus improving the coverage of the available dictionary words.

We obtain some comparative performance results by evaluating seq2seq models with multiplicative attention [Eq. (9)]; we consider ACVI, conventional SA, as well as structured attention as the evaluated competitors. The trained architecture is homogeneous across all our comparisons. Specifically, both the encoders and the decoders of the evaluated models are presented with 256-dimensional trainable word embeddings. We utilize 2-layer BiLSTM encoders, and 2-layer LSTM decoders; all comprise 256-dimensional hidden states on each layer, similar to the summarization application.

Further details on our experimental setup are provided in Appendix C. Our results in Table 4.3 show inferior performance for our competitors. This is on par with our observations in Section 4.1. It is also worth to mention that (the open-source implementation of) structured attention yields the worst performance in the MT task. In fact, to gain reasonable results, we had to considerably extend structured attention model training.555For the (En, Ro) pair, we had to protract training of structured attention models to 12 epochs, in contrast to the 4 epochs that ACVI and baseline models required. Despite that, the method’s inability to correctly capture temporal associations between different (source and target) European languages is conspicuous. This further reinforces our initial assumptions of treating the obtained context vectors as inferred latent random variables in the MT task as well.


|c|c|c|c|c|c|[1.5pt]c|c|c|c|c|   & BLEU    SourceTarget Language   & EnVi   & ViEn   & EnRo   & RoEn     & & dev & test & dev & test & dev & test & dev & test Method & Baseline & 23.21 & 25.18 & 20.89 & 23.28 & 12.87 & 14.40 & 15.87 & 15.78
& Structured Attention (12 epochs) & 16.81 & 17.00 & 17.19 & 18.08 & 7.04 & 7.08 & 11.02 & 11.68
& ACVI & 24.08 & 26.16 & 21.26 & 24.47 & 14.15 & 15.78 & 18.07 & 17.78

5 Conclusions

In this work, we cast the problem of context vector computation for seq2seq-type models employing SA into amortized variational inference. We made this possible by considering that the sought context vectors are latent variables following a Gaussian mixture posterior parameterized by the source sequence encodings. On the same vein, we used the inferred attention probabilities associated with each encoding as the mixture component weights. We exhibited the merits of our approach on seq2seq architectures addressing ADS, VC, and MT tasks; we used benchmark datasets in all cases.

Finally, we underline that our proposed approach induces only negligible computational overheads compared to conventional SA. Specifically, the only extra trainable parameters that our approach postulates are those of the MLPs employed in Eq. (17); these are of extremely limited size compared to the overall model size, and correspond to merely few extra feedforward computations at inference time. Besides, our sampling strategy does not induce significant computational costs, since we adopt the approximation in (25). Hence, it is unequivocal that our approach offers significant modeling performance benefits without imposing significant computational overheads.


  • 1 I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proc. NIPS, 2014.
  • 2 D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. ICLR, 2015.
  • 3 M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proc. EMNLP, 2015.
  • 4 K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. ICML, 2016.
  • 5 S. Sukhbaatar, J. Weston et al., “End-to-end memory networks,” in Proc. NIPS, 2015.
  • 6 D. Jimenez Rezende and S. Mohamed, “Variational inference with normalizing flows,” in Proc. ICML, 2015.
  • 7 D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proc. NIPS, 2013.
  • 8 H. Attias, “A variational baysian framework for graphical models,” in Advances in neural information processing systems, 2000, pp. 209–215.
  • 9 C. Manning, “Get to the point: Summarization with pointer-generator networks. arxiv preprint,” Proc. ACL, 2017.
  • 10 A. M. Rush, S. Chopra, and J. Weston, “A neural attention model for abstractive sentence summarization,” in Proc. EMNLP, 2015.
  • 11 W. Zeng, W. Luo, S. Fidler, and R. Urtasun, “Efficient summarization with read-again and copy mechanism,” Proc. ICLR, 2017.
  • 12 S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • 13 Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li, “Modeling coverage for neural machine translation,” in Proc. ACL, 2016.
  • 14 C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther, “Ladder variational autoencoders,” in Proc. NIPS, 2016.
  • 15 Y. Kim, C. Denton, L. Hoang, and A. M. Rush, “Structured attention networks,” in Proc. ICLR, 2017.
  • 16 L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, pp. 245–255.
  • 17 E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with Gumbel-Softmax,” in Proc. ICLR, 2017.
  • 18 D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, 2015.
  • 19 S. Russel and P. Norvig, “Artificial intelligence: A modern approach, 2003,” EUA: Prentice Hall, vol. 178.
  • 20 M. Abadi et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from [Online]. Available:
  • 21 C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” Text Summarization Branches Out, 2004.
  • 22 M. Denkowski and A. Lavie, “METEOR universal: Language specific translation evaluation for any target language,” in Proc. ACL Workshop on Statistical Machine Translation, 2014.
  • 23 R. Nallapati, B. Zhou, C. N. dos santos, C. Gulcehre, and B. Xiang, “Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond,” in Proc. CoNLL, 2016.
  • 24 R. Nallapati, F. Zhai, and B. Zhou, “Summarunner: A recurrent neural network based sequence model for extractive summarization of documents.” in Proc. AAAI, 2017, pp. 3075–3081.
  • 25 L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, “Describing videos by exploiting temporal structure,” in Proc. ICCV, 2015.
  • 26 C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. CVPR, 2015.
  • 27 Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv:1408.5093, 2014.
  • 28 R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proc. CVPR, 2015.
  • 29 K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for automatic evaluation of machine translation,” in Proc. ACL, 2002.
  • 30 R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015.
  • 31 W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” in Proc. ICLR, 2015.
  • 32 M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, R. Cattoni, and M. Federico, “The iwslt 2015 evaluation campaign,” in IWSLT 2015, International Workshop on Spoken Language Translation, 2015.
  • 33 M. Luong, E. Brevdo, and R. Zhao, “Neural machine translation (seq2seq) tutorial,”, 2017.

Appendix A

We elaborate here on the experimental setup of Section 4.1. The used dataset comprises 287,226 training pairs of documents and reference summaries, 13,368 validation pairs, and 11,490 test pairs. In this dataset, the average article length is 781 tokens; the average summary length is 3.75 sentences, with the average summary being 56 tokens long. In all our experiments, we restrict the used vocabulary to the 50K most common words in the considered dataset, similar to 9. Note that this is significantly smaller than typical in the literature 23.

To allow for faster training convergence, we split it into five phases. On each phase, we employ a different number of maximum encoding steps for the evaluated models (i.e., the size of the inferred attention vectors), as well as for the maximum allowed number of decoding steps. We provide the related details in Table 5. During these phases, we train the employed models with the coverage mechanism being disabled; that is, we set . We enable this mechanism only after these five training phases conclude. Specifically, we perform a final 3K iterations of model training, during which we train the weights along with the rest of the model parameters. We do not use any form of regularization, as suggested in 9.

In Tables 6-9, we provide some indicative examples of summaries produced by a pointer-generator network with coverage, employing the ACVI mechanism. We also show what the initial document has been, as well as the available reference summary used for quantitative performance evaluation. In all cases, we annotate OOV words in italics, we highlight novel words in purple, we show contextual understanding in bold, while article fragments also included in the generated summary are highlighted in green.

Phase Iterations Max encoding steps Max decoding steps
1 0 - 71k 10 10
2 71k - 116k 50 50
3 116k - 184k 100 50
4 184k - 223k 200 50
5 223k - 250k 400 100
Tabela 5: Abstractive Document Summarization: Training phases.
lagos , nigeria -lrb- cnn -rrb- a day after winning nigeria ’s presidency , muhammadu buhari told cnn ’s christiane amanpour that he plans to aggressively fight corruption that has long plagued nigeria and go after the root of the nation ’s unrest . buhari said he ’ll “ rapidly give attention ” to curbing violence in the northeast part of nigeria , where the terrorist group boko haram operates . by cooperating with neighboring nations chad , cameroon and niger , he said his administration is confident it will be able to thwart criminals and others contributing to nigeria ’s instability . for the first time in nigeria ’s history , the opposition defeated the ruling party in democratic elections . buhari defeated incumbent goodluck jonathan by about 2 million votes , according to nigeria ’s independent national electoral commission . the win comes after a long history of military rule , coups and botched attempts at democracy in africa ’s most populous nation . in an exclusive live interview from abuja , buhari told amanpour he was not concerned about reconciling the nation after a divisive campaign . he said now that he has been elected he will turn his focus to boko haram and “ plug holes ” in the “ corruption infrastructure ” in the country . “ a new day and a new nigeria are upon us , ” buhari said after his win tuesday . “ the victory is yours , and the glory is that of our nation . ” earlier , jonathan phoned buhari to concede defeat . the outgoing president also offered a written statement to his nation . “ i thank all nigerians once again for the great opportunity i was given to lead this country , and assure you that i will continue to do my best at the helm of national affairs until the end of my tenure , ” jonathan said . “ i promised the country free and fair elections . (…)
Reference Summary
muhammadu buhari tells cnn ’s christiane amanpour that he will fight corruption in nigeria . nigeria is the most populous country in africa and is grappling with violent boko haram extremists . nigeria is also africa ’s biggest economy , but up to 70 % of nigerians live on less than a dollar a day .
Generated Summary
muhammadu buhari talks to cnn ’s christiane amanpour about the nation ’s unrest . for the first time in nigeria , opposition defeated incumbent goodluck jonathan by about 2 million votes. buhari : ” the victory is yours , and the glory is that of our nation ”
Tabela 6: Example 223.
lrb- cnn -rrb- eyewitness video showing white north charleston police officer michael slager shooting to death an unarmed black man has exposed discrepancies in the reports of the first officers on the scene . slager has been fired and charged with murder in the death of 50-year-old walter scott . a bystander ’s cell phone video , which began after an alleged struggle on the ground between slager and scott , shows the five-year police veteran shooting at scott eight times as scott runs away . scott was hit five times . if words were exchanged between the men , they ’re are not audible on the tape . it ’s unclear what happened before scott ran , or why he ran . the officer initially said that he used a taser on scott , who , slager said , tried to take the weapon . before slager opens fire , the video shows a dark object falling behind scott and hitting the ground . it ’s unclear whether that is the taser . (…)
Reference Summary
more questions than answers emerge in controversial s. c. police shooting . officer michael slager , charged with murder , was fired from the north charleston police department .
Generated Summary
video shows white north charleston police officer michael slager shooting to death . slager has been charged with murder in the death of 50-year-old walter scott . the video shows a dark object falling behind scott and hitting the ground .
Tabela 7: Example 89.
andy murray came close to giving himself some extra preparation time for his wedding next week before ensuring that he still has unfinished tennis business to attend to . the world no 4 is into the semi-finals of the miami open , but not before getting a scare from 21 year-old austrian dominic thiem , who pushed him to 4-4 in the second set before going down 3-6 6-4 , 6-1 in an hour and three quarters . murray was awaiting the winner from the last eight match between tomas berdych and argentina ’s juan monaco . prior to this tournament thiem lost in the second round of a challenger event to soon-to-be new brit aljaz bedene . andy murray pumps his first after defeating dominic thiem to reach the miami open semi finals . muray throws his sweatband into the crowd after completing a 3-6 , 6-4 , 6-1 victory in florida . murray shakes hands with thiem who he described as a ‘ strong guy ’ after the game . (…)
Reference Summary
british no 1 defeated dominic thiem in miami open quarter finals . andy murray celebrated his 500th career win in the previous round . third seed will play the winner of tomas berdych and juan monaco in the semi finals of the atp masters 1000 event in key biscayne
Generated Summary
the world no 4 is into the semi-finals of the miami open . murray is still ahead of his career through the season . andy murray was awaiting the winner from the last eight match . murray throws his sweatband into the crowd after a 6-4 6-1 victory in florida .
Tabela 8: Example 1305.
steve clarke afforded himself a few smiles on the touchline and who could blame him ? this has been a strange old season for reading , who are one win away from an fa cup semi-final against arsenal but have spent too long being too close to a championship relegation battle . at least this win will go some way to easing that load . they made it hard for themselves , but they had an in-form player in jamie mackie who was able to get the job done . he put reading in front in the first half and then scored a brilliant winner just moments after chris o’grady had levelled with a penalty – one of the only legitimate chances brighton had all night , even if clarke was angry about the decision . reading frontman jamie mackie fires the royals ahead against brighton in tuesday ’s championship fixture . mackie -lrb- centre -rrb- is congratulated by nathaniel chalobah and garath mccleary after netting reading ’s opener . reading -lrb- 4-1-3-2 -rrb- : federici ; gunter , hector , cooper , chalobah ; akpan ; mcleary , williams -lrb- keown 92 -rrb- , robson-kanu -lrb- pogrebnyak 76 -rrb- ; blackman , mackie -lrb- norwood 79 -rrb- . subs not used : cox , yakubu , andersen , taylor . scorer : mackie , 24 , 56 . booked : mcleary , pogrebnyak . brighton -lrb- 4-3-3 -rrb- : stockdale ; halford , greer , dunk , bennett ; ince -lrb- best 75 -rrb- , kayal , forster-caskey ; ledesma -lrb- bruno 86 -rrb- , o’grady , lualua . subs not used : ankergren , calderon , hughes , holla , teixeira . scorer : o’grady -lrb- pen -rrb- , 53 . booked : ince , dunk , bennett , greer . ref : andy haines . attendance : 14,748 . ratings by riath al-samarrai . (…)
Reference Summary
reading are now 13 points above the championship drop zone . frontman jamie mackie scored twice to earn royals all three points . chris o’grady scored for chris hughton ’s brighton from the penalty spot . niall keown - son of sportsmail columnist martin - made reading debut .
Generated Summary
jamie mackie opened the scoring against brighton in tuesday ’s championship fixture . chris o’grady and garath mccleary both scored . jamie mackie and garath mccleary were both involved in the game .
Tabela 9: Example 1710.

Appendix B

The considered Video Captioning task utilizes a dataset that comprises 1,970 video clips, each associated with multiple natural language descriptions. This results in a total of approximately 80,000 video / description pairs; the used vocabulary comprises approximately 16,000 unique words. The constituent topics cover a wide range of domains, including sports, animals and music. We split the available dataset into a training set comprising the first 1,200 video clips, a validation set composed of 100 clips, and a test set comprising the last 600 clips in the dataset. We preprocess the available descriptions only using the wordpunct tokenizer from the NLTK toolbox666http:/s/ We perform Dropout regularization of the employed LSTMs, as suggested in 31; we use a dropout rate of 0.5.

Moving on, we provide some characteristic examples of generated video descriptions. In the captions of the figures that follow, we annotate minor deviations with blue color, and use red color to indicate major mistakes which imply wrong perception of the scene.

Figura 1: ACVI: a man is firing a gun
Structured Attention: a man is firing a gun
SA: a man is firing a gun
                              Reference Description: a man is firing a gun at targets
Figura 2: ACVI: a woman is cutting a piece of pork
Structured Attention: a woman is cutting pork
SA: a woman is putting butter on a bed
                 Reference Description: someone is cutting a piece of meat
Figura 3: ACVI: a small animal is eating
Structured Attention: a small woman is eating
SA: a small woman is talking
           Reference Description: a hamster is eating
Figura 4: ACVI: the lady poured the something into a bowl
Structured Attention: a woman poured an egg into a bowl
SA: a woman is cracking an egg
                 Reference Description: someone is pouring something into a bowl
Figura 5: ACVI: a woman is riding a horse
Structured Attention: a woman is riding a horse
SA: a woman is riding a horse
                 Reference Description: a woman is riding a horse
Figura 6: ACVI: several people are driving down a street
Structured Attention: several people are driving down the avenue
SA: a boy trying to jump
    Reference Description: a car is driving down the road
Figura 7: ACVI: a man is playing the guitar
Structured Attention: a high man is playing the guitar
SA: a high man is dancing
                 Reference Description: a boy is playing the guitar
Figura 8: ACVI: the man is riding a bicycle
Structured Attention: the man is riding a motorcycle
SA: a man rides a motorcycle
              Reference Description: a girl is riding a bicycle

Appendix C

Let us first provide some details on the datasets used in the context of our MT experiments. The WMT’16 task comprises of data from combining the Europarl v7, News Commentary v10 and Common Crawl corpora. For the (En, Ro) pair, this amounts to 400K parallel sentences. The shared vocabulary sizes (obtained from BPE) total 31.7K words. We use newsdev2016 as our development set (1.9K sentences), and newstest2016 as our test set (1.9K sentences) for the (En, Ro) pair.

On the other hand, the IWSLT’15 task boasts a dataset with 133K training sentence pairs from translated TED talks, provided by the IWSLT 2015 Evaluation Campaign 32. Following the same preprocessing steps as in 3, we use TED tst2012 (1.5K sentences) as our validation set for hyperparameter tuning, and TED tst2013 (1.3K sentences) as our test set. The Vietnamese and English vocabulary sizes are 7.7K and 17.2K, respectively.

Our models are trained for 12K steps; this totals to 4 epochs for the (En, Ro) pair, and 12 epochs for the (En, Vi) pair. As mentioned in Section 4.3, the only exception applies to structured attention models for the (En, Ro) pair; in that case, structured attention training required 12 epochs. We also perform dropout regularization of the trained models, with a dropout rate equal to 0.2. We prefer default settings for the remainder of hyperparameters, as used in the code777 provided by the authors in 15 and the code in 33, respectively. For clarification, the default settings used by the latter for the (En, Vi) pair also apply to the (En, Ro) pair.

In conclusion, we provide some characteristic examples of generated translations for all examined models. In the Tables that follow, we annotate minor and major deviations from the reference translation with blue and red respectively. Synonyms are highlighted with green. We also indicate missing tokens by adding the [missing] identifier mid-sentence, i.e. verbs, articles, adjectives, etc.

Source sentence
Hầu hết ý tưởng của chúng tôi đều điên khùng , nhưng vài ý tưởng vô cùng tuyệt vời , và chúng tôi tạo ra đột phá .
Reference Translation
Most of our ideas were crazy , but a few were brilliant , and we broke through .
Generated Translation - Baseline
Most of our ideas were crazy , but some incredible ideas were awesome , and we created the breakthrough .
Generated Translation - Structured Attention
Most of our ideas are crazy , but some [missing: verb] really wonderful ideas , and we created a sudden .
Generated Translation - ACVI
Most of our ideas were crazy , but some [missing: verb] wonderful ideas , and we made a breakthrough .
Tabela 10: ViEn, tst2012 - Example 84.
Source sentence
"điều đầu tiên bà muốn con hứa là con phải luôn yêu thương mẹ con "
Reference Translation
She said , " The first thing I want you to promise me is that you ’ll always love your mom . "
Generated Translation - Baseline
" The first thing she wants to revenge is she always loves her mother . "
Generated Translation - Structured Attention
" The first thing she wants to do is always love her . "
Generated Translation - ACVI
" The first thing she wanted you to promise you would have to do is to love your mother . "
Tabela 11: ViEn, tst2012 - Example 165.
Source sentence
Hôm nay chúng ta sẽ dành một ít thời gian để nói về việc làm thế nào các video trở nên được ưa thích trong một thời gian ngắn và tiếp theo là lí do tại sao điều này lại đáng nói đến .
Reference Translation
So we ’re going to talk a little bit today about how videos go viral and then why that even matters .
Generated Translation - Baseline
We ’re going to spend a little bit of time to talk about how video games should be popular in a short time and next is why this is remarkable .
Generated Translation - Structured Attention
Today we ’re going to spend some time talking about how the videos are [missing: adjective] .
Generated Translation - ACVI
We ’re going to spend a little bit of time talking about how video videos become popular in a short time and then why this is remarkable .
Tabela 12: ViEn, tst2012 - Example 856.
Source sentence
Dirceu este cel mai vechi membru al Partidului Muncitorilor aflat la guvernare luat în custodie pentru legăturile cu această schemă.
Reference Translation
Dirceu is the most senior member of the ruling Workers ’ Party to be taken into custody in connection with the scheme .
Generated Translation - Baseline
That is the most old Member of the People ’s Party of Maiers to government in custody for ties with this scheme .
Generated Translation - Structured Attention
It is the oldest member of the Mandi of the Massi in the government in the government .
Generated Translation - ACVI
Dirse is the oldest member of the People ’s Party on government in custody for the links with this scheme .
Tabela 13: RoEn, newsdev2016 - Example 5.
Source sentence
Reprezentanții grupurilor de interese au vorbit la unison despre speranța lor în abilitatea lui Turnbull de a satisface interesul public, de a ajunge la un acord politic și de a face lucrurile bine.
Reference Translation
With one voice the lobbyists talked about a hoped-for ability in Turnbull to make the public argument , to cut the political deal and get tough things done .
Generated Translation - Baseline
The representatives of interest groups have spoken about their hope in the capacity of tourism to meet public interest , to reach a political agreement and to do things well .
Generated Translation - Structured Attention
The representatives of the interest groups have spoken in mind about their hope to meet the public interest , to achieve a political and good thing .
Generated Translation - ACVI
Representatives of interest groups have spoken about their hope in Mr Turnchl ’s ability to satisfy the public interest , to reach a political agreement and to do things well .
Tabela 14: RoEn, newsdev2016 - Example 182.
Source sentence
Speculațiile în privința unei fuziuni, care foarte probabil ar stârni îngrijorări în privința încălcării normelor concurenței pe piețe precum Statele Unite și China, au circulat vreme de mai mulți ani.
Reference Translation
Speculation about a merger , likely to raise antitrust concerns in markets such as the United States and China , has swirled for years .
Generated Translation - Baseline
Speculations in respect of an area , which [missing: verb] very likely to raise concerns about the violation of competition rules on markets such as the United States and China , have been circulated for several years .
Generated Translation - Structured Attention
The signatures in relation to a plant , who would probably raise concerns about violations of competition on the markets like the United States and China , [missing: verb] in a number of years .
Generated Translation - ACVI
Speculations on a merger , which would probably raise concerns about the violation of competition rules in the markets such as the United States and China , have moved for several years .
Tabela 15: RoEn, newsdev2016 - Example 799.
Tabela 4: Translation results on the (En, Vi) and (En, Ro) pairs with beam search.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description