Approximate Distribution Matching for SequencetoSequence Learning
Abstract
SequencetoSequence models were introduced to tackle many reallife problems like machine translation, summarization, image captioning, etc. The standard optimization algorithms are mainly based on exampletoexample matching like maximum likelihood estimation, which is known to suffer from data sparsity problem. Here we present an alternate view to explain sequencetosequence learning as a distribution matching problem, where each source or target example is viewed to represent a local latent distribution in the source or target domain. Then, we interpret sequencetosequence learning as learning a transductive model to transform the source local latent distributions to match their corresponding target distributions. In our framework, we approximate both the source and target latent distributions with recurrent neural networks (augmenter). During training, the parallel augmenters learn to better approximate the local latent distributions, while the sequence prediction model learns to minimize the KLdivergence of the transformed source distributions and the approximated target distributions. This algorithm can alleviate the data sparsity issues in sequence learning by locally augmenting more unseen data pairs and increasing the model’s robustness. Experiments conducted on machine translation and image captioning consistently demonstrate the superiority of our proposed algorithm over the other competing algorithms.
Approximate Distribution Matching for SequencetoSequence Learning
Wenhu Chen, Guanlin Li,, Shujie Liu, Zhirui Zhang, Mu Li, Ming Zhou University of California, Santa Barbara Harbin Institute of Technology Microsoft Research Asia University of Science and Technology of China wenhuchen@cs.ucsb.edu, epsilonlee.green@gmail.com {shujliu, vzhirzh, muli, mingzhou}@microsoft.com
Copyright © 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Introduction
Deep learning has achieved great success in recent years, especially in sequencetosequence applications like machine translation (?; ?), image captioning (?; ?), abstractive summarization (?; ?) and speech recognition (?; ?), etc. The most common approaches are based on neural networks which employ very large parameter set to learn a transductive function between the input space and the target space.
The key problem faced by neural sequencetosequence model is how to learn a robust transductive function in such a highdimensional space with rather sparse humanannotated data pairs. For example, machine translation takes the input sequence which lies in the space of to output sequence in another space, where denote the vocabulary sizes and denote the sequence lengths. In the largescale problem, the input and output space become so large that any amount of annotated dataset appears to be sparse. Such data sparsity problem poses great challenges for the model to understand both the input and output diversity. It’s worth noting that our claimed data sparsity problem is specific to sequencetosequence scenario, which is slightly different from curse of dimensionality^{1}^{1}1The curse of dimensionality problem happens when the feature dimension is too high for the limited data to fit while our claimed data sparsity happens when the data space is too large for the limited data to cover.. In general, our method runs parallel with the methods which prevent model overfitting like regularization (?), dropout (?), and etc.
In order to resolve the specific data sparsity problem in sequencetosequence learning, different data augmentation approaches (?; ?; ?; ?; ?; ?) have been proposed. These methods are mainly focused on “augmenting” pseudoparallel data to fully explore the data space, their main weaknesses can be mainly summarized into the following aspects: 1) BackTranslation (?; ?) are specific to certain task like NMT; 2) RewardAugmented Training (?; ?) fail to consider source side diversity; 3) Dual Learning (?) requires duality property and additional resources.
In this paper, we are devoted to design a generalpurpose sequencetosequence learning algorithm to alleviate data sparsity problem without relying on any external resources. We first assume every example in the dataset actually represents an unknown latent distribution, which we need to approximate. In the language domain, the latent distribution could be viewed as a set of paraphrases, while in the image domain, the latent distribution could be thought of as a set of similar pictures. The current prevalent heuristics for approximating the latent distribution (?; ?) are mainly based on tokenlevel replacement, which are known to suffer from the following problems: 1) Inconsistency: RAML (?) does not retain the fidelity to original data pairs and breaks the pairwise correspondence^{2}^{2}2RAML could turn an English sentence from “a girl is going to school” into “a girl is going to kitchen” while a German translation from “ein(a) MÃ¤dchen(girl) gehe(goes) …” to “ein(a) Junge(boy) gehe(goes) …”.. 2) Broken Structure: paraphrase replacement (?) potentially breaks the structure of the sequence and causes unnecessary errors^{3}^{3}3PPDB could turn a source sentence from “a girl is going to school” into “a girl plans to school”.. 3) Discreteness: these methods are merely used for a sequence with discrete tokens, not suitable for a sequence with continuous vector scenario.
In order to defeat these issues to augment fluent and wellcorresponded sourcetarget pairs, we design our system to meet the following three criterion: 1) generability: we employ the generative model (augmenter) to generate new sequences rather than using hard replacements, which can avoid broken structure and be applicable to continuous variable scenarios; 2) fidelity: we restrict the augmented pair to follow their original prototype by maximizing their likelihood computed by the sequence model; 3) diversity: we encourage the augmenters to output more unseen samples to cover the large data space. These designs can enable the augmenters to better approximate the latent distributions, which then enhances the robustness of sequencetosequence learning. A pedagogical illustration is shown in Figure 1, where we learn the latent distribution and then employ the sequence model to align them. The testing error can be reduced by fully exploring the data space.
In conclusion, the major contributions of our paper are described as follows:

we are the first to view sequencetosequence learning as a distribution matching problem.

we have successfully applied our algorithm into two largescale reallife tasks and design corresponding architectures for them.

we have empirically demonstrated that our method can remarkably outperform the existing algorithms like MLE, RL, and RAML.
Related Literature
Neural Machine Translation
A major recent development in statistical machine translation is the adoption of neural networks. Neural network models promise better sharing of statistical evidence between similar words and inclusion of rich context. Since (?; ?) proposed the sequencetosequence model, it has been widely adopted in the industries and academia. Later on, many new algorithms like (?; ?; ?) have been proposed to enhance its performance on various MT datasets. Here, we follow the basic attentive RNN architecture in (?) to augment text data in both sides for machine translation task.
Visual Captioning
The problem of visual concept detection has been studied in the vision community for decades, it’s becoming more popular with the surge of deep neural networks. Since then, a common theme has become utilizing both convolutional neural networks and recurrent neural networks for generating image descriptions. One of the early examples of this new paradigm is the work of (?) that utilizes a deep CNN to construct an image representation, which is then fed to a bidirectional Recurrent Neural Networks. Later on, in order to exploit the finegrained regional attention information, a sequencetosequence architecture has been proposed in (?) and widely adopted in following research (?; ?). Here we follow such multimodal encoderdecoder architecture as our baseline for image captioning task.
Reinforcement Learning
Exposure bias and traintest loss discrepancy are two major issues in the training of sequence prediction models in neural machine translation or image captioning. Many research works (?; ?; ?; ?; ?) have attempted to tackle these issues by exposing the model to its own distribution and directly maximizing tasklevel rewards. These methods are reported to achieve significant improvements in many applications like machine translation, image captioning and summarization, etc. These works are able to encourage the sequence model to exploit the target space better by driving it with a humancrafted reward signal, our method can also encourage the sequence model to exploit the source and target space with a sophisticated modelbased reward signal.
Reward Augmented Training
One successful approach for data augmentation in neural machine translation system is RAML (?), which proposes a novel payoff distribution to augment training samples based on tasklevel reward (BLEU, Edit Distance, etc). In order to sample from this intractable distribution, they further stratify the sampling process as first sampling an edit distance, then performing random substitution/deletion operations. In order to combat the unnecessary noises introduced by the random replacement strategy, our method considers semantic and syntactic context to perform paraphrase generation.
Preliminary
Here we first introduce the sequencetosequence model proposed in (?; ?), which applies two recurrent neural networks (?) to separately understand input sequence and generate output sequence. This framework has been widely applied in various sequence generation tasks due to its simplicity and endtoend nature, which successfully avoids expensive humancrafted features. The sequence model receives the feedback and form a distribution over the output space according to chain rule as follows:
(1) 
where are the recurrent units, is a global attention function to compute the attention weights over the input information . For generality, the sequence element could be a discrete integer or a realvalue vector depending on the distribution . In language related task, lies in the discrete space , where the most frequently used is the Multinomial distribution:
(2) 
where is the output of function .
In contrast, in visual captioning, can be seen as the representation of image lying in the continuous ddimensional space , where the most popular option is multivariate Gaussian distribution:
(3) 
where are the Gaussian mean and deviation obtained from functions . We will cover these two cases in the following sections.
Model
Overview
Here we demonstrate our philosophy using a pedagogical illustration in Figure 1, the example demonstrates how our distributionmatching framework works in combating data sparsity problem to improve model’s ability to understand the diversity in both sides. Our framework first introduces the parallel augmenter, which views the sourcetarget pairs from the dataset as a prototype and aims at augmenting them simultaneously to output synthetic pairs . Specifically, we parameterize the source side and target side augmenters as and , which are also implemented with recurrent neural networks. Then we elaborate the above mentioned constraints (see introduction) into two objective functions:

Matching loss: the transformed source distribution should match its corresponding local latent distribution in the target domain.
(4) where we use to denote the marginal likelihood . However, we found that such KLdivergence can degenerate into Maximum Likelihood Estimation by setting to Kroneckerdelta function . Such scenario will violate the diversity constraint, therefore, we leverage an entropy regularization term in the source side to avert that. The matching loss can hence be expressed as follows:
(5) 
Fidelity Loss: the randomly drawn samples should remain fidelity to its own ground truth.
(6) where denotes the similarity score (e.g. BLEU, METEROR in discrete case, or other distance measure in continuous case).
With the above two loss function, we propose to sum them as the combined loss function as follows:
(7) 
Here we draw a pedagogical illustration of our proposed objective function in Figure 2. During optimization, we will optimize the joint loss function directly with stochastic gradient descent.
Optimization
Formally, we first write gradient of matching loss with respect to two augmenters and the sequence model as follows:
(8) 
Here we adopt MonteCarlo algorithm to approximate the gradients as follows: 1) sample N source sequence samples and N target sequence samples from augmenters. 2) estimate with . 3) use the sampled source and target sequences to estimate the gradients as follows:
Then we write gradient of fidelity loss with respect to two augmenters as follows:
(9) 
Since the augmenters and sequence model are mutually dependent, we adopt an alternate iterative training algorithm to update these terms as described in Algorithm 1.
Experiments
In order to evaluate our distribution matching frameworks on different sequencetosequence applications, we select the most popular machine translation and image captioning as our benchmark. We compare our method against stateoftheart approaches as well as MLE, RAML and RL methods. Here we design two types of augmenters as described in Figure 3 to handle two different scenarios for machine translation and visual captioning. Our method is abbreviated as S2SDM in the following sections. For comparability, we follow the existing papers (?; ?) to adopt same network architecture, and we also apply learning rate annealing strategy described in (?) to further boost our system performance. We trained all our models on Titan X GPU, the experiments for both machine translation and visual captioning take within 3 days (excluding pretraining) to achieve the reported score. For machine translation, the performance is reported with the standard measure BLEU4, while for image captioning, the performance is reported with CIDEr, METEOR and BLEU4 to measure different aspects of the generated captions.
Baseline systems
In both experiments, we specifically compare with the following three baselines:

MLE: The maximum likelihood estimation is the de facto algorithm to train sequencetosequence model, here we follow (?) to train MLE sequence model.

RL: REINFORCE (?) has been frequently used in sequence training to maximize the tasklevel metrics like (?; ?), etc. Here we design use delta BLEU as the reward function and use policy gradient to update the sequence model.

RAML: We follow (?) to select the best temperature in all experiments. In order to sample from the intractable payoff distribution, we adopt a stratified sampling technique described in (?). Given a ground truth , we first sample an edit distance , and then randomly select positions to replace the original labels. For each sentence, we randomly sample four candidates to perform RAML training.
Task 1: Machine Translation
In our experiments, we design the translator via an encoderdecoder network (?) with a maximum sentence length of 62 for both the encoder and decoder. The two augmenters are both taking the groundtruth tokens as their inputs to approximate the latent distribution as a Multinomial distribution as depicted in Figure 3:
where is the output projection function and is the recurrent state obtained by transition function in each step, is the distribution over the whole vocabulary. Here we write the derivatives as follows:
IWSLT2014 GermanEnglish Dataset
This corpus contains 153K sentences while the validation dataset contains 6,969 sentences pairs. The test set comprises dev2010, dev2012, tst2010, tst2011 and tst2012, and the total amount is 6,750 sentences. We adopt 512 as the length of RNN hidden stats and 256 as embedding size. We use the bidirectional encoder and initialize both its own decoder states and coach’s hidden state with the learner’s last hidden state. The experimental results for IWSLT2014 GermanEnglish and EnglishGerman Translation Task are summarized in Table 1, from which we can observe that our method achieves significant improvements in both directions.
Model  DE2EN  EN2DE 
MIXER (?)  21.81   
BSO (?)  26.36   
AC (?)  28.53   
SoftmaxQ (?)  28.77   
MLE  29.10  24.40 
RL  29.70  24.75 
RAML  29.47  24.86 
S2SDM  30.70  25.46 
LDC ChineseEnglish Dataset
The LDC ChineseEnglish training corpus consists of 1.25M parallel sentence, 27.9M Chinese words and 34.5M English words. We choose NIST 2003 as our development set and evaluate our results on NIST 2005, NIST2006. We adopt a similar setting as IWSLT GermanEnglish translation task, we use 512 as hidden size for GRU cell and 256 as embedding size. The experimental results for LDC ChineseEnglish translation task are listed in Table 2.
Model 



MLE  39.0 / 37.1 / 39.1  17.57 / 16.38 / 17.31  
RL  41.0 / 39.2 / 39.3  18.44 / 16.98 / 17.80  
RAML  40.2 / 37.3 / 37.2  17.83 / 16.52 / 16.79  
S2SDM  41.8 / 39.3 / 39.5  18.92 / 17.36 / 17.88 
Task 2: Image Captioning
In the image captioning experiments, we follow Att2in (?; ?) to design our seqtoseq captioning model as depicted in Figure 4. We adopt the same target augmenter as in the MT experiment. For source augmenter, we use the reparameterization trick to denote the continuous visual representation as a multivariate Gaussian distribution as follows:
here we view the input vector as Gaussian mean and simplify the standard deviate as . We adopt an RNN to predict at step as follows:
where denotes the noise sampled from isotropic Gaussian and denotes the RNN hidden state. We first assume the augmenter distributions as Gaussian distribution. Here we first show the isotropic Gaussian distribution as follows:
Here we assume the , while , we can further boil down the equation as follows:
Then we write the derivatives as follows:
where represents the multivariate Gaussian dimension, and represents the length of the sequence.
Dataset Description
We evaluate the performance of our model on MSCOCO (?). The MSCOCO dataset contains 123,287 images for training and validation, 40775 images for testing. Here we use the standard split described by Karpathy^{4}^{4}4https://github.com/karpathy/neuraltalk2 for which 5000 images were used for both validation and testing and the rest for training. We pretrain the model on this data using a batch size of 256 and validate on an outofdomain heldout set, this stage is ended when the validation score converges or the maximum number of epochs is reached. After pretraining, we continue distribution matching training on the original paired dataset.
Experimental Setting
The LSTM hidden, image, word and attention embeddings dimension are fixed to 512 for all of the models discussed herein. We initialize all models by training the model under the crossentropy objective with a learning rate of . We anneal the learning rate by a factor of 0.8 every three epochs. At test time, we do beam search with a beam size of 4 to decode words until the end sentence symbol is reached. We use different standard evaluation metrics described in (?), including BLEU@N (?), a precisionbased machine translation evaluation metric, METEOR (?), as well as CIDEr (?) which measures human consensus.

Reference  taihsi natives seeking work … being hired, and later their colleagues maintain …  
Sample  taihsi natives seeking work … being employed, and later their colleagues maintain …  
Simplification  Reference  i once took mr tung … that a narrow alley could have accommodated so many people.  
Sample  i once took mr tung … that a narrow alley have a lot of people.  
ReOrdering  Reference  he and I went to the theater yesterday to see a film.  
Sample  I and he went to the theater yesterday to see a film.  
Repetition/Missing  Reference  and i had recently discovered a bomb shelter …  
Sample  i have discovered a place place … 
Model  CIDEr  BLEU4  METEOR  

66.0  23.0  19.5  

66.7  24.3  23.9  

111.4  33.3  26.3  

101.3  31.3  26.0  

109.8  32.8  26.0  
Att2in RAML  98.5  31.2  26.0  
S2SDM  112.8  33.9  26.4 
Experimental Results
We summarize the experimental results in Table 4, where we mainly compare with MLE, RL and RAML. We implement our Att2in RAML and S2SDM based on the open repository^{5}^{5}5https://github.com/ruotianluo/selfcritical.pytorch. As can be seen, our method achieves remarkable gains across different metrics over RL, MLE and RAML, besides, our single model best results also slightly outperform SCST training algorithm (?). These results have consistently demonstrated the advantage of distribution matching algorithm under continuous sequence scenarios, which can be potentially extended to more visionrelated sequencetosequence tasks.
Results Analysis
From the above experiments on machine translation and image captioning, we can observe consistent and significant improvements of S2SDM over MLE, RL and RAML algorithms. To fully understand the strength of our method, we first detail its comparisons with other algorithms and then showcase some examples to analyze its advantages.
Comparison with RAML and RL
From the above results, we can observe limited improvements yielded by the RAML algorithm on most tasks and even causes performance degradation in some tasks (LDC ChineseEnglish, Image Captioning). We conjecture that it’s caused by the heuristic strategic replacement strategy which breaks both the semantic and structure information. Especially in image captioning, there already exist five references for the target side, further augmenting the target site receives very little gain. For reinforcement learning, it only focuses on enhancing the targetside decision process, while our method to augment the source sequence is able to expose the model to more unseen sourceside sequences. Such advantage makes our model better in handling unseen visual representation and generalizing test cases. We empirically verify the effectiveness of S2SDM algorithm on augmenting both the discrete and continuous sequence data pairs.
Case Studies
In order to give a more intuitive view of latent distribution approximated by our augmenters, we here draw some highprobability samples from the augmenters. We can observe that most of the sample pairs remain their fidelity to the original pair, their modifications against the original ground truth are mainly classified into four types, which we demonstrate in Table 3. Though the augmenter introduces some noises into the references, these noises are still under control, and the most frequent noises are missing and repetition. Such observation has confirmed our intuition to build a semantic/syntactic preserving cluster around the ground truth. We here showcase the paired augmentation samples in Figure 5.
Further, we also demonstrate a few image captioning examples in Figure 6 to showcase the advantage of our distributionmatching framework. As can be seen, the generated samples adopt a more vivid and diverse language expression. More detailed descriptions about the objects in the picture are included.
Learning Curve
Here we showcase the learning curves of the sequencetosequence model for both IWSLT machine translation task and image captioning task separately in Figure 7 and Figure 8. We can observe very stable improvements of our distribution matching algorithm over the pretrained model. In machine translation task, RL and RAML can both boost the model by 0.50.8 BLEU, while distribution matching can boost roughly 1.5 BLEU. In image captioning, RAML does not benefit the training evidently, while RL and distribution matching both improve the performance remarkably in terms of CIDErD.
Conclusion
In this paper, we propose a new endtoend training algorithm to resolve the data sparsity problem in sequencetosequence applications. We have verified the capability of our model in two popular applications (machine translation and image captioning) to understand more diverse inputs and generate more complicated outputs. We look forward to testing our algorithms on more sequencetosequence applications to verify its generality.
References
 [Bahdanau et al. 2016] Bahdanau, D.; Brakel, P.; Xu, K.; Goyal, A.; Lowe, R.; Pineau, J.; Courville, A.; and Bengio, Y. 2016. An actorcritic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086.
 [Bahdanau, Cho, and Bengio 2014] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
 [Banerjee and Lavie 2005] Banerjee, S., and Lavie, A. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, volume 29, 65–72.
 [Bühlmann and Van De Geer 2011] Bühlmann, P., and Van De Geer, S. 2011. Statistics for highdimensional data: methods, theory and applications. Springer Science & Business Media.
 [Chen et al. 2015] Chen, X.; Fang, H.; Lin, T.Y.; Vedantam, R.; Gupta, S.; Dollár, P.; and Zitnick, C. L. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
 [Chen et al. 2016] Chen, W.; Matusov, E.; Khadivi, S.; and Peter, J.T. 2016. Guided alignment training for topicaware neural machine translation. arXiv preprint arXiv:1607.01628.
 [Chen et al. 2017] Chen, W.; Li, G.; Ren, S.; Liu, S.; Zhang, Z.; Li, M.; and Zhou, M. 2017. Generative bridging network in neural sequence prediction. arXiv preprint arXiv:1706.09152.
 [Chen, Lucchi, and Hofmann 2016] Chen, W.; Lucchi, A.; and Hofmann, T. 2016. Bootstrap, review, decode: Using outofdomain textual data to improve image captioning. arXiv preprint arXiv:1611.05321.
 [Cho et al. 2014] Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
 [Chung et al. 2014] Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
 [Ganitkevitch, Van Durme, and CallisonBurch 2013] Ganitkevitch, J.; Van Durme, B.; and CallisonBurch, C. 2013. Ppdb: The paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 758–764.
 [He et al. 2016] He, D.; Xia, Y.; Qin, T.; Wang, L.; Yu, N.; Liu, T.; and Ma, W.Y. 2016. Dual learning for machine translation. In Advances in Neural Information Processing Systems, 820–828.
 [Karpathy and FeiFei 2015] Karpathy, A., and FeiFei, L. 2015. Deep visualsemantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3128–3137.
 [Lin et al. 2014] Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 740–755. Springer.
 [Lu et al. 2015] Lu, L.; Zhang, X.; Cho, K.; and Renals, S. 2015. A study of the recurrent neural network encoderdecoder for large vocabulary speech recognition. In Sixteenth Annual Conference of the International Speech Communication Association.
 [Ma et al. 2017] Ma, X.; Yin, P.; Liu, J.; Neubig, G.; and Hovy, E. 2017. Softmax qdistribution estimation for structured prediction: A theoretical interpretation for raml. arXiv preprint arXiv:1705.07136.
 [Mikolov et al. 2010] Mikolov, T.; Karafiát, M.; Burget, L.; Cernockỳ, J.; and Khudanpur, S. 2010. Recurrent neural network based language model. In Interspeech, volume 2, 3.
 [Norouzi et al. 2016] Norouzi, M.; Bengio, S.; Jaitly, N.; Schuster, M.; Wu, Y.; Schuurmans, D.; et al. 2016. Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, 1723–1731.
 [Papineni et al. 2002] Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, 311–318. Association for Computational Linguistics.
 [Paulus, Xiong, and Socher 2017] Paulus, R.; Xiong, C.; and Socher, R. 2017. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.
 [Ranzato et al. 2015] Ranzato, M.; Chopra, S.; Auli, M.; and Zaremba, W. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732.
 [Rennie et al. 2016] Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V. 2016. Selfcritical sequence training for image captioning. arXiv preprint arXiv:1612.00563.
 [Rush, Chopra, and Weston 2015] Rush, A. M.; Chopra, S.; and Weston, J. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.
 [Sennrich, Haddow, and Birch 2015a] Sennrich, R.; Haddow, B.; and Birch, A. 2015a. Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709.
 [Sennrich, Haddow, and Birch 2015b] Sennrich, R.; Haddow, B.; and Birch, A. 2015b. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
 [Shen et al. 2015] Shen, S.; Cheng, Y.; He, Z.; He, W.; Wu, H.; Sun, M.; and Liu, Y. 2015. Minimum risk training for neural machine translation. arXiv preprint arXiv:1512.02433.
 [Srivastava et al. 2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958.
 [Vedantam, Lawrence Zitnick, and Parikh 2015] Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensusbased image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4566–4575.
 [Williams 1992] Williams, R. J. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8(34):229–256.
 [Wiseman and Rush 2016] Wiseman, S., and Rush, A. M. 2016. Sequencetosequence learning as beamsearch optimization. arXiv preprint arXiv:1606.02960.
 [Wu et al. 2016a] Wu, C.; Karanasou, P.; Gales, M. J.; and Sim, K. C. 2016a. Stimulated deep neural network for speech recognition. Technical report, University of Cambridge Cambridge.
 [Wu et al. 2016b] Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. 2016b. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
 [Xu et al. 2015] Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, 2048–2057.
 [Zhang and Zong 2016] Zhang, J., and Zong, C. 2016. Exploiting sourceside monolingual data in neural machine translation. In EMNLP, 1535–1545.