Improving SequencetoSequence Learning
via Optimal Transport
Abstract
Sequencetosequence models are commonly trained via maximum likelihood estimation (MLE). However, standard MLE training considers a wordlevel objective, predicting the next word given the previous groundtruth partial sentence. This procedure focuses on modeling local syntactic patterns, and may fail to capture longrange semantic structure. We present a novel solution to alleviate these issues. Our approach imposes global sequencelevel guidance via new supervision based on optimal transport, enabling the overall characterization and preservation of semantic features. We further show that this method can be understood as a Wasserstein gradient flow trying to match our model to the ground truth sequence distribution. Extensive experiments are conducted to validate the utility of the proposed approach, showing consistent improvements over a wide variety of NLP tasks, including machine translation, abstractive text summarization, and image captioning.
Improving SequencetoSequence Learning
via Optimal Transport
Liqun Chen, Yizhe Zhang, Ruiyi Zhang, Chenyang Tao, Zhe Gan, Haichao Zhang, 

Bai Li, Dinghan Shen, Changyou Chen, Lawrence Carin 
Duke University, Microsoft Research, Microsoft Dynamics 365 AI Research 
Baidu Research, SUNY at Buffalo 
{liqun.chen}@duke.edu 
1 Introduction
Sequencetosequence (Seq2Seq) models are widely used in various natural language processing tasks, such as machine translation (Sutskever et al., 2014, Cho et al., 2014, Bahdanau et al., 2015), text summarization (Rush et al., 2015, Chopra et al., 2016) and image captioning (Vinyals et al., 2015, Xu et al., 2015). Typically, Seq2Seq models are based on an encoderdecoder architecture, with an encoder mapping a source sequence into a latent vector, and a decoder translating the latent vector into a target sequence. The goal of a Seq2Seq model is to optimize this encoderdecoder network to generate sequences close to the target. Therefore, a proper measure of the distance between sequences is crucial for model training.
Maximum likelihood estimation (MLE) is often used as the training paradigm in existing Seq2Seq models (Goodfellow et al., 2016, Lamb et al., 2016). The MLEbased approach maximizes the likelihood of the next word conditioned on its previous groundtruth words. Such an approach adopts crossentropy loss as the objective, essentially measuring the word difference at each position of the target sequence (assuming truth for the preceding words). That is, MLE only provides a wordlevel training loss (Ranzato et al., 2016). Consequently, MLEbased methods suffer from the socalled exposure bias problem (Bengio et al., 2015, Ranzato et al., 2016), i.e., the discrepancy between training and inference stages. During inference, each word is generated sequentially based on previously generated words. However, groundtruth words are used in each timestep during training (HuszÃ¡r, 2015, Wiseman & Rush, 2016). Such discrepancy in training and testing leads to accumulated errors along the sequencegeneration trajectory, and may therefore produce unstable results in practice. Further, commonly used metrics for evaluating the generated sentences at test time are sequencelevel, such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004). This also indicates a mismatch of the training loss and testtime evaluation metrics.
Attempts have been made to alleviate the above issues, via a sequencelevel training loss that enables comparisons between the entire generated and reference sequences. Such efforts roughly fall into two categories: () reinforcementlearningbased (RL) methods (Ranzato et al., 2016, Bahdanau et al., 2017) and () adversariallearningbased methods (Yu et al., 2017, Zhang et al., 2017). These methods overcome the exposure bias issue through criticizing model output during training; however, both schemes have their own vulnerabilities. RL methods often suffer from large variance on policygradient estimation, and control variates and carefully designed baselines (such as a selfcritic) are needed to make RL training more robust (Rennie et al., 2017, Liu et al., 2018). Further, the rewards used by RL training are often criticized as a bad proxy for human evaluation, as they are usually highly biased towards certain particular aspects (Wang et al., 2018b). On the other hand, adversarial supervision relies on the delicate balance of a minimax game, which can be easily undermined by modetrapping and gradientvanishing problems (Arjovsky et al., 2017, Zhang et al., 2017). Sophisticated tuning is often desired for successful adversarial training.
We present a novel Seq2Seq learning scheme that leverages optimal transport (OT) to construct sequencelevel loss. Specifically, the OT objective aims to find an optimal matching of similar words/phrases between two sequences, providing a way to promote their semantic similarity (Kusner et al., 2015). Compared with the above RL and adversarial schemes, our approach has: () semanticinvariance, allowing better preservation of sequencelevel semantic information; and () improved robustness, since neither the reinforce gradient nor the minimax game is involved. The OT loss allows endtoend supervised training and acts as an effective sequencelevel regularization to the MLE loss.
Another novel strategy distinguishing our model from previous approaches is that during training we consider not only the OT distance between the generated sentence and groundtruth references, but also the OT distance between the generated sentence and its corresponding input. This enables our model to simultaneously match the generated output sentence with both the source sentence(s) and target reference sentence, thus enforcing the generator to leverage information contained in the input sentence(s) during generation.
The main contributions of this paper are summarized as follows. () A new sequencelevel training algorithm based on optimal transport is proposed for Seq2Seq learning. In practice, the OT distance is introduced as a regularization term to the MLE training loss. () Our model can be interpreted as approximate Wasserstein gradient flows, learning to approximately match the sequence distribution induced by the generator and a target data distribution. () In order to demonstrate the versatility of the proposed method, we conduct extensive empirical evaluations on three tasks: machine translation, text summarization, and image captioning.
2 Semantic Matching with Optimal Transport
We consider two components of a sentence: its syntactic and semantic parts. In a Seq2Seq model, it is often desirable to keep the semantic meaning while the syntactic part can be more flexible. Conventional training schemes, such as MLE, are known to be wellsuited for capturing the syntactic structure. As such, we focus on the semantic part. An intuitive way to assess semantic similarity is to directly match the “key words” between the synthesized and the reference sequences. Consider the respective sequences as sets and , with vocabularies as their elements. Then the matching can be evaluated by , where is the counting measure for sets. We call this hard matching, as it seeks to exactly match words from both sequences.
For language models, the above hard matching could be an over simplification. This is because words have semantic meaning, and two different words can be close to each other in the semantic space. To account for such ambiguity, we can relax the hard matching to soft bipartite matching (SBM). More specifically, assuming all sequences have the same length , we pair and for , such that , are unique and is minimized. Here is a cost function measuring the semantic dissimilarity between the two words. For instance, the cosine distance between two word embedding vectors and is a popular choice (Pennington et al., 2014). This minimization can be solved exactly, , via the Hungarian algorithm (Kuhn, 1955). Unfortunately, its complexity scales badly for common NLP tasks, and the objective is also nondifferentiable wrt model parameters. As such, endtoend supervised training is not feasible with the Hungarian matching scheme. To overcome this difficulty, we propose to further relax the matching criteria while keeping the favorable features of a semantic bipartite matching. OT arises as a natural candidate.
2.1 Optimal transport and Wasserstein distance
We first provide a brief review of optimal transport, which defines distances between probability measures on a domain (the sequence space in our setting). The optimal transport distance for two probability measures and is defined as (Peyré et al., 2017):
(1) 
where denotes the set of all joint distributions with marginals and ; is the cost function for moving to , e.g., the Euclidean or cosine distance. Intuitively, the optimal transport distance is the minimum cost that induces in order to transport from to . When is a metric on , induces a proper metric on the space of probability distributions supported on , commonly known as the Wasserstein distance (Villani, 2008). One of the most popular choices is the Wasserstein distance where the squared Euclidean distance is used as cost.
OT distance on discrete domains
We mainly focus on applying the OT distance on textual data. Therefore, we only consider OT between discrete distributions. Specifically, consider two discrete distributions , which can be written as and with the Dirac function centered on . The weight vectors and respectively belong to the and dimensional simplex, i.e., , as both and are probability distributions. Under such a setting, computing the OT distance as defined in (1) is equivalent to solving the following networkflow problem (Luise et al., 2018):
(2) 
where , denotes an dimensional allone vector, is the cost matrix given by and represents the Frobenius dotproduct. We refer to the minimizer of (2) as OT matching. Comparing the two objectives, one can readily recognize that soft bipartite matching represents a special constrained solution to (2), where can only take values in instead of ; here is the norm and is the unit vector along th axis. As such, OT matching can be regarded as a relaxed version of soft bipartite matching. In Figure 1 we illustrate the three matching schemes discussed above.
The IPOT algorithm
Unfortunately, the exact minimization over is in general computational intractable (Arjovsky et al., 2017, Genevay et al., 2018, Salimans et al., 2018). To overcome such intractability, we consider an efficient iterative approach to approximate the OT distance. We propose to use the recently introduced Inexact Proximal point method for Optimal Transport (IPOT) algorithm to compute the OT matrix , thus also the OT distance (Xie et al., 2018). IPOT provides a solution to the original OT problem specified in (2). Specifically, IPOT iteratively solves the following optimization problem using the proximal point method (Boyd & Vandenberghe, 2004):
(3) 
where the proximity metric term penalizes solutions that are too distant from the latest approximation, and is understood as the generalized stepsize. This renders a tractable iterative scheme towards the exact OT solution. In this work, we employ the generalized KL Bregman divergence as the proximity metric. Algorithm 1 describes the implementation details for IPOT.
Note that the Sinkhorn algorithm (Cuturi, 2013) can also be used to compute the OT matrix. Specifically, the Sinkhorn algorithm tries to solve the entropy regularized optimization problem: where is the entropy regularization term and is the regularization strength. However, in our experiments, we empirically found that the numerical stability and performance of the Sinkhorn algorithm is quite sensitive to the choice of the hyperparameter , thus only IPOT is considered in our model training.
2.2 Optimal transport distance as a sequence level loss
Figure 2 illustrates how OT is computed to construct the sequencelevel loss. Given two sentences, we can construct their wordlevel or phraselevel embedding matrices and , where is usually recognized as the reference sequence embedding and for the model output sequence embedding. The cost matrix is then computed by and passed on to the IPOT algorithm to get the OT distance. Our full algorithm is summarized in Algorithm 2, and more detailed model specifications are given below.
Encoding model belief with a differentiable sequence generator
We first describe how to design a differentiable sequence generator so that the gradients can be backpropagated from the OT losses to update the model belief. The Long ShortTerm Memory (LSTM) recurrent neural network (Hochreiter & Schmidhuber, 1997) is used as our sequence model. At each timestep , the LSTM decoder outputs a logit vector for the vocabularies, based on its context. Directly sampling from the multinomial distribution is a nondifferentiable operation^{1}^{1}1Here is understood as an onehot vector in order to be notationally consistent with its differentiable alternatives., so we consider the following differentiable alternatives:

Softargmax: , where is the annealing parameter (Zhang et al., 2017). This approximates the deterministic sampling scheme ;
Unstable training and suboptimal solutions have been observed for the GSbased scheme for the Seq2Seq tasks we considered (see Appendix G, Table 11), possibly due to the extra uncertainty introduced. As such, we will assume the use of softargmax to encode model belief in unless otherwise specified. Note is a normalized nonnegative vector that sums up to one.
Sequencelevel OTmatching loss
To pass on the model belief to the OT loss, we use the mean word embedding predicted by the model, given by , where is the word embedding matrix, is the vocabulary size and is the dimension for the embedding vector. We collect the predicted sequence embeddings into , where is the length of sequence. Similarly we denote the reference sequence embeddings as , using ground truth onehot input token sequence . Based on the sequence embeddings and , we can compute the sequencelevel OT loss between groundtruth and model prediction using the IPOT algorithm described above for different Seq2Seq tasks:
(4) 
Softcopying mechanism
We additionally consider feature matching using the OT criteria between the source and target. Intuitively, it will encourage the global semantic meaning to be preserved from source to target. This is related to the copy network (Gu et al., 2016). However, in our framework, the copying mechanism can be understood as a soft optimaltransportbased copying, instead of the original hard retrievedbased copying used by Gu et al. (2016). This soft copying mechanism considers semantic similarity in the embedding space, and thus presumably delivers smoother transformation of information. In the case where the source and target sequences do not share vocabulary (e.g., machine translation), this objective can still be applied by sharing the word embedding space between source and target. Ideally, the embedding for the same concept in different languages will automatically be aligned by optimizing such loss, making available a cosinesimilaritybased cost matrix. This is also related to bilingual skipgram (Luong et al., 2015b). We denote this loss as , where represents the source sequence embeddings.
Complementing MLE training with OT regularization
OT training objectives discussed above can not train a proper language model on its own, as they do not explicitly consider word ordering, i.e., the syntactic strucuture of a language model. To overcome this issue, we propose to combine the OT loss with the de facto likelihood loss , which gives us the final training objective: , where is a hyperparameter to be tuned. For tasks with both input and output sentences, such as machine translation and text summarization, can be applied, in which case the final objective can be written as .
3 Interpretation as Approximate Wasserstein Gradient Flows
To further justify the use of our approach (minimizing the loss , where denotes the Wasserstein loss), we now explain how our model approximately learns to match the groundtruth sequence distribution. Our derivation is based on the theory of Wasserstein gradient flows (WGF) (Villani, 2008). In WGF, the Wasserstein distance describes the local geometry of a trajectory in the space of probability measures converging to a target distribution (Ambrosio et al., 2005). In the following, we show that the proposed method learns to approximately match the data distribution, from the perspective of WGF. For simplicity we only discuss the continuous case, while a similar argument also holds for the discrete case (Li & Montufar, 2018).
We denote the induced distribution of the sequences generated from the decoder at the th iteration as . Assume the sequence data distribution is given by . Intuitively, the optimal generator in a Seq2Seq model learns a distribution that matches . Based on Craig (2014), this can be achieved by composing a sequence of discretized WGFs given by:
(5) 
with defined as
(6) 
where is a regularization parameter ( is the generalized learning rate); denotes the Wasserstein distance between and ; is the space of distributions with finite 2ndorder moments; and is the KullbackLeibler (KL) divergence. It is not difficult to see that discreteized WGF is essentially optimizing the KL divergence with a proximal descent scheme, using the Wasserstein distance as the proximity metric.
We denote with generalized learning rate . It is well known that (Chen et al., 2018a), that is to say the induced model distribution asymptotically converges to the data distribution . In our case, instead of using as the loss function, we define a surrogate loss using its upper bound , where the inequality holds because (6) converges to . When our model distribution is parameterized by , can be solved with stochastic updates on based on the following equation with stepsize :
(7) 
Unfortunately, (7) is an infeasible update as we do not know . However, we argue that this update is still locally valid when current model approximation is close to . To see this, recall that the KLdivergence is a natural Riemannian metric on the space of probability measures (Amari, 1985), therefore it is locally symmetric. So we can safely replace the term with when is close to . This recovers the loss function derived in Section 2.2 as , where is the entropy of , independent of , and . This justifies the use of our proposed scheme in a modelrefinement stage, where model distribution is sufficiently close to . Empirically, we have observed that our scheme also improves training even when is distant from . While the above justification is developed based on Euclidean transport, other nonEuclidean costs such as cosine distance usually yield better empirical performance as they are more adjusted to the geometry of sequence data.
4 Related Work and Discussion
Optimal transport in NLP
Although widely used in other fields such as computer vision (Rubner et al., 2000), OT has only been applied in NLP recently. Pioneered by the work of Kusner et al. (2015) on word mover’s distance (WMD), existing literature primarily considers OT either on a macroscopic level like topic modeling (Huang et al., 2016), or a microscopic level such as word embedding (Xu et al., 2018). Euclidean distance, instead of other more general distance, is often used as the transportation cost, in order to approximate the OT distance with the KantorovichRubinstein duality (Gulrajani et al., 2017) or a more efficient yet less accurate lower bound (Kusner et al., 2015). Our work employs OT for mesoscopic sequencetosequence models, presenting an efficient IPOTbased implementation to enable endtoend learning for general cost functions. The proposed OT not only refines the word embedding matrix but also improves the Seq2Seq model (see Appendix H for details).
RL for sequence generation
A commonly employed strategy for sequencelevel training is via reinforcement learning (RL). Typically, this type of method employs RL by considering the evaluation metrics as the reward to guide the generation (Ranzato et al., 2016, Bahdanau et al., 2017, Rennie et al., 2017, Zhang et al., 2018a, Huang et al., 2018). However, these approaches often introduce procedures that may yield largevariance gradients, resulting in unstable training. Moreover, it has been recognized that these automatic metrics may have poor correlation with human judgments in many scenarios (Wang et al., 2018b). As such, reinforcing the evaluation metrics can potentially boost the quantitative scores but not necessarily improve the generation quality, as such metrics usually encourage exact text snippets overlapping rather than semantic similarity. Some nonstandard metrics like SPICE (Anderson et al., 2016) also consider semantic similarity, however they also can not learn a good model on their own (Liu et al., 2017). Unlike RL methods, our method requires no humandefined rewards, thus preventing the model from overfitting to one specific metric. As a concrete example, the two semantically similar sentences “do you want to have lunch with us ” and “would you like to join us for lunch” would be considered as a bad match based on automatic metrics like BLEU, however, be rated as reasonable match in OT objective.
GAN for sequence generation
Another type of method adopts the framework of generative adversarial networks (GANs) (Goodfellow et al., 2014), by providing sequencelevel guidance based on a learned discriminator (or, critic). To construct such a loss, Yu et al. (2017), Lin et al. (2017), Guo et al. (2018), Fedus et al. (2018) combine the policygradient algorithm with the original GAN training procedure, while Zhang et al. (2017), Chen et al. (2018b) uses a socalled feature mover distance and maximum mean discrepancy (MMD) to match features of real and generated sentences, respectively. However, modecollapse and gradientvanishing problems make the training of these methods challenging. Unlike GAN methods, since no minmax games are involved, the training of our model is more robust. Moreover, compared with GAN, no additional critic is introduced in our model, which makes the model complexity comparable to MLE and less demanding to tune.
5 Experiments
We consider a wide range of NLP tasks to experimentally validate the proposed model, and benchmark it with other strong baselines. All experiments are implemented with Tensorflow and run on a single NVIDIA TITAN X GPU. Code for our experiments are available from https://github.com/LiqunChen0606/Seq2SeqOT.
5.1 Neural machine translation
We test our model on two datasets: () a smallscale EnglishVietnamese parallel corpus of TEDtalks, which has K sentence pairs from the IWSLT Evaluation Campaign (Cettolo et al., 2015); and () a largescale EnglishGerman parallel corpus with 4.5M sentence pairs, from the WMT Evaluation Campaign (Vaswani et al., 2017). We used Google’s Neural Machine Translation (GMNT) model (Wu et al., 2016) as our baseline, following the architecture and hyperparameter settings from the GNMT repository^{2}^{2}2https://github.com/tensorflow/nmt to make a fair comparison. For the EnglishVietnamese (i.e., VIEN and ENVI) tasks, a 2layer LSTM with 512 units in each layer is adopted as the decoder, with a 1layer bidirectionalLSTM adopted as the encoder; the word embedding dimension is set to 512. Attention proposed in Luong et al. (2015a) is used together with a dropout rate of . For the EnglishGerman (i.e., DEEN and ENDE) tasks, we train a 4layer LSTM decoder with 1024 units in each layer. A 2layer bidirectionalLSTM is used as the encoder, and we adopt the attention used in Wu et al. (2016). The word embedding dimension is set to 1024. Standard stochastic gradient descent is used for training with a decreasing learning rate, and we set for the IPOT algorithm. More training details are provided in Appendix A. In terms of wallclock time, our model only slightly increases training time. For the GermanEnglish task, it took roughly 5.5 days to train the GNMT model, and 6 days to train our proposed model from scratch, which only amounts to a roughly increase.
Systems  NT2012  NT2013 

VIEN: GNMT  20.7  23.8 
VIEN: GNMT+  21.9  25.4 
VIEN: GNMT++  21.9  25.5 
ENVI: GNMT  23.8  26.1 
ENVI: GNMT+  24.4  26.5 
ENVI: GNMT++  24.5  26.9 
Systems  NT2013  NT2015 

DEEN: GNMT  29.0  29.9 
DEEN: GNMT+  29.1  29.9 
DEEN: GNMT++  29.2  30.1 
ENDE: GNMT  24.3  26.5 
ENDE: GNMT+  24.3  26.6 
ENDE: GNMT++  24.6  26.8 
We apply different combinations of and to finetune the pretrained GNMT model (Luong et al., 2018) and the results are summarized in Table 2 and 2. . Additional results for training from scratch are provided in Appendix B. The proposed OT approach consistently improves upon MLE training in all experimental setups.
We additionally tested our model with a more expressive layer LSTM model on the ENDE task. The BLEU score of our method is on NT2015. For reference, the GNMT model (same architecture) and a Transformer model (Vaswani et al., 2017) respectively report a score of and . Our method outperforms both baselines, and it is also competitive to the stateoftheart BLEU score reported by Vaswani et al. (2017) using a highly sophisticated model design.
The GermantoEnglish translation examples are provided in Table 3 for qualitative assessment. The main differences among the reference translation, our translation and the GNMT translation are highlighted in blue and red. Our OTaugmented translations are more faithful to the reference than its MLEtrained counterpart. The softcopying mechanism introduced by OT successfully maintains the key semantic content from the reference. Presumably, the OT loss helps refine the word embedding matrices, and promotes matching between words with similar semantic meanings. Vanilla GNMT translations, on the other hand, ignores or misinterprets some of the key terms. More examples are provided in Appendix E.
We also test the robustness of our method wrt the hyperparameter . Results are summarized in Appendix C. Our OTaugmented model is robust to the choice of . The test BLEU scores are consistently higher than the baseline for .
5.2 Abstractive text summarization
We consider two datasets for abstractive text summarization. The first one is the Gigaword corpus (Graff et al., 2003), which has around M training samples, K validation samples, and test samples. The input pairs consist of the first sentence and the headline of an article. We also evaluate our model on the DUC2004 test set (Over et al., 2007), which consists of 500 news articles. Our implementation of the Seq2Seq model adopts a simple architecture, which consists of a bidirectional GRU encoder and a GRU decoder with attention mechanism (Bahdanau et al., 2015)^{3}^{3}3https://github.com/thunlp/TensorFlowSummarization.
Results are summarized in Tables 5 and 5. Our OTregularized model outperforms respective baselines. The stateoftheart ROUGE result for the Gigawords dataset is reported by Wang et al. (2018a). However, much more complex architectures are used to achieve that score. We use a relatively simple Seq2Seq model in our experiments to demonstrate the versatility of the proposed OT method. Applying it for () more complicated models and () more recent datasets such as CNN/DailyMail (See et al., 2017) will be interesting future work.
Summarization examples are provided in Appendix D. Similar to the machine translation task, our proposed method captures the key semantic information in both the source and reference sentences.
Systems  RG1  RG2  RGL 

Seq2Seq  33.4  15.7  32.4 
Seq2Seq+  35.8  17.5  33.7 
Seq2Seq++  36.2  18.1  34.0 
Systems  RG1  RG2  RGL 

Seq2Seq  28.0  9.4  24.8 
Seq2Seq+  29.5  9.8  25.5 
Seq2Seq++  30.1  10.1  26.0 
5.3 Image captioning
We also consider an image captioning task using the COCO dataset (Lin et al., 2014), which contains 123,287 images in total and each image is annotated with at least 5 captions. Following Karpathy’s split (Karpathy & FeiFei, 2015), 113,287 images are used for training and 5,000 images are used for validation and testing. We follow the implementation of the Show, Attend (Xu et al., 2015)^{4}^{4}4https://github.com/DeepRNN/image_captioning, and use Resnet152 (He et al., 2016), image tagging (Gan et al., 2017), and FastRCNN (Anderson et al., 2018) as the image feature extractor (encoder), and a onelayer LSTM with 1024 units as the decoder. The word embedding dimension is set to 512. Note that in this task, the input are images instead of sequences, therefore cannot be applied.
We report BLEU ( from 1 to 4) (Papineni et al., 2002), CIDEr (Vedantam et al., 2015), and METEOR (Banerjee & Lavie, 2005) scores and the results with different settings are shown in Table 6. Consistent acrosstheboard improvements are observed with the introduction of the OT loss, in contrast to the RLbased methods where drastic improvements can only be observed for the optimized evaluation metric (Rennie et al., 2017). Consequently, the OT loss is a more reliable method to improve the quality of generated captions when compared with RL methods that aim to optimize and therefore potentially overfit one specific metric. Examples of generated captions are provided in Appendix F.
6 Conclusion
This work is motivated by the major deficiency in training Seq2Seq models: that the MLE training loss does not operate at sequencelevel. Inspired by soft bipartite matching, we propose the usage of optimal transport as a sequencelevel loss to improve Seq2Seq learning. By applying this new method to machine translation, text summarization, and image captioning, we demonstrate that our proposed model can be used to help improve the performance compared to strong baselines. We believe the proposed method is a general framework, and will be useful to other sequence generation tasks as well, such as conversational response generation (Li et al., 2017, Zhang et al., 2018c), which is left as future work.
References
 Amari (1985) Shunichi Amari. Differentialgeometrical methods in statistics, volume 28. Springer Science & Business Media, 1985.
 Ambrosio et al. (2005) Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. 2005.
 Anderson et al. (2016) Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision, pp. 382–398. Springer, 2016.
 Anderson et al. (2018) Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottomup and topdown attention for image captioning and visual question answering. In CVPR, 2018.
 Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
 Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
 Bahdanau et al. (2017) Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actorcritic algorithm for sequence prediction. In ICLR, 2017.
 Banerjee & Lavie (2005) Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL Workshop, 2005.
 Bengio et al. (2015) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS, 2015.
 Boyd & Vandenberghe (2004) Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
 Cettolo et al. (2015) Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, Roldano Cattoni, and Marcello Federico. The IWSLT 2015 evaluation campaign. In IWSLT 2015, International Workshop on Spoken Language Translation, 2015.
 Chen et al. (2018a) Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, and Liqun Chen. A unified particleoptimization framework for scalable bayesian sampling. In UAI, 2018a.
 Chen et al. (2018b) Liqun Chen, Shuyang Dai, Chenyang Tao, Haichao Zhang, Zhe Gan, Dinghan Shen, Yizhe Zhang, Guoyin Wang, Ruiyi Zhang, and Lawrence Carin. Adversarial text generation via featuremover’s distance. In NeurIPS, 2018b.
 Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. In EMNLP, 2014.
 Chopra et al. (2016) Sumit Chopra, Michael Auli, and Alexander M Rush. Abstractive sentence summarization with attentive recurrent neural networks. In NAACL, 2016.
 Craig (2014) Katy Craig (ed.). The exponential formula for the Wasserstein metric. PhD thesis, The State University of New Jersey, 2014.
 Cuturi (2013) Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, 2013.
 Fedus et al. (2018) William Fedus, Ian Goodfellow, and Andrew M Dai. MaskGAN: Better text generation via filling in the ____. In ICLR, 2018.
 Gan et al. (2017) Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. Semantic compositional networks for visual captioning. In CVPR, 2017.
 Genevay et al. (2018) Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with sinkhorn divergences. In AISTATS, 2018.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
 Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
 Graff et al. (2003) David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. English gigaword. Linguistic Data Consortium, Philadelphia, 2003.
 Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. Incorporating copying mechanism in sequencetosequence learning. In ACL, 2016.
 Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In NIPS, 2017.
 Guo et al. (2018) Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. Long text generation via adversarial training with leaked information. In AAAI, 2018.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Computation, 1997.
 Huang et al. (2016) Gao Huang, Chuan Guo, Matt J Kusner, Yu Sun, Fei Sha, and Kilian Q Weinberger. Supervised word mover’s distance. In NIPS, 2016.
 Huang et al. (2018) Qiuyuan Huang, Zhe Gan, Asli Celikyilmaz, Dapeng Wu, Jianfeng Wang, and Xiaodong He. Hierarchically structured reinforcement learning for topically coherent visual story generation. arXiv preprint arXiv:1805.08191, 2018.
 HuszÃ¡r (2015) Ferenc HuszÃ¡r. How (not) to train your generative model: scheduled sampling, likelihood, adversary? In arXiv:1511.05101, 2015.
 Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbelsoftmax. In arXiv:1611.01144, 2016.
 Karpathy & FeiFei (2015) Andrej Karpathy and Li FeiFei. Deep visualsemantic alignments for generating image descriptions. In CVPR, 2015.
 Kuhn (1955) Harold W Kuhn. The Hungarian method for the assignment problem. Naval research logistics quarterly, 1955.
 Kusner et al. (2015) Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In ICML, 2015.
 Lamb et al. (2016) Alex Lamb, Anirudh Goyal ALIAS PARTH GOYAL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In NIPS, 2016.
 Li et al. (2017) Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky. Adversarial learning for neural dialogue generation. In EMNLP, 2017.
 Li & Montufar (2018) W. Li and G. Montufar. Natural gradient via optimal transport. arXiv preprint arXiv:1803.07033, 2018.
 Lin (2004) ChinYew Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
 Lin et al. (2017) Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, and MingTing Sun. Adversarial ranking for language generation. In NIPS, 2017.
 Lin et al. (2014) TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
 Liu et al. (2018) Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Actiondepedent control variates for policy optimization via stein’s identity. In ICLR, 2018.
 Liu et al. (2017) Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. Improved image captioning via policy gradient optimization of spider. In ICCV, 2017.
 Lu et al. (2017) Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR, 2017.
 Luise et al. (2018) Giulia Luise, Alessandro Rudi, Massimiliano Pontil, and Carlo Ciliberto. Differential properties of sinkhorn approximation for learning with wasserstein distance. arXiv:1805.11897, 2018.
 Luong et al. (2015a) MinhThang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attentionbased neural machine translation. arXiv:1508.04025, 2015a.
 Luong et al. (2015b) Thang Luong, Hieu Pham, and Christopher D Manning. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, 2015b.
 Luong et al. (2018) Thang Luong, Eugene Brevdo, and Rui Zhao. Neural machine translation (seq2seq) tutorial, 2018. URL https://github.com/tensorflow/nmt.
 Maddison et al. (2017) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. ICLR, 2017.
 Over et al. (2007) Paul Over, Hoa Dang, and Donna Harman. DUC in context. Information Processing & Management, 2007.
 Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. BLEU: a method for automatic evaluation of machine translation. In ACL, 2002.
 Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
 Peyré et al. (2017) Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport. Technical report, 2017.
 Ranzato et al. (2016) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. In ICLR, 2016.
 Rennie et al. (2017) Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Selfcritical sequence training for image captioning. In CVPR, 2017.
 Rubner et al. (2000) Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval. IJCV, 2000.
 Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. In EMNLP, 2015.
 Salimans et al. (2018) Tim Salimans, Han Zhang, Alec Radford, and Dimitris Metaxas. Improving GANs using optimal transport. In ICLR, 2018.
 See et al. (2017) Abigail See, Peter J Liu, and Christopher D Manning. Get to the point: summarization with pointergenerator networks. ACL, 2017.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
 Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensusbased image description evaluation. In CVPR, 2015.
 Villani (2008) Cédric Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer, 2008.
 Vinyals et al. (2015) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
 Wang et al. (2018a) Li Wang, Junlin Yao, Yunzhe Tao, Li Zhong, Wei Liu, and Qiang Du. A reinforced topicaware convolutional sequencetosequence model for abstractive text summarization. IJCAI, 2018a.
 Wang et al. (2018b) Xin Wang, Wenhu Chen, YuanFang Wang, and William Yang Wang. No metrics are perfect: Adversarial reward learning for visual storytelling. In ACL, 2018b.
 Wiseman & Rush (2016) Sam Wiseman and Alexander M Rush. Sequencetosequence learning as beamsearch optimization. In EMNLP, 2016.
 Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144, 2016.
 Xie et al. (2018) Yujia Xie, Xiangfeng Wang, Ruijia Wang, and Hongyuan Zha. A fast proximal point method for Wasserstein distance. In arXiv:1802.04307, 2018.
 Xu et al. (2018) Hongteng Xu, Wenlin Wang, Wei Liu, and Lawrence Carin. Distilled Wasserstein learning for word embedding and topic modeling. In NeurIPS, 2018.
 Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
 You et al. (2016) Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In CVPR, 2016.
 Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. SeqGAN: Sequence generative adversarial nets with policy gradient. In AAAI, 2017.
 Zhang et al. (2018a) Ruiyi Zhang, Changyou Chen, Zhe Gan, Wenlin Wang, Liqun Chen, Dinghan Shen, Guoyin Wang, and Lawrence Carin. Sequence generation with guider network. arXiv preprint arXiv:1811.00696, 2018a.
 Zhang et al. (2018b) Ruiyi Zhang, Changyou Chen, Chunyuan Li, and Lawrence Carin. Policy optimization as wasserstein gradient flows. In ICML, 2018b.
 Zhang et al. (2017) Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. Adversarial feature matching for text generation. In ICML, 2017.
 Zhang et al. (2018c) Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, and Bill Dolan. Generating informative and diverse conversational responses via adversarial information maximization. In NeurIPS, 2018c.
Appendix
Appendix A Training Details for NMT task
The following training details are basically the same as the intructions from the Tensorflow/nmt github repository:

layer LSTMs of units with bidirectional encoder (i.e., bidirectional layer for the encoder), embedding dim is 512. LuongAttention (Luong et al., 2015a) (scale=True) is used together with dropout keepprob=. All parameters are uniformly initialized. We use SGD with learning rate as follows: train for K steps (around epochs); after K steps, we start halving learning rate every K step.

The training hyperparameters are similar to the ENVI experiments except for the following details. The data is split into subword units using BPE (32K operations). We train layer LSTMs of units with bidirectional encoder (i.e., bidirectional layers for the encoder), embedding dimension is . We train for K steps (around epochs); after K steps, we start halving learning rate every K step.
Appendix B EndtoEnd neural machine translation
Systems  NT2012  NT2013 

VIEN: GNMT  20.7  23.8 
VIEN: GNMT+OT (Ours)  21.9  25.5 
ENVI: GNMT  23.0  25.4 
ENVI: GNMT+OT (Ours)  24.1  26.5 
Systems  NT2013  NT2015 

DEEN: GNMT  28.5  29.0 
DEEN: GNMT+OT (Ours)  28.8  29.5 
ENDE: GNMT  23.7  25.3 
ENDE: GNMT+OT (Ours)  24.1  26.2 
Appendix C BLEU score for different hyperparameters
We tested different for the OT loss term and summarized the results in Figure 3. gave the best performance for the ENVI experiment. The results are robust wrt the choice of .
Appendix D Summarization examples
Examples  

Source:  japan ’s nec corp. and UNK computer corp. of the united states said wednesday they had agreed to join 
forces in supercomputer sales .  
Reference:  nec UNK in computer sales tieup 
Baseline:  nec UNK computer corp. 
Ours:  nec UNK computer corp sales supercomputer. 
Source:  five east timorese youths who scaled the french embassy ’s fence here thursday , left the embassy on their 
way to portugal friday .  
Reference:  UNK latest east timorese asylum seekers leave for portugal 
Baseline:  five east timorese youths leave embassy 
Ours:  five east timorese seekers leave embassy for portugal 
Source:  the us space shuttle atlantis separated from the orbiting russian mir space station early saturday , after three days of 
test runs for life in a future space facility , nasa announced .  
Reference:  atlantis mir part ways after threeday space collaboration by emmanuel UNK 
Baseline:  atlantis separate from mir 
Ours:  atlantis separate from mir space by UNK 
Source:  australia ’s news corp announced monday it was joining brazil ’s globo , mexico ’s grupo televisa and the us 
telecommunications inc. in a venture to broadcast ### channels via satellite to latin america .  
Reference:  news corp globo televisa and telecommunications in satellite venture 
Baseline:  australia ’s news corp joins brazil 
Ours:  australia ’s news corp joins brazil in satellite venture 
Examples for abstract summarization are provided in Table 9.
Appendix E Neural machine translation examples
In Table 10, we show more examples for comparison. From these examples, sentences generated from our model are more faithful to the reference sentences.
Appendix F Image Caption examples
Table 4 shows the comparison of our model with other baselines.
Appendix G Encoding model belief with Softmax and Gumbelsoftmax
Systems  NT2012  NT2013 

VIEN: GNMT  20.7  23.8 
VIEN: GNMT+OT (GS)  20.9  24.5 
VIEN: GNMT+OT (softmax)  21.8  24.3 
VIEN: GNMT+OT (ours)  21.9  25.5 
ENVI: GNMT  23.0  25.4 
ENVI: GNMT+OT (GS)  23.3  25.7 
ENVI: GNMT+OT (softmax)  23.5  26.0 
ENVI: GNMT+OT (ours)  24.1  26.5 
To find out the best differentiable sequence generating mechanism, we also experimented with Softmax and Gumbelsoftmax (a.k.a. concrete distribution). Detailed results are summarized in Table 11. We can see Softmax and Gumbelsoftmax based OT model provide less significant gains in terms of BLEU score compared with the baseline MLE model. In some situation, the performance even degenerate. We hypothesized that this is because Softmax encodes more ambiguity and Gumbelsoftmax has a larger variance due to the extra random variable involved. These in turn hurts the learning. More involved variance reduction scheme might offset such negative impacts for Gumbelsoftmax, which is left as our future work.
Metric  Baseline  Case 1  Case 2 

BLEU2  71.87  73.60  75.12 
BLEU3  61.18  63.07  64.82 
BLEU4  56.59  58.48  60.27 
BLEU5  53.73  55.69  57.50 
Appendix H OT improves both model and word embedding matrix
To identify the source of performance gains, we designed a toy sequencetosequence experiment to show that OT help to refine the language model and word embedding matrix. We use the English corpus from WMT dataset (from our machine translation task) and trained an autoencoder (Seq2Seq model) on this dataset. We evaluated the reconstruction quality with the BLEU score. In Case 1, we stop the OT gradient from flowing back to the sequence model ( only affecting the word embedding matrix); while in Case 2, the gradient from OT can affect the entire model. Detailed results are shown in Table 12. We can see that Case 1 is better than the baseline model, which means OT helps to refine the word embedding matrix. Case 2 achieves the highest BLEU, which implies OT also helps to improve the language model.