Improving Sequence-to-Sequence Learning
via Optimal Transport
Sequence-to-sequence models are commonly trained via maximum likelihood estimation (MLE). However, standard MLE training considers a word-level objective, predicting the next word given the previous ground-truth partial sentence. This procedure focuses on modeling local syntactic patterns, and may fail to capture long-range semantic structure. We present a novel solution to alleviate these issues. Our approach imposes global sequence-level guidance via new supervision based on optimal transport, enabling the overall characterization and preservation of semantic features. We further show that this method can be understood as a Wasserstein gradient flow trying to match our model to the ground truth sequence distribution. Extensive experiments are conducted to validate the utility of the proposed approach, showing consistent improvements over a wide variety of NLP tasks, including machine translation, abstractive text summarization, and image captioning.
|Liqun Chen, Yizhe Zhang, Ruiyi Zhang, Chenyang Tao, Zhe Gan, Haichao Zhang,|
|Bai Li, Dinghan Shen, Changyou Chen, Lawrence Carin|
|Duke University, Microsoft Research, Microsoft Dynamics 365 AI Research|
|Baidu Research, SUNY at Buffalo|
Sequence-to-sequence (Seq2Seq) models are widely used in various natural language processing tasks, such as machine translation (Sutskever et al., 2014, Cho et al., 2014, Bahdanau et al., 2015), text summarization (Rush et al., 2015, Chopra et al., 2016) and image captioning (Vinyals et al., 2015, Xu et al., 2015). Typically, Seq2Seq models are based on an encoder-decoder architecture, with an encoder mapping a source sequence into a latent vector, and a decoder translating the latent vector into a target sequence. The goal of a Seq2Seq model is to optimize this encoder-decoder network to generate sequences close to the target. Therefore, a proper measure of the distance between sequences is crucial for model training.
Maximum likelihood estimation (MLE) is often used as the training paradigm in existing Seq2Seq models (Goodfellow et al., 2016, Lamb et al., 2016). The MLE-based approach maximizes the likelihood of the next word conditioned on its previous ground-truth words. Such an approach adopts cross-entropy loss as the objective, essentially measuring the word difference at each position of the target sequence (assuming truth for the preceding words). That is, MLE only provides a word-level training loss (Ranzato et al., 2016). Consequently, MLE-based methods suffer from the so-called exposure bias problem (Bengio et al., 2015, Ranzato et al., 2016), i.e., the discrepancy between training and inference stages. During inference, each word is generated sequentially based on previously generated words. However, ground-truth words are used in each timestep during training (HuszÃ¡r, 2015, Wiseman & Rush, 2016). Such discrepancy in training and testing leads to accumulated errors along the sequence-generation trajectory, and may therefore produce unstable results in practice. Further, commonly used metrics for evaluating the generated sentences at test time are sequence-level, such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004). This also indicates a mismatch of the training loss and test-time evaluation metrics.
Attempts have been made to alleviate the above issues, via a sequence-level training loss that enables comparisons between the entire generated and reference sequences. Such efforts roughly fall into two categories: () reinforcement-learning-based (RL) methods (Ranzato et al., 2016, Bahdanau et al., 2017) and () adversarial-learning-based methods (Yu et al., 2017, Zhang et al., 2017). These methods overcome the exposure bias issue through criticizing model output during training; however, both schemes have their own vulnerabilities. RL methods often suffer from large variance on policy-gradient estimation, and control variates and carefully designed baselines (such as a self-critic) are needed to make RL training more robust (Rennie et al., 2017, Liu et al., 2018). Further, the rewards used by RL training are often criticized as a bad proxy for human evaluation, as they are usually highly biased towards certain particular aspects (Wang et al., 2018b). On the other hand, adversarial supervision relies on the delicate balance of a mini-max game, which can be easily undermined by mode-trapping and gradient-vanishing problems (Arjovsky et al., 2017, Zhang et al., 2017). Sophisticated tuning is often desired for successful adversarial training.
We present a novel Seq2Seq learning scheme that leverages optimal transport (OT) to construct sequence-level loss. Specifically, the OT objective aims to find an optimal matching of similar words/phrases between two sequences, providing a way to promote their semantic similarity (Kusner et al., 2015). Compared with the above RL and adversarial schemes, our approach has: () semantic-invariance, allowing better preservation of sequence-level semantic information; and () improved robustness, since neither the reinforce gradient nor the mini-max game is involved. The OT loss allows end-to-end supervised training and acts as an effective sequence-level regularization to the MLE loss.
Another novel strategy distinguishing our model from previous approaches is that during training we consider not only the OT distance between the generated sentence and ground-truth references, but also the OT distance between the generated sentence and its corresponding input. This enables our model to simultaneously match the generated output sentence with both the source sentence(s) and target reference sentence, thus enforcing the generator to leverage information contained in the input sentence(s) during generation.
The main contributions of this paper are summarized as follows. () A new sequence-level training algorithm based on optimal transport is proposed for Seq2Seq learning. In practice, the OT distance is introduced as a regularization term to the MLE training loss. () Our model can be interpreted as approximate Wasserstein gradient flows, learning to approximately match the sequence distribution induced by the generator and a target data distribution. () In order to demonstrate the versatility of the proposed method, we conduct extensive empirical evaluations on three tasks: machine translation, text summarization, and image captioning.
2 Semantic Matching with Optimal Transport
We consider two components of a sentence: its syntactic and semantic parts. In a Seq2Seq model, it is often desirable to keep the semantic meaning while the syntactic part can be more flexible. Conventional training schemes, such as MLE, are known to be well-suited for capturing the syntactic structure. As such, we focus on the semantic part. An intuitive way to assess semantic similarity is to directly match the “key words” between the synthesized and the reference sequences. Consider the respective sequences as sets and , with vocabularies as their elements. Then the matching can be evaluated by , where is the counting measure for sets. We call this hard matching, as it seeks to exactly match words from both sequences.
For language models, the above hard matching could be an over simplification. This is because words have semantic meaning, and two different words can be close to each other in the semantic space. To account for such ambiguity, we can relax the hard matching to soft bipartite matching (SBM). More specifically, assuming all sequences have the same length , we pair and for , such that , are unique and is minimized. Here is a cost function measuring the semantic dissimilarity between the two words. For instance, the cosine distance between two word embedding vectors and is a popular choice (Pennington et al., 2014). This minimization can be solved exactly, , via the Hungarian algorithm (Kuhn, 1955). Unfortunately, its complexity scales badly for common NLP tasks, and the objective is also non-differentiable wrt model parameters. As such, end-to-end supervised training is not feasible with the Hungarian matching scheme. To overcome this difficulty, we propose to further relax the matching criteria while keeping the favorable features of a semantic bipartite matching. OT arises as a natural candidate.
2.1 Optimal transport and Wasserstein distance
We first provide a brief review of optimal transport, which defines distances between probability measures on a domain (the sequence space in our setting). The optimal transport distance for two probability measures and is defined as (Peyré et al., 2017):
where denotes the set of all joint distributions with marginals and ; is the cost function for moving to , e.g., the Euclidean or cosine distance. Intuitively, the optimal transport distance is the minimum cost that induces in order to transport from to . When is a metric on , induces a proper metric on the space of probability distributions supported on , commonly known as the Wasserstein distance (Villani, 2008). One of the most popular choices is the Wasserstein distance where the squared Euclidean distance is used as cost.
OT distance on discrete domains
We mainly focus on applying the OT distance on textual data. Therefore, we only consider OT between discrete distributions. Specifically, consider two discrete distributions , which can be written as and with the Dirac function centered on . The weight vectors and respectively belong to the and -dimensional simplex, i.e., , as both and are probability distributions. Under such a setting, computing the OT distance as defined in (1) is equivalent to solving the following network-flow problem (Luise et al., 2018):
where , denotes an -dimensional all-one vector, is the cost matrix given by and represents the Frobenius dot-product. We refer to the minimizer of (2) as OT matching. Comparing the two objectives, one can readily recognize that soft bipartite matching represents a special constrained solution to (2), where can only take values in instead of ; here is the norm and is the unit vector along -th axis. As such, OT matching can be regarded as a relaxed version of soft bipartite matching. In Figure 1 we illustrate the three matching schemes discussed above.
The IPOT algorithm
Unfortunately, the exact minimization over is in general computational intractable (Arjovsky et al., 2017, Genevay et al., 2018, Salimans et al., 2018). To overcome such intractability, we consider an efficient iterative approach to approximate the OT distance. We propose to use the recently introduced Inexact Proximal point method for Optimal Transport (IPOT) algorithm to compute the OT matrix , thus also the OT distance (Xie et al., 2018). IPOT provides a solution to the original OT problem specified in (2). Specifically, IPOT iteratively solves the following optimization problem using the proximal point method (Boyd & Vandenberghe, 2004):
where the proximity metric term penalizes solutions that are too distant from the latest approximation, and is understood as the generalized stepsize. This renders a tractable iterative scheme towards the exact OT solution. In this work, we employ the generalized KL Bregman divergence as the proximity metric. Algorithm 1 describes the implementation details for IPOT.
Note that the Sinkhorn algorithm (Cuturi, 2013) can also be used to compute the OT matrix. Specifically, the Sinkhorn algorithm tries to solve the entropy regularized optimization problem: where is the entropy regularization term and is the regularization strength. However, in our experiments, we empirically found that the numerical stability and performance of the Sinkhorn algorithm is quite sensitive to the choice of the hyper-parameter , thus only IPOT is considered in our model training.
2.2 Optimal transport distance as a sequence level loss
Figure 2 illustrates how OT is computed to construct the sequence-level loss. Given two sentences, we can construct their word-level or phrase-level embedding matrices and , where is usually recognized as the reference sequence embedding and for the model output sequence embedding. The cost matrix is then computed by and passed on to the IPOT algorithm to get the OT distance. Our full algorithm is summarized in Algorithm 2, and more detailed model specifications are given below.
Encoding model belief with a differentiable sequence generator
We first describe how to design a differentiable sequence generator so that the gradients can be backpropagated from the OT losses to update the model belief. The Long Short-Term Memory (LSTM) recurrent neural network (Hochreiter & Schmidhuber, 1997) is used as our sequence model. At each timestep , the LSTM decoder outputs a logit vector for the vocabularies, based on its context. Directly sampling from the multinomial distribution is a non-differentiable operation111Here is understood as an one-hot vector in order to be notationally consistent with its differentiable alternatives., so we consider the following differentiable alternatives:
Soft-argmax: , where is the annealing parameter (Zhang et al., 2017). This approximates the deterministic sampling scheme ;
Unstable training and sub-optimal solutions have been observed for the GS-based scheme for the Seq2Seq tasks we considered (see Appendix G, Table 11), possibly due to the extra uncertainty introduced. As such, we will assume the use of soft-argmax to encode model belief in unless otherwise specified. Note is a normalized non-negative vector that sums up to one.
Sequence-level OT-matching loss
To pass on the model belief to the OT loss, we use the mean word embedding predicted by the model, given by , where is the word embedding matrix, is the vocabulary size and is the dimension for the embedding vector. We collect the predicted sequence embeddings into , where is the length of sequence. Similarly we denote the reference sequence embeddings as , using ground truth one-hot input token sequence . Based on the sequence embeddings and , we can compute the sequence-level OT loss between ground-truth and model prediction using the IPOT algorithm described above for different Seq2Seq tasks:
We additionally consider feature matching using the OT criteria between the source and target. Intuitively, it will encourage the global semantic meaning to be preserved from source to target. This is related to the copy network (Gu et al., 2016). However, in our framework, the copying mechanism can be understood as a soft optimal-transport-based copying, instead of the original hard retrieved-based copying used by Gu et al. (2016). This soft copying mechanism considers semantic similarity in the embedding space, and thus presumably delivers smoother transformation of information. In the case where the source and target sequences do not share vocabulary (e.g., machine translation), this objective can still be applied by sharing the word embedding space between source and target. Ideally, the embedding for the same concept in different languages will automatically be aligned by optimizing such loss, making available a cosine-similarity-based cost matrix. This is also related to bilingual skip-gram (Luong et al., 2015b). We denote this loss as , where represents the source sequence embeddings.
Complementing MLE training with OT regularization
OT training objectives discussed above can not train a proper language model on its own, as they do not explicitly consider word ordering, i.e., the syntactic strucuture of a language model. To overcome this issue, we propose to combine the OT loss with the de facto likelihood loss , which gives us the final training objective: , where is a hyper-parameter to be tuned. For tasks with both input and output sentences, such as machine translation and text summarization, can be applied, in which case the final objective can be written as .
3 Interpretation as Approximate Wasserstein Gradient Flows
To further justify the use of our approach (minimizing the loss , where denotes the Wasserstein loss), we now explain how our model approximately learns to match the ground-truth sequence distribution. Our derivation is based on the theory of Wasserstein gradient flows (WGF) (Villani, 2008). In WGF, the Wasserstein distance describes the local geometry of a trajectory in the space of probability measures converging to a target distribution (Ambrosio et al., 2005). In the following, we show that the proposed method learns to approximately match the data distribution, from the perspective of WGF. For simplicity we only discuss the continuous case, while a similar argument also holds for the discrete case (Li & Montufar, 2018).
We denote the induced distribution of the sequences generated from the decoder at the -th iteration as . Assume the sequence data distribution is given by . Intuitively, the optimal generator in a Seq2Seq model learns a distribution that matches . Based on Craig (2014), this can be achieved by composing a sequence of discretized WGFs given by:
with defined as
where is a regularization parameter ( is the generalized learning rate); denotes the -Wasserstein distance between and ; is the space of distributions with finite 2nd-order moments; and is the Kullback-Leibler (KL) divergence. It is not difficult to see that discreteized WGF is essentially optimizing the KL divergence with a proximal descent scheme, using the -Wasserstein distance as the proximity metric.
We denote with generalized learning rate . It is well known that (Chen et al., 2018a), that is to say the induced model distribution asymptotically converges to the data distribution . In our case, instead of using as the loss function, we define a surrogate loss using its upper bound , where the inequality holds because (6) converges to . When our model distribution is parameterized by , can be solved with stochastic updates on based on the following equation with stepsize :
Unfortunately, (7) is an infeasible update as we do not know . However, we argue that this update is still locally valid when current model approximation is close to . To see this, recall that the KL-divergence is a natural Riemannian metric on the space of probability measures (Amari, 1985), therefore it is locally symmetric. So we can safely replace the term with when is close to . This recovers the loss function derived in Section 2.2 as , where is the entropy of , independent of , and . This justifies the use of our proposed scheme in a model-refinement stage, where model distribution is sufficiently close to . Empirically, we have observed that our scheme also improves training even when is distant from . While the above justification is developed based on Euclidean transport, other non-Euclidean costs such as cosine distance usually yield better empirical performance as they are more adjusted to the geometry of sequence data.
4 Related Work and Discussion
Optimal transport in NLP
Although widely used in other fields such as computer vision (Rubner et al., 2000), OT has only been applied in NLP recently. Pioneered by the work of Kusner et al. (2015) on word mover’s distance (WMD), existing literature primarily considers OT either on a macroscopic level like topic modeling (Huang et al., 2016), or a microscopic level such as word embedding (Xu et al., 2018). Euclidean distance, instead of other more general distance, is often used as the transportation cost, in order to approximate the OT distance with the Kantorovich-Rubinstein duality (Gulrajani et al., 2017) or a more efficient yet less accurate lower bound (Kusner et al., 2015). Our work employs OT for mesoscopic sequence-to-sequence models, presenting an efficient IPOT-based implementation to enable end-to-end learning for general cost functions. The proposed OT not only refines the word embedding matrix but also improves the Seq2Seq model (see Appendix H for details).
RL for sequence generation
A commonly employed strategy for sequence-level training is via reinforcement learning (RL). Typically, this type of method employs RL by considering the evaluation metrics as the reward to guide the generation (Ranzato et al., 2016, Bahdanau et al., 2017, Rennie et al., 2017, Zhang et al., 2018a, Huang et al., 2018). However, these approaches often introduce procedures that may yield large-variance gradients, resulting in unstable training. Moreover, it has been recognized that these automatic metrics may have poor correlation with human judgments in many scenarios (Wang et al., 2018b). As such, reinforcing the evaluation metrics can potentially boost the quantitative scores but not necessarily improve the generation quality, as such metrics usually encourage exact text snippets overlapping rather than semantic similarity. Some nonstandard metrics like SPICE (Anderson et al., 2016) also consider semantic similarity, however they also can not learn a good model on their own (Liu et al., 2017). Unlike RL methods, our method requires no human-defined rewards, thus preventing the model from over-fitting to one specific metric. As a concrete example, the two semantically similar sentences “do you want to have lunch with us ” and “would you like to join us for lunch” would be considered as a bad match based on automatic metrics like BLEU, however, be rated as reasonable match in OT objective.
GAN for sequence generation
Another type of method adopts the framework of generative adversarial networks (GANs) (Goodfellow et al., 2014), by providing sequence-level guidance based on a learned discriminator (or, critic). To construct such a loss, Yu et al. (2017), Lin et al. (2017), Guo et al. (2018), Fedus et al. (2018) combine the policy-gradient algorithm with the original GAN training procedure, while Zhang et al. (2017), Chen et al. (2018b) uses a so-called feature mover distance and maximum mean discrepancy (MMD) to match features of real and generated sentences, respectively. However, mode-collapse and gradient-vanishing problems make the training of these methods challenging. Unlike GAN methods, since no min-max games are involved, the training of our model is more robust. Moreover, compared with GAN, no additional critic is introduced in our model, which makes the model complexity comparable to MLE and less demanding to tune.
We consider a wide range of NLP tasks to experimentally validate the proposed model, and benchmark it with other strong baselines. All experiments are implemented with Tensorflow and run on a single NVIDIA TITAN X GPU. Code for our experiments are available from https://github.com/LiqunChen0606/Seq2Seq-OT.
5.1 Neural machine translation
We test our model on two datasets: () a small-scale English-Vietnamese parallel corpus of TED-talks, which has K sentence pairs from the IWSLT Evaluation Campaign (Cettolo et al., 2015); and () a large-scale English-German parallel corpus with 4.5M sentence pairs, from the WMT Evaluation Campaign (Vaswani et al., 2017). We used Google’s Neural Machine Translation (GMNT) model (Wu et al., 2016) as our baseline, following the architecture and hyper-parameter settings from the GNMT repository222https://github.com/tensorflow/nmt to make a fair comparison. For the English-Vietnamese (i.e., VI-EN and EN-VI) tasks, a 2-layer LSTM with 512 units in each layer is adopted as the decoder, with a 1-layer bidirectional-LSTM adopted as the encoder; the word embedding dimension is set to 512. Attention proposed in Luong et al. (2015a) is used together with a dropout rate of . For the English-German (i.e., DE-EN and EN-DE) tasks, we train a 4-layer LSTM decoder with 1024 units in each layer. A 2-layer bidirectional-LSTM is used as the encoder, and we adopt the attention used in Wu et al. (2016). The word embedding dimension is set to 1024. Standard stochastic gradient descent is used for training with a decreasing learning rate, and we set for the IPOT algorithm. More training details are provided in Appendix A. In terms of wall-clock time, our model only slightly increases training time. For the German-English task, it took roughly 5.5 days to train the GNMT model, and 6 days to train our proposed model from scratch, which only amounts to a roughly increase.
We apply different combinations of and to fine-tune the pre-trained GNMT model (Luong et al., 2018) and the results are summarized in Table 2 and 2. . Additional results for training from scratch are provided in Appendix B. The proposed OT approach consistently improves upon MLE training in all experimental setups.
We additionally tested our model with a more expressive -layer LSTM model on the EN-DE task. The BLEU score of our method is on NT2015. For reference, the GNMT model (same architecture) and a Transformer model (Vaswani et al., 2017) respectively report a score of and . Our method outperforms both baselines, and it is also competitive to the state-of-the-art BLEU score reported by Vaswani et al. (2017) using a highly sophisticated model design.
The German-to-English translation examples are provided in Table 3 for qualitative assessment. The main differences among the reference translation, our translation and the GNMT translation are highlighted in blue and red. Our OT-augmented translations are more faithful to the reference than its MLE-trained counterpart. The soft-copying mechanism introduced by OT successfully maintains the key semantic content from the reference. Presumably, the OT loss helps refine the word embedding matrices, and promotes matching between words with similar semantic meanings. Vanilla GNMT translations, on the other hand, ignores or misinterprets some of the key terms. More examples are provided in Appendix E.
We also test the robustness of our method wrt the hyper-parameter . Results are summarized in Appendix C. Our OT-augmented model is robust to the choice of . The test BLEU scores are consistently higher than the baseline for .
5.2 Abstractive text summarization
We consider two datasets for abstractive text summarization. The first one is the Gigaword corpus (Graff et al., 2003), which has around M training samples, K validation samples, and test samples. The input pairs consist of the first sentence and the headline of an article. We also evaluate our model on the DUC-2004 test set (Over et al., 2007), which consists of 500 news articles. Our implementation of the Seq2Seq model adopts a simple architecture, which consists of a bidirectional GRU encoder and a GRU decoder with attention mechanism (Bahdanau et al., 2015)333https://github.com/thunlp/TensorFlow-Summarization.
Results are summarized in Tables 5 and 5. Our OT-regularized model outperforms respective baselines. The state-of-the-art ROUGE result for the Gigawords dataset is reported by Wang et al. (2018a). However, much more complex architectures are used to achieve that score. We use a relatively simple Seq2Seq model in our experiments to demonstrate the versatility of the proposed OT method. Applying it for () more complicated models and () more recent datasets such as CNN/DailyMail (See et al., 2017) will be interesting future work.
Summarization examples are provided in Appendix D. Similar to the machine translation task, our proposed method captures the key semantic information in both the source and reference sentences.
5.3 Image captioning
We also consider an image captioning task using the COCO dataset (Lin et al., 2014), which contains 123,287 images in total and each image is annotated with at least 5 captions. Following Karpathy’s split (Karpathy & Fei-Fei, 2015), 113,287 images are used for training and 5,000 images are used for validation and testing. We follow the implementation of the Show, Attend (Xu et al., 2015)444https://github.com/DeepRNN/image_captioning, and use Resnet-152 (He et al., 2016), image tagging (Gan et al., 2017), and FastRCNN (Anderson et al., 2018) as the image feature extractor (encoder), and a one-layer LSTM with 1024 units as the decoder. The word embedding dimension is set to 512. Note that in this task, the input are images instead of sequences, therefore cannot be applied.
We report BLEU- ( from 1 to 4) (Papineni et al., 2002), CIDEr (Vedantam et al., 2015), and METEOR (Banerjee & Lavie, 2005) scores and the results with different settings are shown in Table 6. Consistent across-the-board improvements are observed with the introduction of the OT loss, in contrast to the RL-based methods where drastic improvements can only be observed for the optimized evaluation metric (Rennie et al., 2017). Consequently, the OT loss is a more reliable method to improve the quality of generated captions when compared with RL methods that aim to optimize and therefore potentially overfit one specific metric. Examples of generated captions are provided in Appendix F.
This work is motivated by the major deficiency in training Seq2Seq models: that the MLE training loss does not operate at sequence-level. Inspired by soft bipartite matching, we propose the usage of optimal transport as a sequence-level loss to improve Seq2Seq learning. By applying this new method to machine translation, text summarization, and image captioning, we demonstrate that our proposed model can be used to help improve the performance compared to strong baselines. We believe the proposed method is a general framework, and will be useful to other sequence generation tasks as well, such as conversational response generation (Li et al., 2017, Zhang et al., 2018c), which is left as future work.
- Amari (1985) Shun-ichi Amari. Differential-geometrical methods in statistics, volume 28. Springer Science & Business Media, 1985.
- Ambrosio et al. (2005) Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. 2005.
- Anderson et al. (2016) Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision, pp. 382–398. Springer, 2016.
- Anderson et al. (2018) Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.
- Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
- Bahdanau et al. (2017) Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. In ICLR, 2017.
- Banerjee & Lavie (2005) Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL Workshop, 2005.
- Bengio et al. (2015) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS, 2015.
- Boyd & Vandenberghe (2004) Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
- Cettolo et al. (2015) Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, Roldano Cattoni, and Marcello Federico. The IWSLT 2015 evaluation campaign. In IWSLT 2015, International Workshop on Spoken Language Translation, 2015.
- Chen et al. (2018a) Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, and Liqun Chen. A unified particle-optimization framework for scalable bayesian sampling. In UAI, 2018a.
- Chen et al. (2018b) Liqun Chen, Shuyang Dai, Chenyang Tao, Haichao Zhang, Zhe Gan, Dinghan Shen, Yizhe Zhang, Guoyin Wang, Ruiyi Zhang, and Lawrence Carin. Adversarial text generation via feature-mover’s distance. In NeurIPS, 2018b.
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014.
- Chopra et al. (2016) Sumit Chopra, Michael Auli, and Alexander M Rush. Abstractive sentence summarization with attentive recurrent neural networks. In NAACL, 2016.
- Craig (2014) Katy Craig (ed.). The exponential formula for the Wasserstein metric. PhD thesis, The State University of New Jersey, 2014.
- Cuturi (2013) Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, 2013.
- Fedus et al. (2018) William Fedus, Ian Goodfellow, and Andrew M Dai. MaskGAN: Better text generation via filling in the ____. In ICLR, 2018.
- Gan et al. (2017) Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. Semantic compositional networks for visual captioning. In CVPR, 2017.
- Genevay et al. (2018) Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with sinkhorn divergences. In AISTATS, 2018.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
- Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
- Graff et al. (2003) David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. English gigaword. Linguistic Data Consortium, Philadelphia, 2003.
- Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. Incorporating copying mechanism in sequence-to-sequence learning. In ACL, 2016.
- Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In NIPS, 2017.
- Guo et al. (2018) Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. Long text generation via adversarial training with leaked information. In AAAI, 2018.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
- Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 1997.
- Huang et al. (2016) Gao Huang, Chuan Guo, Matt J Kusner, Yu Sun, Fei Sha, and Kilian Q Weinberger. Supervised word mover’s distance. In NIPS, 2016.
- Huang et al. (2018) Qiuyuan Huang, Zhe Gan, Asli Celikyilmaz, Dapeng Wu, Jianfeng Wang, and Xiaodong He. Hierarchically structured reinforcement learning for topically coherent visual story generation. arXiv preprint arXiv:1805.08191, 2018.
- HuszÃ¡r (2015) Ferenc HuszÃ¡r. How (not) to train your generative model: scheduled sampling, likelihood, adversary? In arXiv:1511.05101, 2015.
- Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-softmax. In arXiv:1611.01144, 2016.
- Karpathy & Fei-Fei (2015) Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.
- Kuhn (1955) Harold W Kuhn. The Hungarian method for the assignment problem. Naval research logistics quarterly, 1955.
- Kusner et al. (2015) Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In ICML, 2015.
- Lamb et al. (2016) Alex Lamb, Anirudh Goyal ALIAS PARTH GOYAL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In NIPS, 2016.
- Li et al. (2017) Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky. Adversarial learning for neural dialogue generation. In EMNLP, 2017.
- Li & Montufar (2018) W. Li and G. Montufar. Natural gradient via optimal transport. arXiv preprint arXiv:1803.07033, 2018.
- Lin (2004) Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
- Lin et al. (2017) Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, and Ming-Ting Sun. Adversarial ranking for language generation. In NIPS, 2017.
- Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
- Liu et al. (2018) Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-depedent control variates for policy optimization via stein’s identity. In ICLR, 2018.
- Liu et al. (2017) Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. Improved image captioning via policy gradient optimization of spider. In ICCV, 2017.
- Lu et al. (2017) Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR, 2017.
- Luise et al. (2018) Giulia Luise, Alessandro Rudi, Massimiliano Pontil, and Carlo Ciliberto. Differential properties of sinkhorn approximation for learning with wasserstein distance. arXiv:1805.11897, 2018.
- Luong et al. (2015a) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv:1508.04025, 2015a.
- Luong et al. (2015b) Thang Luong, Hieu Pham, and Christopher D Manning. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, 2015b.
- Luong et al. (2018) Thang Luong, Eugene Brevdo, and Rui Zhao. Neural machine translation (seq2seq) tutorial, 2018. URL https://github.com/tensorflow/nmt.
- Maddison et al. (2017) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. ICLR, 2017.
- Over et al. (2007) Paul Over, Hoa Dang, and Donna Harman. DUC in context. Information Processing & Management, 2007.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. In ACL, 2002.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
- Peyré et al. (2017) Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport. Technical report, 2017.
- Ranzato et al. (2016) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. In ICLR, 2016.
- Rennie et al. (2017) Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. In CVPR, 2017.
- Rubner et al. (2000) Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval. IJCV, 2000.
- Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. In EMNLP, 2015.
- Salimans et al. (2018) Tim Salimans, Han Zhang, Alec Radford, and Dimitris Metaxas. Improving GANs using optimal transport. In ICLR, 2018.
- See et al. (2017) Abigail See, Peter J Liu, and Christopher D Manning. Get to the point: summarization with pointer-generator networks. ACL, 2017.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
- Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In CVPR, 2015.
- Villani (2008) Cédric Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer, 2008.
- Vinyals et al. (2015) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
- Wang et al. (2018a) Li Wang, Junlin Yao, Yunzhe Tao, Li Zhong, Wei Liu, and Qiang Du. A reinforced topic-aware convolutional sequence-to-sequence model for abstractive text summarization. IJCAI, 2018a.
- Wang et al. (2018b) Xin Wang, Wenhu Chen, Yuan-Fang Wang, and William Yang Wang. No metrics are perfect: Adversarial reward learning for visual storytelling. In ACL, 2018b.
- Wiseman & Rush (2016) Sam Wiseman and Alexander M Rush. Sequence-to-sequence learning as beam-search optimization. In EMNLP, 2016.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144, 2016.
- Xie et al. (2018) Yujia Xie, Xiangfeng Wang, Ruijia Wang, and Hongyuan Zha. A fast proximal point method for Wasserstein distance. In arXiv:1802.04307, 2018.
- Xu et al. (2018) Hongteng Xu, Wenlin Wang, Wei Liu, and Lawrence Carin. Distilled Wasserstein learning for word embedding and topic modeling. In NeurIPS, 2018.
- Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
- You et al. (2016) Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In CVPR, 2016.
- Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. SeqGAN: Sequence generative adversarial nets with policy gradient. In AAAI, 2017.
- Zhang et al. (2018a) Ruiyi Zhang, Changyou Chen, Zhe Gan, Wenlin Wang, Liqun Chen, Dinghan Shen, Guoyin Wang, and Lawrence Carin. Sequence generation with guider network. arXiv preprint arXiv:1811.00696, 2018a.
- Zhang et al. (2018b) Ruiyi Zhang, Changyou Chen, Chunyuan Li, and Lawrence Carin. Policy optimization as wasserstein gradient flows. In ICML, 2018b.
- Zhang et al. (2017) Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. Adversarial feature matching for text generation. In ICML, 2017.
- Zhang et al. (2018c) Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, and Bill Dolan. Generating informative and diverse conversational responses via adversarial information maximization. In NeurIPS, 2018c.
Appendix A Training Details for NMT task
The following training details are basically the same as the intructions from the Tensorflow/nmt github repository:
-layer LSTMs of units with bidirectional encoder (i.e., bidirectional layer for the encoder), embedding dim is 512. LuongAttention (Luong et al., 2015a) (scale=True) is used together with dropout keep-prob=. All parameters are uniformly initialized. We use SGD with learning rate as follows: train for K steps (around epochs); after K steps, we start halving learning rate every K step.
The training hyperparameters are similar to the EN-VI experiments except for the following details. The data is split into subword units using BPE (32K operations). We train -layer LSTMs of units with bidirectional encoder (i.e., bidirectional layers for the encoder), embedding dimension is . We train for K steps (around epochs); after K steps, we start halving learning rate every K step.
Appendix B End-to-End neural machine translation
|VI-EN: GNMT+OT (Ours)||21.9||25.5|
|EN-VI: GNMT+OT (Ours)||24.1||26.5|
|DE-EN: GNMT+OT (Ours)||28.8||29.5|
|EN-DE: GNMT+OT (Ours)||24.1||26.2|
Appendix C BLEU score for different hyper-parameters
We tested different for the OT loss term and summarized the results in Figure 3. gave the best performance for the EN-VI experiment. The results are robust wrt the choice of .
Appendix D Summarization examples
|Source:||japan ’s nec corp. and UNK computer corp. of the united states said wednesday they had agreed to join|
|forces in supercomputer sales .|
|Reference:||nec UNK in computer sales tie-up|
|Baseline:||nec UNK computer corp.|
|Ours:||nec UNK computer corp sales supercomputer.|
|Source:||five east timorese youths who scaled the french embassy ’s fence here thursday , left the embassy on their|
|way to portugal friday .|
|Reference:||UNK latest east timorese asylum seekers leave for portugal|
|Baseline:||five east timorese youths leave embassy|
|Ours:||five east timorese seekers leave embassy for portugal|
|Source:||the us space shuttle atlantis separated from the orbiting russian mir space station early saturday , after three days of|
|test runs for life in a future space facility , nasa announced .|
|Reference:||atlantis mir part ways after three-day space collaboration by emmanuel UNK|
|Baseline:||atlantis separate from mir|
|Ours:||atlantis separate from mir space by UNK|
|Source:||australia ’s news corp announced monday it was joining brazil ’s globo , mexico ’s grupo televisa and the us|
|tele-communications inc. in a venture to broadcast ### channels via satellite to latin america .|
|Reference:||news corp globo televisa and tele-communications in satellite venture|
|Baseline:||australia ’s news corp joins brazil|
|Ours:||australia ’s news corp joins brazil in satellite venture|
Examples for abstract summarization are provided in Table 9.
Appendix E Neural machine translation examples
In Table 10, we show more examples for comparison. From these examples, sentences generated from our model are more faithful to the reference sentences.
Appendix F Image Caption examples
Table 4 shows the comparison of our model with other baselines.
Appendix G Encoding model belief with Softmax and Gumbel-softmax
|VI-EN: GNMT+OT (GS)||20.9||24.5|
|VI-EN: GNMT+OT (softmax)||21.8||24.3|
|VI-EN: GNMT+OT (ours)||21.9||25.5|
|EN-VI: GNMT+OT (GS)||23.3||25.7|
|EN-VI: GNMT+OT (softmax)||23.5||26.0|
|EN-VI: GNMT+OT (ours)||24.1||26.5|
To find out the best differentiable sequence generating mechanism, we also experimented with Softmax and Gumbel-softmax (a.k.a. concrete distribution). Detailed results are summarized in Table 11. We can see Softmax and Gumbel-softmax based OT model provide less significant gains in terms of BLEU score compared with the baseline MLE model. In some situation, the performance even degenerate. We hypothesized that this is because Softmax encodes more ambiguity and Gumbel-softmax has a larger variance due to the extra random variable involved. These in turn hurts the learning. More involved variance reduction scheme might offset such negative impacts for Gumbel-softmax, which is left as our future work.
|Metric||Baseline||Case 1||Case 2|
Appendix H OT improves both model and word embedding matrix
To identify the source of performance gains, we designed a toy sequence-to-sequence experiment to show that OT help to refine the language model and word embedding matrix. We use the English corpus from WMT dataset (from our machine translation task) and trained an auto-encoder (Seq2Seq model) on this dataset. We evaluated the reconstruction quality with the BLEU score. In Case 1, we stop the OT gradient from flowing back to the sequence model ( only affecting the word embedding matrix); while in Case 2, the gradient from OT can affect the entire model. Detailed results are shown in Table 12. We can see that Case 1 is better than the baseline model, which means OT helps to refine the word embedding matrix. Case 2 achieves the highest BLEU, which implies OT also helps to improve the language model.