Disentangled Representation Learning for Text Style Transfer
This paper tackles the problem of disentangling the latent variables of style and content in language models. We propose a simple, yet effective approach, which incorporates auxiliary objectives: a multi-task classification objective, and dual adversarial objectives for label prediction and bag-of-words prediction, respectively. We show, both qualitatively and quantitatively, that the style and content are indeed disentangled in the latent space, using this approach. This disentangled latent representation learning method is applied to attribute (e.g. style) transfer on non-parallel corpora. We achieve similar content preservation scores compared to previous state-of-the-art approaches, and significantly better style-transfer strength scores. Our code is made publicly available for replicability and extension purposes 111https://github.com/vineetjohn/linguistic-style-transfer.
Disentangled Representation Learning for Text Style Transfer
Vineet John University of Waterloo firstname.lastname@example.org Lili Mou AdeptMind Research email@example.com firstname.lastname@example.org
Hareesh Bahuleyan University of Waterloo email@example.com Olga Vechtomova University of Waterloo firstname.lastname@example.org
Copyright © 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
The neural network has been a successful learning machine during the past decade due to its highly expressive modeling capability, which is a consequence of multiple layers of non-linear transformations of input features. Such transformations, however, make intermediate features ‘latent’, in that they do not have explicit meaning and are not explainable. Therefore, neural networks are usually treated as black-box machinery.
Disentangling the latent space of neural networks has become an increasingly important research topic. In the image domain, for example, ? (?) use adversarial and information maximization objectives to produce interpretable latent representations that can be tweaked to adjust writing style for handwritten digits, as well as lighting and orientation for face models. ? (?) utilize a convolutional autoencoder to achieve the same objective. However, this problem is not well explored in natural language processing.
In this paper, we address the problem of disentangling the latent space of neural networks for text generation. Our model is built on an autoencoder that encodes the latent space (vector representation) of a sentence by learning to reconstruct the sentence itself. We would like the latent space to be disentangled with respect to different features, namely, style and content.
To accomplish this, we propose a simple approach that combines multi-task and adversarial objectives. We artificially divide the latent representation into two parts: style space and content space. We learn model parameters that produce separate style and content latent spaces from the encoded sentence representation. The multi-task objective operates on the style space to ensure it encodes style information. The adversarial style objective, on the contrary, operates on the content space to minimize the predictability of style. In addition, the bag-of-words adversary minimizes the predictability of original words of the sentence in the style space. In this way, the style and content variables can be disentangled from each other.
These representation learning objectives can be directly used for style-transfer in sentence generation (?; ?; ?), in which a model can generate a sentence with the same content, but a different style. We simply use an autoencoder to encode the content vector of a sentence, but ignore its encoded style vector. Then we infer from the training data, an empirical embedding of the style that we want to transfer. The encoded content vector and the inferred style vector are concatenated and fed to the decoder. Using this grafting technique, we generate a new sentence similar in content but different in style.
We conducted experiments on two customer review benchmark datasets. Both qualitative and quantitative results show that the style vector does contain most style information, whereas the content vector contains little (if any). In the empirical style-transfer evaluations, we achieve significantly better style-transfer strength scores than previous results, while obtaining better or comparable content preservation scores. We also show, using ablation tests, that the auxiliary losses can be combined well, each playing its own role in disentangling the latent space.
2 Related Work
Disentanglement of latent spaces has been explored in the image processing domain in the recent years, and researchers have successfully disentangled rotation features, color features, etc. (?; ?). Some image characteristics (e.g. artistic style) can be captured well by certain statistics, like minimizing the L2 loss between a noise image and a pair of style and content images (?). In other works, researchers adopt data augmentation techniques to learn a disentangled latent space (?; ?).
In natural language processing, the definition of ‘style’ itself is vague, and as a convenient starting point, NLP researchers typically treat sentiment as a salient style of text. ? (?) manage to control the sentiment by using discriminators to reconstruct sentiment and content from generated sentences. However, there is no evidence that the latent space would be disentangled by this reconstruction. ? (?) use a pair of adversarial discriminators to align the recurrent hidden decoder states of (a) original sentences with a given style and (b) sentences transferred to the given style, thereby performing style-transfer. ? (?) propose two approaches, (a) using a style embedding matrix, and (b) using style-specific decoders for style-transfer. They apply an adversarial loss on the encoded space to discourage encoding style in the latent space of an autoencoding model.
Our paper differs from the previous work in that both our style space and content space are encoded from the input. We apply three auxiliary losses to ensure that each space encodes its own information, including a novel bag-of-words adversary that helps disentangle the style space from sentence content. Our disentangled representation learning function can then be directly applied to text style-transfer tasks, as in the aforementioned studies.
In this section, we describe our approach in detail. Our model is built upon an autoencoder with a sequence-to-sequence (Seq2Seq) neural network (?), as described in Subsection 3.1. Then, we introduce three auxiliary losses: a multi-task classification loss, a style adversarial loss, and a content adversarial loss in Subsections 3.2, 3.3 and 3.4, respectively. Subsection 3.6 presents the approach to transfer style in the context of natural language generation. Figure 1 and Figure 2 depict the training and style-transferred sentence generation processes of our approach, respectively.
An autoencoder encodes an input to a latent vector space, which is usually of a much lower dimensionality than the input data. From this latent space, it decodes the input itself. By doing so, the autoencoder learns salient and compact representations of data. This serves as our primary learning objective. Besides, we also use the autoencoder for text generation in the style-transfer application.
Let be an input sequence with tokens. The encoder recurrent neural network (RNN) with gated recurrent units (GRU) (?) encodes and obtains a hidden state . Then a decoder RNN generates a sentence, which ideally should be itself. Suppose at a time step , the decoder RNN predicts the word with probability , then the deterministic autoencoder (DAE) is trained using a sequence-aggregated cross-entropy loss, given by
where and are the parameters of the encoder and decoder, respectively.
In addition to the deterministic auto-encoding objective presented above, we also implement a variational autoencoder (VAE) (?). The variational sampling and Kullback-Leibler (KL) divergence (?) losses are applicable to both the style space and the content space , separately. The weights applied to each KL divergence loss are tuned independently as model hyperparameters.
Equation 3.1 shows the reconstruction objective that is minimized in the VAE variant of our model.
where and are the weights for each of the KL divergence terms, and is the standard normal distribution that both the encoded style and content distributions are aligned to.
The motivation for using a variational autoencoder variant to compare the base deterministic autoencoder model to, is the property of VAEs that enable them to learn smooth and continuous latent spaces, without many dead zones in the latent space that cannot be decoded from. ? (?) show this empirically by using the latent space of a text variational autoencoder to interpolate and generate novel sentences from the latent spaces between two valid sentences from a corpus.
Since this objective is used to train the model to reconstruct , it is also called the reconstruction loss. Besides the above reconstruction loss, we design three auxiliary losses to disentangle the latent space . In particular, we hope that can be separated into two spaces and , representing style and content respectively, i.e., , where denotes concatenation. This is accomplished by the auxiliary losses described in the subsequent sections.
3.2 Multi-Task Classification Loss
Our first auxiliary loss ensures the style space does contain style information. We build a classifier on the style space with the objective of predicting the true style label , which is part of the training data.
This loss can be viewed as a multi-task loss, which incentivizes the model to not only decode the sentence, but also predict its sentiment from a fixed subset of the latent variables. Similar multi-task losses are used in previous work for sequence-to-sequence learning (?), sentence representation learning (?) and sentiment analysis (?), among others.
In our application, we follow previous work (?; ?; ?) and treat the sentiment as the style of interest. We introduce a multi-label classifier
where are the classifier’s parameters for multi-task learning, and is the predicted label distribution.
This is trained using a simple cross-entropy loss
where are the encoder’s parameters, and is the true label distribution.
|Style Embedding (?)||0.182||0.959||0.666||-16.17|
|Style Embedding (?)||0.417||0.933||0.359||-28.13|
3.3 Adversarial Style Discrimination
The above multi-task loss only operates on the style space, but does not have an effect on the content space .
We therefore apply an adversarial loss to disentangle the content space from style information, inspired by adversarial generation (?), adversarial domain adaptation (?), and adversarial style-transfer (?).
The idea of adversarial loss is to introduce an objective that deliberately discriminates the true style label using the content vector . Then the autoencoder is trained to learn a content vector space that its adversary cannot predict style information from.
Concretely, the adversarial discriminator predicts style by computing a softmax distribution over the possible class labels
where are the parameters of the adversary, and is the predicted label distribution.
It is trained by minimizing the following objective, using a cross-entropy loss.
where is the true label distribution.
The adversarial loss appears similar to the multi-task loss as in Equation 4. However, it should be emphasized that, for the adversary, the gradients are not propagated back to the autoencoder, i.e. the variables in are treated as shallow features.
Having trained an adversary, we would like the autoencoder to be tuned in such an ad hoc fashion, that is not discriminative in style. In other words, we penalize the Shannon entropy of the adversary’s prediction, given by
where is the entropy and is the predicted distribution over the style labels. The adversarial objective is maximized, in this phase, with respect to the encoder. This objective helps the autoencoder maximize the uncertainty of the discriminator’s predicted probability distribution.
|, , ,||0.885||0.878||0.197||-14.05|
3.4 Adversarial Bag-of-Words Discriminator
In addition to the auxiliary losses used above, we also propose a bag-of-words discriminator on the style space to make our approach complete.
The motivation is to emulate the adversarial signal provided by the style discriminator, and do the same for the content. Here, we define the content of the sentence as the words from the original sentence without any words that are discriminative of style.
The input sentences are represented as vectors of the same size as the corpus vocabulary, with each index of the vector denoting the discrete probability of a word’s presence in the sentence. Therefore, this bag-of-words representation is comprised of only s and s.
The bag-of-words discriminator uses the style vector produced by the autoencoder model and tries to predict the true bag-of-words distribution using a set of parameters that are distinct from those of the autoencoder. The discriminator uses a logistic regression to predict the probability of each word’s occurrence in the original sentence, between and .
where are the classifier’s parameters for bag-of-words prediction, and is the predicted word distribution.
This objective is trained in a similar method to the style adversary, using a cross-entropy loss
where is the true word distribution.
We also refrain from propagating the effects of this discriminator loss to the encoder parameters, ensuring that the parameters that each adversary and the autoencoder can update are mutually exclusive.
Similar to the style adversary, the empirical Shannon entropy of the predicted distribution is provided as a training signal for the autoencoder model to maximize.
where is the predicted word distribution.
The motivation for this adversarial loss is similar to the one used in the context of the style discriminator. We want to encourage the encoder to learn a representation of style that the bag-of-words discriminator cannot predict most of the original words from.
3.5 Training Process
The overall loss , used for the autoencoder, is thus comprised of four distinct objectives: the reconstruction objective, the multi-task objective, and the adversarial objectives for style and content.
where , and balance the model’s auxiliary losses.
To put it all together, the model training consists of a loop involving the processes as shown in Algorithm 1.
We use the Adam optimizer (?) for the autoencoder and the RMSProp optimizer (?) for the discriminators, each with an initial learning rate of , and train the model for 20 epochs. Both the autoencoder and the discriminators are trained once per epoch with , and . The recurrent unit size is , the style vector size is , and the content vector size is . We append the latent vector to the hidden state at every time step of the decoder (?). For the VAE model, we set and and use the KL-weight annealing schedule proposed by ? (?).
3.6 Generating Style-Transferred Sentences
A direct application of our disentangled latent space is style-transfer for natural language generation. For example, we can generate a sentence with generally the same meaning (content) but a different style (e.g. sentiment).
Let be an input sentence with and being the encoded, disentangled style and content vectors, respectively. If we would like to transfer its content to a different style, we compute an empirical estimate of the target style’s vector by
The inferred target style is concatenated with the encoded content for decoding style-transferred sentences (Figure 2).
|Original (Positive)||DAE Transferred||VAE Transferred|
|i would recommend a visit here||i would not recommend this place again||i would not recommend this place for my experience|
|the restaurant itself is romantic and quiet||the restaurant itself is soooo quiet||the restaurant itself was dirty|
|my experience was brief but very good||my experience was very loud and very expensive||my experience was ok but not very much|
|the food is excellent and the service is exceptional||the food is by the worst part is the horrible costumer service||the food was bland and i am not thrilled with this|
|the food is very very amazing like beef and fish||the food is very horrible i have ever had mostly fish||the food is very bland and just not fresh|
|we will definitely come back here||we will not come back here again||we will never come back here|
|both were very good||everything was very bland||both were very bad|
|Original (Negative)||DAE Transferred||VAE Transferred|
|so nasty||so helpful||so fabulous|
|consistently slow||consistently awesome||fast service|
|crap fries hard hamburger buns burger tasted like crap||cheap and yummy sandwiches really something different||yummy hamburgers and blue cheese bagels are classic italian|
|oh and terrible tea||oh and awesome tea||oh and great tea|
|the interior is old and generally falling apart||the interior is clean and orderly as entertaining||the interior is old and noble|
|front office customer service does not exist here||front office is very professional does you||kudos to customer service is very professional|
|the crust was kinda gooey like||the crust is kinda traditional||the crust is soooooo worth it|
We conduct experiments on two datasets, the details for which are given below. Both of these datasets are comprised of sentences accompanied by binary sentiment labels (positive, negative) and are, therefore, used to evaluate the task of sentiment transfer.
Yelp Service Reviews
We use a Yelp review dataset (?), which has been sourced from the code repository accompanying the implementation of the paper by ? (?) 222https://github.com/shentianxiao/language-style-transfer. It contains 444101, 63483 and 126670 sentences for train, validation, and test, respectively. The maximum sentence length is 15, and the vocabulary size is about 9200.
Amazon Product Reviews
We also use an Amazon product reviews dataset 333http://jmcauley.ucsd.edu/data/amazon/, following ? (?). The reviews were sourced from the code repository accompanying the paper 444https://github.com/fuzhenxin/text_style_transfer. It contains 559142, 2000 and 2000 sentences for train, validation, and test, respectively. The maximum sentence length is 20, and the vocabulary size is about 58000.
4.2 Evaluation Metrics
We evaluate our method using four metrics: style-transfer strength, content preservation, word overlap, and language fluency.
We train a convolutional neural network (CNN) style classifier (?) and predict the style of the generated sentences. While the style classifier itself may not be perfect, it provides a quantitative way of evaluating the strength of style-transfer (?; ?; ?). The validation accuracies of the sentiment classifier trained on the Yelp and Amazon datasets are and , respectively. The classifier accuracy on the style-transferred sentences, considering the target style to be the true label, is reported as the style-transfer strength.
We compute a sentence embedding by min, max, and average pooling its constituent word embeddings. Then, the cosine similarity between the source and generated sentence embedding is computed to evaluate how close they are in meaning. Here, sentiment words from a stop list (?) are removed (?). The cosine similarity is reported as the content preservation score.
We also utilize a simpler unigram overlap metric to evaluate the similarity between the source and generated sentences. Given a source sentence and an attribute style-transferred sentence , let and be the set of unique words present in and respectively, while excluding sentiment words (?) and stopwords (?). Then, the word overlap score can be calculated using
which is simply a normalized measure of overlapping unigrams in the source and target sentences.
We use a trigram Kneser-Ney smoothed language model (?) as a quantifiable and automated scoring metric by which to assess the quality of generated sentences. It calculates the probability distribution of trigrams in a corpus, based on their occurrence statistics, to build a language model. We train this language model on the complete corpus that we evaluate our style-transfer models with. The log-likelihood score for a generated sentence, as predicted by the Kneser-Ney language model, is reported as the indicator of language fluency.
Manual Evaluation of Language Quality
We perform manual evaluations by randomly sampling sentences generated from each model trained on the Yelp dataset, and requesting 5 human annotators to rate them on a 1-5 Likert scale 5555=Flawless, 4=Good, 3=Adequate, 2=Poor, 1=Incomprehensible (?) based on their syntax, grammar and fluency. The aggregate score of all the evaluators is reported as a measure of language quality.
5.1 Disentangling Latent Space
We first analyze how the style (sentiment) and content of the latent space are disentangled. We train classifiers on the different latent spaces, and report their inference-time classification accuracies in Table 5. These are results from the experiments on the Yelp dataset.
|Content space ()||0.6137||0.6567|
|Style space ()||0.7927||0.7911|
|Complete space ()||0.7918||0.7918|
We see that the 128-dimensional content vector is not particularly discriminative for style. It achieves accuracies that are only slightly better than random/majority guess. However, the 8-dimensional style vector , despite its low dimensionality, achieves significantly higher style classification accuracy. When combining content and style vectors, we achieve no further improvement. These results verify the effectiveness of our disentangling approach, because the style space does contain style information, whereas the content space does not.
We show t-SNE plots of both the deterministic autoencoder (DAE) and the variational autoencoder (VAE) models in Figure 3 and Figure 4, respectively. As can be seen from the t-SNE plots, sentences with different styles are noticeably separated in a cleaner manner in the style space (LHS), but are indistinguishable in the content space (RHS). It is also evident that the latent space learned by the variational autoencoder is considerably smoother and continuous compared to the one learned by the deterministic autoencoder.
5.2 Style-Transfer Sentence Generation
We apply the disentangled latent space to a style-transfer sentence generation task, where the goal is to generate a sentence with different sentiment. We compare our approach with previous state-of-the-art work in Table 1 and Table 2. We replicated the experiments with their publicly available code and data.
We observe that the style embedding model (?) performs poorly on the style-transfer objective, resulting in inflated content preservation and word overlap scores. A qualitative analysis indicates that this model resorts to simply reconstructing most of the source sentences, and is not very effective at transferring style.
Results show that, our approach achieves comparable content preservation and word overlap scores to previous work (?), and significantly better style-transfer strength scores than either of the models compared to, showing that our disentangled latent space can be used for better style-transfer sentence generation. Our VAE model also produces the most fluent sentences for the Yelp dataset task, which is corroborated by the manual evaluation results.
Table 3 presents the results of an ablation test. We see that both the style adversarial loss and multi-task classification loss play a role in the strength of style-transfer, and that they can be combined to further improve performance. Also, the combination of all the losses described yields the best language fluency score. We observe that the bag-of-words adversary described in Section 3.4 does not provide much of an improvement in terms of our evaluation metrics. However, this idea can be improved upon in future work.
|Style Embedding (?)||3.784|
The manual evaluation results are presented in Table 6. Our VAE model attains the best score for generated sentence quality amongst all the evaluated models. We also observe that the ranking of these models is positively correlated with the automated ‘Language Fluency’ metric presented in Table 1.
Some examples of style-transfer sentence generation are presented in Table 4. We see that, with the empirically estimated style vector, we can reliably control the sentiment of generated sentences.
6 Conclusion and Future Work
In this paper, we propose a simple yet effective approach for disentangling the latent space of neural networks using multi-task and adversarial objectives. Our learned disentanglement approach can be applied to text style-transfer tasks. It achieves similar content preservation scores, and significantly better style-transfer strength scores compared to previous state-of-the-art work.
For future work, we intend to evaluate the effects of disentangling the style space for datasets with greater than two distinct styles. We would also like to explore the possibility of aligning each encoded style distribution to a unique prior, which could be sampled from at inference time for style-transfer, as opposed to using the empirical mean of training-time style embeddings.
We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) [261439-2013-RGPIN], and Amazon Research Award. The Titan Xp GPU used for this research was donated by the NVIDIA Corporation.
- [Bahuleyan et al. 2018a] Bahuleyan, H.; Mou, L.; Vamaraju, K.; Zhou, H.; and Vechtomova, O. 2018a. Probabilistic natural language generation with wasserstein autoencoders. arXiv preprint arXiv:1806.08462.
- [Bahuleyan et al. 2018b] Bahuleyan, H.; Mou, L.; Vechtomova, O.; and Poupart, P. 2018b. Variational attention for sequence-to-sequence models. Proceedings of the 27th International Conference on Computational Linguistics (COLING).
- [Balikas, Moura, and Amini 2017] Balikas, G.; Moura, S.; and Amini, M.-R. 2017. Multitask learning for fine-grained twitter sentiment analysis. In SIGIR, 1005–1008.
- [Bird and Loper 2004] Bird, S., and Loper, E. 2004. Nltk: the natural language toolkit. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, 31. Association for Computational Linguistics.
- [Bowman et al. 2016] Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A.; Jozefowicz, R.; and Bengio, S. 2016. Generating sentences from a continuous space. In CoNLL, 10–21.
- [Challenge 2013] Challenge, Y. D. 2013. Yelp dataset challenge.
- [Champandard 2016] Champandard, A. J. 2016. Semantic style transfer and turning two-bit doodles into fine artworks. arXiv preprint arXiv:1603.01768.
- [Chen et al. 2016] Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; and Abbeel, P. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2172–2180.
- [Cho et al. 2014] Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–1734.
- [Fu et al. 2018] Fu, Z.; Tan, X.; Peng, N.; Zhao, D.; and Yan, R. 2018. Style transfer in text: Exploration and evaluation. In AAAI, 663–670.
- [Gatys, Ecker, and Bethge 2016] Gatys, L. A.; Ecker, A. S.; and Bethge, M. 2016. Image style transfer using convolutional neural networks. In CVPR, 2414–2423.
- [Goodfellow et al. 2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In NIPS, 2672–2680.
- [Hu and Liu 2004] Hu, M., and Liu, B. 2004. Mining and summarizing customer reviews. In KDD, 168–177.
- [Hu et al. 2017] Hu, Z.; Yang, Z.; Liang, X.; Salakhutdinov, R.; and Xing, E. P. 2017. Toward controlled generation of text. In ICML, 1587–1596.
- [Jernite, Bowman, and Sontag 2017] Jernite, Y.; Bowman, S. R.; and Sontag, D. 2017. Discourse-based objectives for fast unsupervised sentence representation learning. arXiv preprint arXiv:1705.00557.
- [Kim 2014] Kim, Y. 2014. Convolutional neural networks for sentence classification. In EMNLP, 1746–1751.
- [Kingma and Ba 2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- [Kingma and Welling 2014] Kingma, D. P., and Welling, M. 2014. Auto-encoding variational bayes. International Conference on Learning Representations.
- [Kneser and Ney 1995] Kneser, R., and Ney, H. 1995. Improved backing-off for m-gram language modeling. In icassp, volume 1, 181e4.
- [Kulkarni et al. 2015] Kulkarni, T. D.; Whitney, W. F.; Kohli, P.; and Tenenbaum, J. 2015. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, 2539–2547.
- [Kullback and Leibler 1951] Kullback, S., and Leibler, R. A. 1951. On information and sufficiency. The annals of mathematical statistics 22(1):79–86.
- [Liu, Qiu, and Huang 2017] Liu, P.; Qiu, X.; and Huang, X. 2017. Adversarial multi-task learning for text classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 1–10.
- [Luan et al. 2017] Luan, F.; Paris, S.; Shechtman, E.; and Bala, K. 2017. Deep photo style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4990–4998.
- [Luong et al. 2015] Luong, M.-T.; Le, Q. V.; Sutskever, I.; Vinyals, O.; and Kaiser, L. 2015. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114.
- [Mathieu et al. 2016] Mathieu, M. F.; Zhao, J. J.; Zhao, J.; Ramesh, A.; Sprechmann, P.; and LeCun, Y. 2016. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, 5040–5048.
- [Shen et al. 2017] Shen, T.; Lei, T.; Barzilay, R.; and Jaakkola, T. 2017. Style transfer from non-parallel text by cross-alignment. In NIPS, 6833–6844.
- [Stent, Marge, and Singhai 2005] Stent, A.; Marge, M.; and Singhai, M. 2005. Evaluating evaluation methods for generation in the presence of variation. In International Conference on Intelligent Text Processing and Computational Linguistics, 341–351. Springer.
- [Sutskever, Vinyals, and Le 2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In NIPS, 3104–3112.
- [Tieleman and Hinton 2012] Tieleman, T., and Hinton, G. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4(2):26–31.