Disentangled Representation Learning for Text Style Transfer

Disentangled Representation Learning for Text Style Transfer

Vineet John
University of Waterloo
vineet.john@uwaterloo.ca
&Lili Mou
AdeptMind Research
doublepower.mou@gmail.com
lili@adeptmind.ai

\ANDHareesh Bahuleyan
University of Waterloo
hpallika@uwaterloo.ca
&Olga Vechtomova
University of Waterloo
ovechtom@uwaterloo.ca
Abstract

This paper tackles the problem of disentangling the latent variables of style and content in language models. We propose a simple, yet effective approach, which incorporates auxiliary objectives: a multi-task classification objective, and dual adversarial objectives for label prediction and bag-of-words prediction, respectively. We show, both qualitatively and quantitatively, that the style and content are indeed disentangled in the latent space, using this approach. This disentangled latent representation learning method is applied to attribute (e.g. style) transfer on non-parallel corpora. We achieve similar content preservation scores compared to previous state-of-the-art approaches, and significantly better style-transfer strength scores. Our code is made publicly available for replicability and extension purposes 111https://github.com/vineetjohn/linguistic-style-transfer.

Disentangled Representation Learning for Text Style Transfer


Vineet John University of Waterloo vineet.john@uwaterloo.ca                        Lili Mou AdeptMind Research doublepower.mou@gmail.com lili@adeptmind.ai

Hareesh Bahuleyan University of Waterloo hpallika@uwaterloo.ca                        Olga Vechtomova University of Waterloo ovechtom@uwaterloo.ca

Copyright © 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

1 Introduction

The neural network has been a successful learning machine during the past decade due to its highly expressive modeling capability, which is a consequence of multiple layers of non-linear transformations of input features. Such transformations, however, make intermediate features ‘latent’, in that they do not have explicit meaning and are not explainable. Therefore, neural networks are usually treated as black-box machinery.

Disentangling the latent space of neural networks has become an increasingly important research topic. In the image domain, for example, ? (?) use adversarial and information maximization objectives to produce interpretable latent representations that can be tweaked to adjust writing style for handwritten digits, as well as lighting and orientation for face models. ? (?) utilize a convolutional autoencoder to achieve the same objective. However, this problem is not well explored in natural language processing.

In this paper, we address the problem of disentangling the latent space of neural networks for text generation. Our model is built on an autoencoder that encodes the latent space (vector representation) of a sentence by learning to reconstruct the sentence itself. We would like the latent space to be disentangled with respect to different features, namely, style and content.

To accomplish this, we propose a simple approach that combines multi-task and adversarial objectives. We artificially divide the latent representation into two parts: style space and content space. We learn model parameters that produce separate style and content latent spaces from the encoded sentence representation. The multi-task objective operates on the style space to ensure it encodes style information. The adversarial style objective, on the contrary, operates on the content space to minimize the predictability of style. In addition, the bag-of-words adversary minimizes the predictability of original words of the sentence in the style space. In this way, the style and content variables can be disentangled from each other.

These representation learning objectives can be directly used for style-transfer in sentence generation (???), in which a model can generate a sentence with the same content, but a different style. We simply use an autoencoder to encode the content vector of a sentence, but ignore its encoded style vector. Then we infer from the training data, an empirical embedding of the style that we want to transfer. The encoded content vector and the inferred style vector are concatenated and fed to the decoder. Using this grafting technique, we generate a new sentence similar in content but different in style.

We conducted experiments on two customer review benchmark datasets. Both qualitative and quantitative results show that the style vector does contain most style information, whereas the content vector contains little (if any). In the empirical style-transfer evaluations, we achieve significantly better style-transfer strength scores than previous results, while obtaining better or comparable content preservation scores. We also show, using ablation tests, that the auxiliary losses can be combined well, each playing its own role in disentangling the latent space.

Figure 1: Model Training Overview
Figure 2: Model Generation Overview

2 Related Work

Disentanglement of latent spaces has been explored in the image processing domain in the recent years, and researchers have successfully disentangled rotation features, color features, etc. (??). Some image characteristics (e.g. artistic style) can be captured well by certain statistics, like minimizing the L2 loss between a noise image and a pair of style and content images (?). In other works, researchers adopt data augmentation techniques to learn a disentangled latent space (??).

In natural language processing, the definition of ‘style’ itself is vague, and as a convenient starting point, NLP researchers typically treat sentiment as a salient style of text. ? (?) manage to control the sentiment by using discriminators to reconstruct sentiment and content from generated sentences. However, there is no evidence that the latent space would be disentangled by this reconstruction. ? (?) use a pair of adversarial discriminators to align the recurrent hidden decoder states of (a) original sentences with a given style and (b) sentences transferred to the given style, thereby performing style-transfer. ? (?) propose two approaches, (a) using a style embedding matrix, and (b) using style-specific decoders for style-transfer. They apply an adversarial loss on the encoded space to discourage encoding style in the latent space of an autoencoding model.

Our paper differs from the previous work in that both our style space and content space are encoded from the input. We apply three auxiliary losses to ensure that each space encodes its own information, including a novel bag-of-words adversary that helps disentangle the style space from sentence content. Our disentangled representation learning function can then be directly applied to text style-transfer tasks, as in the aforementioned studies.

3 Approach

In this section, we describe our approach in detail. Our model is built upon an autoencoder with a sequence-to-sequence (Seq2Seq) neural network (?), as described in Subsection 3.1. Then, we introduce three auxiliary losses: a multi-task classification loss, a style adversarial loss, and a content adversarial loss in Subsections 3.2, 3.3 and 3.4, respectively. Subsection 3.6 presents the approach to transfer style in the context of natural language generation. Figure 1 and Figure 2 depict the training and style-transferred sentence generation processes of our approach, respectively.

3.1 Autoencoder

An autoencoder encodes an input to a latent vector space, which is usually of a much lower dimensionality than the input data. From this latent space, it decodes the input itself. By doing so, the autoencoder learns salient and compact representations of data. This serves as our primary learning objective. Besides, we also use the autoencoder for text generation in the style-transfer application.

Let be an input sequence with tokens. The encoder recurrent neural network (RNN) with gated recurrent units (GRU) (?) encodes and obtains a hidden state . Then a decoder RNN generates a sentence, which ideally should be itself. Suppose at a time step , the decoder RNN predicts the word with probability , then the deterministic autoencoder (DAE) is trained using a sequence-aggregated cross-entropy loss, given by

(1)

where and are the parameters of the encoder and decoder, respectively.

Variational Autoencoder

In addition to the deterministic auto-encoding objective presented above, we also implement a variational autoencoder (VAE) (?). The variational sampling and Kullback-Leibler (KL) divergence (?) losses are applicable to both the style space and the content space , separately. The weights applied to each KL divergence loss are tuned independently as model hyperparameters.

Equation 3.1 shows the reconstruction objective that is minimized in the VAE variant of our model.

(2)

where and are the weights for each of the KL divergence terms, and is the standard normal distribution that both the encoded style and content distributions are aligned to.

The motivation for using a variational autoencoder variant to compare the base deterministic autoencoder model to, is the property of VAEs that enable them to learn smooth and continuous latent spaces, without many dead zones in the latent space that cannot be decoded from. ? (?) show this empirically by using the latent space of a text variational autoencoder to interpolate and generate novel sentences from the latent spaces between two valid sentences from a corpus.

Figure 3: t-SNE Plots: Deterministic Autoencoder

Since this objective is used to train the model to reconstruct , it is also called the reconstruction loss. Besides the above reconstruction loss, we design three auxiliary losses to disentangle the latent space . In particular, we hope that can be separated into two spaces and , representing style and content respectively, i.e., , where denotes concatenation. This is accomplished by the auxiliary losses described in the subsequent sections.

Figure 4: t-SNE Plots: Variational Autoencoder

3.2 Multi-Task Classification Loss

Our first auxiliary loss ensures the style space does contain style information. We build a classifier on the style space with the objective of predicting the true style label , which is part of the training data.

This loss can be viewed as a multi-task loss, which incentivizes the model to not only decode the sentence, but also predict its sentiment from a fixed subset of the latent variables. Similar multi-task losses are used in previous work for sequence-to-sequence learning (?), sentence representation learning (?) and sentiment analysis (?), among others.

In our application, we follow previous work (???) and treat the sentiment as the style of interest. We introduce a multi-label classifier

(3)

where are the classifier’s parameters for multi-task learning, and is the predicted label distribution.

This is trained using a simple cross-entropy loss

(4)

where are the encoder’s parameters, and is the true label distribution.

Model Transfer Content Word Language
Strength Preservation Overlap Fluency
Cross-Alignment (?) 0.809 0.892 0.209 -23.39
Style Embedding (?) 0.182 0.959 0.666 -16.17
Ours (DAE) 0.843 0.892 0.255 -16.48
Ours (VAE) 0.890 0.882 0.211 -14.41
Table 1: Yelp Dataset - Comparison
Model Transfer Content Word Language
Strength Preservation Overlap Fluency
Cross-Alignment (?) 0.606 0.893 0.024 -26.31
Style Embedding (?) 0.417 0.933 0.359 -28.13
Ours (DAE) 0.703 0.918 0.131 -32.42
Ours (VAE) 0.726 0.909 0.081 -28.50
Table 2: Amazon Dataset - Comparison

3.3 Adversarial Style Discrimination

The above multi-task loss only operates on the style space, but does not have an effect on the content space .

We therefore apply an adversarial loss to disentangle the content space from style information, inspired by adversarial generation (?), adversarial domain adaptation (?), and adversarial style-transfer (?).

The idea of adversarial loss is to introduce an objective that deliberately discriminates the true style label using the content vector . Then the autoencoder is trained to learn a content vector space that its adversary cannot predict style information from.

Concretely, the adversarial discriminator predicts style by computing a softmax distribution over the possible class labels

(5)

where are the parameters of the adversary, and is the predicted label distribution.

It is trained by minimizing the following objective, using a cross-entropy loss.

(6)

where is the true label distribution.

The adversarial loss appears similar to the multi-task loss as in Equation 4. However, it should be emphasized that, for the adversary, the gradients are not propagated back to the autoencoder, i.e. the variables in are treated as shallow features.

Having trained an adversary, we would like the autoencoder to be tuned in such an ad hoc fashion, that is not discriminative in style. In other words, we penalize the Shannon entropy of the adversary’s prediction, given by

(7)

where is the entropy and is the predicted distribution over the style labels. The adversarial objective is maximized, in this phase, with respect to the encoder. This objective helps the autoencoder maximize the uncertainty of the discriminator’s predicted probability distribution.

Objectives Transfer Content Word Language
Strength Preservation Overlap Fluency
0.144 0.915 0.329 -14.28
, 0.727 0.880 0.204 -14.16
, 0.789 0.898 0.259 -14.56
, 0.168 0.915 0.328 -14.45
, , 0.890 0.882 0.211 -14.41
, , 0.749 0.883 0.202 -14.36
, , 0.783 0.896 0.257 -14.34
, , , 0.885 0.878 0.197 -14.05
Table 3: Ablation Tests

3.4 Adversarial Bag-of-Words Discriminator

In addition to the auxiliary losses used above, we also propose a bag-of-words discriminator on the style space to make our approach complete.

The motivation is to emulate the adversarial signal provided by the style discriminator, and do the same for the content. Here, we define the content of the sentence as the words from the original sentence without any words that are discriminative of style.

The input sentences are represented as vectors of the same size as the corpus vocabulary, with each index of the vector denoting the discrete probability of a word’s presence in the sentence. Therefore, this bag-of-words representation is comprised of only s and s.

The bag-of-words discriminator uses the style vector produced by the autoencoder model and tries to predict the true bag-of-words distribution using a set of parameters that are distinct from those of the autoencoder. The discriminator uses a logistic regression to predict the probability of each word’s occurrence in the original sentence, between and .

(8)

where are the classifier’s parameters for bag-of-words prediction, and is the predicted word distribution.

This objective is trained in a similar method to the style adversary, using a cross-entropy loss

(9)

where is the true word distribution.

We also refrain from propagating the effects of this discriminator loss to the encoder parameters, ensuring that the parameters that each adversary and the autoencoder can update are mutually exclusive.

Similar to the style adversary, the empirical Shannon entropy of the predicted distribution is provided as a training signal for the autoencoder model to maximize.

(10)

where is the predicted word distribution.

The motivation for this adversarial loss is similar to the one used in the context of the style discriminator. We want to encourage the encoder to learn a representation of style that the bag-of-words discriminator cannot predict most of the original words from.

3.5 Training Process

The overall loss , used for the autoencoder, is thus comprised of four distinct objectives: the reconstruction objective, the multi-task objective, and the adversarial objectives for style and content.

where , and balance the model’s auxiliary losses.

To put it all together, the model training consists of a loop involving the processes as shown in Algorithm 1.

1 while epochs remaining do
2       minimize w.r.t. ;
3       minimize w.r.t. ;
4       minimize w.r.t. ;
5      
6 end while
Algorithm 1 Training Process

We use the Adam optimizer (?) for the autoencoder and the RMSProp optimizer (?) for the discriminators, each with an initial learning rate of , and train the model for 20 epochs. Both the autoencoder and the discriminators are trained once per epoch with , and . The recurrent unit size is , the style vector size is , and the content vector size is . We append the latent vector to the hidden state at every time step of the decoder (?). For the VAE model, we set and and use the KL-weight annealing schedule proposed by ? (?).

3.6 Generating Style-Transferred Sentences

A direct application of our disentangled latent space is style-transfer for natural language generation. For example, we can generate a sentence with generally the same meaning (content) but a different style (e.g. sentiment).

Let be an input sentence with and being the encoded, disentangled style and content vectors, respectively. If we would like to transfer its content to a different style, we compute an empirical estimate of the target style’s vector by

The inferred target style is concatenated with the encoded content for decoding style-transferred sentences (Figure 2).

Original (Positive) DAE Transferred VAE Transferred
(Negative) (Negative)
i would recommend a visit here i would not recommend this place again i would not recommend this place for my experience
the restaurant itself is romantic and quiet the restaurant itself is soooo quiet the restaurant itself was dirty
my experience was brief but very good my experience was very loud and very expensive my experience was ok but not very much
the food is excellent and the service is exceptional the food is by the worst part is the horrible costumer service the food was bland and i am not thrilled with this
the food is very very amazing like beef and fish the food is very horrible i have ever had mostly fish the food is very bland and just not fresh
we will definitely come back here we will not come back here again we will never come back here
both were very good everything was very bland both were very bad
Original (Negative) DAE Transferred VAE Transferred
(Positive) (Positive)
so nasty so helpful so fabulous
consistently slow consistently awesome fast service
crap fries hard hamburger buns burger tasted like crap cheap and yummy sandwiches really something different yummy hamburgers and blue cheese bagels are classic italian
oh and terrible tea oh and awesome tea oh and great tea
the interior is old and generally falling apart the interior is clean and orderly as entertaining the interior is old and noble
front office customer service does not exist here front office is very professional does you kudos to customer service is very professional
the crust was kinda gooey like the crust is kinda traditional the crust is soooooo worth it
Table 4: Examples of Style-Transfer Generation

4 Experiments

4.1 Datasets

We conduct experiments on two datasets, the details for which are given below. Both of these datasets are comprised of sentences accompanied by binary sentiment labels (positive, negative) and are, therefore, used to evaluate the task of sentiment transfer.

Yelp Service Reviews

We use a Yelp review dataset (?), which has been sourced from the code repository accompanying the implementation of the paper by ? (?222https://github.com/shentianxiao/language-style-transfer. It contains 444101, 63483 and 126670 sentences for train, validation, and test, respectively. The maximum sentence length is 15, and the vocabulary size is about 9200.

Amazon Product Reviews

We also use an Amazon product reviews dataset 333http://jmcauley.ucsd.edu/data/amazon/, following ? (?). The reviews were sourced from the code repository accompanying the paper 444https://github.com/fuzhenxin/text_style_transfer. It contains 559142, 2000 and 2000 sentences for train, validation, and test, respectively. The maximum sentence length is 20, and the vocabulary size is about 58000.

4.2 Evaluation Metrics

We evaluate our method using four metrics: style-transfer strength, content preservation, word overlap, and language fluency.

Style-Transfer Strength

We train a convolutional neural network (CNN) style classifier (?) and predict the style of the generated sentences. While the style classifier itself may not be perfect, it provides a quantitative way of evaluating the strength of style-transfer (???). The validation accuracies of the sentiment classifier trained on the Yelp and Amazon datasets are and , respectively. The classifier accuracy on the style-transferred sentences, considering the target style to be the true label, is reported as the style-transfer strength.

Content Preservation

We compute a sentence embedding by min, max, and average pooling its constituent word embeddings. Then, the cosine similarity between the source and generated sentence embedding is computed to evaluate how close they are in meaning. Here, sentiment words from a stop list (?) are removed (?). The cosine similarity is reported as the content preservation score.

Word Overlap

We also utilize a simpler unigram overlap metric to evaluate the similarity between the source and generated sentences. Given a source sentence and an attribute style-transferred sentence , let and be the set of unique words present in and respectively, while excluding sentiment words (?) and stopwords (?). Then, the word overlap score can be calculated using

which is simply a normalized measure of overlapping unigrams in the source and target sentences.

Language Fluency

We use a trigram Kneser-Ney smoothed language model (?) as a quantifiable and automated scoring metric by which to assess the quality of generated sentences. It calculates the probability distribution of trigrams in a corpus, based on their occurrence statistics, to build a language model. We train this language model on the complete corpus that we evaluate our style-transfer models with. The log-likelihood score for a generated sentence, as predicted by the Kneser-Ney language model, is reported as the indicator of language fluency.

Manual Evaluation of Language Quality

We perform manual evaluations by randomly sampling sentences generated from each model trained on the Yelp dataset, and requesting 5 human annotators to rate them on a 1-5 Likert scale 5555=Flawless, 4=Good, 3=Adequate, 2=Poor, 1=Incomprehensible (?) based on their syntax, grammar and fluency. The aggregate score of all the evaluators is reported as a measure of language quality.

5 Discussion

5.1 Disentangling Latent Space

We first analyze how the style (sentiment) and content of the latent space are disentangled. We train classifiers on the different latent spaces, and report their inference-time classification accuracies in Table 5. These are results from the experiments on the Yelp dataset.

Random/Majority guess 0.6018
Latent Space DAE VAE
Content space () 0.6137 0.6567
Style space () 0.7927 0.7911
Complete space () 0.7918 0.7918
Table 5: Style Classification Accuracy

We see that the 128-dimensional content vector is not particularly discriminative for style. It achieves accuracies that are only slightly better than random/majority guess. However, the 8-dimensional style vector , despite its low dimensionality, achieves significantly higher style classification accuracy. When combining content and style vectors, we achieve no further improvement. These results verify the effectiveness of our disentangling approach, because the style space does contain style information, whereas the content space does not.

We show t-SNE plots of both the deterministic autoencoder (DAE) and the variational autoencoder (VAE) models in Figure 3 and Figure 4, respectively. As can be seen from the t-SNE plots, sentences with different styles are noticeably separated in a cleaner manner in the style space (LHS), but are indistinguishable in the content space (RHS). It is also evident that the latent space learned by the variational autoencoder is considerably smoother and continuous compared to the one learned by the deterministic autoencoder.

5.2 Style-Transfer Sentence Generation

We apply the disentangled latent space to a style-transfer sentence generation task, where the goal is to generate a sentence with different sentiment. We compare our approach with previous state-of-the-art work in Table 1 and Table 2. We replicated the experiments with their publicly available code and data.

We observe that the style embedding model (?) performs poorly on the style-transfer objective, resulting in inflated content preservation and word overlap scores. A qualitative analysis indicates that this model resorts to simply reconstructing most of the source sentences, and is not very effective at transferring style.

Results show that, our approach achieves comparable content preservation and word overlap scores to previous work (?), and significantly better style-transfer strength scores than either of the models compared to, showing that our disentangled latent space can be used for better style-transfer sentence generation. Our VAE model also produces the most fluent sentences for the Yelp dataset task, which is corroborated by the manual evaluation results.

Table 3 presents the results of an ablation test. We see that both the style adversarial loss and multi-task classification loss play a role in the strength of style-transfer, and that they can be combined to further improve performance. Also, the combination of all the losses described yields the best language fluency score. We observe that the bag-of-words adversary described in Section 3.4 does not provide much of an improvement in terms of our evaluation metrics. However, this idea can be improved upon in future work.

Model Language
Quality
Cross-Alignment (?) 3.188
Style Embedding (?) 3.784
Ours (DAE) 3.460
Ours (VAE) 3.824
Table 6: Results - Manual Evaluation

The manual evaluation results are presented in Table 6. Our VAE model attains the best score for generated sentence quality amongst all the evaluated models. We also observe that the ranking of these models is positively correlated with the automated ‘Language Fluency’ metric presented in Table 1.

Some examples of style-transfer sentence generation are presented in Table 4. We see that, with the empirically estimated style vector, we can reliably control the sentiment of generated sentences.

6 Conclusion and Future Work

In this paper, we propose a simple yet effective approach for disentangling the latent space of neural networks using multi-task and adversarial objectives. Our learned disentanglement approach can be applied to text style-transfer tasks. It achieves similar content preservation scores, and significantly better style-transfer strength scores compared to previous state-of-the-art work.

For future work, we intend to evaluate the effects of disentangling the style space for datasets with greater than two distinct styles. We would also like to explore the possibility of aligning each encoded style distribution to a unique prior, which could be sampled from at inference time for style-transfer, as opposed to using the empirical mean of training-time style embeddings.

7 Acknowledgements

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) [261439-2013-RGPIN], and Amazon Research Award. The Titan Xp GPU used for this research was donated by the NVIDIA Corporation.

References

  • [Bahuleyan et al. 2018a] Bahuleyan, H.; Mou, L.; Vamaraju, K.; Zhou, H.; and Vechtomova, O. 2018a. Probabilistic natural language generation with wasserstein autoencoders. arXiv preprint arXiv:1806.08462.
  • [Bahuleyan et al. 2018b] Bahuleyan, H.; Mou, L.; Vechtomova, O.; and Poupart, P. 2018b. Variational attention for sequence-to-sequence models. Proceedings of the 27th International Conference on Computational Linguistics (COLING).
  • [Balikas, Moura, and Amini 2017] Balikas, G.; Moura, S.; and Amini, M.-R. 2017. Multitask learning for fine-grained twitter sentiment analysis. In SIGIR, 1005–1008.
  • [Bird and Loper 2004] Bird, S., and Loper, E. 2004. Nltk: the natural language toolkit. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions,  31. Association for Computational Linguistics.
  • [Bowman et al. 2016] Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A.; Jozefowicz, R.; and Bengio, S. 2016. Generating sentences from a continuous space. In CoNLL, 10–21.
  • [Challenge 2013] Challenge, Y. D. 2013. Yelp dataset challenge.
  • [Champandard 2016] Champandard, A. J. 2016. Semantic style transfer and turning two-bit doodles into fine artworks. arXiv preprint arXiv:1603.01768.
  • [Chen et al. 2016] Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; and Abbeel, P. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2172–2180.
  • [Cho et al. 2014] Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–1734.
  • [Fu et al. 2018] Fu, Z.; Tan, X.; Peng, N.; Zhao, D.; and Yan, R. 2018. Style transfer in text: Exploration and evaluation. In AAAI, 663–670.
  • [Gatys, Ecker, and Bethge 2016] Gatys, L. A.; Ecker, A. S.; and Bethge, M. 2016. Image style transfer using convolutional neural networks. In CVPR, 2414–2423.
  • [Goodfellow et al. 2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In NIPS, 2672–2680.
  • [Hu and Liu 2004] Hu, M., and Liu, B. 2004. Mining and summarizing customer reviews. In KDD, 168–177.
  • [Hu et al. 2017] Hu, Z.; Yang, Z.; Liang, X.; Salakhutdinov, R.; and Xing, E. P. 2017. Toward controlled generation of text. In ICML, 1587–1596.
  • [Jernite, Bowman, and Sontag 2017] Jernite, Y.; Bowman, S. R.; and Sontag, D. 2017. Discourse-based objectives for fast unsupervised sentence representation learning. arXiv preprint arXiv:1705.00557.
  • [Kim 2014] Kim, Y. 2014. Convolutional neural networks for sentence classification. In EMNLP, 1746–1751.
  • [Kingma and Ba 2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • [Kingma and Welling 2014] Kingma, D. P., and Welling, M. 2014. Auto-encoding variational bayes. International Conference on Learning Representations.
  • [Kneser and Ney 1995] Kneser, R., and Ney, H. 1995. Improved backing-off for m-gram language modeling. In icassp, volume 1, 181e4.
  • [Kulkarni et al. 2015] Kulkarni, T. D.; Whitney, W. F.; Kohli, P.; and Tenenbaum, J. 2015. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, 2539–2547.
  • [Kullback and Leibler 1951] Kullback, S., and Leibler, R. A. 1951. On information and sufficiency. The annals of mathematical statistics 22(1):79–86.
  • [Liu, Qiu, and Huang 2017] Liu, P.; Qiu, X.; and Huang, X. 2017. Adversarial multi-task learning for text classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 1–10.
  • [Luan et al. 2017] Luan, F.; Paris, S.; Shechtman, E.; and Bala, K. 2017. Deep photo style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4990–4998.
  • [Luong et al. 2015] Luong, M.-T.; Le, Q. V.; Sutskever, I.; Vinyals, O.; and Kaiser, L. 2015. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114.
  • [Mathieu et al. 2016] Mathieu, M. F.; Zhao, J. J.; Zhao, J.; Ramesh, A.; Sprechmann, P.; and LeCun, Y. 2016. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, 5040–5048.
  • [Shen et al. 2017] Shen, T.; Lei, T.; Barzilay, R.; and Jaakkola, T. 2017. Style transfer from non-parallel text by cross-alignment. In NIPS, 6833–6844.
  • [Stent, Marge, and Singhai 2005] Stent, A.; Marge, M.; and Singhai, M. 2005. Evaluating evaluation methods for generation in the presence of variation. In International Conference on Intelligent Text Processing and Computational Linguistics, 341–351. Springer.
  • [Sutskever, Vinyals, and Le 2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In NIPS, 3104–3112.
  • [Tieleman and Hinton 2012] Tieleman, T., and Hinton, G. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4(2):26–31.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
254242
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description