Adversarial Symmetric Variational Autoencoder
A new form of variational autoencoder (VAE) is developed, in which the joint distribution of data and codes is considered in two (symmetric) forms: () from observed data fed through the encoder to yield codes, and () from latent codes drawn from a simple prior and propagated through the decoder to manifest data. Lower bounds are learned for marginal log-likelihood fits observed data and latent codes. When learning with the variational bound, one seeks to minimize the symmetric Kullback-Leibler divergence of joint density functions from () and (), while simultaneously seeking to maximize the two marginal log-likelihoods. To facilitate learning, a new form of adversarial training is developed. An extensive set of experiments is performed, in which we demonstrate state-of-the-art data reconstruction and generation on several image benchmark datasets.
Recently there has been increasing interest in developing generative models of data, offering the promise of learning based on the often vast quantity of unlabeled data. With such learning, one typically seeks to build rich, hierarchical probabilistic models that are able to fit to the distribution of complex real data, and are also capable of realistic data synthesis.
Generative models are often characterized by latent variables (codes), and the variability in the codes encompasses the variation in the data . The generative adversarial network (GAN)  employs a generative model in which the code is drawn from a simple distribution (, isotropic Gaussian), and then the code is fed through a sophisticated deep neural network (decoder) to manifest the data. In the context of data synthesis, GANs have shown tremendous capabilities in generating realistic, sharp images from models that learn to mimic the structure of real data . The quality of GAN-generated images has been evaluated by somewhat ad hoc metrics like inception score .
However, the original GAN formulation does not allow inference of the underlying code, given observed data. This makes it difficult to quantify the quality of the generative model, as it is not possible to compute the quality of model fit to data. To provide a principled quantitative analysis of model fit, not only should the generative model synthesize realistic-looking data, one also desires the ability to infer the latent code given data (using an encoder). Recent GAN extensions  have sought to address this limitation by learning an inverse mapping (encoder) to project data into the latent space, achieving encouraging results on semi-supervised learning. However, these methods still fail to obtain faithful reproductions of the input data, partly due to model underfitting when learning from a fully adversarial objective .
Variational autoencoders (VAEs) are designed to learn both an encoder and decoder, leading to excellent data reconstruction and the ability to quantify a bound on the log-likelihood fit of the model to data . In addition, the inferred latent codes can be utilized in downstream applications, including classification  and image captioning . However, new images synthesized by VAEs tend to be unspecific and/or blurry, with relatively low resolution. These limitations of VAEs are becoming increasingly understood. Specifically, the traditional VAE seeks to maximize a lower bound on the log-likelihood of the generative model, and therefore VAEs inherit the limitations of maximum-likelihood (ML) learning . Specifically, in ML-based learning one optimizes the (one-way) Kullback-Leibler (KL) divergence between the distribution of the underlying data and the distribution of the model; such learning does not penalize a model that is capable of generating data that are different from that used for training.
Based on the above observations, it is desirable to build a generative-model learning framework with which one can compute and assess the log-likelihood fit to real (observed) data, while also being capable of generating synthetic samples of high realism. Since GANs and VAEs have complementary strengths, their integration appears desirable, with this a principal contribution of this paper. While integration seems natural, we make important changes to both the VAE and GAN setups, to leverage the best of both. Specifically, we develop a new form of the variational lower bound, manifested jointly for the expected log-likelihood of the observed data and for the latent codes. Optimizing this variational bound involves maximizing the expected log-likelihood of the data and codes, while simultaneously minimizing a symmetric KL divergence involving the joint distribution of data and codes. To compute parts of this variational lower bound, a new form of adversarial learning is invoked. The proposed framework is termed Adversarial Symmetric VAE (AS-VAE), since within the model () the data and codes are treated in a symmetric manner, () a symmetric form of KL divergence is minimized when learning, and () adversarial training is utilized. To illustrate the utility of AS-VAE, we perform an extensive set of experiments, demonstrating state-of-the-art data reconstruction and generation on several benchmarks datasets.
2Background and Foundations
Consider an observed data sample , modeled as being drawn from , with model parameters and latent code . The prior distribution on the code is denoted , typically a distribution that is easy to draw from, such as isotropic Gaussian. The posterior distribution on the code given data is , and since this is typically intractable, it is approximated as , parameterized by learned parameters . Conditional distributions and are typically designed such that they are easily sampled and, for flexibility, modeled in terms of neural networks . Since is a latent code for , is also termed a stochastic encoder, with a corresponding stochastic decoder. The observed data are assumed drawn from , for which we do not have a explicit form, but from which we have samples, i.e., the ensemble used for learning.
Our goal is to learn the model such that it synthesizes samples that are well matched to those drawn from . We simultaneously seek to learn a corresponding encoder that is both accurate and efficient to implement. Samples are synthesized via with ; provides an efficient coding of observed , that may be used for other purposes (, classification or caption generation when is an image ).
2.1Traditional Variational Autoencoders and Their Limitations
Maximum likelihood (ML) learning of based on direct evaluation of is typically intractable. The VAE  seeks to bound by maximizing variational expression , with respect to parameters , where
with expectations and performed approximately via sampling. Specifically, to evaluate we draw a finite set of samples , with denoting the observed data, and for , we directly use observed data . When learning , the expectation using samples from is implemented via the “reparametrization trick” .
Maximizing wrt provides a lower bound on , hence the VAE setup is an approximation to ML learning of . Learning based on is equivalent to learning based on minimizing , again implemented in terms of the observed samples of . As discussed in , such learning does not penalize severely for yielding of relatively high probability in while being simultaneously of low probability in . This means that seeks to match to the properties of the observed data samples, but may also have high probability of generating samples that do not look like data drawn from . This is a fundamental limitation of ML-based learning , inherited by the traditional VAE in .
One reason for the failing of ML-based learning of is that the cumulative posterior on latent codes is typically different from , which implies that , with may yield samples that are different from those generated from . Hence, when learning one may seek to match to samples of , as done in , while simultaneously matching to samples of . The expression in provides a variational bound for matching to samples of , thus one may naively think to simultaneously set a similar variational expression for , with these two variational expressions optimized jointly. However, to compute this additional variational expression we require an analytic expression for , which also means we need an analytic expression for , which we do not have.
Examining , we also note that approximates , which has limitations aligned with those discussed above for ML-based learning of . Analogous to the above discussion, we would also like to consider . So motivated, in Section 3 we develop a new form of variational lower bound, applicable to maximizing and , where is the -th of samples from . We demonstrate that this new framework leverages both and , by extending ideas from adversarial networks.
The original idea of GAN  was to build an effective generative model , with , as discussed above. There was no desire to simultaneously design an inference network . More recently, authors  have devised adversarial networks that seek both and . As an important example, Adversarial Learned Inference (ALI)  considers the following objective function:
where the expectations are approximated with samples, as in . The function , termed a discriminator, is typically implemented using a neural network with parameters . Note that in (Equation 2) we need only sample from and , avoiding the need for an explicit form for .
The framework in can, in theory, match and , by finding a Nash equilibrium of their respective non-convex objectives . However, training of such adversarial networks is typically based on stochastic gradient descent, which is designed to find a local mode of a cost function, rather than locating an equilibrium . This objective mismatch may lead to the well-known instability issues associated with GAN training [?].
To alleviate this problem, some researchers add a regularization term, such as reconstruction loss  or mutual information , to the GAN objective, to restrict the space of suitable mapping functions, thus avoiding some of the failure modes of GANs, i.e., mode collapsing. Below we will formally match the joint distributions as in (Equation 2), and reconstruction-based regularization will be manifested by generalizing the VAE setup via adversarial learning. Toward this goal we consider the following lemma, which is analogous to Proposition 1 in .
Under Lemma ?, we are able to estimate the and using the following corollary.
The proof is provided in the Appendix A. We also assume in Corollary ? that and are sufficiently flexible such that there are parameters and capable of achieving the equalities in ( ?). Toward that end, and are implemented as - and -parameterized neural networks (details below), to encourage universal approximation .
3Adversarial Symmetric Variational Auto-Encoder (AS-VAE)
Consider variational expressions
where all expectations are again performed approximately using samples from and . Recall that , and , thus is maximized when and . Similarly, is maximized when and . Hence, and impose desired constraints on both the marginal and joint distributions. Note that the log-likelihood terms in (Equation 3) and ( ?) are analogous to the data-fit regularizers discussed above in the context of ALI, but here implemented in a generalized form of the VAE. Direct evaluation of and is not possible, as it requires an explicit form for to evaluate .
One may readily demonstrate that
A similar expression holds for , in terms of . This naturally suggests the cumulative variational expression
where and are updated using the adversarial objectives in and , respectively.
Note that to evaluate we must be able to sample from and , both of which are readily available, as discussed above. Further, we require explicit expressions for and , which we have. For and we similarly must be able to sample from the distributions involved, and we must be able to evaluate and , each of which is implemented via a neural network. Note as well that the bound in for is in terms of the KL distance between conditional distributions and , while utilizes the KL distance between joint distributions and (use of joint distributions is related to ALI). By combining and , the complete variational bound employs the symmetric KL between these two joint distributions. By contrast, from , the original variational lower bound only addresses a one-way KL distance between and . While  had a similar idea of employing adversarial methods in the context variational learning, it was only done within the context of the original form in , the limitations of which were discussed in Section ?.
In the original VAE, in which (Equation 1) was optimized, the reparametrization trick  was invoked wrt , with samples and , as the expectation was performed wrt this distribution; this reparametrization is convenient for computing gradients wrt . In the AS-VAE in (Equation 4), expectations are also needed wrt . Hence, to implement gradients wrt , we also constitute a reparametrization of . Specifically, we consider samples with . in is re-expressed as
The expectations in are approximated via samples drawn from and , as well as samples of and . and can be implemented with a Gaussian assumption  or via density transformation , detailed when presenting experiments in Section ?.
The complete objective of the proposed Adversarial Symmetric VAE (AS-VAE) requires the cumulative variational in , which we maximize wrt and as in and , using the results in . Hence, we write
The following proposition characterizes the solutions of in terms of the joint distributions of and .
The proof is provided in the Appendix A. This theoretical result implies that () is an estimator that yields good reconstruction, and () matches the aggregated posterior to prior distribution .
VAEs  represent one of the most successful deep generative models developed recently. Aided by the reparameterization trick, VAEs can be trained with stochastic gradient descent. The original VAEs implement a Gaussian assumption for the encoder. More recently, there has been a desire to remove this Gaussian assumption. Normalizing flow  employs a sequence of invertible transformation to make the distribution of the latent codes arbitrarily flexible. This work was followed by inverse auto-regressive flow , which uses recurrent neural networks to make the latent codes more expressive. More recently, SteinVAE  applies Stein variational gradient descent  to infer the distribution of latent codes, discarding the assumption of a parametric form of posterior distribution for the latent code. However, these methods are not able to address the fundamental limitation of ML-based models, as they are all based on the variational formulation in (Equation 1).
GANs  constitute another recent framework for learning a generative model. Recent extensions of GAN have focused on boosting the performance of image generation by improving the generator , discriminator  or the training algorithm . More recently, some researchers  have employed a bidirectional network structure within the adversarial learning framework, which in theory guarantees the matching of joint distributions over two domains. However, non-identifiability issues are raised in . For example, they have difficulties in providing good reconstruction in latent variable models, or discovering the correct pairing relationship in domain transformation tasks. It was shown that these problems are alleviated in DiscoGAN , CycleGAN  and ALICE  via additional , or adversarial losses. However, these methods lack of explicit probabilistic modeling of observations, thus could not directly evaluate the likelihood of given data samples.
A key component of the proposed framework concerns integrating a new VAE formulation with adversarial learning. There are several recent approaches that have tried to combining VAE and GAN , Adversarial Variational Bayes (AVB)  is the one most closely related to our work. AVB employs adversarial learning to estimate the posterior of the latent codes, which makes the encoder arbitrarily flexible. However, AVB seeks to optimize the original VAE formulation in (Equation 1), and hence it inherits the limitations of ML-based learning of . Unlike AVB, the proposed use of adversarial learning is based on a new VAE setup, that seeks to minimize the symmetric KL distance between and , while simultaneously seeking to maximize the marginal expected likelihoods and .
We evaluate our model on three datasets: MNIST, CIFAR-10 and ImageNet. To balance performance and computational cost, and are approximated with a normalizing flow  of length 80 for the MNIST dataset, and a Gaussian approximation for CIFAR-10 and ImageNet data. All network architectures are provided in the Appendix B. All parameters were initialized with Xavier , and optimized via Adam  with learning rate 0.0001. We do not perform any dataset-specific tuning or regularization other than dropout . Early stopping is employed based on average reconstruction loss of and on validation sets.
We show three types of results, using part of or all of our model to illustrate each component. ) AS-VAE-r: This model trained with the first half of the objective in to minimize in ; it is an ML-based method which focuses on reconstruction. ) AS-VAE-g: This model trained with the second half of the objective in to minimize in ; it can be considered as maximizing the likelihood of , and designed for generation. ) AS-VAE This is our proposed model, developed in Section 3.
We evaluate our model on both reconstruction and generation. The performance of the former is evaluated using negative log-likelihood (NLL) estimated via the variational lower bound defined in (Equation 1). Images are modeled as continuous. To do this, we add -uniform noise to natural images (one color channel at the time), then divide by 256 to map 8-bit images (256 levels) to the unit interval. This technique is widely used in applications involving natural images , since it can be proved that in terms of log-likelihood, modeling in the discrete space is equivalent to modeling in the continuous space (with added noise) . During testing, the likelihood is computed as where . This is done to guarantee a fair comparison with prior work (that assumed quantization). For the MNIST dataset, we treat the -mapped continuous input as the probability of a binary pixel value (on or off) . The inception score (IS), defined as , is employed to quantitatively evaluate the quality of generated natural images, where is the empirical distribution of labels (we do not leverage any label information during training) and is the output of the Inception model  on each generated image.
To the authors’ knowledge, we are the first to report both inception score (IS) and NLL for natural images from a single model. For comparison, we implemented DCGAN  and PixelCNN++  as baselines. The implementation of DCGAN is based on a similar network architectures as our model. Note that for NLL a lower value is better, whereas for IS a higher value is better.
|Method||NF (k=80) ||IAF ||AVB ||PixelRNN ||AS-VAE-r||AS-VAE-g||AS-VAE|
We first evaluate our model on the MNIST dataset. The log-likelihood results are summarized in Table ?. Our AS-VAE achieves a negative log-likelihood of 82.51 nats, outperforming normalizing flow (85.1 nats) with a similar architecture. The perfomance of AS-VAE-r (81.14 nats) is competitive to the state-of-the-art (79.2 nats). The generated samples are showed in Figure 1. AS-VAE-g and AS-VAE both generate good samples while the results of AS-VAE-r are slightly more blurry, partly due to the fact that AS-VAE-r is an ML-based model.
Next we evaluate our models on the CIFAR-10 dataset. The quantitative results are listed in Table ?. AS-VAE-r and AS-VAE-g achieve encouraging results on reconstruction and generation, respectively, while our AS-VAE model (leveraging the full objective) achieves a good balance between these two tasks, which demonstrates the benefit of optimizing a symmetric objective. Compared with state-of-the-art ML-based models , we achieve competitive results on reconstruction but provide a much better performance on generation, also outperforming other adversarially-trained models. Note that our negative ELBO (evidence lower bound) is an upper bound of NLL as reported in . We also achieve a smaller root-mean-square-error (RMSE). Generated samples are shown in Figure ?. Additional results are provided in the Appendix C.
ALI , which also seeks to match the joint encoder and decoder distribution, is also implemented as a baseline. Since the decoder in ALI is a deterministic network, the NLL of ALI is impractical to compute. Alternatively, we report the RMSE of reconstruction as a reference. Figure 3 qualitatively compares the reconstruction performance of our model, ALI and VAE. As can be seen, the reconstruction of ALI is related to but not faithful reproduction of the input data, which evidences the limitation in reconstruction ability of adversarial learning. This is also consistent in terms of RMSE.
ImageNet 2012 is used to evaluate the scalability of our model to large datasets. The images are resized to . The quantitative results are shown in Table ?. Our model significantly improves the performance on generation compared with DCGAN and PixelCNN++, while achieving competitive results on reconstruction compared with PixelRNN and PixelCNN++.
Note that the PixelCNN++ takes more than two weeks (44 hours per epoch) for training and 52.0 seconds/image for generating samples while our model only requires less than 2 days (4 hours per epoch) for training and 0.01 seconds/image for generating on a single TITAN X GPU. As a reference, the true validation set of ImageNet 2012 achieves accuracy. This is because ImageNet has much greater variety of images than CIFAR-10. Figure 4 shows generated samples based on trained with ImageNet, compared with DCGAN and PixelCNN++. Our model is able to produce sharp images without label information while capturing more local spatial dependencies than PixelCNN++, and without suffering from mode collapse as DCGAN. Additional results are provided in the Appendix C.
We presented Adversarial Symmetrical Variational Autoencoders, a novel deep generative model for unsupervised learning. The learning objective is to minimizing a symmetric KL divergence between the joint distribution of data and latent codes from encoder and decoder, while simultaneously maximizing the expected marginal likelihood of data and codes. An extensive set of results demonstrated excellent performance on both reconstruction and generation, while scaling to large datasets. A possible direction for future work is to apply AS-VAE to semi-supervised learning tasks.
This research was supported in part by ARO, DARPA, DOE, NGA, ONR and NSF.
Proof of Corollary 1.1 We start from a simple observation . The second term in (5) of main paper can be rewritten as
Therefore, the objective function in (5) can be expressed as
This integral of (Equation 8) is maximal as a function of if and only if the integrand is maximal for every . Note that the problem achieves maximum at and . Hence, we have the optimal function of at
Similarly, we have
Proof of Proposition 1 If achieves an equilibrium of (12) of main paper. The Corollary 1.1 indicates that and .
where and can be considered as constant. Therefore, maximize is equivalent to minimize
The minimum of first two terms is achieved if and only if while the minimums of last two terms are achieved at and , respectively. Note that the joint match is achieved, the marginals also matches which indicates the optimal is achieved if and only if .
The model architectures are shown as following. For and , we use the same architecture but the parameters are not shared.
- A deep generative deconvolutional image model.
Y. Pu, X. Yuan, A. Stevens, C. Li, and L. Carin. Artificial Intelligence and Statistics (AISTATS), 2016.
- Generative deep deconvolutional learning.
Y. Pu, X. Yuan, and L. Carin. In ICLR workshop, 2015.
- Generative adversarial nets.
I.. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S.l Ozair, A. Courville, and Y. Bengio. In NIPS, 2014.
- Infogan: Interpretable representation learning by information maximizing generative adversarial nets.
X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. In NIPS, 2016.
- Unsupervised representation learning with deep convolutional generative adversarial networks.
A. Radford, L. Metz, and S. Chintala. In ICLR, 2016.
- Generative adversarial text to image synthesis.
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. In ICML, 2016.
- Adversarial feature matching for text generation.
Y. Zhang, Z. Gan, K. Fan, Z. Chen, R. Henao, D. Shen, and L. Carin. In ICML, 2017.
- Generating text with adversarial training.
Y. Zhang, Z. Gan, and L. Carin. In NIPS workshop, 2016.
- Improved techniques for training gans.
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. In NIPS, 2016.
- Adversarially learned inference.
V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville. In ICLR, 2017.
- Adversarial feature learning.
J. Donahue, . Krähenbühl, and T. Darrell. In ICLR, 2017.
- Auto-encoding variational Bayes.
D. P. Kingma and M. Welling. In ICLR, 2014.
- Stochastic backpropagation and approximate inference in deep generative models.
D. J. Rezende, S. Mohamed, and D. Wierstra. In ICML, 2014.
- Variational inference with normalizing flows.
D.J. Rezende and S. Mohamed. In ICML, 2015.
- Importance weighted autoencoders.
Y. Burda, R. Grosse, and R. Salakhutdinov. In ICLR, 2016.
- Improving variational inference with inverse autoregressive flow.
D. P. Kingma, T. Salimans, R. Jozefowicz, X.i Chen, I. Sutskever, and M. Welling. In NIPS, 2016.
- Deconvolutional paragraph representation learning.
Y. Zhang, D. Shen, G. Wang, Z. Gan, R. Henao, and L. Carin. In NIPS, 2017.
- Symmetric variational autoencoder and connections to adversarial learning.
L. Chen, S. Dai, Y. Pu, C. Li, and Q. Su Lawrence Carin. In arXiv, 2017.
- Deconvolutional latent-variable model for text sequence matching.
D. Shen, Y. Zhang, R. Henao, Q. Su, and L. Carin. In arXiv, 2017.
- Semi-supervised learning with deep generative models.
D.P. Kingma, D.J. Rezende, S. Mohamed, and M. Welling. In NIPS, 2014.
- Variational autoencoder for deep learning of images, labels and captions.
Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin. In NIPS, 2016.
- Towards principled methods for training generative adversarial networks.
M. Arjovsky and L. Bottou. In ICLR, 2017.
- Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks.
L. Mescheder, S. Nowozin, and A. Geiger. In arXiv, 2016.
- Learning to discover cross-domain relations with generative adversarial networks.
T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. In arXiv, 2017.
- Triple generative adversarial nets.
C. Li, K. Xu, J. Zhu, and B. Zhang. In arXiv, 2017.
- Unpaired image-to-image translation using cycle-consistent adversarial networks.
JY Zhu, T. Park, P. Isola, and A. Efros. In arXiv, 2017.
- Multilayer feedforward networks are universal approximators.
K. Hornik, M. Stinchcombe, and H. White. Neural networks, 1989.
- Vae learning via stein variational gradient descent.
Y. Pu, Z. Gan, R. Henao, C. Li, S. Han, and L. Carin. In NIPS, 2017.
- Stein variational gradient descent: A general purpose bayesian inference algorithm.
Q. Liu and D. Wang. In NIPS, 2016.
- Energy-based generative adversarial network.
J. Zhao, M. Mathieu, and Y. LeCun. In ICLR, 2017.
- Wasserstein gan.
M. Arjovsky, S. Chintala, and L. Bottou. In arXiv, 2017.
- Triangle generative adversarial networks.
Z. Gan, L. Chen, W. Wang, Y. Pu, Y. Zhang, H. Liu, C. Li, and Lawrence Carin. In NIPS, 2017.
- Alice: Towards understanding adversarial learning for joint distribution matching.
C. Li, H. Liu, C. Chen, Y. Pu, L. Chen, R. Henao, and L. Carin. In NIPS, 2017.
- Adversarial autoencoders.
A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. In arXiv, 2015.
- Autoencoding beyond pixels using a learned similarity metric.
A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. In ICML, 2016.
- Understanding the difficulty of training deep feedforward neural networks.
X. Glorot and Y. Bengio. In AISTATS, 2010.
- Adam: A method for stochastic optimization.
D. Kingma and J. Ba. In ICLR, 2015.
- Dropout: A simple way to prevent neural networks from overfitting.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. JMLR, 2014.
- Pixel recurrent neural network.
A. Oord, N. Kalchbrenner, and K. Kavukcuoglu. In ICML, 2016.
- Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications.
T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. In ICLR, 2017.
- A note on the evaluation of generative models.
L. Thei, A. Oord, and M. Bethge. In ICLR, 2016.
- Going deeper with convolutions.
C. Szegedy, W. Liui, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. In CVPR, 2015.
- Generalization and equilibrium in generative adversarial nets.
S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. In arXiv, 2017.