Stacked Wasserstein Autoencoder
Approximating distributions over complicated manifolds, such as natural images, are conceptually attractive. The deep latent variable model, trained using variational autoencoders and generative adversarial networks, is now a key technique for representation learning. However, it is difficult to unify these two models for exact latent-variable inference and parallelize both reconstruction and sampling, partly due to the regularization under the latent variables, to match a simple explicit prior distribution. These approaches are prone to be oversimplified, and can only characterize a few modes of the true distribution. Based on the recently proposed Wasserstein autoencoder (WAE) with a new regularization as an optimal transport. The paper proposes a stacked Wasserstein autoencoder (SWAE) to learn a deep latent variable model. SWAE is a hierarchical model, which relaxes the optimal transport constraints at two stages. At the first stage, the SWAE flexibly learns a representation distribution, i.e., the encoded prior; and at the second stage, the encoded representation distribution is approximated with a latent variable model under the regularization encouraging the latent distribution to match the explicit prior. This model allows us to generate natural textual outputs as well as perform manipulations in the latent space to induce changes in the output space. Both quantitative and qualitative results demonstrate the superior performance of SWAE compared with the state-of-the-art approaches in terms of faithful reconstruction and generation quality.
Recent work on deep latent variable models, such as variational autoencoders  and generative adversarial networks , have shown significant progress in learning smooth representations of complex and high-dimensional data. These latent variable representations facilitate the ability to apply smooth transformations in latent space in order to produce complex modifications of the generated outputs, while still remain on the data manifold. Learning latent variable models is a challenging problem. Initial work on VAEs has shown that optimization is difficult when there are large variations in the data distribution, as the generative model can easily degenerate with blurry reconstructions. In contrast, generative adversarial networks (GANs) , come without an encoder, have generated more impressive results in terms of the visual quality of images sampled from the model.
Specifically, most of the existing methods are designed to approximate the data distribution on a single scale. Due to the difficulty in directly approximating the high-resolution data distribution such as images, most previous methods are limited to generating low-resolution images. To circumvent this difficulty, we observe that real-world data, especially natural images, can be modeled at different scales.
In this work, we propose a two-stage regularized autoencoder. The proposed model is built on the theoretical analysis presented in [30, 14]. Similar to the ARAE , our model provides flexibility in learning an autoencoder from the input space at the first stage. The encoder is adversarially regularized to encode a continuous latent space without explicit structure. On top of this encoded prior space, we stack another autoencoder to approximate the learned prior distribution with an explicitly simple distribution, such as Gaussian.
Under this two-stage setup, this stacked Wasserstein autoencoder (SWAE) approximates the data space at two scales. It first learns a flexible autoencoder, which tends to produce faithful reconstructions of the inputs. But the encoded representation does not lay in an explicit distribution. By taking this flexibly learned representation as a prior, we can learn a latent variable model to approximate this simplified low-dimensional distribution with regularization encouraging the encoded distribution to match an explicit prior, e.g., Gaussian and uniform distribution. By combining the two models together, we are able to generate varied unseen samples given the random samples of the explicit prior, and generate consistent image manipulations by moving around in the latent space via interpolation and offset vector arithmetic. Extensive experiments demonstrate the effectiveness of our method in terms of image quality of generation and reconstruction. The main contributions of this work are listed below.
A novel latent variable model, named as the stacked Wasserstein autoencoder (SWAE), is proposed to approximate the complex and high-dimensional data distribution.
The optimal transport is minimized at two stages. This two-step setting jointly encourages to approximate the data space while learning the encoded latent distribution as a nice explicit manifold structure.
We experimentally show that the SWAE model learns semantically meaningful latent variables of the observed data, enables the interpolation of the latent representation and semantic manipulation, and it can be generalized to sample unobserved data.
The remainder of this paper is organized as follows. We describe the background of this problem and review the recent literatures in Section 2. The proposed approach is elaborated in details in Section 3. Section 4 presents both the qualitative and quantitative results and analysis. Finally, this paper is concluded in Section 5.
2 Background and Related Work
Deep generative models have recently received increasing attentions. They learn to approximate implicit probability distributions. Given the data sample , where is the true while unknown distribution, and is the observed training data, the purpose of generative model is to fit the data samples with the model parameters and random code sampled from an explicit prior distribution . This process is denoted by and the training is to model a neural network that maps the representation vectors to data .
2.1 Regularized Autoencoder
Unregularized autoencoders (AE) can learn an identity mapping such that the encoded latent code space can compactly capture the meaningful features to represent the observed data. However, this latent code space is free of any structure, degenerating the capability of sampling from the latent code space. One popular approach to solve this issue is to regularize through an explicit prior on the code space and employ a variational approximation to the posterior, leading to a family of models called variational autoencoders (VAE).
The VAE formulation relies on a random encoder mapping function , and takes a ’reparametrization trick’ to optimize the parameter. Moreover, minimizing the divergence drives the to match the prior , thus the solution will converge close to the optima. One possible extension is to force the mixture to match the prior. With this observation, AAE  and WAE  regularize the latent code space with adversarial training. WAE minimizes a relaxed optimal transport by penalizing the divergence between and as
This formulation attempts to match the encoded distribution of the training examples to the prior as measured by any specified divergence in order to guarantee that the latent codes provided to the decoder are informative enough to reconstruct the encoded training examples. It also allows the non-random encoders deterministically to map the inputs to their latent codes. This gives rise to the potential of unifying two types of generative models [11, 27, 20, 25] in one framework. There are some works on making the prior more flexible through explicit parameterization . In , the authors show that standard deep architectures can adversarially approximate to the latent space and explicitly represent factors of variation for image generation.
2.2 Generative Adversarial Network
Deep neural network models have shown great success in many pattern recognition [38, 36, 39, 44] and computer vision applications [3, 9, 24, 37, 34]. The deep generative network is one of the most successful models for a large variant of computer vision tasks, such as high resolution image generation  and image translation [13, 45, 32]. The success of GANs on images have inspired many researchers to consider applying GANs as a metric to match two distributions. To approximate the true distribution , the model is trained by introducing a second neural network as a discriminator
The discriminator can provide a measure on how probable the generated sample is from the true data distribution. WGAN [1, 8] is trained using Wasserstein-1 distance to strengthen the measure on the probability divergence and thus improves the training stability. However, the original GANs do not allow inference of the latent code. To solve this issue, BEGAN  applies an auto-encoder as the discriminator. ALI  and BIGAN  propose to match in an augmented space by simultaneously training the model and an inverse mapping from the random noise to the data. However, the ALI model tends to generate reconstructions that are not necessarily faithful reproductions of the inputs, the so called non-identifiability issue. To solve this problem, ALICE  extends the ALI model to combine the framework of cross entropy (CE). This additional regularization imposes a restriction on the connection between the image and the latent variable, and thus enables the faithful reconstruction. A recent successful extension, VEEGAN , is also trained by discriminating the joint samples of the data and the corresponding latent variable , by introducing an additional regularization to penalize the cross entropy of the inferred latent code.
2.3 Stacked Model
A number of works have been proposed to use multiple GANs to improve sample quality. LAPGANs  is built on a series of GANs within a Laplacian pyramid framework. For each generator, the StackGANs [12, 40, 41] generate high-resolution images that are conditioned on their low-resolution inputs. At the level, a discriminator is trained to distinguish the generated representations from encoded ’real’ representations .
where and are the encoded representations, and is the random noise. Our proposed model differs from existing regularized autoencoder models in that it learns a hierarchical latent space, and only matches the encoded latent distribution to explicit prior at the second stage.
3 Proposed Method
To build an autoencoder for faithful reconstruction with a nice latent manifold structure, we propose to learn stacked autoencoders at two stages, as shown in Figure 2. The proposed SWAE consists of two major components: The encoder-generator, , at the first stage and the second encoder-generator, , at the second stage. At each stage, we adversarially train the encoder-generator with additional discriminators . In this work, we aim at minimizing optimal transport at two scales. Given the true (but unknown) data distribution , at the first stage, it learns a latent variable model specified by the encoded prior distribution of the latent codes and the generative model of the data points given . We assume that the successfully trained autoencoder ensures , the output of , is the true latent codes distribution with an unknown structure, which cannot be sampled in a closed form. To solve the sampling issue, at the second stage, we train another encoder-generator by minimizing optimal transport between the encoded (but unknown) latent variable distribution and a latent variable model of the latent encoder prior given . And , the output of , is enforced to match the explicit prior . The joint objective is defined as
where is the output distribution of encoder and is an arbitrary divergence between and .
The above objective is not easy to solve. We attempt to optimize each term by considering: (1) the first encoder-generator to minimize data reconstruction; (2) the second encoder-generator to learn a latent variable model; and (3) the encoder-generator adversarially to minimize . In the following, we discuss how to simplify and transform the cost function into a computable version at each stage.
Stage I: Instead of enforcing the encoded latent distribution to match an explicit prior, we simplify the task by first learning a flexible latent variable model, which aims at faithful reconstruction for observed data. As a result, the encoded latent space exactly reflects the data variation. Stage-I SWAE consists of the encoder and generator . They are adversarially trained with discriminator by maximizing
for the measurable cost function , we use a squared cost function and a weighted adversarial objective
where is the adversarial loss between , the sample of data distribution, and , the output of generator model . Since this autoencoder is trained without direct regularization under the latent space, the adversarial training process is free of model collapse and assists to generate sharp image samples.
Stage II: The flexibly encoded representation from Stage-I could be considered as a ’real’ sample of the true distribution , but it is free of any explicit structure. It is difficult to sample directly for . The Stage II model is to approximate the encoded representation space with a latent variable model specified by an explicit simple prior distribution. The Stage-II consists of the encoder and generator . The discriminator is employed to enforce the match between and . The objective function is defined as
This objective could be consider as minimizing the optimal transport between the encoded (but unknown) representation distribution and the output distribution of the latent variable model . Here, we use the same squared cost function but without adversarial objective.
Specifically, we introduce an adversary (discriminator ) in the latent space trying to separate the “true” points sampled from and the “fake” ones sampled from .
The full training process is outlined in Algorithm 1.
3.1 Stacked GAN-based
Empirically the choice of the prior distribution strongly influences the performance of the generative models. The simplest choice is to employ a fixed distribution such as Gaussian distribution. However, this choice is seemingly too constrained to achieve a faithful reconstruction and even suffers from mode collapse. Our model exploits the two-stage setup, which first map the complex, high-dimensional data distribution to a low-dimensional representation, and then learn a latent variable model to approximate the representation distribution. Therefore, it is not sensitive to the choice of a prior distribution and we can stack several encoder-generators to learn multiple latent variable models as illustrated in Figure 3. The trained model enables us to draw samples given different prior distributions.
3.2 Connection to WAE
The optimal transport (OT) problem in  is defined as:
The WAE proves that learning an autoencoder can be interpreted as learning a generative model with latent variables, as long as we ensure that the marginalized encoded space is the same as the prior.
In practice, learning the marginalized encoded space to be the same as the prior is nontrivial. Thus, we seek to approximate the prior distribution at two stages:
where the stage-I model aims at generating the representation distribution by minimizing the first term, while the second term is to learn a latent variable model specified by an explicit prior .
In this section, we conduct extensive experiments to evaluate the proposed SWAE model. Three publically available datasets are used to train the model: MNIST consisting of 70k images, CIFAR-10  consisting of 60k images in 10 classes, and CelebA  containing roughly 203k images. The performances of our approach are quantitatively and qualitatively compared with the state-of-the-art approaches. We report our results on three aspects of the model. First, we measure the reconstruction accuracy of the observed data inputs and the quality of the randomly generated samples. Next, we explore the latent space by manipulating the codes for consistent image transformation [17, 19]. Finally, we study the crucial aspect that affects the performance of both the reconstruction and random generation.
Experiment setup: All models were optimized via Adam  with a learning rate of 0.0001. We set , and . We do not perform any dataset-specific tuning except for employing early stopping based on the average data reconstruction loss of on the validation sets. For the CelebA dataset, we crop the original images from size to centered at the faces, then resize them to . The training process is shown in Figure 4. The MSE of reconstructions from is constantly low, which means that the stage-I model easily encodes the representation distribution and is reasonable to provide ’real’ samples for training the stage-II model. A smooth learning process at the stage-I will provide constant ’real’ representation. We prefer to set a big batch size with a value of 64, which is significant for training stabilization. We adopt the patch discriminator [13, 45] for . The architectures for the model are provided in the supplemental material.
Quantitative evaluation protocol: We adopt the mean squared error (MSE) and the inception score (ICP) [28, 22] to quantitatively evaluate the performance of the generative models. MSE is employed to evaluate the reconstruction quality, while ICP reflects the plausibility and variety of the sample generation. Based on the pretrained inception model C, the ICP score is calculated by
where denotes the KullbackLeibler divergence and a higher ICP score indicates better performance. In order to quantitatively assess the quality of the generated images on the CelebA dataset, we adopt the Frechet inception distance (FID) introduced in . The FID score measures the distance between the Gaussian distribution with mean and covariance of the real data and the Gaussion distribution of the generated data. It is calculated by
In our experiments, the ICP and FID scores are computed statistically based on samples.
4.1 Random Samples and Reconstruction
The proposed method maps input data to two types of latent codes. At the first stage, we learn a flexible encoded distribution , which tightly captures useful features that represent the observed inputs. At the second stage, the latent space distribution is regularized to match an explicit prior. Here, we learn two latent variable models specified by Gaussian and uniform distribution, respectively. We begin our experiments by comparing our model against two closely related state-of-the-art approaches: ALI  and WAE . The quantitative results are tabulated in Table 1. SWAE is able to generate impressive synthesized images, achieving MSE (0.01) and ICP (8.91) on MNIST. This outperforms GAN based model, ALI (MSE 0.38 and ICP 8.84), while also being competitive to the modified ALICE (MSE 0.07 and ICP 9.35). As for more complicate datasets, such as CelebA, the ALI generates high quality samples (FID 6.95), however, it fails to faithfully reconstruct the input images. This is evidenced by the high reconstruction err (MSE 0.281). While the proposed SWAE dose not have this issue. It achieves better performance in terms of both FID (17.14) and MSE (0.066). This is due to the stacked structure of our model. At the first stage, it prefers to high-quality reconstruction; and it learns a latent variable model for random generation at the second stage. This two-steps learning scheme enables our model to work well in both random generation and faithful reconstruction.
Figure 4(a) and Figure 4(b) show some comparative reconstructions by SWAE, ALI, and WAE, respectively. It is evident that the reconstructions of ALI are not faithful reproduction of the input data, although they are related to the input images. The results demonstrate the limitation of adversarial regularization in reconstruction. This is also consistent with the results in terms of MSE as shown in Table 1. Some generated samples are shown in Figure 4(c) and 4(d). We observe that the is crucial for the image quality. Figure 4(e) shows the blurry random generations of SWAE model trained without , (). The quantitative FID score (68.15) also reflects the degradation of image quality. However, when , we observe serious artifacts in the generated samples. Similarly, the original WAE dose not integrate this discriminator and could only generate blurry samples (FID 98.78). The random samples are shown in Figure 4(f). It is clear that the adversarial learning assists to generate sharp images matching the true data distribution.
The number of iteration of the inner loop in Algorithm 1 will also affect the performance. This inner loop is to update the latent variable model to approximate the encoded representation distribution. The latent variable model is not able to follow the change of the encoded representation distribution when . We also observe that setting works well for the latent variable model to approximate the encoded representation distribution. As a trade-off of efficiency and effectiveness, we prefer to set .
4.2 Latent Space Interpolation
The latent variable model is characterized by learning semantic representations of the observed data. The latent variables are disentangled and evenly distributed in a well-organized manifold structure. Figure 6 demonstrates the learned manifold. To explore the latent manifold structure [35, 42], we investigate the latent space interpolations between the example pair by linearly interpolating between and with equal steps in the latent space. We observe smooth transitions between the pairs of examples, and intermediate images remain plausible and realistic as shown in Figure 7. Figure 8 illustrates the interpolations between images with two different attributes.
4.3 Semantic Manipulation
Learning disentangled latent features is an important computer vision topic. It learns the latent codes to represent different attributes of the observations [31, 43, 5]. To demonstrate the capability of learning disentangled latent codes, we cluster the learned latent codes according to image attributes. We then calculate the average latent vector for images with the attribute and for images without, and then use the difference as a direction for manipulating. This is done after the model is trained making it extremely easy to perform for a variety of different target attributes.
where given or given , and is the scale used to emphasize the added attribute. The reconstructed results are shown in Figure 9. It proves that SWAE can achieve reliable geometry of latent space without any class information at the training stage.
4.4 Effect of Training Epoch
Figure 10 shows the random generation and reconstruction of the model trained after 50 epochs in contrast to the best model trained after 25 epochs. The model achieves a better visual quality of random generations. The FID score goes down to 14.6. However, the MSE is 0.15. At this point, the discriminator is more sensitive to samples not in the true data distribution. The improvement of random generation is at the cost of faithful reconstruction. The latent manifold structure is destroyed as showed in Figure 9(d). Thus we choose to stop the training process early as a trade-off of generation and reconstruction.
In this paper, we have presented a stacked Wasserstein autoencoder, which learns the latent code space a manifold structure and generates high-quality samples. The model is fulfilled by training an autoencoder in two stages with more flexibility. The first stage learns a flexible autoencoder, which tends to produce faithful reconstructions of the inputs. However, the encoded representation distribution is not an explicit distribution. With a latent variable model, the flexibly encoded representation distribution is further approximated. Experimental results demonstrate that the images sampled from the learned distribution are of better quality while the reconstructions are consistent with the inputs.
The work was supported in part by USDA NIFA under the grant no. 2019-67021-28996, the Research Grant Opportunity program of the University of Kansas, and the Nvidia GPU grant.
-  (2017) Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214–223. Cited by: §2.2.
-  (2017) Began: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717. Cited by: §2.2.
-  (2019) Dictionary representation of deep features for occlusion-robust face recognition. IEEE Access. Cited by: §2.2.
-  (2015) Deep generative image models using a￼ laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pp. 1486–1494. Cited by: §2.3.
-  (2017) Adversarial feature learning. International Conference on Learning Representations. Cited by: §2.2, §4.3.
-  (2017) Adversarially learned inference. International Conference on Learning Representations. Cited by: §2.2, §4.1.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
-  (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777. Cited by: §2.2.
-  (2018) Learning depth from single images with deep neural network embedding focal length. IEEE Transactions on Image Processing 27 (9), pp. 4676–4689. Cited by: §2.2.
-  (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems. Cited by: §4.
-  (2018) On unifying deep generative models. International Conference on Learning Representations. Cited by: §2.1.
-  (2017) Stacked generative adversarial networks.. In CVPR, Vol. 2, pp. 3. Cited by: §2.3.
-  (2017) Image-to-image translation with conditional adversarial networks. IEEE Conference on Computer Vision and Pattern Recognition. Cited by: §2.2, §4.
-  (2017) Adversarially regularized autoencoders for generating discrete structures. arXiv preprint arXiv:1706.04223. Cited by: §1, §2.1.
-  (2015) Adam: a method for stochastic optimization. International Conference on Learning Representations. Cited by: §4.
-  (2014) Auto-encoding variational bayes. International Conference on Learning Representations. Cited by: §1.
-  (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §4.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §4.
-  (2015) Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300. Cited by: §4.
-  (2016) Autoencoding beyond pixels using a learned similarity metric. International Conference on Machine Learning 48, pp. 1558–1566. Cited by: §2.1.
-  (2017) Photo-realistic single image super-resolution using a generative adversarial network.. In CVPR, Vol. 2, pp. 4. Cited by: §2.2.
-  (2017) Alice: towards understanding adversarial learning for joint distribution matching. In Advances in Neural Information Processing Systems, pp. 5495–5503. Cited by: §2.2, Table 1, §4.
-  (2015) Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738. Cited by: §4.
-  (2018) MDCN: multi-scale, deep inception convolutional neural networks for efficient object detection. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2510–2515. Cited by: §2.2.
-  (2017) PixelGAN autoencoders. In Advances in Neural Information Processing Systems, pp. 1975–1985. Cited by: §2.1.
-  (2016) Adversarial autoencoders. Workshop Track of International Conference on Learning Representations. Cited by: §2.1.
-  (2017) Adversarial variational bayes: unifying variational autoencoders and generative adversarial networks. International Conference on Machine Learning. Cited by: §2.1.
-  (2016) Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242. Cited by: §4.
-  (2017) Veegan: reducing mode collapse in gans using implicit variational learning. In Advances in Neural Information Processing Systems, pp. 3308–3318. Cited by: §2.2.
-  (2018) Wasserstein auto-encoders. International Conference on Learning Representations. Cited by: §1, §2.1, §3.2, §4.1, Table 1.
-  (2017) Disentangled representation learning gan for pose-invariant face recognition. In CVPR, Vol. 3, pp. 7. Cited by: §4.3.
-  (2019) Toward learning a unified many-to-many mapping for diverse image translation. Pattern Recognition 93, pp. 570 – 580. Cited by: §2.2.
-  (2019) Adversarially approximated autoencoder for image generation and manipulation. IEEE Transactions on Multimedia. Cited by: §2.1.
-  (2019) Towards learning affine-invariant representations via data-efficient cnns. arXiv preprint arXiv:1909.00114. Cited by: §2.2.
-  (2018) Locally adaptive sparse representation on riemannian manifolds for robust classification. Neurocomputing 310, pp. 69–76. Cited by: §4.2.
-  (2017) Multitask autoencoder model for recovering human poses. IEEE Transactions on Industrial Electronics 65 (6), pp. 5060–5068. Cited by: §2.2.
-  (2018) Leveraging content sensitiveness and user trustworthiness to recommend fine-grained privacy settings for social image sharing. IEEE Transactions on Information Forensics and Security 13 (5), pp. 1317–1332. Cited by: §2.2.
-  (2016) Deep multimodal distance metric learning using click constraints for image ranking. IEEE transactions on cybernetics 47 (12), pp. 4014–4024. Cited by: §2.2.
-  (2019) Spatial pyramid-enhanced netvlad with weighted triplet loss for place recognition. IEEE transactions on neural networks and learning systems. Cited by: §2.2.
-  (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. pp. 5907–5915. Cited by: §2.3.
-  (2018) Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.3.
-  (2019) A nonlinear and explicit framework of supervised manifold-feature extraction for hyperspectral image classification. Neurocomputing. Cited by: §4.2.
-  (2017) Age progression/regression by conditional adversarial autoencoder. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2. Cited by: §4.3.
-  (2014) A smart high accuracy silicon piezoresistive pressure sensor temperature compensation system. Sensors 14 (7), pp. 12174–12190. Cited by: §2.2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. International Conference on Computer Vision. Cited by: §2.2, §4.