Towards Latent Space Optimality for Auto-Encoder Based Generative Models

Towards Latent Space Optimality for Auto-Encoder Based Generative Models


The field of neural generative models is dominated by the highly successful Generative Adversarial Networks (GANs) despite their challenges, such as training instability and mode collapse. Auto-Encoders (AE) with regularized latent space provides an alternative framework for generative models, albeit their performance levels have not reached that of GANs. In this work, we identify one of the causes for the under-performance of AE-based models and propose a remedial measure. Specifically, we hypothesise that the dimensionality of the AE model’s latent space has a critical effect on the quality of the generated data. Under the assumption that nature generates data by sampling from a “true” generative latent space followed by a deterministic non-linearity, we show that the optimal performance is obtained when the dimensionality of the latent space of the AE-model matches with that of the “true” generative latent space. Further, we propose an algorithm called the Latent Masked Generative Auto-Encoder (LMGAE), in which the dimensionality of the model’s latent space is brought closer to that of the “true” generative latent space, via a novel procedure to mask the spurious latent dimensions. We demonstrate through experiments on synthetic and several real-world datasets that the proposed formulation yields generation quality that is better than the state-of-the-art AE-based generative models and is comparable to that of GANs.


1 Introduction

The primary objective of a generative model is to sample from the data generating distribution. Deep generative models, especially the Generative Adversarial Networks (GANs) [12] have shown remarkable success in this task by generating high quality data [7]. GANs implicitly learn to sample from the data distribution by transforming a sample from a simplistic distribution (such as Gaussian) to the sample from the data distribution by optimising a min-max objective through an adversarial game between a pair of function approximators called the generator and the discriminator. Although GANs generate high-quality data, they are known to suffer from problems like instability of training [4, 33], degenerative supports for the generated data (mode collapse) [2, 34] and sensitivity to hyper-parameters [7].

Auto-Encoder (AE) based generative models provide an alternative to GAN based models [39, 20, 29, 6]. The fundamental idea is to learn a lower dimensional latent representation of data through a deterministic or stochastic encoder and learn to generate (decode) the data through a decoder. Typically, both the encoder and decoder are realised through learnable family of function approximators or deep neural networks. To facilitate the generation process, the distribution of the latent space is forced to follow a known distribution so that sampling from it is feasible. Despite resulting in higher data-likelihood and stable training, the quality of generated data of the AE-based models is known to be far away from state-of-the-art GAN models [11, 13, 35].

While there have been several angles of looking at the shortcomings of the AE-based models [11, 18, 21, 36, 22, 5, 37], an important question seems to have remained unaddressed - What is the “optimal” dimensionality of the latent space to ensure good generation in AE-based models? It is a well-known fact that most of the naturally occurring data effectively lies in a manifold with dimension much lesser than its original dimensionality [9, 25, 30]. Intuitively, this suggests that with well-behaved functions (such as Deep Neural Networks) it is difficult to optimally represent such data using latent dimensions less or more than the effective manifold dimensionality. Specifically, “lesser” number of dimensions in may result in loss of information and “extra” dimensions may lead to noise in generated data when this latent space is used as input to a generative neural network. This observation is also corroborated with empirical evidence provided in Fig. 3 where an AE-based generative model is constructed on synthetic and MNIST [26] datasets, with varying latent dimensionality (everything else kept the same) and shown that the standard generation quality metric follows a peaky U-shaped curve. Motivated by the aforementioned observations, we explore the role of latent dimensionality on the generation quality of AE-models with following contributions:

  1. We provide theoretical understanding on the role of the dimensionality of the latent space on the quality of AE-based generative models.

  2. We model the data generation as a two-stage process comprising of sampling from a “true” latent space followed by a deterministic function and show that under this model, the optimal generation quality is achieved by an AE-model when the latent dimensionality of the AE-model is exactly equal to that of the “true” latent space that is assumed to generate data.

  3. Owing to the obliviousness of the dimensionality of the “true” latent space in real-life data, we propose a method to algorithmically “mask” the spurious dimensions in AE-based models.

  4. We demonstrate the efficacy of the proposed model on synthetic as well as large-scale image datasets by achieving better generation quality metrics compared to the state-of-the-art (SOTA) AE-based models. We also show that the proposed method ensures that the effective active dimensions remain the same irrespective of the initially assumed latent dimensionality.

2 Related work

Let denote data points lying in the space confirming to an underlying distribution , from which a generative model desires to sample. An Auto-Encoder based model constructs a lower-dimensional latent space confirming to a distribution to which the data is projected though an (probabilistic or deterministic) Encoder function, . An inverse projection map is learned from the to through a Decoder function , which can be subsequently used as a sampler for . For this to happen, it is necessary that the latent space is regularized to facilitate explicit sampling from , so that decoder can generate data taking samples from as input. Most of the AE-based models take the route of maximizing a lower bound constricted on the data likelihood which is shown [20, 17] to consist of the sum of two primary terms - (i) the likelihood of the data generated by the Decoder network - and (ii) the divergence measure between the the assumed latent distribution, , and the distribution imposed on the latent space by the Encoder, , [17, 29]. This underlying commonality, suggests that the success of an AE-based generative model depends upon simultaneously increasing the likelihood of the generated data and reducing the divergence between and . The former of the criteria is fairly easily ensured in all AE models by minimizing a surrogate function such as the reconstruction error between the samples of the true data and output of the decoder. It is observed that, with enough capacity, this can be made arbitrarily small [8, 11, 1]. It is well recognized that the quality of the generated data relies heavily on achieving the second criteria of bringing the Encoder imposed latent distribution close to the assumed latent prior distribution [11, 17, 8]. This can be achieved either by (i) assuming a pre-defined primitive distribution for and modifying the Encoder such that follow assumed [20, 29, 6, 10, 16, 19, 21] or by (ii) modifying the latent prior to follow whatever distribution Encoder imposes on the latent space [36, 5, 22, 18, 37].

The seminal paper on VAE [20] proposes a probabilistic Encoder which is tuned to output the parameters of the conditional posterior which is forced to follow the Normal distribution prior assumed on . However, the minimization of the divergence between the conditional latent distribution and the prior in the VAE leads to trade-off between the reconstruction quality and the latent matching, as this procedure also leads to the minimization of the mutual information between and which in turn reduces Decoder’s ability to render good reconstructions [19]. This issue is partially mitigated by altering the weights on the two terms of the ELBO during optimization [16, 8], or through introducing explicit penalty terms in the ELBO to strongly penalize the deviation of from assumed prior [10, 19]. Adversarial Auto-Encoders (AAE) [29] and Wasserstain Auto-Encoders (WAE) [6] address this issue, by respectively using adversarial learning and maximum mean discrepancy to match and . There also have been attempts in employing the idea of normalizing flow for distributional estimation for making close to [21, 31]. These methods although improving the generation quality over vanilla VAE while providing additional properties such as disentanglement in the learned space, fail to reach GAN-level generation quality.

In another class of methods, the latent prior is made learnable instead of being fixed to a primitive distribution so that it matches with Encoder imposed . In VamPrior [36], the prior is taken as a mixture density whose components are learned using pseudo-inputs to the Encoder. [22] introduces a graph-based interpolation method to learn the prior in a hierarchical way. Two-stage-VAE constructs two VAEs, one on the dataspace and the second on the latent space of the first stage VAE [11]. [37, 24] employ discrete latent space using vector quantization schemes and fits the prior using a discrete auto-regressive model.

While all these models have proposed pragmatic methods on matching the distributions through changing the distributional forms, a couple of important related questions seem to be unaddressed - (a) What is the best that one can achieve despite a simplistic unimodal prior such as Gaussian and (ii) What is the effect of the latent dimensionality on generation quality in such case? Here, we focus on addressing these.

3 Effect of Latent Dimensionality

3.1 Preliminaries

In this section, we theoretically examine the effect of latent dimensionality on quality of generated data in AE. We show that it’s impossible to simultaneously achieve both R1 and R2 unless has a certain optimal dimensionality. Specifically, we show that if dimensionality of is more than the optimal dimensionality, and diverge too much (Lemma 2 and 3) whereas it being less leads to information loss (Lemma 1).

Figure 1: Depiction of assumed data generation process.

We allow a certain inductive bias in assuming that nature generates the data as in Figure 1 using the following two step process: First sample from some isotropic continuous latent distribution in -dimensions (call this over ), and then pass this through a function , where is the dataset dimensionality. Typically , thereby making data to lie on a low-dimensional manifold in . Since can intuitively be viewed as the latent space from which the nature is generating the data, we call as the true latent dimension and function , as the data-generating function. Note that within this ambit, forms the domain for . We further cast the following benign assumptions on the function :

  1. is injective: This assumption follows from the uniqueness of the underlying true latent variable , given a data point.

  2. is -lipschitz: some finite satisfying .

  3. is differentiable almost everywhere.

Since universal function approximators such as Deep neural networks are shown to successfully generate data from that adhere to assumptions made above on , it is reasonable to impose them on data-generating function. With these definitions and assumptions, we show requirements for good generation in AE-models.

3.2 Conditions for Good Generation

An AE-based generative model would attempt to learn continuous functions and via some deep enough structure of neural networks. We refer to the dimension to which the Encoder maps the data as , the assumed latent dimension. As discussed earlier, for good quality data generation the following conditions are to be satisfied:

  1. , where denotes some norm. This condition states that the reconstruction error between the real and generated data should be minimal.

  2. The Kullback-Leibler divergence, between the chosen prior , and on is minimal3.

With this, we state and prove the conditions required to ensure R1 and R2 are met.

Theorem 1.

With the assumption of data generating process mentioned in Sec.3.1, requirements R1 and R2 (Sec.3.2), can be satisfied iff true latent dimension is equal to assumed latent dimension .

Proof: We prove by contradicting either R1 or R2, in assuming both the cases of or .

Case A : : Since is injective and differentiable everywhere, it must have a continuous left inverse (call it ). Also, the R1 forces to be the left inverse of on the range of . Since and are both neural networks, the composite has a continuous left inverse which is impossible due to the following lemma:
Lemma 1: A continuous function cannot have a continuous left inverse if .
Proof: It follows trivially from the fact that such a function would define a homeomorphism from to a subset of , whereas it is well known that these two spaces are not homeomorphic.∎
This implies that R1 and Lemma 1 contradict each other in the case and thus to obtain a good reconstruction, should at-least be equal to .

Case B : : For the sake of simplicity, let us assume that is a unit cube in 4. We show that in this case, R2 will be contradicted if in Lemma 2 and 3. The idea is to first show that the range of will have Lebesgue measure 0 (Lemma 2) and this leads to arbitrarily large (Lemma 3).
 Lemma 2: Let be an function. Then its range has Lebesgue measure in dimensions in .

Proof: For some , consider the set of points:
Construct closed balls around then having radius . It is easy to see that every point in the domain of is contained in at least one of these balls. This is because, for any given point, the nearest point in S can be at-most units away along each dimension. Also, since is -lipschitz, we can conclude that the image set of a closed ball having radius and centre would be a subset of the closed ball having centre and radius .
The range of is then a subset of the union of the image sets off all the closed balls defines around S. The volume of this set is upper bounded by the sum of the volumes of the individual image balls, each having volume where c is a constant having value . Therefore,


The final quantity of Eq. 1 can be made arbitrarily small by choosing appropriately. Since the Lebesgue measure of a closed ball is same as its volume, the range of , has measure in
Since are Lipschitz, must have a range with Lebesgue measure 0 as a consequence of Lemma 2. Now we show that as a consequence of the range of (call it ) of having measure , the KL divergence between and goes to infinity.
Lemma 3: If and are two distributions as defined in Sec.3.1 such that the support of the latter has a Lebesgue measure, then grows to be arbitrarily large.
Proof: To begin with, can be equivalently expressed as:


Define as the indicator function of , i.e.


Since has measure (Lemma 2), we have


Further, since is identically in the support of , we have


Now consider the cross-entropy between and given by:


for any arbitrarily large positive real . This holds because is identically 0 over the domain of integration. Further,


Combining 6 and 7, the required cross-entropy is lower bounded by an arbitrarily large quantity . Since cross entropy and KL-divergence differ only by the entropy of which is finite, the KL-divergence is arbitrarily large.∎
 Thus Lemma 3 contradicts R2 required for good generation in the case . Therefore, neither nor can be true if good generation has be to ensured. Thus, one must have (concludes Theorem 1).∎
One can ensure good generation by a trivial solution in the form of with an appropriate and making . However, since neither nor is known, one needs a practical method to ensure to approach which is described in the next section.

4 Latent Masked Generative AE (LMGAE)

Figure 2: Block Diagram of Latent Masked Generative Auto Encoder. LMGAE consists of three pipelines: reconstruction pipeline consisting of , , and ; Masking pipeline consisting of , and ; and Distribution-Matching pipeline consisting of , , and .

In this section, we take the ideas presented in Section 3, and build an architecture such that the true latent dimension of the underlying data distribution could potentially be discovered in an AE set-up, resulting in better generation quality. Our key intuition lies in the following insight: if we start with a large enough estimated latent space dimension, and then give our model an ability to train a mask, which could suppress the set of spurious dimensions, then the model would automatically discover the right number of latent dimensions (which are not masked). This is via minimizing the combination of (1) Re-construction loss (2) A divergence loss, i.e., and , in Section 3, respectively. We next describe the details of our architecture followed by the details of the corresponding loss function which we minimize over this architecture. Figure 2 presents our model architecture. There are three components to it as follows:

  1. Re-construction Pipeline: This is the standard pipeline in any given AE based model, which tries to minimize the reconstruction loss (R1). An input sample is passed through the encoder results in , the corresponding representation in the latent space. The new addition here is the Hadamard product with the mask (explained next), resulting in the masked latent space representation . The masked representation is then fed to the decoder to obtain the re-constructed output . The goal here is to minimize the norm of the difference between and .

  2. Masking Pipeline: Introduction of a mask is one of the novel contributions of our work, and this is the second part of our architecture presented in the middle of the Figure 2. Our mask is represented as and is a binary vector of size (model capacity). Ideally, the mask would be a binary vector, but in order to make it learnable, we relax it to be continuous valued, while imposing certain regularizers so that it does not deviate too much from 0 or 1 during learning. Specifically, we paramerterize using a vector such that where . Intuitively, this parameterization forces to be close to or .

  3. Distribution-Matching Pipeline: This is the third part of our architecture presented at the bottom of Figure 2. Objective of this pipeline is to minimize the distribution loss between a prior distribution, , and the distribution imposed on the latent space by the encoder. is a random vector sampled from the prior distribution, whose Hadamard product is taken with the mask (similar to in the case of encoder), resulting in a masked vector . This masked vector is then passed through the network , where the goal is to separate out the samples coming from prior distribution () from those coming from the encoded space () using some divergence metric. We use the principles detailed in [3] using the Wasserstein’s distance to measure the distributional divergence. Note that has two inputs namely, samples of and output of .

Intuitively, masking of the latent space vector and the sampled vector allows us to work only with a subset of dimensions in the latent space. This means that though may be greater than , we can use the idea of masking, to effectively work in an dimensional space, satisfying the conditions of our theory for minimizing the combined loss over and (Section 3). Next, corresponding to each of the components above, we present a loss function where represents the batch size.

  1. Auto-Encoder Loss: This is the standard loss to capture the quality of re-construction as used earlier in the AE literature. In addition, we have a term corresponding to minimization of the variance over the masked dimensions in the encoded output in a batch. The intuition is that encoder should not inject information into the dimensions which are going to be masked anyway. The loss is specified as:


    represents the co-variance matrix for the encoding matrix , being the data matrix for the current batch. is the vector obtained by applying the function point-wise to . , and are hyperparameters.

  2. Generator Loss: This is the loss capturing the quality of generation in terms of how far is it from the prior distribution. This loss measures the ability of the encoder to generate the samples such that they are coming from which is ensured using the generator loss mentioned in [3]:

  3. Distribution-Matching Loss: This is the loss incurred by the Distribution-matching network, in matching the distributions. We use Wasserstein’s distance [3] to measure the distributional closeness with the following loss:


    Recall that . Further, we have used . are hyper parameters, with , and set as in [14].

  4. Masking Loss: This is the loss capturing the quality of the current mask. The loss is a function of three terms (1) Auto-encoder loss (2) distribution matching loss (3) a regularizer to ensure that parameters stay close to or . This can be specified as:


    where is the Wasserstein’s distance. Here and and are hyper-parameters.

Training: During training, we optimize each of the four losses specified above in turn. Specifically, in each learning loop, we optimize the , , and , in that order using a learning schedule. We use RMSProp for our optimization.

5 Experiments and Results

We divide our experiments into two parts: (a) Synthetic, and (b) Real. In synthetic experiments, we control the data generation process, with a known number of true latent dimensions. Hence, we can compare the performance of our proposed model for different number of true latent dimensions, and examine whether our method can discover the true number of latent dimensions. This also helps us validate some of the theoretical claims made in Section 3. On the other hand, the goal in real experiments is to examine whether our masking based approach can result in generation quality which can beat the state-of-the-art AE-based models, given sufficient model capacity. We would also like to understand the behaviour of the number of dimensions which are masked in this case (though the precise number of latent data dimensions may not be known).

5.1 Synthetic Experiments

In the following description, we will use to denote the true data latent dimension, and to denote the estimated latent dimension (i.e, the one used by LMGAE) in line with the notation used earlier in the paper. We will often refer to as the model capacity. We are interested in the answering the following question: assuming that data is generated according to the generation process described in Section 3, (a) Given sufficient model capacity (i.e, and sufficiently powerful , and ), can LMGAE discover the true number of latent dimensions? (b) What is the quality of the data generated by LMGAE for varying values of ?

In the ideal scenario, we would expect that whenever , the number of dimensions masked by LMGAE are exactly those not required, i.e, the difference between and . Further, we would expect that the performance of LMGAE is independent of the value of , whenever . The performance is expected to deteriorate for values of . In order to do a comparison, for each value of that we experimented with, we also trained an equivalent WAE model with exactly the same architecture as ours but with no masking. We would expect the performance of the WAE model to deteriorate in cases whenever or if our theory were to hold correct.

For the synthetic case, we experiment with the data generated using the following process.

  • Sample , where the mean was fixed to be zero and represents the diagonal co-variance matrix (isotropic Gaussian).

  • Compute , where is a non-linear function computed using a two-layer fully connected neural network with units in each layer, output units, and using leaky ReLU as the non-linearity. The weights of these networks are randomly fixed and was taken as 128.

In our experiments, we set and , and varied in the range of for and for at intervals of in each case. We use the standard Fréchet Inception Distance (FID) [15] score between generated and real images to validate the quality of the generated data. Figure 3 (a) and (b) presents our results on synthetic data. On X-axis, we plot . Y-axis (left) plots the FID score comparing LMGAE and WAE for different values of . Y-axis (right) plots the number of active dimensions discovered by our algorithm. In both the figures, both the approaches achieve the best FID score when . But whereas the performance for WAE deteriorates with increasing , LMGAE retains the optimal FID score independent of the value of . Further, in each case, we get very close to the true number of latent dimensions, even with different values of (as long as or , respectively). This clearly validates our theoretical claims, and also the fact that LMGAE is capable of offering good quality generation in practice.

Figure 3: (a) and (b) shows FID score for WAE and LMGAE and active dimension in a trained LMGAE model with varying model capacity, for synthetic dataset of true latent dimensions, and , represents the number of unmasked latent dimensions in the trained model and (c) shows the same plots for MNIST dataset.

5.2 Real Datasets

In this section, we examine the behavior of LMGAE on real-world datasets. In this case, the true latent data dimensions () is unknown, but we can still analyze the behavior as the estimated latent number of dimension () is varied. We work on four image datasets used extensively in the literature: (a) MNIST [26] (b) Fashion MNIST [38] (c) CIFAR-10 [23] (d) CelebA [27]. We use the same test/train splits as used in the earlier literature.

Figure 4: Randomly generated (no cherry picking) images of (a) MNIST, (b) Fashion MNIST, (c) CelebA, and (d) CIFAR-10 datasets.

In our first set of experiments, we perform an analysis similar to the one done in the case of synthetic data, for the MNIST. Specifically, we varied the estimated latent dimension (model capacity ) for MNIST from to , and analyzed the FID score, as well as number of true dimensions discovered by the model. For comparison, we also did the same experiment using the WAE model. Figure 3 (c) shows the results. As in the case of synthetic data, we observe a U-shape behavior for the WAE model, with the lowest value achieved at . This validates our thesis that the best performance is achieved at a specific value of latent dimension, which is supposedly around in this case5 Further, looking at LMGAE curve, we notice that the performance (FID score) more or less stabilizes (it varies between and ) for values of . In addition, the true latent dimension discovered also stabilizes around irrespective of , without compromising much on the generation quality. Note that the same network architecture was used at all points of Figure 3. These observations are in line with the expected behavior of our model, and the fact that our model can indeed mask the spurious set of dimensions to achieve good generation quality.

Figure 5 shows the behaviour of mask for model capacity and on MNIST dataset. Interestingly, in each case, we are able to discover almost the same number of unmasked dimensions, independent of the starting point. It is also observed that the Wasserstein distance is minimized at the point where the masks reaches the optimal point.

Figure 5: Behaviour of mask in LMGAE models with different for the MNIST dataset. Model capacity, , in figure (a), (b), and (c) are and , respectively. The active dimensions after training are are and respectively.

Finally to measure the generation quality, we present the FID scores of our method along with several state-of-the-art AE-based models mentioned in section 2 in Table 1. Clearly, our approach achieves the best FID score on all the datasets compared to the state-of-the-art AE based generative models. Performance of LMGAE is also comparable to that of GANs [28], despite using a simple norm based reconstruction loss and an isotropic uni-modal Gaussian prior. Figure 4 shows some non-cherry-picked samples generated by our algorithm for each of the datasets.

The better FID scores of LMGAE can be attributed to better distribution matching in the latent space between and . But quantitatively comparing the matching between the two distributions is not easy as LMGAE might mask out some of the latent dimensions resulting in a mismatch between the dimentionality of latent space in different models, thus rendering the usual metrics undefined. We therefore calculate the average off-diagonal co-variance of the encoded latent vectors and report it in Table 2. Since is assumed to be an isotropic Gaussian, ideally should be zero and any deviation from zero indicates a mismatch. For LMGAE we only use the unmasked latent dimensions for the calculation, this is to ensure that is not underestimated by considering the unused dimension. It is observed that for the same model capacity LMGAE has considerably less than the corresponding WAE indicating better distribution matching.

MNIST Fashion CIFAR-10 CelebA
VAE (cross-entr.)
VAE (fixed variance)
VAE (learned variance)
VAE + Flow
2-Stage VAE
Table 1: FID scores for generated images from different AE-based generative models (Lower the better).
Dataset Model Capacity WAE LMGAE


Table 2: Average off-diagonal covariance for both WAE and LMGAE. represents the number of unmasked latent dimensions in the trained model. It is seen that LMGAE has lower values indicating lesser deviation of from as compared to a WAE.

These results clearly demonstrate that not only LMGAE can achieve the best FID scores on a number of benchmarks datasets, it also gives us a practical handle on potentially discovering the true set of latent dimensions in real as well as artificial data. To the best of our knowledge, this is the first study analyzing (and discovering) the effect of latent dimensions on the generation quality.

6 Discussion and Conclusions

Despite demonstrating its pragmatic success, we critically analyze the possible deviations of the practical cases from the presented analysis. More often than not, the naturally occurring data contains some noise superimposed onto the actual image. Thus, theoretically one can argue that this noise can be utilized to minimize the divergence between the distributions. Practically, however, this noise has very low amplitude, so it can only work for a few extra dimensions, giving a slight overestimate of . Further, in practice, not all latent dimensions contribute equally to the data generation. Since the objective of our model is to ignore noise dimensions, it can at times end up throwing away meaningful data dimensions which do not contribute significantly. This can lead to a slight underestimate of (which is observed at times during experimentation). Finally, neural networks, however deep, can represent only a certain level of complexity in a function. This is both good and bad for our purposes. It is good because while we have shown that certain losses cannot be made zero for , universal approximators can bring them arbitrarily close to zero, which is practically the same thing. Due to their limitation, however, we end up getting a U-curve. It is bad because even at the , the encoder and decoder networks might be unable to learn the appropriate functions, and at the fails to make distributions apart. This implies that instead of discovering the exact same number of dimensions every time, we might get a range of values near the true latent dimension. Also, the severity of this problem is likely to increase with the complexity of the dataset (again corroborated by the experiments).

To conclude, in this work, we have taken a step towards constructing an optimal latent space for improving the generation quality of Auto-Encoder based neural generative model. We have argued that, under the assumption two-step generative process to natural data, the optimal latent space for the AE-model is one where its dimensionality matches with that of the latent space of the generative process. Further, we have proposed a practical method to arrive at this optimal dimensionality from an arbitrary point.


  1. footnotemark:
  2. footnotemark:
  3. Subsequent results can be extended to any other divergence metric.
  4. One can easily obtain another function that scales and translates the unit cube appropriately. Note that for such a to exist, we need to be bounded, which may not be the case for certain distributions like the Gaussian distributions. Such distributions, however, can be approximated successively in the limiting sense by truncating at some large value [32]
  5. True latent dimension is not known in this case. In fact, it is only an assumption that the dataset is generated using the data generation process described in section 3.1.


  1. G. Alain and Y. Bengio (2014) What regularized auto-encoders learn from the data-generating distribution. The Journal of Machine Learning Research 15 (1), pp. 3563–3593. Cited by: §2.
  2. M. Arjovsky and L. Bottou (2017) Towards principled methods for training generative adversarial networks. arxiv. Cited by: §1.
  3. M. Arjovsky, S. Chintala and L. Bottou (2017) Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214–223. Cited by: item 3, item 2, item 3.
  4. S. Arora, R. Ge, Y. Liang, T. Ma and Y. Zhang (2017) Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 224–232. Cited by: §1.
  5. M. Bauer and A. Mnih (2018) Resampled priors for variational autoencoders. arXiv preprint arXiv:1810.11428. Cited by: §1, §2.
  6. O. Bousquet, S. Gelly and B. Scholkopf (2018) Wasserstein auto-encoders. External Links: Link Cited by: §1, §2, §2, Table 1.
  7. A. Brock, J. Donahue and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §1.
  8. C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins and A. Lerchner (2018) Understanding disentangling in beta -vae. arXiv preprint arXiv:1804.03599. Cited by: §2, §2.
  9. L. Cayton (2005) Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep 12 (1-17), pp. 1. Cited by: §1.
  10. T. Q. Chen, X. Li, R. B. Grosse and D. K. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §2, §2.
  11. B. Dai and D. Wipf (2019) Diagnosing and enhancing vae models. arXiv preprint arXiv:1903.05789. Cited by: §1, §1, §2, §2, Table 1.
  12. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence and K. Q. Weinberger (Eds.), pp. 2672–2680. External Links: Link Cited by: §1.
  13. A. Grover, M. Dhar and S. Ermon (2017) Flow-gan: bridging implicit and prescribed learning in generative models. arXiv preprint arXiv:1705.08868. Cited by: §1.
  14. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin and A. Courville (2017) Improved training of wasserstein gans. External Links: 1704.00028 Cited by: item 3.
  15. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett (Eds.), pp. 6626–6637. External Links: Link Cited by: §5.1.
  16. I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §2, §2.
  17. M. D. Hoffman and M. J. Johnson (2016) Elbo surgery: yet another way to carve up the variational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS, Vol. 1. Cited by: §2.
  18. Y. Hoshen, K. Li and J. Malik (2019) Non-adversarial image synthesis with generative latent nearest neighbors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5811–5819. Cited by: §1, §2.
  19. H. Kim and A. Mnih (2018) Disentangling by factorising. arXiv preprint arXiv:1802.05983. Cited by: §2, §2.
  20. D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. External Links: 1312.6114 Cited by: §1, §2, §2, Table 1.
  21. D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §1, §2, §2, Table 1.
  22. A. Klushyn, N. Chen, R. Kurle, B. Cseke and P. van der Smagt (2019) Learning hierarchical priors in vaes. arXiv preprint arXiv:1905.04982. Cited by: §1, §2, §2.
  23. A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report . Cited by: §5.2.
  24. V. Kyatham, D. Mishra, T. K. Yadav and D. Mundhra (2019) Variational inference with latent space quantization for adversarial resilience. arXiv preprint arXiv:1903.09940. Cited by: §2.
  25. M. H. Law and A. K. Jain (2006) Incremental nonlinear dimensionality reduction by manifold learning. IEEE transactions on pattern analysis and machine intelligence 28 (3), pp. 377–391. Cited by: §1.
  26. Y. Lecun (2010) The mnist database of handwritten digits. Note: \url Cited by: §1, §5.2.
  27. Z. Liu, P. Luo, X. Wang and X. Tang (2015-12) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §5.2.
  28. M. Lucic, K. Kurach, M. Michalski, S. Gelly and O. Bousquet (2018) Are gans created equal? a large-scale study. In Advances in neural information processing systems, pp. 700–709. Cited by: §5.2.
  29. A. Makhzani, J. Shlens, N. Jaitly and I. Goodfellow (2016) Adversarial autoencoders. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2, §2.
  30. H. Narayanan and S. Mitter (2010) Sample complexity of testing the manifold hypothesis. In Advances in Neural Information Processing Systems, pp. 1786–1794. Cited by: §1.
  31. D. J. Rezende and S. Mohamed (2015) Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770. Cited by: §2.
  32. W. Rudin (1964) Principles of mathematical analysis. Vol. 3, McGraw-hill New York. Cited by: footnote 2.
  33. T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §1.
  34. A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann and C. Sutton (2017) Veegan: reducing mode collapse in gans using implicit variational learning. In Advances in Neural Information Processing Systems, pp. 3308–3318. Cited by: §1.
  35. L. Theis, A. v. d. Oord and M. Bethge (2015) A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844. Cited by: §1.
  36. J. M. Tomczak and M. Welling (2017) VAE with a vampprior. arXiv preprint arXiv:1705.07120. Cited by: §1, §2, §2.
  37. A. van den Oord and O. Vinyals (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315. Cited by: §1, §2, §2.
  38. H. Xiao, K. Rasul and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. External Links: 1708.07747 Cited by: §5.2.
  39. S. Zhao, J. Song and S. Ermon (2017) InfoVAE: information maximizing variational autoencoders. External Links: 1706.02262 Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description