Towards Latent Space Optimality for AutoEncoder Based Generative Models
Abstract
The field of neural generative models is dominated by the highly successful Generative Adversarial Networks (GANs) despite their challenges, such as training instability and mode collapse. AutoEncoders (AE) with regularized latent space provides an alternative framework for generative models, albeit their performance levels have not reached that of GANs. In this work, we identify one of the causes for the underperformance of AEbased models and propose a remedial measure. Specifically, we hypothesise that the dimensionality of the AE model’s latent space has a critical effect on the quality of the generated data. Under the assumption that nature generates data by sampling from a “true” generative latent space followed by a deterministic nonlinearity, we show that the optimal performance is obtained when the dimensionality of the latent space of the AEmodel matches with that of the “true” generative latent space. Further, we propose an algorithm called the Latent Masked Generative AutoEncoder (LMGAE), in which the dimensionality of the model’s latent space is brought closer to that of the “true” generative latent space, via a novel procedure to mask the spurious latent dimensions. We demonstrate through experiments on synthetic and several realworld datasets that the proposed formulation yields generation quality that is better than the stateoftheart AEbased generative models and is comparable to that of GANs.
1 Introduction
The primary objective of a generative model is to sample from the data generating distribution. Deep generative models, especially the Generative Adversarial Networks (GANs) [12] have shown remarkable success in this task by generating high quality data [7]. GANs implicitly learn to sample from the data distribution by transforming a sample from a simplistic distribution (such as Gaussian) to the sample from the data distribution by optimising a minmax objective through an adversarial game between a pair of function approximators called the generator and the discriminator. Although GANs generate highquality data, they are known to suffer from problems like instability of training [4, 33], degenerative supports for the generated data (mode collapse) [2, 34] and sensitivity to hyperparameters [7].
AutoEncoder (AE) based generative models provide an alternative to GAN based models [39, 20, 29, 6]. The fundamental idea is to learn a lower dimensional latent representation of data through a deterministic or stochastic encoder and learn to generate (decode) the data through a decoder. Typically, both the encoder and decoder are realised through learnable family of function approximators or deep neural networks. To facilitate the generation process, the distribution of the latent space is forced to follow a known distribution so that sampling from it is feasible. Despite resulting in higher datalikelihood and stable training, the quality of generated data of the AEbased models is known to be far away from stateoftheart GAN models [11, 13, 35].
While there have been several angles of looking at the shortcomings of the AEbased models [11, 18, 21, 36, 22, 5, 37], an important question seems to have remained unaddressed  What is the “optimal” dimensionality of the latent space to ensure good generation in AEbased models? It is a wellknown fact that most of the naturally occurring data effectively lies in a manifold with dimension much lesser than its original dimensionality [9, 25, 30]. Intuitively, this suggests that with wellbehaved functions (such as Deep Neural Networks) it is difficult to optimally represent such data using latent dimensions less or more than the effective manifold dimensionality. Specifically, “lesser” number of dimensions in may result in loss of information and “extra” dimensions may lead to noise in generated data when this latent space is used as input to a generative neural network. This observation is also corroborated with empirical evidence provided in Fig. 3 where an AEbased generative model is constructed on synthetic and MNIST [26] datasets, with varying latent dimensionality (everything else kept the same) and shown that the standard generation quality metric follows a peaky Ushaped curve. Motivated by the aforementioned observations, we explore the role of latent dimensionality on the generation quality of AEmodels with following contributions:

We provide theoretical understanding on the role of the dimensionality of the latent space on the quality of AEbased generative models.

We model the data generation as a twostage process comprising of sampling from a “true” latent space followed by a deterministic function and show that under this model, the optimal generation quality is achieved by an AEmodel when the latent dimensionality of the AEmodel is exactly equal to that of the “true” latent space that is assumed to generate data.

Owing to the obliviousness of the dimensionality of the “true” latent space in reallife data, we propose a method to algorithmically “mask” the spurious dimensions in AEbased models.

We demonstrate the efficacy of the proposed model on synthetic as well as largescale image datasets by achieving better generation quality metrics compared to the stateoftheart (SOTA) AEbased models. We also show that the proposed method ensures that the effective active dimensions remain the same irrespective of the initially assumed latent dimensionality.
2 Related work
Let denote data points lying in the space confirming to an underlying distribution , from which a generative model desires to sample. An AutoEncoder based model constructs a lowerdimensional latent space confirming to a distribution to which the data is projected though an (probabilistic or deterministic) Encoder function, . An inverse projection map is learned from the to through a Decoder function , which can be subsequently used as a sampler for . For this to happen, it is necessary that the latent space is regularized to facilitate explicit sampling from , so that decoder can generate data taking samples from as input. Most of the AEbased models take the route of maximizing a lower bound constricted on the data likelihood which is shown [20, 17] to consist of the sum of two primary terms  (i) the likelihood of the data generated by the Decoder network  and (ii) the divergence measure between the the assumed latent distribution, , and the distribution imposed on the latent space by the Encoder, , [17, 29]. This underlying commonality, suggests that the success of an AEbased generative model depends upon simultaneously increasing the likelihood of the generated data and reducing the divergence between and . The former of the criteria is fairly easily ensured in all AE models by minimizing a surrogate function such as the reconstruction error between the samples of the true data and output of the decoder. It is observed that, with enough capacity, this can be made arbitrarily small [8, 11, 1]. It is well recognized that the quality of the generated data relies heavily on achieving the second criteria of bringing the Encoder imposed latent distribution close to the assumed latent prior distribution [11, 17, 8]. This can be achieved either by (i) assuming a predefined primitive distribution for and modifying the Encoder such that follow assumed [20, 29, 6, 10, 16, 19, 21] or by (ii) modifying the latent prior to follow whatever distribution Encoder imposes on the latent space [36, 5, 22, 18, 37].
The seminal paper on VAE [20] proposes a probabilistic Encoder which is tuned to output the parameters of the conditional posterior which is forced to follow the Normal distribution prior assumed on . However, the minimization of the divergence between the conditional latent distribution and the prior in the VAE leads to tradeoff between the reconstruction quality and the latent matching, as this procedure also leads to the minimization of the mutual information between and which in turn reduces Decoder’s ability to render good reconstructions [19]. This issue is partially mitigated by altering the weights on the two terms of the ELBO during optimization [16, 8], or through introducing explicit penalty terms in the ELBO to strongly penalize the deviation of from assumed prior [10, 19]. Adversarial AutoEncoders (AAE) [29] and Wasserstain AutoEncoders (WAE) [6] address this issue, by respectively using adversarial learning and maximum mean discrepancy to match and . There also have been attempts in employing the idea of normalizing flow for distributional estimation for making close to [21, 31]. These methods although improving the generation quality over vanilla VAE while providing additional properties such as disentanglement in the learned space, fail to reach GANlevel generation quality.
In another class of methods, the latent prior is made learnable instead of being fixed to a primitive distribution so that it matches with Encoder imposed . In VamPrior [36], the prior is taken as a mixture density whose components are learned using pseudoinputs to the Encoder. [22] introduces a graphbased interpolation method to learn the prior in a hierarchical way. TwostageVAE constructs two VAEs, one on the dataspace and the second on the latent space of the first stage VAE [11]. [37, 24] employ discrete latent space using vector quantization schemes and fits the prior using a discrete autoregressive model.
While all these models have proposed pragmatic methods on matching the distributions through changing the distributional forms, a couple of important related questions seem to be unaddressed  (a) What is the best that one can achieve despite a simplistic unimodal prior such as Gaussian and (ii) What is the effect of the latent dimensionality on generation quality in such case? Here, we focus on addressing these.
3 Effect of Latent Dimensionality
3.1 Preliminaries
In this section, we theoretically examine the effect of latent dimensionality on quality of generated data in AE. We show that it’s impossible to simultaneously achieve both R1 and R2 unless has a certain optimal dimensionality. Specifically, we show that if dimensionality of is more than the optimal dimensionality, and diverge too much (Lemma 2 and 3) whereas it being less leads to information loss (Lemma 1).
We allow a certain inductive bias in assuming that nature generates the data as in Figure 1 using the following two step process: First sample from some isotropic continuous latent distribution in dimensions (call this over ), and then pass this through a function , where is the dataset dimensionality. Typically , thereby making data to lie on a lowdimensional manifold in . Since can intuitively be viewed as the latent space from which the nature is generating the data, we call as the true latent dimension and function , as the datagenerating function. Note that within this ambit, forms the domain for . We further cast the following benign assumptions on the function :

is injective: This assumption follows from the uniqueness of the underlying true latent variable , given a data point.

is lipschitz: some finite satisfying .

is differentiable almost everywhere.
Since universal function approximators such as Deep neural networks are shown to successfully generate data from that adhere to assumptions made above on , it is reasonable to impose them on datagenerating function. With these definitions and assumptions, we show requirements for good generation in AEmodels.
3.2 Conditions for Good Generation
An AEbased generative model would attempt to learn continuous functions and via some deep enough structure of neural networks. We refer to the dimension to which the Encoder maps the data as , the assumed latent dimension. As discussed earlier, for good quality data generation the following conditions are to be satisfied:

, where denotes some norm. This condition states that the reconstruction error between the real and generated data should be minimal.

The KullbackLeibler divergence, between the chosen prior , and on is minimal
^{3} .
With this, we state and prove the conditions required to ensure R1 and R2 are met.
Theorem 1.
Proof: We prove by contradicting either R1 or R2, in assuming both the cases of or .
Case A : : Since is injective and differentiable everywhere, it must have a continuous left inverse (call it ). Also, the R1 forces to be the left inverse of on the range of . Since and are both neural networks, the composite has a continuous left inverse which is impossible due to the following lemma:
Lemma 1: A continuous function cannot have a continuous left inverse if .
Proof: It follows trivially from the fact that such a function would define a homeomorphism from to a subset of , whereas it is well known that these two spaces are not homeomorphic.∎
This implies that R1 and Lemma 1 contradict each other in the case and thus to obtain a good reconstruction, should atleast be equal to .
Case B : : For the sake of simplicity, let us assume that is a unit cube in
Lemma 2:
Let be an function. Then its range has Lebesgue measure in dimensions in .
Proof:
For some , consider the set of points:
S={}.
Construct closed balls around then having radius . It is easy to see that every point in the domain of is contained in at least one of these balls. This is because, for any given point, the nearest point in S can be atmost units away along each dimension. Also, since is lipschitz, we can conclude that the image set of a closed ball having radius and centre would be a subset of the closed ball having centre and radius .
The range of is then a subset of the union of the image sets off all the closed balls defines around S. The volume of this set is upper bounded by the sum of the volumes of the individual image balls, each having volume where c is a constant having value .
Therefore,
(1) 
The final quantity of Eq. 1 can be made arbitrarily small by choosing appropriately. Since the Lebesgue measure of a closed ball is same as its volume, the range of , has measure in ∎
Since are Lipschitz, must have a range with Lebesgue measure 0 as a consequence of Lemma 2. Now we show that as a consequence of the range of (call it ) of having measure , the KL divergence between and goes to infinity.
Lemma 3: If and are two distributions as defined in Sec.3.1 such that the support of the latter has a Lebesgue measure, then grows to be arbitrarily large.
Proof: To begin with, can be equivalently expressed as:
(2) 
Define as the indicator function of , i.e.
(3) 
Since has measure (Lemma 2), we have
(4) 
Further, since is identically in the support of , we have
(5) 
Now consider the crossentropy between and given by:
(6) 
for any arbitrarily large positive real . This holds because is identically 0 over the domain of integration. Further,
(7) 
Combining 6 and 7, the required crossentropy is lower bounded by an arbitrarily large quantity . Since cross entropy and KLdivergence differ only by the entropy of which is finite, the KLdivergence is arbitrarily large.∎
Thus Lemma 3 contradicts R2 required for good generation in the case . Therefore, neither nor can be true if good generation has be to ensured. Thus, one must have (concludes Theorem 1).∎
One can ensure good generation by a trivial solution in the form of with an appropriate and making . However, since neither nor is known, one needs a practical method to ensure to approach which is described in the next section.
4 Latent Masked Generative AE (LMGAE)
In this section, we take the ideas presented in Section 3, and build an architecture such that the true latent dimension of the underlying data distribution could potentially be discovered in an AE setup, resulting in better generation quality. Our key intuition lies in the following insight: if we start with a large enough estimated latent space dimension, and then give our model an ability to train a mask, which could suppress the set of spurious dimensions, then the model would automatically discover the right number of latent dimensions (which are not masked). This is via minimizing the combination of (1) Reconstruction loss (2) A divergence loss, i.e., and , in Section 3, respectively. We next describe the details of our architecture followed by the details of the corresponding loss function which we minimize over this architecture. Figure 2 presents our model architecture. There are three components to it as follows:

Reconstruction Pipeline: This is the standard pipeline in any given AE based model, which tries to minimize the reconstruction loss (R1). An input sample is passed through the encoder results in , the corresponding representation in the latent space. The new addition here is the Hadamard product with the mask (explained next), resulting in the masked latent space representation . The masked representation is then fed to the decoder to obtain the reconstructed output . The goal here is to minimize the norm of the difference between and .

Masking Pipeline: Introduction of a mask is one of the novel contributions of our work, and this is the second part of our architecture presented in the middle of the Figure 2. Our mask is represented as and is a binary vector of size (model capacity). Ideally, the mask would be a binary vector, but in order to make it learnable, we relax it to be continuous valued, while imposing certain regularizers so that it does not deviate too much from 0 or 1 during learning. Specifically, we paramerterize using a vector such that where . Intuitively, this parameterization forces to be close to or .

DistributionMatching Pipeline: This is the third part of our architecture presented at the bottom of Figure 2. Objective of this pipeline is to minimize the distribution loss between a prior distribution, , and the distribution imposed on the latent space by the encoder. is a random vector sampled from the prior distribution, whose Hadamard product is taken with the mask (similar to in the case of encoder), resulting in a masked vector . This masked vector is then passed through the network , where the goal is to separate out the samples coming from prior distribution () from those coming from the encoded space () using some divergence metric. We use the principles detailed in [3] using the Wasserstein’s distance to measure the distributional divergence. Note that has two inputs namely, samples of and output of .
Intuitively, masking of the latent space vector and the sampled vector allows us to work only with a subset of dimensions in the latent space. This means that though may be greater than , we can use the idea of masking, to effectively work in an dimensional space, satisfying the conditions of our theory for minimizing the combined loss over and (Section 3). Next, corresponding to each of the components above, we present a loss function where represents the batch size.

AutoEncoder Loss: This is the standard loss to capture the quality of reconstruction as used earlier in the AE literature. In addition, we have a term corresponding to minimization of the variance over the masked dimensions in the encoded output in a batch. The intuition is that encoder should not inject information into the dimensions which are going to be masked anyway. The loss is specified as:
(8) represents the covariance matrix for the encoding matrix , being the data matrix for the current batch. is the vector obtained by applying the function pointwise to . , and are hyperparameters.

Generator Loss: This is the loss capturing the quality of generation in terms of how far is it from the prior distribution. This loss measures the ability of the encoder to generate the samples such that they are coming from which is ensured using the generator loss mentioned in [3]:
(9) 
DistributionMatching Loss: This is the loss incurred by the Distributionmatching network, in matching the distributions. We use Wasserstein’s distance [3] to measure the distributional closeness with the following loss:
(10) Recall that . Further, we have used . are hyper parameters, with , and set as in [14].

Masking Loss: This is the loss capturing the quality of the current mask. The loss is a function of three terms (1) Autoencoder loss (2) distribution matching loss (3) a regularizer to ensure that parameters stay close to or . This can be specified as:
(11) where is the Wasserstein’s distance. Here and and are hyperparameters.
Training: During training, we optimize each of the four losses specified above in turn. Specifically, in each learning loop, we optimize the , , and , in that order using a learning schedule. We use RMSProp for our optimization.
5 Experiments and Results
We divide our experiments into two parts: (a) Synthetic, and (b) Real. In synthetic experiments, we control the data generation process, with a known number of true latent dimensions. Hence, we can compare the performance of our proposed model for different number of true latent dimensions, and examine whether our method can discover the true number of latent dimensions. This also helps us validate some of the theoretical claims made in Section 3. On the other hand, the goal in real experiments is to examine whether our masking based approach can result in generation quality which can beat the stateoftheart AEbased models, given sufficient model capacity. We would also like to understand the behaviour of the number of dimensions which are masked in this case (though the precise number of latent data dimensions may not be known).
5.1 Synthetic Experiments
In the following description, we will use to denote the true data latent dimension, and to denote the estimated latent dimension (i.e, the one used by LMGAE) in line with the notation used earlier in the paper. We will often refer to as the model capacity. We are interested in the answering the following question: assuming that data is generated according to the generation process described in Section 3, (a) Given sufficient model capacity (i.e, and sufficiently powerful , and ), can LMGAE discover the true number of latent dimensions? (b) What is the quality of the data generated by LMGAE for varying values of ?
In the ideal scenario, we would expect that whenever , the number of dimensions masked by LMGAE are exactly those not required, i.e, the difference between and . Further, we would expect that the performance of LMGAE is independent of the value of , whenever . The performance is expected to deteriorate for values of . In order to do a comparison, for each value of that we experimented with, we also trained an equivalent WAE model with exactly the same architecture as ours but with no masking. We would expect the performance of the WAE model to deteriorate in cases whenever or if our theory were to hold correct.
For the synthetic case, we experiment with the data generated using the following process.

Sample , where the mean was fixed to be zero and represents the diagonal covariance matrix (isotropic Gaussian).

Compute , where is a nonlinear function computed using a twolayer fully connected neural network with units in each layer, output units, and using leaky ReLU as the nonlinearity. The weights of these networks are randomly fixed and was taken as 128.
In our experiments, we set and , and varied in the range of for and for at intervals of in each case. We use the standard Fréchet Inception Distance (FID) [15] score between generated and real images to validate the quality of the generated data. Figure 3 (a) and (b) presents our results on synthetic data. On Xaxis, we plot . Yaxis (left) plots the FID score comparing LMGAE and WAE for different values of . Yaxis (right) plots the number of active dimensions discovered by our algorithm. In both the figures, both the approaches achieve the best FID score when . But whereas the performance for WAE deteriorates with increasing , LMGAE retains the optimal FID score independent of the value of . Further, in each case, we get very close to the true number of latent dimensions, even with different values of (as long as or , respectively). This clearly validates our theoretical claims, and also the fact that LMGAE is capable of offering good quality generation in practice.
5.2 Real Datasets
In this section, we examine the behavior of LMGAE on realworld datasets. In this case, the true latent data dimensions () is unknown, but we can still analyze the behavior as the estimated latent number of dimension () is varied. We work on four image datasets used extensively in the literature: (a) MNIST [26] (b) Fashion MNIST [38] (c) CIFAR10 [23] (d) CelebA [27]. We use the same test/train splits as used in the earlier literature.
In our first set of experiments, we perform an analysis similar to the one done in the case of synthetic data, for the MNIST. Specifically, we varied the estimated latent dimension (model capacity ) for MNIST from to , and analyzed the FID score, as well as number of true dimensions discovered by the model. For comparison, we also did the same experiment using the WAE model. Figure 3 (c) shows the results. As in the case of synthetic data, we observe a Ushape behavior for the WAE model, with the lowest value achieved at . This validates our thesis that the best performance is achieved at a specific value of latent dimension, which is supposedly around in this case
Figure 5 shows the behaviour of mask for model capacity and on MNIST dataset. Interestingly, in each case, we are able to discover almost the same number of unmasked dimensions, independent of the starting point. It is also observed that the Wasserstein distance is minimized at the point where the masks reaches the optimal point.
Finally to measure the generation quality, we present the FID scores of our method along with several stateoftheart AEbased models mentioned in section 2 in Table 1. Clearly, our approach achieves the best FID score on all the datasets compared to the stateoftheart AE based generative models. Performance of LMGAE is also comparable to that of GANs [28], despite using a simple norm based reconstruction loss and an isotropic unimodal Gaussian prior. Figure 4 shows some noncherrypicked samples generated by our algorithm for each of the datasets.
The better FID scores of LMGAE can be attributed to better distribution matching in the latent space between and . But quantitatively comparing the matching between the two distributions is not easy as LMGAE might mask out some of the latent dimensions resulting in a mismatch between the dimentionality of latent space in different models, thus rendering the usual metrics undefined. We therefore calculate the average offdiagonal covariance of the encoded latent vectors and report it in Table 2. Since is assumed to be an isotropic Gaussian, ideally should be zero and any deviation from zero indicates a mismatch. For LMGAE we only use the unmasked latent dimensions for the calculation, this is to ensure that is not underestimated by considering the unused dimension. It is observed that for the same model capacity LMGAE has considerably less than the corresponding WAE indicating better distribution matching.
MNIST  Fashion  CIFAR10  CelebA  
















































Dataset  Model Capacity  WAE  LMGAE  



MNIST  
FMNIST  
CIFAR  
CelebA  

These results clearly demonstrate that not only LMGAE can achieve the best FID scores on a number of benchmarks datasets, it also gives us a practical handle on potentially discovering the true set of latent dimensions in real as well as artificial data. To the best of our knowledge, this is the first study analyzing (and discovering) the effect of latent dimensions on the generation quality.
6 Discussion and Conclusions
Despite demonstrating its pragmatic success, we critically analyze the possible deviations of the practical cases from the presented analysis. More often than not, the naturally occurring data contains some noise superimposed onto the actual image. Thus, theoretically one can argue that this noise can be utilized to minimize the divergence between the distributions. Practically, however, this noise has very low amplitude, so it can only work for a few extra dimensions, giving a slight overestimate of . Further, in practice, not all latent dimensions contribute equally to the data generation. Since the objective of our model is to ignore noise dimensions, it can at times end up throwing away meaningful data dimensions which do not contribute significantly. This can lead to a slight underestimate of (which is observed at times during experimentation). Finally, neural networks, however deep, can represent only a certain level of complexity in a function. This is both good and bad for our purposes. It is good because while we have shown that certain losses cannot be made zero for , universal approximators can bring them arbitrarily close to zero, which is practically the same thing. Due to their limitation, however, we end up getting a Ucurve. It is bad because even at the , the encoder and decoder networks might be unable to learn the appropriate functions, and at the fails to make distributions apart. This implies that instead of discovering the exact same number of dimensions every time, we might get a range of values near the true latent dimension. Also, the severity of this problem is likely to increase with the complexity of the dataset (again corroborated by the experiments).
To conclude, in this work, we have taken a step towards constructing an optimal latent space for improving the generation quality of AutoEncoder based neural generative model. We have argued that, under the assumption twostep generative process to natural data, the optimal latent space for the AEmodel is one where its dimensionality matches with that of the latent space of the generative process. Further, we have proposed a practical method to arrive at this optimal dimensionality from an arbitrary point.
Footnotes
 footnotemark:
 footnotemark:
 Subsequent results can be extended to any other divergence metric.
 One can easily obtain another function that scales and translates the unit cube appropriately. Note that for such a to exist, we need to be bounded, which may not be the case for certain distributions like the Gaussian distributions. Such distributions, however, can be approximated successively in the limiting sense by truncating at some large value [32]
 True latent dimension is not known in this case. In fact, it is only an assumption that the dataset is generated using the data generation process described in section 3.1.
References
 (2014) What regularized autoencoders learn from the datagenerating distribution. The Journal of Machine Learning Research 15 (1), pp. 3563–3593. Cited by: §2.
 (2017) Towards principled methods for training generative adversarial networks. arxiv. Cited by: §1.
 (2017) Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214–223. Cited by: item 3, item 2, item 3.
 (2017) Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 224–232. Cited by: §1.
 (2018) Resampled priors for variational autoencoders. arXiv preprint arXiv:1810.11428. Cited by: §1, §2.
 (2018) Wasserstein autoencoders. External Links: Link Cited by: §1, §2, §2, Table 1.
 (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §1.
 (2018) Understanding disentangling in beta vae. arXiv preprint arXiv:1804.03599. Cited by: §2, §2.
 (2005) Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep 12 (117), pp. 1. Cited by: §1.
 (2018) Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §2, §2.
 (2019) Diagnosing and enhancing vae models. arXiv preprint arXiv:1903.05789. Cited by: §1, §1, §2, §2, Table 1.
 (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence and K. Q. Weinberger (Eds.), pp. 2672–2680. External Links: Link Cited by: §1.
 (2017) Flowgan: bridging implicit and prescribed learning in generative models. arXiv preprint arXiv:1705.08868. Cited by: §1.
 (2017) Improved training of wasserstein gans. External Links: 1704.00028 Cited by: item 3.
 (2017) GANs trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett (Eds.), pp. 6626–6637. External Links: Link Cited by: §5.1.
 (2017) Betavae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §2, §2.
 (2016) Elbo surgery: yet another way to carve up the variational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS, Vol. 1. Cited by: §2.
 (2019) Nonadversarial image synthesis with generative latent nearest neighbors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5811–5819. Cited by: §1, §2.
 (2018) Disentangling by factorising. arXiv preprint arXiv:1802.05983. Cited by: §2, §2.
 (2013) Autoencoding variational bayes. External Links: 1312.6114 Cited by: §1, §2, §2, Table 1.
 (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §1, §2, §2, Table 1.
 (2019) Learning hierarchical priors in vaes. arXiv preprint arXiv:1905.04982. Cited by: §1, §2, §2.
 (2009) Learning multiple layers of features from tiny images. Technical report . Cited by: §5.2.
 (2019) Variational inference with latent space quantization for adversarial resilience. arXiv preprint arXiv:1903.09940. Cited by: §2.
 (2006) Incremental nonlinear dimensionality reduction by manifold learning. IEEE transactions on pattern analysis and machine intelligence 28 (3), pp. 377–391. Cited by: §1.
 (2010) The mnist database of handwritten digits. Note: \urlhttp://yann.lecun.com/exdb/mnist/ Cited by: §1, §5.2.
 (201512) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §5.2.
 (2018) Are gans created equal? a largescale study. In Advances in neural information processing systems, pp. 700–709. Cited by: §5.2.
 (2016) Adversarial autoencoders. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2, §2.
 (2010) Sample complexity of testing the manifold hypothesis. In Advances in Neural Information Processing Systems, pp. 1786–1794. Cited by: §1.
 (2015) Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770. Cited by: §2.
 (1964) Principles of mathematical analysis. Vol. 3, McGrawhill New York. Cited by: footnote 2.
 (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §1.
 (2017) Veegan: reducing mode collapse in gans using implicit variational learning. In Advances in Neural Information Processing Systems, pp. 3308–3318. Cited by: §1.
 (2015) A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844. Cited by: §1.
 (2017) VAE with a vampprior. arXiv preprint arXiv:1705.07120. Cited by: §1, §2, §2.
 (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315. Cited by: §1, §2, §2.
 (2017) Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. External Links: 1708.07747 Cited by: §5.2.
 (2017) InfoVAE: information maximizing variational autoencoders. External Links: 1706.02262 Cited by: §1.