Leveraging the Exact Likelihood of Deep Latent Variables Models
Deep latent variable models combine the approximation abilities of deep neural networks and the statistical foundations of generative models. The induced data distribution is an infinite mixture model whose density is extremely delicate to compute. Variational methods are consequently used for inference, following the seminal work of rezende2014 and kingma2014. We study the well-posedness of the exact problem (maximum likelihood) these techniques approximatively solve. In particular, we show that most unconstrained models used for continuous data have an unbounded likelihood. This ill-posedness and the problems it causes are illustrated on real data. We also show how to insure the existence of maximum likelihood estimates, and draw useful connections with nonparametric mixture models. Furthermore, we describe an algorithm that allows to perform missing data imputation using the exact conditional likelihood of a deep latent variable model. On several real data sets, our algorithm consistently and significantly outperforms the usual imputation scheme used within deep latent variable models.
Dimension reduction aims at summarizing multivariate data using a small number of features that constitute a code. Earliest attempts rested on linear projections, leading to Hotelling’s (hotelling1933) principal component analysis (PCA) that has been vastly explored and perfected over the last century (jolliffe2016principal). In recent years, the field has been vividly animated by the successes of latent variable models, that probalistically use the low-dimensional features to define powerful generative models. Usually, these latent variable models transform the random code into parameters of a simple output distribution. Linear mappings and Gaussian outputs were initially considered, giving rise to factor analysis (bartholomew2011) and probabilistic principal component analysis (tipping1999). Beyond the Gaussian distribution, Bernoulli or more general exponential family outputs have been considered. In recent years, much work has been done regarding nonlinear mappings parametrised by deep neural networks, following the seminal papers of rezende2014 and kingma2014. These models have led to impressive empirical performance in unsupervised and semi-supervised generative modelling of images (narayanaswamy2017), molecular structures (gomez2018; kusner2017), or arithmetic expressions (kusner2017). This paper is an investigation of the statistical properties of these models, which remain essentially unknown.
1.1 Deep latent variable models
In their most common form, deep latent variable models (DLVMs) assume that we are in presence of a data matrix that we wish to explain using some latent variables . We assume that are independent and identically distributed (i.i.d.) random variables driven by the following generative model:
The unobserved random vector is called the latent variable and usually follows marginally a simple distribution called the prior distribution. The dimension of the latent space is called the intrinsic dimension, and is usually smaller than the dimensionality of the data. The function is called a decoder or a generative network, and is parametrised by a (deep) neural network whose weights are stored in . This parametrisation allows to leverage recent advances in deep architectures, such as deep residual networks (kingma2016), recurrent networks (bowman2016), or batch normalisation (sonderby2016).
The collection is a parametric family of output densities with respect to a dominating measure (usually the Lebesgue or the counting measure). Several output families have been considered by recent work:
In case of discrete multivariate data, products of Multinomial distributions.
In case of real multivariate data, multivariate Gaussian distributions.
When dealing with images, several specific proposals have been made (e.g. the discretised logistic mixture approach of salimans2017).
Dirac outputs correspond to deterministic decoders, that are used e.g. within generative adversarial networks (goodfellow2014), or non-volume preserving transformations (dinh2017).
The Gaussian and Bernoulli families are the most widely studied, and will be the focus of this article. The latent structure of these DLVMs leads to the following marginal distribution of the data:
1.2 Scalable learning through amortised variational inference
The log-likelihood function of a DLVM is, for all ,
which is an extremely challenging to compute quantity that involves potentially high-dimensional integrals. Estimating by maximum likelihood appears therefore out of reach. Consequently, following rezende2014 and kingma2014, inference in DLVMs is usually performed using amortised variational inference. Variational inference approximatively maximises the log-likelihood by maximising a lower bound known as the evidence lower bound (ELBO, see e.g. blei2017):
where the variational distribution is a distribution over the space of codes . The variational distribution plays the role of a tractable approximation of the posterior distribution of the codes; when this approximation is perfectly accurate, the ELBO is equal to the log-likelihood. In regular variational inference, the variational distribution is optimised over a space of (often factorised) distributions, which is computationaly expensive when the number of samples is large. Amortised inference (gershman2014) scales up variational inference by, rather than learning a distribution over , learning a function called the inference network (parametrised by a deep neural network whose weights are stored in ). The role of this inference network will be to transform each data point into the parameters of the corresponding approximate posterior. More specifically, the amortised family of posterior approximations is
where is a parametric family of distributions over – like isotropic Gaussians as in kingma2014 or full-rank Gaussians as in cremer2018. Other kinds of families – built using e.g. normalising flows (rezende2015; kingma2016), auxiliary variables (maaloe2016; ranganath2016), or importance weights (burda2016; cremer2017) – have been considered for amortised inference, but they will not be central focus of in this paper. Within this framework, variational inference for DLVMs solves the optimisation problem
using variants of stochastic gradient ascent (see e.g. roeder2017, for strategies for computing gradients estimates of the ELBO).
As emphasised by kingma2014, the ELBO resembles the objective function of a popular deep learning model called an autoencoder (see e.g. goodfellow2016, Chapter 14). This motivates the popular denomination of encoder for the inference network and variational autoencoder (VAE) for the combination of a DLVM with amortised variational inference.
Our main goal is to investigate the theoretical properties of the marginal distribution of deep latent variable models, and to examine to what extend these properties can be leveraged in practice. In particular, we focus on the well-posedness of the maximum likelihood problem, and the use of deep latent variable models for data imputation. Our main contributions are:
We show that maximum likelihood is ill-posed for DLVMs with Gaussian outputs, and we propose several ways of tacking this problem using constraints or regularisation. We empirically show on real data that this theoretical result can lead to undesirable likelihood blowups.
We show that maximum likelihood is well-posed for DLVMs with discrete outputs.
We draw a connection between DLVMs and nonparametric statistics, and show that DLVMs can be seen as parsimonious submodels of nonparametric mixture models.
We leverage that connection to provide a way of finding an upper bound of the likelihood based on finite mixtures, under some conditions that guarantee the well-posedness of maximum likelihood. Albeit not tight, when combined with the ELBO, this bound allow us to provide for the first time useful “sandwichings” of the exact likelihood that are illustrated on real data.
When missing data is present at test time, we show how a simple modification of a approximate scheme proposed by rezende2014 allows us to draw according to the exact conditional distribution of the missing data. On several data sets and missing data scenarios, our algorithm consistently outperforms the one of rezende2014, while having the same computational cost.
2 Is maximum likelihood well-defined for deep latent variable models?
In this section, we investigate the well-posedness of maximum likelihood for DLVMs with Gaussian and Bernoulli outputs.
2.1 On the boundedness of the likelihood of deep latent variable models
Deep generative models with Gaussian outputs assume that the data space is , and that the family of output distributions is the family of -variate full-rank Gaussian distributions. The conditional distribution of each data point is consequently
where and are two continuous functions parametrised by neural networks whose weights are stored in a parameter . These two functions constitute the decoder of the model. This leads to the log-likelihood
This model can be seen as a special case of infinite mixture of Gaussian distributions. However, it is well-known that maximum likelihood is ill-posed for finite Gaussian mixtures (see e.g. lecam1990). Indeed, the likelihood function is unbounded above, and its infinite maxima happen to be very poor generative models. This problematic behaviour of a model quite similar to DLVMs motivates the question: is the likelihood function of DLVMs bounded above?
In this section, we will not make any particular parametric assumption about the prior distribution of the latent variable . Both simple isotropic Gaussian distributions and more complex learnable priors have been proposed, but all of them are affected by the negative results of this section. We simply make the natural assumptions that is continuous and has zero mean. Many different neural architectures have been explored regarding the parametrisation of the decoder. For example, kingma2017 considers multilayer perceptrons (MLPs) of the form
where and the tanh and exp functions are applied entrywise. The weights of the decoder are , and . The integer is the (common) number of hidden units of the MLPs. Much more complex parametrisations exist, but we will see that this one, arguably one of the most rigid, is already too flexible for maximum likelihood. Actually, we will show that an even much less flexible family of MLPs with a single hidden unit is problematic. Let be a sequence of nonnegative real numbers such that as . For all , within the framework given by (6) and (7), we will consider the sequences of parameters . This leads to the following simplified decoders:
As shown by next theorem, these sequences of decoders lead to the explosion of the log-likelihood function.
For all and , we have
Proofs of all results are available in the Appendix of this paper. The key element of the proof is the fact that the sequence of functions converges to a function that outputs both singular and nonsingular covariances, leading to the explosion of while all other terms in the likelihood remain bounded below by a constant.
Using simple MLP-based parametrisations such a the one of kingma2017 therefore brings about an unbounded log-likelihood function. A natural question that follows is: do these infinite suprema lead to useful generative models? The answer is no. Actually, none of the functions considered in Theorem 1 are particularly useful, because of the use of a constant mean function. This is formalised in the next proposition.
For all , , and , the distribution is spherically symmetric and unimodal around .
The spherical symmetry implies that the distribution of these “optimal” deep generative model will lead to uncorrelated variables, and the unimodality will lead to poor sample diversity. Since being able to account for correlations and diversity is essential in deep generative modeling, the infinite suprema of the likelihood function unveiled by Theorem 1 are not desirable. Unregularised gradient-based optimisation of a tight lower bound of this unbounded likelihood is therefore likely to chase these (infinitely many) bad suprema. This gives a theoretical foundation to the necessary regularisation of VAEs that was already noted by rezende2014 and kingma2014. For example, using weight decay as in kingma2014 is likely to help avoiding these infinite maxima.
This difficulty to learn the variance may also explain the choice made by several authors to use a constant variance function of the form , where can be either fixed (zhao2017) or learned via approximate maximum likelihood (pu2016).
Tackling the unboundedness of the likelihood.
Let us go back to a parametrisation which is not necessarily MLP-based. Even in this general context, it is possible to tackle the unboundedness of the likelihood using additional constraints on . Specifically, for each , we will consider the set
Note that . This simple spectral constraint allows to end up with a bounded likelihood.
Let . If the parametrisation of the decoder is such that the image of is included in for all , then the log-likelihood function is upper bounded by .
Similar constraints have been proposed to solve the ill-posedness of maximum likelihood for finite Gaussian mixtures (see e.g. hathaway1985; biernacki2011). In practice, implementing such constraints can be easily done by adding a constant diagonal matrix to the output of the covariance decoder.
What about other parametrisations?
We chose a specific and natural parametrisation in order to obtain a constructive proof of the unboundedness of the likelihood. However, virtually any other deep parametrisation that do not include covariance constraints will be affected by our result, because of the universal approximation abilities of neural networks (see e.g. goodfellow2016, Section 6.4.1).
Bernoulli DLVMs do not suffer from unbounded likelihood.
When , Bernoulli DLVMs assume that is the family of -variate multivariate Bernoulli distribution (that is, the family of products of univariate Bernoulli distributions). In this case, maximum likelihood is well-posed.
Given any possible parametrisation, the log-likelihood function of a deep latent model with Bernoulli outputs is everywhere negative.
2.2 Towards data-dependent likelihood upper bounds
We have determined under which conditions maximum likelihood estimates exist, and have computed simple upper bounds on the likelihood functions. Since they do not depend on the data, there bounds are likely to be very loose. A natural follow-up issue is to seek tighter, data-dependent upper bounds that remain easily computable. Such bounds are desirable because, combined with ELBOs, they would allow to sandwich the likelihood between two bounds.
To study this problem, let us take a step backwards and consider a more general infinite mixture model. Precisely, given any distribution over the generic parameter space , we define the nonparametric mixture model (see e.g. lindsay1995, Chapter 1) as:
Note that there are many ways for a mixture model to be nonparametric (e.g. having some nonparametric components, an infinite but countable number of components, or un uncountable number of components). In this case, this comes from the fact that the model parameter is the mixing distribution , which belongs to the set of all probability measures over . The likelihood of any is given by
When has a finite support of cardinal , is a finite mixture model with components. When the mixing distribution is generatively defined by the distribution of a random variable such that
we exactly end up with a deep generative model with decoder . Therefore, the nonparametric mixture is a more general model that the DLVM. The fact that the mixing distribution of a DLVM is intrinsically low-dimensional leads us to interpret the DLVM as a parsimonious submodel of the nonparametric mixture model. This also gives us an immediate upper bound on the likelihood of any decoder :
Of course, in many cases, this upper bound will be infinite (for example in the case of unconstrained Gaussian outputs). However, under the conditions of boundedness of the likelihood of deep Gaussian models, the bound is finite and attained for a finite mixture model with at most components.
Assume that is the family of multivariate Bernoulli distributions or the family of Gaussian distributions with the spectral constraint of Proposition 4. The likelihood of the corresponding nonparametric mixture model is maximised for a finite mixture model of distributions from the family .
This rather surprising result is based on Lindsay’s (lindsay1983) geometric analysis of the likelihood of nonparametric mixtures. Assume that the conditions of Theorem 2 are satisfied. Let us denote a maximum likelihood estimate of as . For all , we therefore have
which gives an upper bound on the likelihood. We call the difference the parsimony gap (see Figure 1). By sandwiching the exact likelihood between this bound and an ELBO, we can also have guarantees on how far a posterior approximation is from the true posterior:
From a computational perspective, the estimate can be found using the expectation-maximisation algorithm for finite mixtures (dempster1977) – although it only insures to find a local optimum. Some strategies guaranteed to find a global optimum have also been developed (see e.g. lindsay1995, Chapter 6; wang2007).
3 Handling missing data with variational autoencoders
In this section, we assume that a variational autoencoder has been trained, and that missing data is present at test time. The couple decoder/encoder obtained after training is denoted by and . Let be a new data point that consists in some observed features and missing data .
3.1 Conditional data generation
Since we have a probabilistic model of the data, an ideal way of imputing would be to generate some data according to the conditional distribution
Again, this distribution appears out of reach because of the integration of the latent variable . However, it is reasonable to assume that, for all ,
is easily computable (this is the case for Bernoulli or Gaussian outputs). Under this assumption, we will see that generating data according to the conditional distribution is actually possible.
3.2 Pseudo-Gibbs sampling
rezende2014 proposed a simple way of imputing by following a Markov chain (initialised by randomly imputing the missing data with ). For all , they consider
This scheme closely resembles Gibbs sampling (Geman1984), and actually exactly coincides with Gibbs sampling when the amortised variational distribution is equal to the true posterior distribution for all possible . Following the terminology of heckerman2000, we will call this algorithm pseudo-Gibbs sampling. Very similar schemes have been proposed for more general autoencoder settings (goodfellow2016, Section 20.11). Because of its flexibility, this pseudo-Gibbs approach is routinely used for missing data imputation using DLVMs (see e.g. li2016; rezende2016; du2018).
rezende2014 proved that, when these two distributions are close in some sense, pseudo-Gibbs sampling generates point that approximatively follow the conditional distribution . Actually, we will see that a simple modification of this scheme allows to generate exactly according to the conditional distribution.
3.3 Metropolis-within-Gibbs sampling
At each step of the chain, rather than generating codes according to the approximate posterior distribution, we may use this approximation as a proposal within a Metropolis-Hastings algorithm (Metropolis1953; Hastings1970), using the fact that we have access to the unnormalised posterior density of the latent codes.
Specifically, at each step, we will generate a new code as a proposal using the approximate posterior . This proposal is accepted as a valid code with probability
This probability corresponds to a ratio of importance ratios, and is equal to one when the posterior approximation is perfect. This code-generating scheme exactly corresponds to performing a single iteration of an independent Metropolis-Hastings algorithm. With the obtained code , we can now generate a new imputation using the exact conditional . The obtained algorithm, detailed in Algorithm 1, is a particular instance of a Metropolis-within-Gibbs algorithm (Metropolis1953; Tierney1991). Actually, it exactly corresponds to the algorithm described by gelman1993, and is insured to asymptotically produce samples from the true conditional ditribution , even if the variational approximation is imperfect. Note that when the variational approximation is perfect, all proposals are accepted and the algorithm exactly reduces to Gibbs sampling.
The theoretical superiority of the Metropolis-within-Gibbs scheme compared to the pseudo-Gibbs sampler comes with almost no additional computational cost. Indeed, all the quantities that need to be computed in order to compute the acceptance probability need also to be computed within the pseudo-Gibbs scheme – except the prior evaluations, which are assumed to be computationally negligible. However, a poor initialisation of the missing values might lead to a lot of rejections at the beginning of the chain, and to slow convergence. A good initialisation heuristic is to perform a few pseudo-Gibbs iterations at first in order to begin with a sensible imputation. Note eventually that, similarly to the pseudo-Gibbs sampler, our Metropolis-within-Gibbs scheme can be extended to many other variational approximations – like normalising flows (rezende2015; kingma2016) – in a straightforward manner.
4 Empirical results
In this section, we investigate the empirical realisations of our theoretical findings on DLVMs.
Some computational choices are common for all experiments: the prior distribution is a standard Gaussian distribution; the chosen variational family is the family of Gaussian distributions with “diagonal + rank-1” covariance (as in rezende2014, Section 4.3); stochastic gradients of the ELBO are computed via the path derivative estimate of roeder2017; the Adam optimiser (kingma2014adam) is used with a learning rate of and mini-batches of size 10; neural network are initialised following the heuristics of glorot2010; sampling for variational autoencoders and Markov chains is performed via the Distributions module of TensorFlow (dillon2017). For the Frey faces data set, we used a Gaussian DLVM together with the parametrisation presented in Section 2.1 (with and ). The architecture of the inference network follows the same parametrisation. Constrained finite Gaussian mixtures were fit using the scikit-learn package (pedregosa2011). Regarding the data sets considered in the missing data experiments, we used on the architecture of rezende2014, with hidden units and an intrinsic dimension of .
4.1 Concrete examples of likelihood explosions
To investigate if the unboundedness of the likelihood of a DLVM with Gaussian outputs has actual concrete consequences for variational inference, we train two DLVMs on the Frey faces data set: one with no constraints, and one with the constraint of Proposition 2 (with ). The results are presented in Figure 3. One can notice that the unconstrained DLVM finds models with very high likelihood but very poor generalisation performance. This confirms that the unboundedness of the likelihood is not a merely theoretical concern. We also display the two upper bounds of the likelihood. The nonparametric bound offers a slight but significant improvement over the naive upper bound. On this example, using the nonparametric upper bound as an early stopping criterion for the unconstrained ELBO appears to provide a good regularisation scheme – that perform better than the covariance constraints on this data set. This illustrate the potential practical usefulness of the connection that we drew between DLVMs and nonparametric mixtures.
4.2 Comparing the pseudo-Gibbs and Metropolis-within-Gibbs samplers
We compare the two samplers for single imputation of the test sets of the three data sets : Caltech 101 Silhouettes and statically binarised versions of MNIST and OMNIGLOT. We consider two missing data scenarios: a first one with pixels missing uniformly at random (Figure 2) and one where the top or bottom half of the pixels was removed (Table 2). Both samplers use the same trained VAE and perform the same number of iterations ( for the first scenario, and for the second scenario). Note that the convergence can be monitored using a validation set of complete data. The first iterations of the Metropolis-within-Gibbs sampler are actually pseudo-Gibbs iterations, in order to avoid having too many early rejections (see Section 3.3). The Metropolis-within-Gibbs sampler consistently outperforms the pseudo-Gibbs scheme.
|MNIST||OMNIGLOT||Caltech 101 Silhouettes|
Although extremely difficult to compute in practice, the exact likelihood of DLVMs offers several important insights on deep generative modelling. Its unboundedness justifies the necessity of regularising DLVMs for continuous data, which we have demonstrated empirically. Furthermore, we showed that using the exact conditional distribution for data imputations gives better performance than the usual pseudo-Gibbs approach.
Missing data imputation is often used as a performance metric for DLVMs (e.g. li2016; du2018). Since both algorithms have essentially the same computational cost, this motivates to replace pseudo-Gibbs sampling by Metropolis-within-Gibbs when evaluating these models. Upon convergence, the samples generated by Metropolis-within-Gibbs do not depend on the inference network, and explicitly depend on the prior, which allows us to evaluate only the generative performance of the models.
We also interpreted DLVMs as parsimonious submodels of nonparametric mixture models. While we used this connection to provide upper bounds of the likelihood, many other applications could be derived. In particular, the important body of work regarding consistency of maximum likelihood estimates for nonparametric mixtures (e.g. kiefer1956; van2003; chen2017) could be leveraged to study the asymptotics of DLVMs.
While there are several ways of bounding above and below the exact likelihood function, the problem of estimating it accurately remains open. This problem is closely related to the computation of Bayesian marginal likelihoods. Even though exact computation is sometimes possible (e.g. holmes2015; bouveyron2017), marginal likelihoods are usually evaluated thanks to specific sampling schemes – see e.g. robert2009 or friel2012 for good reviews and grosse2013 or frellsen2016 for recent perspectives. While some work has been done towards using these schemes to compute the likelihoods of DLVMs (rezende2014; kingma2014; wu2017; cremer2018), it still constitutes an important and quite unexplored research theme.
Appendix A. Proof of Theorem 1
Let and . To avoid notational overload, we denote in the remainder of this proof. We will show that the contribution of the -th observation explodes while all other contributions remain bounded below.
A first useful remark is the fact that, since is continuous and has zero mean and , the univariate random variable is continuous and has zero mean. Therefore and .
Regarding the -th observation, we have for all ,
Let such that . The function
is strictly decreasing on . Indeed, its derivative is equal to
which is strictly negative because the image of the hyperbolic tangent function is . Moreover,
Therefore, the sequence is strictly increasing and diverges to for all such that . Therefore , the monotone convergence theorem combined with the fact that insure that the right hand side of (15) diverges to , leading to .
Regarding the other contributions, let . Since , there exists such that . We have
and, since , we will have
By combining all contributions, we end up with . ∎
Appendix B. Proof of Proposition 1
Because of the constant mean function and the isotropic covariance, the density of is a decreasing function , hence the spherical symmetry and the unimodality. Note that there are several different definitions for multivariate unimodality (see e.g. dharmadhikari1988). Here, we mean that the only local maximum of the density is at . ∎
Appendix C. Proof of Proposition 2
For all , we have
using the fact that the determinant of is lower bounded by for all and that the exponential of a negative number is smaller than one. Therefore, the likelihood function is bounded above by . ∎
Appendix D. Proof of Proposition 3
This is a direct consequence of the fact that the density of a Bernoulli distribution is always smaller than one. ∎
Appendix E. Proof of Theorem 2
Let us assume that there are distinct observations . Following lindsay1983, let us denote
We will use the following theorem, which describes maximum likelihood estimators of nonparametric mixtures under some topological conditions.
Theorem 3 (Lindsay, 1983, Theorem 3.1).
If is compact, then the likelihood of the corresponding nonparametric mixture model is maximised for a finite mixture model of distributions from the family .
This theorem allows us to treat both cases:
Since is compact and the function is continuous, is compact and Lindsay’s theorem can be directly applied.
The parameter space is not compact in this case, but we can get around this problem using a compactification argument similar to the one of van1992. Consider the Alexandroff compactification of the parameter space (kelley1955, p. 150). Because of the definition of , we have for all . Therefore, we can continuously extend the function from to using the conventions for all . The space is therefore compact, and we can use Lindsay’s theorem to deduce that the nonparametric maximum likelihood estimator for the compactified parameter space is a finite mixture of distributions from . However, this finite mixture can only contain elements from . Indeed, if any mixture component were associated with , the likelihood could be improved by emptying said component. Therefore, the maximum likelihood estimator found using the compactified space is also the maximum likelihood estimator of the original space, which allows us to conclude. ∎