Automatic Differentiation Variational Inference with Mixtures
Abstract
Automatic Differentiation Variational Inference (ADVI) is a useful tool for efficiently learning probabilistic models in machine learning. Generally approximate posteriors learned by ADVI are forced to be unimodal in order to facilitate use of the reparameterization trick. In this paper, we show how stratified sampling may be used to enable mixture distributions as the approximate posterior, and derive a new lower bound on the evidence analogous to the importance weighted autoencoder (IWAE). We show that this “SIWAE” is a tighter bound than both IWAE and the traditional ELBO, both of which are special instances of this bound. We verify empirically that the traditional ELBO objective disfavors the presence of multimodal posterior distributions and may therefore not be able to fully capture structure in the latent space. Our experiments show that using the SIWAE objective allows the encoder to learn more complex distributions which regularly contain multimodality, resulting in higher accuracy and better calibration in the presence of incomplete, limited, or corrupted data.
1 Introduction
Variational inference has become a powerful tool for Bayesian modeling using deep neural networks, with successes ranging from image generation Kingma:14 (), classification Alemi:17 (), uncertainty quantification Snoek:19 () and outlier detection Bishop:94 (); Nalisnick:18 (). Much of the recent successes in variational inference have been driven by the relative ease of fitting models using ADVI, where small numbers of samples can be used for individual forward passes through a model, and noisy but unbiased gradients can be determined using the reparameterization trick, allowing the use of backpropagation in training and enabling traditional stochastic gradient methods Rezende:14 (); Kingma:14 (). Currently, one major limitation of ADVI is that it is only possible if the posterior distribution can use reparameterization. This has to date forced ADVI methods to utilize only a subset of possible distributions. While there have been developments in extending reparameterization to broader classes of distributions (e.g., gamma and beta distributions Ruiz:16 ()), multimodal distributions have remained elusive.
This paper focuses on using ADVI with mixture distributions. Mixture distributions potentially present an advantage over unimodal distributions due to their flexibility and ability to approximate more complex posterior distributions Bishop:98 (); West:93 (). The contributions of this paper are as follows:

We propose the SIWAE, a tighter lower bound on the evidence which allows for the specialization of different mixture components in the posterior.

We demonstrate using a toy problem that SIWAE is better suited to approximate a known posterior distribution than either the traditional ELBO and the score function estimator.

We empirically show that models trained using the traditional ELBO objective often fail to discover multimodality in the latent space even if mixtures are used for the posterior. We also show that SIWAE allows models to more easily infer multimodality when it exists.

We demonstrate that models trained with SIWAE achieve higher classification accuracy and better model calibration than ELBO using incomplete feature information.
2 Approach
Consider a simple latent variable model with a single observed data point and corresponding latent variable along with a prior distribution and likelihood . In probabilistic modeling, we are interested in the posterior distribution , but generally, computing the posterior analytically is intractable. Variational inference is a strategy that reframes Bayesian inference as an optimization problem by first introducing a surrogate variational posterior , where are free parameters, and then maximizing the evidence lower bound (ELBO) with respect to . The ELBO is defined as
(1)  
(2) 
and is a lowerbound on the marginal probability of the data . In ADVI, we aim to compute , but computing the ELBO itself is difficult. Both terms in are expectations over , so we approximate the gradient by first drawing samples from and computing the gradient of a MonteCarlo approximation of the ELBO, i.e., for a single sample , we see that .
When computing the gradient, ADVI differentiates through the sampling procedure itself, utilizing the reparameterization trick Kingma:14 (); Rezende:14 (). The reparameterization trick expresses sampling a random variable from its distribution as a transformation of noise drawn from a base distribution , where the transformation is a deterministic function of the parameters of the sampling distribution. For the posterior , we express sampling as and provided we can differentiate w.r.t. , ADVI can differentiate the sampling procedure correctly. For example, if is an unconditional Gaussian distribution parameterized by , we sample by first drawing and computing . In ADVI, we are restricted to “reparameterizable” posterior distributions – distributions whose sampling procedure can be expressed using the reparameterization trick. Although there has been notable work in growing this class of distributions, such as in Figurnov:18 () and Jankowiak:18 (), the choice of posterior in ADVI is limited.
In this paper, we consider mixture posteriors for ADVI, specifically mixtures whose component distributions are reparameterizable. Mixture distributions are a powerful class of posteriors, as growing the number of components can make them arbitrarily expressive, but are challenging to use as posteriors in ADVI as sampling from a mixture is not naively reparameterizable, due to the discrete categorical variable that is sampled to assign a data point to a mixture component. As seen in Roeder:17 (), stratified sampling can address this issue. In stratified sampling, we compute expectations by sampling evenly over each component “strata,” and averaging using the weights of each stratum. For a mixture distribution, the natural stratification is each of the mixture component distributions. For any continuous and differentiable function and mixture distribution , where are the mixture weights and are the components, we can compute the expectation as follows:
(3) 
By pulling the sum over the mixture components outside of the integral over and sampling from each of the mixture components, we are able to compute the expectation using the reparameterization trick, so long as the components distribution from the mixture are themselves subject to reparameterization. Returning to ADVI, when the posterior is a mixture distribution with weights and components , we can compute the “stratified ELBO”, or SELBO:
MonteCarlo estimates of the SELBO for ADVI can be computed by drawing reparameterizable samples from .
2.1 A tighter bound for mixture posteriors
While the SELBO objective allows us to fit a mixture posterior using ADVI, it does not aid in modeling multiple distinct modes in the true posterior if they exist. In fact, the ELBO objective appears to actively oppose the discovery of multiple distinct modes in the posterior. The main reason for this may be due to the mode seeking behavior of the ELBO objective. In particular, the term in the SELBO (the expected loglikelihood) penalizes a posterior for generating samples for which the decoder estimates a low loglikelihood, even if that sample is a priori improbable. ADVI will thus learn an overly conservative lowvariance posterior, which only produces highlikelihood samples, while neglecting to explore other distinct modes which may also be able to explain the data.
To address this limitation, we define a new lower bound on the logevidence using importance sampling. An importanceweighted estimate of the loglikelihood first draws i.i.d. samples from the posterior computes a lowerbound from the importance weights for each sample (called “IWAE” in Burda:16 ()):
Burda:16 () shows that if importance weights are bounded, then as increases the IWAE grows tighter and approaches as . Unlike the regular ELBO, the posterior in the IWAE is less penalized for generating samples that are unlikely. Instead, unlikely samples are weighted less and the learned posterior can have higher variance to better cover the space. This is a desirable property for mixturemodel posteriors, where in order to capture distinct modes, we would like to encourage the posterior to have higher variance.
Our main contribution is a novel importanceweighted estimator for the ELBO when using mixture posteriors. To incorporate importance sampling into the SELBO, we first draw samples from each of the mixture components, . We then compute importance weights that are themselves weighted by the mixture weights, arriving at the “stratified IWAE”, or SIWAE:
By repeated application of Jensen’s equality, we can demonstrate that is a valid lowerbound that is tighter than when .
Theorem 1.
is a lowerbound on the evidence .
Proof.
∎
Theorem 2.
When , is a tighter lowerbound than .
Proof.
∎
is also equivalent to and under certain circumstances ( and , respectively). Because is tighter than even when , is also tighter than . Furthermore the importance sampling step enables highervariance posteriors, as it mitigates the penalty for lowlikelihood samples. Consequently, the implicit posterior Cremer:17 () (defined by importance sampling the learned posterior) can better explore and model different modes.
SELBO and SIWAE are both easy to implement and are simple augmentations of existing variational inference code. See Figure 1 for a code snippet in TensorFlow which evaluates the SIWAE for a latent variable model.
3 Related works
Salimans:13 () and Kingma:14 () show that sampling from a distribution can be reparameterized as a deterministic function of the parameters of the distribution and some auxiliary variable, thereby facilitating the propagation of gradients through the distribution. They also introduced the Variational Auto Encoder (VAE), which uses an amortized variational posterior for a deep generativel model. Burda:16 () showed that the bound on the evidence could be tightened using importance sampling, and that the tightness of the bound was improved by the number of importance samples. Cremer:17 () suggest that the IWAE can be viewed as fitting the traditional ELBO, but using a richer latent space distribution defined by the importancereweighting of samples from the posterior, and further explore the functional forms of these implicit posteriors.
Mixture models have also appeared in variational inference as a popular choice of prior distribution. Dilokthanakul:16 () introduce a VAE which uses a learnable mixture of Gaussians to represent the prior distribution of a latent variable. However, while their training procedure involves computation over a discretized categorical variable, their inference model utilizes only a single component for the posterior, and avoids many of the difficulties posed in fitting a mixture for the posterior that we address in this work, i.e., differentiating through sampling. They find that their model achieves competitive performance on unsupervised clustering, with the mixture components learning clusters that approximate the different classes present in the data. Similarly, Tomczak:17 () use a mixture of Gaussians trained on learnable pseudoinputs as the prior, which allows them to introduce greater flexibility in the latent space distribution. They find that their generative performance improves on a number of benchmarks using this procedure. While using a mixture distribution as a prior enables modeling global structure in the latent space, it does not explicitly model ambiguity or competing explanations for a single observation. The uses of mixture distributions for either the prior or posterior are orthogonal and complementary, and a mixture distribution in either part of the model is a valid option.
Domke:19 () propose to use alternative sampling schemes (including stratified) from a uniform distribution defined over a state space, along with a coupling transformation to the latent space in order to design a sampling scheme which results in better coverage of the approximating posterior distribution. They also show that the divergence of this approximation from the true posterior is then bounded by the looseness of the evidence bound.
When using mixture distributions as the posterior, most works typically avoid having to explicitly sample from the mixture components by either fixing the component weights oh2018modeling (), or by using a continuous relaxation (e.g., the concrete relaxation of the categorical distribution Poduval:20 ()). Graves:16 () proposes an algorithm that allows for gradients to be backpropagated through the mixture distribution when the components distribution is diagonal by composing the sampling procedure as a recursion over the dimensions. Our method only requires that the components distribution is subject to reparameterization, and therefore can be used with a wider class of distributions. Furthermore it does not require explicit specification of the gradient updates to be hardcoded, making it easy to integrate mixtures into existing models. Roeder:17 () derives a pathwise gradient extension to the SELBO that lowers the variance of gradient estimates, but still suffers from the modeseeking properties of the SELBO.
4 Experimental Results
In this section, we compare experimental results with both deterministic models containing no latent variable, and with those containing only a single component parameterizing the latent space distribution.
4.1 Toy Problem
We define a latent variable model where the true posterior is multimodal by construction, with the hope of recovering the distinct modes. Specifically, we sample 1000 datapoints from the following twodimensional generative model:
where , i.e., we first sample a latent from a isotropic normal, but in contrast, observe with some Gaussian noise. For an observed , we know there exist distinct modes in space that could generate it, since is twodimensional. We initialize the variational posterior as a multilayer perceptron (MLP) with 2 layers of 100 hidden units that outputs a 4component mixture of Gaussians distribution. We evaluate three different estimators of the ELBO: (1) SELBO, (2) SIWAE, and (3) a score function estimator as a baseline. We fit the posterior for 1000 epochs, with a batch size of 32 and using the Adam Adam () optimizer with a learning rate of 0.001, using 10 importance samples for SIWAE and 100 for both SELBO and score function.
In Figure 2, we plot a sample SIWAE estimate at the end of each epoch and observe that the SIWAEtrained estimator achieves the highest value of 1.505, compared to 2.024 and 2.038 from the SELBO and score function estimators, respectively. Investigating further, we plot samples from each of the implicit importanceweighted posteriors in the latent space. We find that in many cases, the SELBO and score function posteriors are unable to capture the four distinct modes (see Figure 3), whereas the highervariance SIWAE posterior is able to cover the modes successfully. We also observe similar results to those found in Rainforth:18 (), where tighter variational bounds result in lower signaltonoise ratios in the gradients to the posterior. This is reflected by onaverage highervariance gradients while training a SIWAE posterior vs. a SELBO posterior (1.16 vs. 0.48 average elementwise variance, respectively). However, the score function estimator, with an empirical variance of 261.4, has significantly higher variance than that of both SIWAE and SELBO, indicating that the variance reduction coming from the use of the reparameterization trick offsets the additional variance from a tighter variational bound. We also found that using the “stickingthelanding” estimator Roeder:17 () (see Figure 8, Figure 9) does not significantly improve the SELBO or SIWAE in the toy experiment.
4.2 Single Column MNIST Classification
To evaluate SIWAE’s efficacy on a more challenging problem, we trained a classifier on the benchmark dataset MNIST Lecun:98 (). The classifier is a Variational Information Bottleneck (VIB) model Alemi:17 (), a variant of the VAE where decoder outputs a class rather than a reconstructed input. To better motivate the use of mixtures on this dataset, we consider the problem of classification under incomplete information. In particular, Doersch:16 () shows that training a VAE using only the centermost column of the image introduces multimodality into the dataset that is difficult to capture using a unimodal encoder. We replicate this multimodality in the classification setting by taking the centermost column of each training image. An example of a corrupted input can be seen in Figure 4. In general, it can be difficult even for a human to correctly classify the image given this type of corruption.
For our experiments, we use an MLP architecture with 4 layers of 128 hidden units and ELU activation functions for the encoder, with the last layer outputting a distribution over a two dimensional latent variable. For all models, we use a mixture of multivariate normal distributions with full covariance, and with the mixture weights as a learnable parameter. For models with , this reduces to a single multivariate normal distribution with no learnable mixture weights. For the decoder, we use an affine transformation that output the logits for a categorical distribution. We use such a simple architecture for decoder to encourage the encoder to capture potentially multimodal information about the class of an image. For our prior distribution , we use a trainable mixture of Gaussians, although we found the prior makes relatively little difference in the final results.
For a single component model, we optimize both the traditional evidence lower bound (ELBO), as well as the importance weighted estimate of the evidence (IWAE). For the mixture models, we use stratified sampling to compute the ELBO (SELBO), as well as the StratifiedIWAE (SIWAE) derived in Section 2. We use for the number of mixture components, and for the number of samples drawn per component. To regulate the information content of the posterior, we use a penalty on the KL divergence term (and the equivalent term in the SIWAE objective), as used in Higgins:17 (). Because onecolumn MNIST does not have an established benchmark, we also train two deterministic models to use as baselines: (1) a “pyramid” MLP with 5 layers of 256 hidden units to approximate the peak deterministic accuracy, and (2) a “bottleneck” MLP with the same architecture as our VIB models, therefore containing a two dimensional “latent space.” All models were trained for 50 epochs using the Adam optimizer Adam () with a learning rate of 0.001 which was decayed by 0.5 every 15000 train steps. When training SELBO models, refers to the number of samples drawn to compute the MonteCarlo estimate of the objective.
To evaluate the accuracy of the model, we first need the posterior predictive distribution. We sample the posterior predictive by decoding samples from and averaging the class probabilities returned by each sample. This marginalizes over the uncertainty in the latent variables and nominally produces more calibrated probabilities. From these probabilities, we take the highestprobability class, and consider that the prediction of the model.
In Figure 4, we show the twodimensional latent space representation for a single validation set example learned by optimizing the ELBO objective. We find that, while SELBO enables the use of multiple mixture components in the encoder distribution, the model only learns a unimodal representation for the latent variable. This is a direct consequence of the likelihood term in the ELBO objective, which disincentivizes exploration and encourages modeseeking in the variational posterior. In this case, we observe the posterior “hedging its bets,” where the single mode sits across several decision boundaries. Although the resulting test set accuracy numbers are comparable to the baseline, we show later that a unimodal posterior distribution negatively impacts calibration, which undermines the value proposition of using Bayesian inference. Furthermore, we find that the best SELBO accuracy is achieved by a model with and (where is the number of samples from the posterior, not importance samples). Even without producing distinct modes in the latent space, we find that the model with more mixture components trained with SELBO is able to marginally improve accuracy, suggesting the mixture distribution can also better model just a single mode of the true posterior. Finally, we observe approximating SELBO with more samples at a fixed does not increase the accuracy. This suggests that the reduced variance in the stochastic optimization results in a model that has worse generalization performance, possibly because it settles into a suboptimal local minimum.
In Figure 4 we visualize the latent space learned by optimizing the SIWAE objective. In stark contrast to the SELBO objective, we find that SIWAE learns posteriors that have many active and distinct modes. This implies that rather than “hedging its bets” as in the SELBO, a SIWAEtrained posterior offers multiple competing explanations, moving the uncertainty in the final prediction into the latent space rather than the output of the decoder. This can be directly seen by looking at the transparency of the background colors in Figure 4, which indicates the confidence in the decoder prediction. Where the SELBOtrained decoder tends to have fuzzier, more transparent decision regions, the SIWAEtrained decoder has sharper, more confident decision boundaries. We later see how this property is critical for wellcalibrated predictions. The best performing SIWAE model achieves 77.97% accuracy, roughly equivalent to the best of our deterministic baselines, but using far fewer parameters. Furthermore, while it is difficult to evaluate the interpretability of the latent space quantitatively, the SIWAE models are qualitatively easier to interpret using the latent space, with the model very clearly predicting the example shown as either a 5, an 8, or a 6 (with some additional limited probability that it is a 3). This appears to reflect our own intuition of the output class of this example.
While our models with large appear to achieve comparable accuracy to the deterministic baseline, it is also important to compare their calibration, since realworld decisionmaking systems not only require accurate models, but also ones which quantify their uncertainty correctly. To measure the calibration objectively, we use the Expected Calibration Error (ECE) from Guo:17 () both for the deterministic baseline, as well as for the models trained with SELBO and SIWAE. For the SELBO and SIWAE models, we use the posterior predictive distribution to evaluate the ECE, marginalized over 10000 samples from . We show the ECE as a function of the number of mixture components for our models (Figure 6). In addition to the improvements in accuracy noted above, we find that increasing also decreases the calibration error when training using the SIWAE objective, outperforming the deterministic baseline for , where the SIWAE objective should maximize a better approximation of the evidence. We also find that models trained with SELBO do not have a corresponding improvement in calibration, with a final ECE more than 10 times larger than the equivalent model trained with SIWAE. Additionally, models trained with SELBO never surpass the deterministic baseline of 4.5% calibration error, in contrast to SIWAE models which achieve 1% ECE for and . This confirms our hypothesis that having predictive uncertainty stem from the decoder worsens the model calibration and that having multiple competing explanations in the latent space results in better calibrated predictions.
We further investigate the properties of the objectives as the number of mixture components and number of samples are varied. Figure 5 shows the classification accuracy of a VIB model over a range of and . We find that for SIWAE models, increasing the number of components in the model monotonically increases the classification accuracy. This suggests that with sufficient computational budget, there is no downside to increasing , though there are diminishing returns. With SELBO, we observe the same trend, though test set accuracy is consistently lower than with SIWAE. For , we find that increasing (number of importance samples) also increases accuracy with the SIWAE objective (which for is equivalent to IWAE), indicating that more importance samples are helpful for fitting a better approximate posterior and generative model. For both and we find that increasing (the number of MonteCarlo samples) does not increase the accuracy when the ELBO or SELBO objectives are used. For , increasing generally appears to improve the classification accuracy, but the improvements are less clear both because the accuracy for is already close to the peak accuracy and because the optimization appears more difficult for large numbers of samples. In the case of SIWAE, it is important to note that arbitrarily growing the number of importance samples may also be harmful, a phenomenon observed by Rainforth:18 (). We do not observe any decreases in performance over the range of importance samples, suggesting that positive effect of importance sampling enabling fitting better mixture models outweighs the negative effect of worse gradients.
Onedimensional latent variable
We ran the same MNIST classification experiment using a onedimensional latent variable. In general, fitting a onedimensional latent variable should result in an appreciable drop in accuracy because a singledimensional bottleneck allows for a maximum of two decision boundaries for a particular class, and therefore forces a latent representation which becomes multimodal in the presence of any complex structure in the uncertainty. Therefore, a reasonable expectation is that this should force a model trained with SELBO to learn distinct modes in the encoder. However, we experimented with training for this objective using multiple components as well as with a single component, and in all cases achieve an accuracy of 51% or lower, which is substantially worse than can be reached in two dimensions. By dissecting this model, we see again that the model reduces to a single mode in the posterior, either by assigning all of the component weights to a single mode, or by merging all of the separate modes together (Figure 7). This strongly suggests that the SELBO objective actively opposes the formation of multiple modes in the posterior.
Using the SIWAE objective instead of the SELBO, we see our accuracy climb to , nearly equivalent to the peak accuracy in two dimensions. Interestingly, we also find that using the SIWAE objective with only a single component (IWAE) outperforms the traditional ELBO substantially as well, achieving 63% accuracy. However, there is still a substantial gap between the IWAE model and the SIWAE model. All of the probabilistic models outperform a deterministic model in this case, which achieves a peak accuracy of merely 25%.
5 Conclusion
In this work, we argue that although stratified sampling allows for ADVI with mixture posterior distributions, use of the traditional ELBO objective may not be sufficient to discover multimodality in the posterior distribution even if multimodality exists by construction. We also show how stratified sampling enables a tighter evidence lower bound analogous to the IWAE, but utilizing stratification over the encoder mixture components to make the bound tighter. We show in experiments that this SIWAE encourages multimodality in the latent space, and that it produces posterior distributions which are closer to the true posterior. We also show that multimodality in the posterior helps classifiers better represent predictive uncertainty, resulting in increased classification accuracy and better calibration using extremely limited input information. Both accuracy and calibration improve as the number of components in the mixture increases, suggesting that the training with the SIWAE objective ultimately results in a better approximation to the true posterior distribution.
Appendix A Supplementary Material
a.1 Additional Toy Experiment Results
References
 Alex Alemi, Ian Fischer, Josh Dillon, and Kevin Murphy. Deep variational information bottleneck. In ICLR, 2017.
 Christopher M Bishop. Novelty detection and neural network validation. ICANN ’93, 1993.
 Christopher M Bishop, Neil D Lawrence, Tommi Jaakkola, and Michael I Jordan. Approximating posterior distributions in belief networks using mixtures. In Advances in neural information processing systems, pages 416–422, 1998.
 Yuri Burda, Roger Grosse Roger, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
 Chris Cremer, Quaid Morris, and David Duvenaud. Reinterpreting importanceweighted autoencoders. arXiv preprint arXiv:1704.02916, 2017.
 P Kingma Diederik, Max Welling, et al. Autoencoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), volume 1, 2014.
 Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. Deep unsupervised clustering with Gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016.
 Carl Doersch. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016.
 Justin Domke and Daniel R Sheldon. Divide and couple: Using Monte Carlo variational objectives for posterior approximation. In Advances in Neural Information Processing Systems, pages 338–347, 2019.
 Mikhail Figurnov, Shakir Mohamed, and Andriy Mnih. Implicit reparameterization gradients. In Advances in Neural Information Processing Systems, pages 441–452, 2018.
 Alex Graves. Stochastic backpropagation through mixture density distributions. arXiv preprint arXiv:1607.05690, 2016.
 Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1321–1330. JMLR. org, 2017.
 Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. BetaVAE: Learning basic visual concepts with a constrained variational framework. ICLR, 2(5):6, 2017.
 Martin Jankowiak and Fritz Obermeyer. Pathwise derivatives beyond the reparameterization trick. arXiv preprint arXiv:1806.01851, 2018.
 Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Do deep generative models know what they don’t know? arXiv preprint arXiv:1810.09136, 2018.
 Seong Joon Oh, Andrew C. Gallagher, Kevin P. Murphy, Florian Schroff, Jiyan Pan, and Joseph Roth. Modeling uncertainty with hedged instance embeddings. In International Conference on Learning Representations, 2019.
 Pranav Poduval, Hrushikesh Loya, Rajat Patel, and Sumit Jain. Mixture distributions for scalable Bayesian inference, 2020.
 Tom Rainforth, Adam R Kosiorek, Tuan Anh Le, Chris J Maddison, Maximilian Igl, Frank Wood, and Yee Whye Teh. Tighter variational bounds are not necessarily better. arXiv preprint arXiv:1802.04537, 2018.
 Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286, 2014.
 Geoffrey Roeder, Yuhuai Wu, and David K Duvenaud. Sticking the landing: Simple, lowervariance gradient estimators for variational inference. In Advances in Neural Information Processing Systems, pages 6925–6934, 2017.
 Francisco J. R. Ruiz, Michalis K. Titsias, and David M. Blei. The generalized reparameterization gradient. In Advances in neural information processing systems, pages 460–468, 2016.
 Tim Salimans and David A Knowles. Fixedform variational posterior approximation through stochastic linear regression. Bayesian Analysis, 8(4):837–882, 2013.
 Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin, D Sculley, Joshua Dillon, Jie Ren, and Zachary Nado. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pages 13969–13980, 2019.
 Jakub M Tomczak and Max Welling. VAE with a VampPrior. arXiv preprint arXiv:1705.07120, 2017.
 Mike West. Approximating posterior distributions by mixtures. Journal of the Royal Statistical Society: Series B (Methodological), 55(2):409–422, 1993.