Operator Variational Inference
Abstract
Variational inference is an umbrella term for algorithms which cast Bayesian inference as optimization. Classically, variational inference uses the KullbackLeibler divergence to define the optimization. Though this divergence has been widely used, the resultant posterior approximation can suffer from undesirable statistical properties. To address this, we reexamine variational inference from its roots as an optimization problem. We use operators, or functions of functions, to design variational objectives. As one example, we design a variational objective with a LangevinStein operator. We develop a black box algorithm, \glsOPVI, for optimizing any operator objective. Importantly, operators enable us to make explicit the statistical and computational tradeoffs for variational inference. We can characterize different properties of variational objectives, such as objectives that admit data subsampling—allowing inference to scale to massive data—as well as objectives that admit variational programs—a rich class of posterior approximations that does not require a tractable density. We illustrate the benefits of \glsOPVI on a mixture model and a generative model of images.
fontsize= \glsdisablehyper \newacronymKLklKullbackLeibler \newacronymELBOelboevidence lower bound \newacronymEPepexpectation propagation \newacronymMCmcMonte Carlo \newacronymMCMCmcmcMarkov chain Monte Carlo \newacronymVIvivariational inference \newacronymMFVImfvimeanfield variational inference \newacronymSVIsvistochastic variational inference \newacronymVMPvmpvariational message passing \newacronymADVIadviautomatic differentiation variational inference \newacronymRMSPROPrmsproprmsprop \newacronymNUTSnutsnoUturn sampler \newacronymHMChmcHamiltonian Monte Carlo \newacronymARDardautomatic relevance determination \newacronymGMMgmmGaussian mixture model \newacronymDLGMDLGMdeep latent Gaussian model \newacronymLSlsLangevinStein \newacronymOPVIopvioperator variational inference
1 Introduction
Variational inference is an umbrella term for algorithms that cast Bayesian inference as optimization [11]. Originally developed in the 1990s, recent advances in variational inference have scaled Bayesian computation to massive data [8], provided black box strategies for generic inference in many models [21], and enabled more accurate approximations of a model’s posterior without sacrificing efficiency [23, 22]. These innovations have both scaled Bayesian analysis and removed the analytic burdens that have traditionally taxed its practice.
Given a model of latent and observed variables , variational inference posits a family of distributions over its latent variables and then finds the member of that family closest to the posterior, . This is typically formalized as minimizing a \glsKL divergence from the approximating family to the posterior . However, while the objective offers many beneficial computational properties, it is ultimately designed for convenience; it sacrifices many desirable statistical properties of the resultant approximation.
When optimizing \glsKL, there are two issues with the posterior approximation that we highlight. First, it typically underestimates the variance of the posterior. Second, it can result in degenerate solutions that zero out the probability of certain configurations of the latent variables. While both of these issues can be partially circumvented by using more expressive approximating families, they ultimately stem from the choice of the objective. Under the \glsKL divergence, we pay a large price when is big where is tiny; this price becomes infinite when has larger support than .
In this paper, we revisit variational inference from its core principle as an optimization problem. We use operators—mappings from functions to functions—to design variational objectives, explicitly trading off computational properties of the optimization with statistical properties of the approximation. We use operators to formalize the basic properties needed for variational inference algorithms. We further outline how to use them to define new variational objectives; as one example, we design a variational objective using a LangevinStein operator.
We develop \glsresetOPVI\glsOPVI, a black box algorithm that optimizes any operator objective. In the context of \glsOPVI, we show that the LangevinStein objective enjoys two good properties. First, it is amenable to data subsampling, which allows inference to scale to massive data. Second, it permits rich approximating families, called variational programs, which do not require analytically tractable densities. This greatly expands the class of variational families and the fidelity of the resulting approximation. (We note that the traditional \glsKL is not amenable to using variational programs.) We study \glsOPVI with the LangevinStein objective on a mixture model and a generative model of images.
Related Work. There are several threads of research in variational inference with alternative divergences. An early example is \glsEP [18]. \glsEP promises approximate minimization of the inclusive \glsKL divergence to find overdispersed approximations to the posterior. \glsEP hinges on local minimization with respect to subsets of data and connects to work on divergence minimization [19, 7]. However, it does not have convergence guarantees and typically does not minimize \glsKL or an divergence because it is not a global optimization method. We note that these divergences can be written as operator variational objectives, but they do not satisfy the tractability criteria and thus require further approximations. Li and Turner [16] present a variant of divergences that satisfy the full requirements of \glsOPVI. Score matching [10], a method for estimating models by matching the score function of one distribution to another that can be sampled, also falls into the class of objectives we develop.
Here we show how to construct new objectives, including some not yet studied. We make explicit the requirements to construct objectives for variational inference. Finally, we discuss further properties that make them amenable to both scalable and flexible variational inference.
2 Operator Variational Objectives
We define operator variational objectives and the conditions needed for an objective to be useful for variational inference. We develop a new objective, the LangevinStein objective, and show how to place the classical \glsKL into this class. In the next section, we develop a general algorithm for optimizing operator variational objectives.
2.1 Variational Objectives
Consider a probabilistic model of data and latent variables . Given a data set , approximate Bayesian inference seeks to approximate the posterior distribution , which is applied in all downstream tasks. Variational inference posits a family of approximating distributions and optimizes a divergence function to find the member of the family closest to the posterior.
The divergence function is the variational objective, a function of both the posterior and the approximating distribution. Useful variational objectives hinge on two properties: first, optimizing the function yields a good posterior approximation; second, the problem is tractable when the posterior distribution is known up to a constant.
The classic construction that satisfies these properties is the \glsELBO,
(1) 
It is maximized when and it only depends on the posterior distribution up to a tractable constant, . The \glsELBO has been the focus in much of the classical literature. Maximizing the \glsELBO is equivalent to minimizing the \glsKL divergence to the posterior, and the expectations are analytic for a large class of models [5].
2.2 Operator Variational Objectives
We define a new class of variational objectives, operator variational objectives. An operator objective has three components. The first component is an operator that depends on and . (Recall that an operator maps functions to other functions.) The second component is a family of test functions , where each maps realizations of the latent variables to real vectors . In the objective, the operator and a function will combine in an expectation , designed such that values close to zero indicate that is close to . The third component is a distance function , which is applied to the expectation so that the objective is nonnegative. (Our example uses the square function .)
These three components combine to form the operator variational objective. It is a nonnegative function of the variational distribution,
(2) 
Intuitively, it is the worstcase expected value among all test functions . Operator variational inference seeks to minimize this objective with respect to the variational family .
We use operator objectives for posterior inference. This requires two conditions on the operator and function family.
 [leftmargin=*]

Closeness. The minimum of the variational objective is at the posterior, . We meet this condition by requiring that for all . Thus, optimizing the objective will produce if it is the only member of with zero expectation (otherwise it will produce a distribution in the equivalence class: with zero expectation). In practice, the minimum will be the closest member of to .

Tractability. We can calculate the variational objective up to a constant without involving the exact posterior . In other words, we do not require calculating the normalizing constant of the posterior, which is typically intractable. We meet this condition by requiring that the operator —originally in terms of and —can be written in terms of and . Tractability also imposes conditions on : it must be feasible to find the supremum. Below, we satisfy this by defining a parametric family for that is amenable to stochastic optimization.
Eq.LABEL:*eq:operatorobj and the two conditions provide a mechanism to design meaningful variational objectives for posterior inference. Operator variational objectives try to match expectations with respect to to those with respect to .
2.3 Understanding Operator Variational Objectives
Consider operators where only takes positive values. In this case, distance to zero can be measured with the identity , so tractability implies the operator need only be known up to a constant. This family includes tractable forms of familiar divergences like the \glsKL divergence (\glsELBO), Rényi’s divergence [16], and the divergence [20].
When the expectation can take positive or negative values, operator variational objectives are closely related to Stein divergences [2]. Consider a family of scalar test functions that have expectation zero with respect to the posterior, . Using this family, a Stein divergence is
Now recall the operator objective of Eq.LABEL:*eq:operatorobj. The closeness condition implies that
In other words, operators with positive or negative expectations lead to Stein divergences with a more generalized notion of distance.
2.4 LangevinStein Operator Variational Objective
We developed the operator variational objective. It is a class of tractable objectives, each of which can be optimized to yield an approximation to the posterior. An operator variational objective is built from an operator, function class, and distance function to zero. We now use this construction to design a new type of variational objective.
An operator objective involves a class of functions that has known expectations with respect to an intractable distribution. There are many ways to construct such classes [1, 2]. Here, we construct an operator objective from the generator Stein’s method applied to the Langevin diffusion.
Let denote the divergence of a vectorvalued function , that is, the sum of its individual gradients. Applying the generator method of Barbour [2] to Langevin diffusion gives the operator
(3) 
We call this the \glsLS operator. We obtain the corresponding variational objective by using the squared distance function and substituting Eq.LABEL:*eq:langstein into Eq.LABEL:*eq:operatorobj,
(4) 
The \glsLS operator satisfies both conditions. First, it satisfies closeness because it has expectation zero under the posterior (Appendix A) and its unique minimizer is the posterior (Appendix B). Second, it is tractable because it requires only the joint distribution. The functions will also be a parametric family, which we detail later.
Additionally, while the \glsKL divergence finds variational distributions that underestimate the variance, the \glsLS objective does not suffer from that pathology. The reason is that \glsKL is infinite when the support of is larger than ; here this is not the case.
We provided one example of a variational objectives using operators, which is specific to continuous variables. In general, operator objectives are not limited to continuous variables; Appendix C describes an operator for discrete variables.
2.5 The KL Divergence as an Operator Variational Objective
Finally, we demonstrate how classical variational methods fall inside the operator family. For example, traditional variational inference minimizes the \glsKL divergence from an approximating family to the posterior [11]. This can be construed as an operator variational objective,
(5) 
This operator does not use the family of functions—it trivially maps all functions to the same function. Further, because \glsKL is strictly positive, we use the identity distance .
The operator satisfies both conditions. It satisfies closeness because . It satisfies tractability because it can be computed up to a constant when used in the operator objective of Eq.LABEL:*eq:operatorobj. Tractability comes from the fact that .
3 Operator Variational Inference
\glsresetallWe described operator variational objectives, a broad class of objectives for variational inference. We now examine how it can be optimized. We develop a black box algorithm [30, 21] based on Monte Carlo estimation and stochastic optimization. Our algorithm applies to a general class of models and any operator objective.
Minimizing the operator objective involves two optimizations: minimizing the objective with respect to the approximating family and maximizing the objective with respect to the function class (which is part of the objective).
We index the family with variational parameters and require that it satisfies properties typically assumed by black box methods [21]: the variational distribution has a known and tractable density; we can sample from ; and we can tractably compute the score function . We index the function class with parameters , and require that is differentiable. In the experiments, we use neural networks, which are flexible enough to approximate a general family of test functions [9].
Given parameterizations of the variational family and test family, \glsOPVI seeks to solve a minimax problem,
(6) 
We will use stochastic optimization [25, 14]. In principle, we can find stochastic gradients of by rewriting the objective in terms of the optimized value of , . In practice, however, we simultaneously solve the maximization and minimization. Though computationally beneficial, this produces saddle points. In our experiments we found it to be stable enough. We derive gradients for the variational parameters and test function parameters . (We fix the distance function to be the square ; the identity also readily applies.)
Gradient with respect to . For a fixed test function with parameters , denote the objective
The gradient with respect to variational parameters is
Now write the second expectation with the score function gradient [21]. This gradient is
(7) 
Eq.LABEL:*eq:gradlambda lets us calculate unbiased stochastic gradients. We first generate two sets of independent samples from ; we then form Monte Carlo estimates of the first and second expectations. For the second expectation, we can use the variance reduction techniques developed for black box variational inference, such as RaoBlackwellization [21].
We described the score gradient because it is general. An alternative is to use the reparameterization gradient for the second expectation [12, 24]. It requires that the operator be differentiable with respect to and that samples from can be drawn as a transformation of a parameterfree noise source , . In our experiments, we use the reparameterization gradient.
Gradient with respect to . Mirroring the notation above, the operator objective for fixed variational is
The gradient with respect to test function parameters is
(8) 
Again, we can construct unbiased stochastic gradients with two sets of Monte Carlo estimates. Note that gradients for the test function do not require score gradients (or reparameterization gradients) because the expectation does not depend on .
Algorithm. Algorithm LABEL:*alg:operator_vi outlines \glsOPVI. We simultaneously minimize the variational objective with respect to the variational family while maximizing it with respect to the function class . Given a model, operator, and function class parameterization, we can use automatic differentiation to calculate the necessary gradients [3]. Provided the operator does not require modelspecific computation, this algorithm satisfies the black box criteria.
[t]
\SetKwInOutInputInput
\SetKwInOutOutputOutput
\InputModel , variational approximation
\OutputVariational parameters
Initialize and randomly.
\Whilenot converged
Compute unbiased estimates of from
Eq.LABEL:*eq:gradlambda.
Compute unbiased esimates of from
Eq.LABEL:*eq:gradtheta.
Update , with unbiased stochastic gradients.
3.1 Data Subsampling and \glsOpvi
With stochastic optimization, data subsampling scales up traditional variational inference to massive data [8, 28]. The idea is to calculate noisy gradients by repeatedly subsampling from the data set, without needing to pass through the entire data set for each gradient.
An as illustration, consider hierarchical models. Hierarchical models consist of global latent variables that are shared across data points and local latent variables each of which is associated to a data point . The model’s log joint density is
Hoffman et al. [8] calculate unbiased estimates of the log joint density (and its gradient) by subsampling data and appropriately scaling the sum.
We can characterize whether \glsOPVI with a particular operator supports data subsampling. \glsOPVI relies on evaluating the operator and its gradient at different realizations of the latent variables (Eq.LABEL:*eq:gradlambda and Eq.LABEL:*eq:gradtheta). Thus we can subsample data to calculate estimates of the operator when it derives from linear operators of the log density, such as differentiation and the identity. This follows as a linear operator of sums is a sum of linear operators, so the gradients in Eq.LABEL:*eq:gradlambda and Eq.LABEL:*eq:gradtheta decompose into a sum. The \acrlongLS and \acrshortKL operator are both linear in the log density; both support data subsampling.
3.2 Variational Programs
Given an operator and variational family, Algorithm LABEL:*alg:operator_vi optimizes the
corresponding operator objective.
Certain operators require the density of . For example, the
\acrshortKL operator (Eq.LABEL:*eq:KLoperator) requires its log
density. This potentially limits the construction of rich variational
approximations for which the density of is difficult to
compute.
Some operators, however, do not depend on having a analytic density; the \glsLS operator (Eq.LABEL:*eq:langstein) is an example. These operators can be used with a much richer class of variational approximations, those that can be sampled from but might not have analytically tractable densities. We call such approximating families variational programs.
Inference with a variational program requires the family to be reparameterizable [12, 24]. (Otherwise we need to use the score function, which requires the derivative of the density.) A reparameterizable variational program consists of a parametric deterministic transformation of random noise . Formally, let
(9) 
This generates samples for , is differentiable with respect to , and its density may be intractable. For operators that do not require the density of , it can be used as a powerful variational approximation. This is in contrast to the standard \glsKL operator.
As an example, consider the following variational program for a onedimensional random variable. Let denote the th dimension of and make the corresponding definition for :
(10) 
When outputs positive values, this separates the parametrization of the density to the positive and negative halves of the reals; its density is generally intractable. In Section LABEL:*sec:experiments, we will use this distribution as a variational approximation.
Eq.LABEL:*eq:reparamvarprogram contains many densities when the function class can approximate arbitrary continuous functions. We state it formally.
Theorem 1.
Consider a posterior distribution with a finite number of latent variables and continuous quantile function. Assume the operator variational objective has a unique root at the posterior and that can approximate continuous functions. Then there exists a sequence of parameters in the variational program, such that the operator variational objective converges to , and thus converges in distribution to .
This theorem says that we can use variational programs with an appropriate independent operator to approximate continuous distributions. The proof is in Appendix D.
4 Empirical Study
\glsresetallWe evaluate operator variational inference on a mixture of Gaussians, comparing different choices in the objective. We then study logistic factor analysis for images.
4.1 Mixture of Gaussians
Consider a onedimensional mixture of Gaussians as the posterior of interest, . The posterior contains multiple modes. We seek to approximate it with three variational objectives: \glsKL with a Gaussian approximating family, \glsLS with a Gaussian approximating family, and \glsLS with a variational program.
Figure LABEL:*fig:mog displays the posterior approximations. We find that the \glsKL divergence and \glsLS divergence choose a single mode and have slightly different variances. These operators do not produce good results because a single Gaussian is a poor approximation to the mixture. The remaining distribution in Figure LABEL:*fig:mog comes from the toy variational program described by Eq.LABEL:*eq:toy_var_prog with the \glsLS operator. Because this program captures different distributions for the positive and negative half of the real line, it is able to capture the posterior.
In general, the choice of an objective balances statistical and computational properties of variational inference. We highlight one tradeoff: the \glsLS objective admits the use of a variational program; however, the objective is more difficult to optimize than the \glsKL.
4.2 Logistic Factor Analysis
Logistic factor analysis models binary vectors with a matrix of parameters and biases ,
where has fixed dimension and is the sigmoid function. This model captures correlations of the entries in through .
We apply logistic factor analysis to analyze the binarized MNIST data set [26], which contains 28x28 binary pixel images of handwritten digits. (We set the latent dimensionality to 10.) We fix the model parameters to those learned with variational expectationmaximization using the \glsKL divergence, and focus on comparing posterior inferences.
We compare the \glsKL operator to the \glsLS operator and study two choices of variational models: a fully factorized Gaussian distribution and a variational program. The variational program generates samples by transforming a dimensional standard normal input with a twolayer neural network, using rectified linear activation functions and a hidden size of twice the latent dimensionality. Formally, the variational program we use generates samples of as follows:
The variational parameters are the weights and biases . For , we use a threelayer neural network with the same hidden size as the variational program and hyperbolic tangent activations where unit activations were bounded to have norm two. Bounding the unit norm bounds the divergence. We used the Adam optimizer [13] with learning rates for and for the variational approximation.
Inference method  Completed data loglikelihood 

Meanfield Gaussian + \glsKL  59.3 
Meanfield Gaussian + \glsLS  75.3 
Variational Program + \glsLS  58.9 
There is no standard for evaluating generative models and their inference algorithms [27]. Following Rezende et al. [24], we consider a missing data problem. We remove half of the pixels in the test set (at random) and reconstruct them from a fitted posterior predictive distribution. Table LABEL:*table:mnist summarizes the results on 100 test images; we report the loglikelihood of the completed image. \glsLS with the variational program performs best. It is followed by \glsKL and the simpler \glsLS inference. The \glsLS performs better than \glsKL even though the model parameters were learned with \glsKL.
5 Summary
We present operator variational objectives, a broad yet tractable class of optimization problems for approximating posterior distributions. Operator objectives are built from an operator, a family of test functions, and a distance function. We outline the connection between operator objectives and existing divergences such as the KL divergence, and develop a new variational objective using the LangevinStein operator. In general, operator objectives produce new ways of posing variational inference.
Given an operator objective, we develop a black box algorithm for optimizing it and show which operators allow scalable optimization through data subsampling. Further, unlike the popular evidence lower bound, not all operators explicitly depend on the approximating density. This permits flexible approximating families, called variational programs, where the distributional form is not tractable. We demonstrate this approach on a mixture model and a factor model of images.
There are several possible avenues for future directions such as developing new variational objectives, adversarially learning [6] model parameters with operators, and learning model parameters with operator variational objectives.
Acknowledgments. This work is supported by NSF IIS1247664, ONR N000141110651, DARPA FA87501420009, DARPA N6600115C4032, Adobe, NSERC PGSD, Porter Ogden Jacobus Fellowship, Seibel Foundation, and the Sloan Foundation. The authors would like to thank Dawen Liang, Ben Poole, Stephan Mandt, Kevin Murphy, Christian Naesseth, and the anonymous reviews for their helpful feedback and comments.
References
Appendix A Technical Conditions for LangevinStein Operators
Here we establish the conditions needed on the function class or the posterior distribution shorthanded for the operators to have expectation zero for all . W derive properties using integration by parts for supports that are bounded open sets. Then we extend the result to unbounded supports using limits. We start with the LangevinStein operator. Let be the set over which we integrate and let be its boundary. Let be the unit normal to the surface , and be the th component of the surface normal (which is dimensional). Then we have that
A sufficient condition for this expectation to be zero is that either goes to zero at its boundary or that the vector field is zero at the boundary.
For unbounded sets, the result can be written as a limit for a sequence of increasing sets and a set of boundaries using the dominated convergence theorem [4]. To use dominated convergence, we establish absolute integrability. Sufficient conditions for absolute integrability of the LangevinStein operator are for the gradient of to be bounded and the vector field and its derivatives to be bounded. Via dominated convergence, we get that for the LangevinStein operator to have expectation zero.
Appendix B Characterizing the zeros of the LangevinStein Operators
We provide analysis on how to characterize the equivalence class of distributions defined as . One general condition for equality in distribution comes from equality in probability on all Borel sets. We can build functions that have expectation zero with respect to the posterior that test this equality. Formally, for any Borel set with being the indicator, these functions on have the form:
We show that if the LangevinStein operator satisfies , then is equivalent to in distribution. We do this by showing the above functions are in the span of . Expanding the LangevinStein operator we have
Setting this equal to the desired function above yields the differential equation
To solve this, set for all but . This yields
which is an ordinary differential equation with solution for
This function is differentiable with respect to , so this gives the desired result. Plugging the function back into the operator variational objective gives
for all Borel measurable . This implies the induced distance captures total variation.
Appendix C Operators for Discrete Variables
Some operators based on Stein’s method are applicable only for latent variables in a continuous space. There are Stein operators that work with discrete variables [1, 15]. We present one amenable to operator variational objectives based on a discrete analogue to the LangevinStein operator developed in [15]. For simplicity, consider a onedimensional discrete posterior with support . Let be a function such that , then an operator can be defined as
Since the expectation of this operator with respect to the posterior is a telescoping sum with both endpoints , it has expectation zero.
This relates to the LangevinStein operator in the following. The LangevinStein operator in one dimension can be written as
This operator is the discrete analogue as the differential is replaced by a discrete difference. We can extend this operator to multiple dimensions by an ordered indexing. For example, binary numbers of length would work for binary latent variables.
Appendix D Proof of Universal Representations
Consider the optimal form of such that transformations of standard normal draws are equal in distribution to exact draws from the posterior. This means
where squashes the draw from a standard normal such that it is equal in distribution to a uniform random variable. The posterior’s inverse cumulative distribution function is applied to the uniform draws. The transformed samples are now equivalent to exact samples from the posterior. For a richenough parameterization of , we may hope to sufficiently approximate this function.
Indeed, as in the universal approximation theorem of Tran et al. [29] there exists a sequence of parameters such that the operator variational objective goes to zero, but the function class is no longer limited to local interpolation. Universal approximators like neural networks [9] also work. Further, under the assumption that is the unique root and by satisfying the conditions described in Section LABEL:*sec:optimal_operator for equality in distribution, this implies that the variational program given by drawing and applying converges in distribution to .
Footnotes
 It is possible to construct rich approximating families with , but this requires the introduction of an auxiliary distribution [17].
References
 Assaraf, R. and Caffarel, M. (1999). Zerovariance principle for monte carlo algorithms. In Phys. Rev. Let.
 Barbour, A. D. (1988). Stein’s method and poisson process convergence. Journal of Applied Probability.
 Carpenter, B., Hoffman, M. D., Brubaker, M., Lee, D., Li, P., and Betancourt, M. (2015). The Stan Math Library: Reversemode automatic differentiation in C++. arXiv preprint arXiv:1509.07164.
 Cinlar, E. (2011). Probability and Stochastics. Springer.
 Ghahramani, Z. and Beal, M. (2001). Propagation algorithms for variational Bayesian learning. In NIPS 13, pages 507–513.
 Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Neural Information Processing Systems.
 HernándezLobato, J. M., Li, Y., Rowland, M., HernándezLobato, D., Bui, T., and Turner, R. E. (2015). Blackbox divergence Minimization. arXiv.org.
 Hoffman, M., Blei, D., Wang, C., and Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14(1303–1347).
 Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366.
 Hyvärinen, A. (2005). Estimation of nonnormalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695–709.
 Jordan, M., Ghahramani, Z., Jaakkola, T., and Saul, L. (1999). Introduction to variational methods for graphical models. Machine Learning, 37:183–233.
 Kingma, D. and Welling, M. (2014). Autoencoding variational bayes. In (ICLR).
 Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. In International Conference on Learning Representations.
 Kushner, H. and Yin, G. (1997). Stochastic Approximation Algorithms and Applications. Springer New York.
 Ley, C. and Swan, Y. (2011). Discrete Stein characterizations and discrete information distances. arXiv preprint arXiv:1201.0143.
 Li, Y. and Turner, R. E. (2016). Rényi divergence variational inference. arXiv preprint arXiv:1602.02311.
 Maaløe, L., Sønderby, C. K., Sønderby, S. K., and Winther, O. (2016). Auxiliary deep generative models. arXiv preprint arXiv:1602.05473.
 Minka, T. P. (2001). Expectation propagation for approximate Bayesian inference. In UAI.
 Minka, T. P. (2004). Power EP. Technical report, Microsoft Research, Cambridge.
 Nielsen, F. and Nock, R. (2013). On the chi square and higherorder chi distances for approximating fdivergences. arXiv preprint arXiv:1309.3029.
 Ranganath, R., Gerrish, S., and Blei, D. (2014). Black Box Variational Inference. In AISTATS.
 Ranganath, R., Tran, D., and Blei, D. M. (2016). Hierarchical variational models. In International Conference on Machine Learning.
 Rezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing flows. In Proceedings of the 31st International Conference on Machine Learning (ICML15).
 Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning.
 Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):pp. 400–407.
 Salakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief networks. In International Conference on Machine Learning.
 Theis, L., van den Oord, A., and Bethge, M. (2016). A note on the evaluation of generative models. In International Conference on Learning Representations.
 Titsias, M. and LázaroGredilla, M. (2014). Doubly stochastic variational bayes for nonconjugate inference. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pages 1971–1979.
 Tran, D., Ranganath, R., and Blei, D. M. (2016). The variational Gaussian process. In ICLR.
 Wingate, D. and Weber, T. (2013). Automated variational inference in probabilistic programming. ArXiv eprints.