Boosting Black Box Variational Inference
Abstract
Approximating a probability density in a tractable manner is a central task in Bayesian statistics. Variational Inference (VI) is a popular technique that achieves tractability by choosing a relatively simple variational family. Borrowing ideas from the classic boosting framework, recent approaches attempt to boost VI by replacing the selection of a single density with a greedily constructed mixture of densities. In order to guarantee convergence, previous works impose stringent assumptions that require significant effort for practitioners. Specifically, they require a custom implementation of the greedy step (called the LMO) for every probabilistic model with respect to an unnatural variational family of truncated distributions. Our work fixes these issues with novel theoretical and algorithmic insights. On the theoretical side, we show that boosting VI satisfies a relaxed smoothness assumption which is sufficient for the convergence of the functional FrankWolfe (FW) algorithm. Furthermore, we rephrase the LMO problem and propose to maximize the Residual ELBO (RELBO) which replaces the standard ELBO optimization in VI. These theoretical enhancements allow for black box implementation of the boosting subroutine. Finally, we present a stopping criterion drawn from the duality gap in the classic FW analyses and exhaustive experiments to illustrate the usefulness of our theoretical and algorithmic contributions.
noitemsep,topsep=0pt,parsep=2pt,partopsep=0pt
1 Introduction
Approximating probability densities is a core problem in Bayesian statistics, where inference translates to the computation of a posterior distribution. Posterior distributions depend on the modeling assumptions and can rarely be computed exactly. Variational Inference (VI) is a technique to approximate posterior distributions through optimization. It involves choosing a set of tractable densities, a.k.a. variational family, out of which the final approximation is to be chosen. The approximation is done by selecting a density in the candidate set that is close to the true posterior in terms of KullbackLeibler (KL) divergence [1]. There is an inherent tradeoff involved in fixing the set of tractable densities. Increasing the capacity of the variational family to approximate the posterior also increases the complexity of the optimization problem. Consider a degenerate case where the variational family contains just a single density. The optimization problem is trivial and runs in constant time, but the quality of the solution is poor and stands in no relation to the true posterior. This contrived example is clearly too restrictive, and in practice, the mean field approximation offers a good tradeoff between expressivity and tractability [2]. However, in many realworld applications, mean field approximations are lacking in their ability to accurately approximate the posterior.
Imagine a practitioner that, after designing a Bayesian model and using a VI algorithm to approximate the posterior, finds that the approximation is too poor to be useful. Standard VI does not give the practitioner the option to trade additional computational cost for a better approximation. As a result, there have been several efforts to ramp up the representative capacity of the variational family while maintaining tractability.
One line of work in this direction involves replacing the simple mean field family by a mixture of Gaussians. It is known that mixtures of Gaussians can approximate any distribution arbitrarily closely [17]. Boosting is a practical approach to finding the optimal approximating mixture and involves adding components to the mixture greedily one at a time [4, 12, 15]. Not only is boosting a practical solution, it also has wellstudied tradeoff bounds for number of iterations vs. approximation quality [12] by virtue of being essentially a variant of the classical FrankWolfe (FW) algorithm [7, 12]. Unfortunately, these greedy algorithms require a specialized, restricted variational family to ensure convergence and therefore a white box implementation of the boosting subroutine. These restrictions include that (a) each potential component of the mixture has a bounded support i.e., truncated densities, and (b) the subroutine should not return degenerate distributions. These assumptions require specialized care during implementation, and therefore, one cannot simply take existing VI solvers and boost them. This makes boosting VI unattractive for practitioners. In this work, we fix this issue by proposing a boosting black box VI algorithm that has many practical benefits.
Our work presents several key algorithmic and theoretical contributions, which we summarize below:

We relax the previously known conditions for guaranteed convergence from smoothness to bounded curvature. As a consequence, the set of candidate densities no longer needs to be truncated, thereby easing its implementation and improving on the convergence guarantees.

We propose a modified form of the ELBO optimization, the Residual ELBO (RELBO), which guarantees that the selected density is nondegenerate and is suitable for black box solvers.

We propose a novel stopping criterion using the duality gap from FW, which is applicable to any boosting VI algorithm.
In addition to these theoretical contributions, we present extensive simulated and realworld empirical experiments to show the applicability of our proposed algorithmic extensions.
While our work is motivated by applications to VI, our theoretical results have a more general impact. We essentially analyze the application of the functional FW algorithm to the general task of minimizing the KLdivergence over a space of probability densities.
1.1 Related work
There is a vast literature on VI. We refer to [1] for a thorough review. Our focus is to use boosting to increase the complexity of a density, similar to the goal of Normalizing Flows [18], MCMCVI hybrid methods [20, 19], or distribution transformations [21]. Our approach is in line with several previous approaches using mixtures of distributions to improve the expressiveness of the variational approximation [8, 5] but goes further to draw connections with classic optimization to obtain several novel theoretical and algorithmic insights.
While boosting has been well studied in other settings [14], it has only recently been applied to the problem of VI. Related works of [4] and [15] developed the algorithmic framework and conjectured a possible convergence rate of but without theoretical analyses. The authors in [4] identify the need of truncated densities to ensure smoothness of the KL cost function. A more recent work [12] provides a theoretical base for analyzing the algorithm. They identify the sufficient conditions for guaranteeing convergence and provide explicit constants to the conjectured rate. Unfortunately, these sufficient conditions amount to restrictive assumptions about the variational family and therefore require the practitioner to have white box access to the variational family and underlying VI algorithm. In this work, we remove these assumptions to allow use of black box VI methods.
Our analysis is mainly based on the FW algorithm [7], which is a commonly used algorithm for projection free optimization. The convergence rates and requisite assumptions are well studied in various settings [11, 10, 13]. Its applications include nonEuclidean spaces, e.g., a variational objective for approximate marginal inference over the marginal polytope [9].
The rest of the paper is organized as follows. We begin by introducing and formalizing the boosting VI framework in Section 2. In Section 3, we review and analyze the Functional FW algorithm to greedily solve the boosting VI. In Section 4, we first propose RELBO, an alternative of the contemporary ELBO optimization to implement a black box LMO (linear minimization oracle). Then, we propose a duality gap based stopping criterion for boosting VI algorithms. Finally, experimental evaluation is presented in Section 5. We refer the reader to the appendix for all proofs.
Notation. We represent vectors by small letters bold, (e.g. ) and matrices by capital bold, (e.g. ). For a nonempty subset of some Hilbert space , let denote its convex hull. is often called atom set in the literature, and its elements are called atoms. The support of a density function is a measurable set denoted by capital letters sans serif i.e., . The inner product between two density functions in is defined as .
2 Variational inference and boosting
Bayesian inference involves computing the posterior distribution given a model and the data. More formally, we choose a distribution for our observations given unobserved latent variables , called the likelihood and a prior distribution over the latent variables . Our goal is to infer the posterior, [1]. Bayes theorem relates these three distributions by expressing the posterior as equal to the product of prior and likelihood divided by the normalization constant, . The posterior is often intractable because the normalization constant requires integrating over the full latent variable space.
The goal of VI is to find a tractable approximation of . From an optimization viewpoint, one can think of the posterior as an unknown function where is a measurable set. The task of VI is to find the best approximation, in terms of KL divergence, to this unknown function within a family of tractable distributions . Therefore, VI can be written as the following optimization problem:
(1) 
It should be obvious that the quality of the approximation directly depends on the expressivity of the family . However, as we increase the complexity of , the optimization problem (1) also becomes more complex.
Rather than optimizing the objective (1) which requires access to an unknown function and is therefore not computable, VI equivalently maximizes instead the socalled Evidence Lower BOund (ELBO) [1]:
(2) 
A recent line of work [4, 15, 12] aims at replacing with thereby expanding the capacity of the variational approximation to the class of mixtures of the base family :
(3) 
The boosting approach to this problem consists of specifying an iterative procedure, in which the problem (3) is solved via the greedy combination of solutions from simpler surrogate problems. This approach was first proposed in [4], and its connection to the FW algorithm was studied in [12]. Contrary to the boosting approaches in the deep generative modeling literature initiated by [22], boosting VI does not enjoy a simple and elegant subproblem as we discuss in Section 3.1. Next, we show how to tackle (3) from a formal and yet very practical optimization perspective.
3 Functional FrankWolfe for boosting variational inference
Taking a step back from the problem (3), we first define the general optimization problem and the relevant quantities needed for proving the convergence of FW. Then, we present the extension to boosting black box VI.
We start with the general optimization problem:
(4) 
where is closed and bounded and is a convex functional with bounded curvature over its domain. Here the curvature is defined as in [7]:
(5) 
where
It is known that if is smooth over , then where [7].
The FW algorithm is depicted in Algorithm 1. Note that Algorithm 2 in [4] is a variant of Algorithm 1. In each iteration, the FW algorithm queries a socalled Linear Minimization Oracle (LMO) which solves the following inner optimization problem:
(6) 
for a given and . Depending on , computing an exact solution of (6) can be hard in practice. This motivates the approximate LMO which returns an approximate minimizer of (6) for some accuracy parameter and the current iterate such that:
(7) 
We discuss a simple algorithm to implement the LMO in Section 4. Finally, we find that Algorithm 1 is known to converge sublinearly to the minimizer of (4) with the following rate:
Theorem 1 ([7]).
Let be a closed and bounded set and let be a convex function with bounded curvature over . Then, the Affine Invariant FW algorithm (Algorithm 1) converges for as
where is the initial error in objective, and is the accuracy parameter of the approximate LMO.
Discussion. Theorem 1 has several implications for boosting VI. First, the LMO problem does not need to be solved exactly in order to guarantee convergence. Second, Theorem 1 guarantees that Algorithm 1 converges to the best approximation in which, depending on the expressivity of the base family, could even contain the full posterior. Furthermore, the theorem gives a convergence rate which states that, in order to achieve an error of , we need to perform iterations.
Similar discussions are also presented by [12]. The crucial question, which they leave unaddressed, is whether under their assumptions there exists a variational family of densities which (a) is expressive enough to wellapproximate the posterior; (b) satisfies the conditions required to guarantee convergence; and (c) allows for efficient implementation of the LMO.
3.1 Curvature of boosting variational inference
In order to boost VI using FW in practice, we need to ensure that the assumptions are not violated. Assume that is the set of probability density functions with compact parameter space as well as bounded infinity norm and norm. These assumptions on the search space are easily justified since it is reasonable to assume that the posterior is not degenerate (bounded infinity norm) and has modes that are not arbitrarily far away from each other (compactness). Under these assumptions, the optimization domain is closed and bounded. It is simple to show that the solution of the LMO problem over is an element of . Therefore, is closed. The troublesome condition that needs to be satisfied for the convergence of FW is smoothness. Indeed, the work of [4] already recognized that the boosting literature typically require a smooth objective and showed that densities bounded away from zero are sufficient. [12] showed that the necessary condition to achieve smoothness is that the approximation be not arbitrarily far from the optimum. They argue that while this is a practical assumption, the general theory would require truncated densities. We relax this assumption. As per Theorem 1, a bounded curvature is actually sufficient to guarantee convergence. This condition is weaker than smoothness, which was assumed by [12, 4]. For the KL divergence, the following holds.
Theorem 2.
is bounded for the KL divergence if the parameter space of the densities in is bounded.
The proof is provided in Appendix A.
Discussion. Surprisingly, a bounded curvature for the can be obtained as long as:
is bounded. The proof sketch proceeds as follows. For any pair and , we need to check that is bounded as a function of . The two limit points, for and for , are both bounded for any choice of and . Hence, the is bounded as it is a continuous function of in with bounded function values at the extreme points. is bounded because the parameter space is bounded. is bounded by the triangle inequality and bounded norm of the elements of . This result is particularly relevant, as it makes the complicated truncation described in [12] unnecessary without any additional assumption. Indeed, while a bounded parameter space was already assumed in [12] and is a practical assumption, truncation is tedious to implement. Note that [4] considers full densities as an approximation of the truncated one. They also argue that the theoretically grounded family of distributions for boosting should contain truncated densities. Avoiding truncation has another very important consequence for the optimization. Indeed, [12] proves convergence of boosting VI only to a truncated version of the posterior. Therefore, Theorem 8 in [12] contains a term that does not decrease when the number of iteration increases. While this term could be small, as it contains the error of the approximation on the tails, it introduces a crucial hyperparameter in the algorithm i.e., where to truncate. Instead, we here show that under much weaker assumptions on the set , it is possible to converge to the true posterior.
4 The residual ELBO: implementing a black box LMO
Note that the LMO is a constrained linear problem in a function space. A complicated heuristic is developed in [4] to deal with the fact that the unconstrained linear problem they consider has a degenerate solution. The authors of [12] propose to use projected stochastic gradient descent on the parameters of . The problem with this is that, to the best of our knowledge, the convergence of projected stochastic gradient descent is not yet understood in this setting. To guarantee convergence of the FW procedure, it is crucial to make sure that the solution of the LMO is not a degenerate distribution. This translates to a constraint on the infinity norm of . Such a constraint is hardly practical. Indeed, one must be able to compute the maximum value of as a function of its parameters which depends on the particular choice of . In contrast, the entropy is a general term that can be approximated via sampling and therefore allows for black box computation. We relate infinity norm and entropy in the following lemma.
Lemma 3.
A density with bounded infinity norm has entropy bounded from below. The converse is true for many of the distributions which are commonly used in VI (for example Gaussian, Cauchy and Laplace).
The proof is provided in Appendix A.
In general, bounded entropy does not always imply bounded infinity norm. While this is precisely the statement we need, a simple verification is sufficient to show that it holds in most cases of interest. We assume that is a family for which bounded entropy implies bounded infinity norm. Therefore, we can constrain the optimization problem with the entropy instead of the infinity norm. We call the family without the infinity norm constraint. At every iteration, we need to solve:
Note that the constraint on the entropy is crucial here, otherwise the solution of the LMO would be a degenerate distribution as also argued in [4].
We now replace this problem with its regularized form using Lagrange multipliers and solve for given a fixed value of :
(8)  
Therefore, the regularized LMO problem is equivalent to the following minimization problem:
where is the normalization constant of . From this optimization problem, we can write what we call the Residual Evidence Lower Bound (RELBO) as:
(9) 
Discussion. Let us now analyze the RELBO and compare it with the ELBO in standard VI [1]. First, note that we introduce the hyperparameter which controls the weight of the entropy. In order to obtain the true LMO solution, one would need to maximize the LHS of Equation (8) for and solve the saddle point problem. In light of the fact that an approximate solution is sufficient for convergence, we consider the regularized problem as a simple heuristic. One can then fix an arbitrary value for or decrease it when increases. The latter amounts to allowing increasingly sharp densities as optimization proceeds. The other important difference between ELBO and RELBO is the residual term which is expressed through . Maximizing this term amounts to looking for a density with low cross entropy with the joint and high cross entropy with the current iterate . In other words, the next component needs to be as close as possible to the target but also sufficiently different from the current approximation . Indeed, should capture the aspects of the posterior that the current mixture could not approximate yet.
Failure Modes. Using a black box VI as an implementation for the LMO represents an attractive practical solution. Indeed, one could just run VI once and, if the result is not good enough, run it again on the residual without changing the structure of the implementation. Unfortunately, there are two failure modes that should be discussed. First, if the target posterior is a perfectly symmetric multimodal distribution, then the residual is also symmetric and the algorithm may get stuck. A simple solution to this problem is to run the black box VI for fewer iterations, breaking the symmetry of the residual. The second problem arises in scenarios where the posterior distribution can be approximated well by a single element of . In such cases, most of the residual will be on the tails. The algorithm will then fit the tails and in the following iterations relearn a distribution close to . As a consequence, it is important to identify good solutions before investing additional computational effort by adding more components to the mixture. Note that the ELBO cannot be used for this purpose, as its value at the maximum is unknown.
Stopping criterion. We propose a stopping criterion for boosting VI which allows us to identify when a reasonably good approximation is reached and save computational effort. To this end, we rephrase the notion of duality gap [7, 6] in the context of boosting VI, which gives a surprisingly simple stopping criterion for the algorithm.
Lemma 4.
The duality gap computed at some iterate is an upper bound on the primal error .
The proof is provided in Appendix A.
Note that the is precisely the LMO solution to the problem (6). Therefore, with an exact LMO, one obtains a certificate on the primal error for free, without knowing the value of . It is possible to show that a convergence rate similar to Theorem 1 also holds for the duality gap [7]. If the oracle is inexact, the estimate of the duality gap satisfies that , as a consequence of (7).
5 Experimental evaluation
Notably, our VI algorithm is black box in the sense that it leaves the definition of the model and the choice of variational family up to the user. Therefore, we are able to reuse the same boosting black box VI solver to run all our experiments, and more generally, any probabilistic model and choice of variational family. We chose to implement our algorithm as an extension to the Edward probabilistic programming framework [23] thereby enabling users to apply boosting VI to any probabilistic model and variational family which are definable in Edward. In Appendix B, we show a code sample of our implementation of Bayesian logistic regression.
For comparisons to baseline VI, we use Edward’s builtin black box VI (BBVI) algorithm without modification. We run these baseline VI experiments for 10,000 iterations which is orders of magnitude more than what is required for convergence. Unless otherwise noted, we use Gaussians as our base family. Note that FW with fixed step size is not monotonic and so in the experiments which use fixed step size, it is not surprising that the last iteration is not the best one. We use the training loglikelihood to select the best iteration and we used the duality gap as a diagnostic tool in the implementation to understand the impact of . We found that worked well in all the experiments.
5.1 Synthetic data
First, we use synthetic data to visualize the approximation of our algorithm of a bimodal posterior. In particular, we consider a mixture of two Gaussians with parameters , , and mixing weights .
We performed experiments with all three variants of FW described in Algorithm 1. For the fully corrective variant, we used FW to solve the subproblem of finding the optimal weights for the current atom set. We observe that unlike BBVI, all three variants are able to fit both modes of the bimodal target distribution. The fully corrective version gives the best fit. Unfortunately, this improved solution comes at a computational cost — solving the line search and fully corrective subproblems is dramatically slower than the fixed step size variant. In the experiments that follow we were able to improve upon the initial VI solution using the simple fixed step size. We believe this is the most interesting variant for practitioners as it does not require any additional implementation other than the VI subroutine. Our synthetic data results are summarized in Figure 1.
5.2 Bayesian logistic regression on two realworld datasets
In this experiment, we consider two realworld binaryclassification tasks: predicting the reactivity of a chemical and predicting mortality in the intensive case unit (ICU). For both tasks we use the Bayesian logistic regression model. This allows us to compare to previous work in [12]. Bayesian logistic regression is a conditional prediction model with prior on the weights and conditional likelihood . This model is commonly used as an example of a simple model which does not have a closed form posterior. [1]
We use the ChemReact dataset which contains 26,733 chemicals, each with 100 features. For this experiment, we ran our algorithm for 35 iterations and found that iteration 17 had the best performance. We observe that running merely one single welltuned iteration of BBVI as implemented in the Edward framework using Gaussian as the variational class outperforms 10 iterations of boosting VI in [12]. While BBVI already has good performance in terms of AUROC, we are able to improve it further by using the fixed step size variant of FW and the Laplace distributions as the base family. In addition, our solution is more stable, namely it has lower standard deviation across replications. Results are summarized in Table 1.
For the mortality prediction task, we selected patient stays between 1 and 30 days and a subset of relevant features from the eICU Collaborative Research database [3]. We removed patients with missing values for one or more features resulting in a dataset with 71,366 patient stays and 70 relevant features ranging from age and gender to lab test results. We then performed a 7030% traintest split. We ran our algorithm for 29 iterations and again found that iteration 17 had the best performance. We observed that our method improves upon the AUROC of Edward’s baseline VI solution and is also more stable. Results are summarized in Table 2.
5.3 Bayesian matrix factorization
Bayesian Matrix Factorization [16] is a more complex model defined in terms of two latent variables, and for some choice of the latent dimension . Each entry of and has prior distribution . The observations matrix, , is assumed to be distributed as . The th iterate of our boosting BBVI, , is distributed where and are the approximations of and returned by the LMO at iteration .
We use the CBCL Face^{2}^{2}2http://cbcl.mit.edu/softwaredatasets/FaceData2.html dataset which is composed of 2,492 images of 361 pixels each, arranged into a matrix. Using a 50% mask for the traintest split, we performed matrix completion on this data using the above model. We compared our boosting BBVI to BBVI for three choices of the latent dimension and observe improvements in both meansquared error and heldout test loglikelihood for all three. Similar to the results for Bayesian linear regression, we observe that the variance across replications is also smaller for our method. Surprisingly, increasing does not have a significant effect on either of the metrics which may indicate that a relatively inexpressive model already contains a good approximation. Results are summarized in Table 3.
6 Conclusion
We have presented a refined yet practical theoretical analysis for the boosting variational inference paradigm. Our approach incorporates black box VI solvers into a general gradient boosting framework based on the FrankWolfe algorithm. Furthermore, we introduced a subroutine which is finally attractive to practitioners as it does not require any additional overhead beyond running a general black box VI solver multiple times. This is an important step forward in adding boosting VI to the standard toolbox of Bayesian inference.
Acknowledgements. FL is partially supported by the MaxPlanck ETH Center for Learning Systems. FL, GD are partially supported by an ETH core grant (to GR). RK is supported by NSF Grant IIS 1421729.
References
 [1] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. arXiv preprint arXiv:1601.00670, 2016.
 [2] M Bishop Christopher. Pattern Recognition and Machine Learning. SpringerVerlag New York, 2016.
 [3] Ary L. Goldberger, Luis A. N. Amaral, Leon Glass, Jeffrey M. Hausdorff, Plamen Ch. Ivanov, Roger G. Mark, Joseph E. Mietus, George B. Moody, ChungKang Peng, and H. Eugene Stanley. Physiobank, physiotoolkit, and physionet. Circulation, 101(23):e215–e220, 2000.
 [4] Fangjian Guo, Xiangyu Wang, Kai Fan, Tamara Broderick, and David B Dunson. Boosting variational inference. arXiv preprint arXiv:1611.05559, 2016.
 [5] Tommi S. Jaakkola and Michael I. Jordan. Improving the mean field approximation via the use of mixture distributions, 1998.
 [6] Martin Jaggi. Convex Optimization without Projection Steps. arXiv math.OC, August 2011.
 [7] Martin Jaggi. Revisiting FrankWolfe: ProjectionFree Sparse Convex Optimization. In ICML 2013  Proceedings of the 30th International Conference on Machine Learning, 2013.
 [8] Ghassen Jerfel. Boosted stochastic backpropagation for variational inference, 2017.
 [9] Rahul G. Krishnan, Simon LacosteJulien, and David Sontag. Barrier frankwolfe for marginal inference. pages 532–540, 2015.
 [10] Simon LacosteJulien. Convergence rate of frankwolfe for nonconvex objectives. arXiv preprint arXiv:1607.00345, 2016.
 [11] Simon LacosteJulien and Martin Jaggi. On the Global Linear Convergence of FrankWolfe Optimization Variants. In NIPS 2015, pages 496–504, 2015.
 [12] Francesco Locatello, Rajiv Khanna, Joydeep Ghosh, and Gunnar Rätsch. Boosting variational inference: an optimization perspective. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS 2018), volume 84 of Proceedings of Machine Learning Research, pages 464–472. PMLR, 2018.
 [13] Francesco Locatello, Rajiv Khanna, Michael Tschannen, and Martin Jaggi. A unified optimization view on generalized matching pursuit and frankwolfe. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
 [14] Ron Meir and Gunnar Rätsch. An introduction to boosting and leveraging. In Advanced lectures on machine learning, pages 118–183. Springer, 2003.
 [15] Andrew C Miller, Nicholas Foti, and Ryan P Adams. Variational boosting: Iteratively refining posterior approximations. arXiv preprint arXiv:1611.06585, 2016.
 [16] Andriy Mnih and Ruslan R Salakhutdinov. Probabilistic matrix factorization. In Advances in neural information processing systems, pages 1257–1264, 2008.
 [17] Emanuel Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33:pp. 1065–1076, 1962.
 [18] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Francis R. Bach and David M. Blei, editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 1530–1538. JMLR.org, 2015.
 [19] Ardavan Saeedi, Tejas D. Kulkarni, Vikash K. Mansinghka, and Samuel J. Gershman. Variational particle approximations. Journal of Machine Learning Research, 18(69):1–29, 2017.
 [20] Tim Salimans, Diederik P. Kingma, and Max Welling. Markov chain monte carlo and variational inference: Bridging the gap. In ICML, 2015.
 [21] Siddhartha Saxena, Shibhansh Dohare, and Jaivardhan Kapoor. Variational inference via transformations on distributions. CoRR, abs/1707.02510, 2017.
 [22] Ilya Tolstikhin, Sylvain Gelly, Olivier Bousquet, CarlJohann SimonGabriel, and Bernhard Schölkopf. Adagan: Boosting generative models. arXiv preprint arXiv:1701.02386, 2017.
 [23] Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787, 2016.
Appendix A Proof of the Main Results
Lemma’ 3.
A density with bounded infinity norm has entropy bounded from below.
Proof.
Assume the infinity norm of a density is bounded from above by a constant . This implies that . Therefore:
Therefore which concludes the proof. ∎
Theorem’ 2.
is bounded for the KL divergence if the parameters space of the densities in is bounded.
Proof.
In order to show that is bounded we then need to show that:
is bounded. For a fixed and we how that is continuous. Since the parameter space is bounded is always bounded for any and so is the , therefore the is continuous for . We only need to show that it also holds for in order to use the result that a continuous function on a bounded domain is bounded. When we have that both and . Therefore we use L’Hospital Rule (H) and obtain:
where for the derivative of the we used the functional chain rule. Again both numerator and denominators in the limit go to zero when , so we use L’Hospital Rule again and obtain:
which is bounded under the assumption of bounded parameters space and bounded infinity norm. Indeed:
Which is bounded under the assumption of bounded norm of the densities in by triangle inequality. ∎
Lemma’ 4.
The duality gap computed at some iterate is an upper bound on the primal error .
Proof.
Let be the gradient of the computed at some . The dual function of the is:
By definition, the gradient is a linear approximation to a function lying below its graph at any point. Therefore, we have that for any :
The duality gap at some point is the defined as the difference between the values of the primal and dual problems:
(10) 
Note that the duality gap is a bound on the primal error as:
(11) 
where the first inequality comes from the fact that the optimum and the second from the convexity of the KL divergence w.r.t. . ∎
Appendix B Code Example
Here we include a Python example of how a practitioner would use our method to do Bayesian logistic regression. Just as in the Edward framework, the user defines their probabilistic model in terms of the prior w, the input data X, and the likelihood of . Then, qt is initialized. This is ultimately the variational approximation to the posterior and what is returned by our algorithm. Finally, for each iteration t, we create a new variable sw to be optimized by the LMO and then run the black box LMO solver. Note that the user need only specify the model and the base family. The optimization is left up to the RELBO black box solver.