Divergence Variational Inference
Abstract
This paper introduces the divergence variational inference (VI) that generalizes variational inference to all divergences. Initiated from minimizing a crafty surrogate divergence that shares the statistical consistency with the divergence, the VI framework not only unifies a number of existing VI methods, e.g. Kullback–Leibler VI Jordan_ML_1999 (), Rényi’s VI Li_NIPS_2016 (), and VI Dieng_NIPS_2017 (), but offers a standardized toolkit for VI subject to arbitrary divergences from divergence family. A general variational bound is derived and provides a sandwich estimate of marginal likelihood (or evidence). The development of the VI unfolds with a stochastic optimization scheme that utilizes the reparameterization trick, importance weighting and Monte Carlo approximation; a meanfield approximation scheme that generalizes the wellknown coordinate ascent variational inference (CAVI) is also proposed for VI. Empirical examples, including variational autoencoders and Bayesian neural networks, are provided to demonstrate the effectiveness and the wide applicability of VI.
University of Illinois at UrbanaChampaign, Urbana, IL 61801
Anker Innovations, Shenzhen, China
1 Introduction
Variational inference (VI) is a machine learning method that makes Bayesian inference computationally efficient and scalable to large datasets. For decades, the dominant paradigm for approximate Bayesian inference has been MarkovChain MonteCarlo (MCMC) algorithms, which estimate the evidence via sampling. However, since sampling tends to be a slow and computationally intensive process, these samplingbased approximate inference methods fade when dealing with the modern probabilistic machine learning problems that usually involve very complex models, highdimensional feature spaces and large datasets. In these instances, VI becomes a good alternative to perform Bayesian inference. The foundation of VI is primarily optimization rather than sampling. To perform VI, we posit as a family of approximate (or recognition) densities and find the member that minimizes the statistical divergence to the true posterior . Meanwhile, since VI also has many elegant and favorable theoretical properties, e.g. variational bounds of the true evidence, it has become the foundation of many popular generative and machine learning models.
Recent advances in VI can be roughly categorized into three groups, improvements over traditional VI algorithms Knowles_NIPS_2011 (); Wang_JMLR_2013 (), developments of scalable VI methods Hoffman_JMLR_2013 (); Kingma_ICLR_2014 (); Li_NIPS_2015 (), and explorations for tighter variational bounds Burda_ICLR_2016 (); Tao_ICML_2018 (). Comprehensive reviews on VI’s background and progression can be found in Blei_JASA_2017 (); Zhang_TPAMI_2019 (). While most of these advancements were built on the classical VI associated with the Kullback–Leibler (KL) divergence, some recent efforts tried to extend the VI framework to other statistical divergences and showed promising results. Among these efforts, Rényi’s divergence and divergence as the root divergences (or generators) of the KL divergence were employed for VI in Li_NIPS_2016 (); Dieng_NIPS_2017 (); Regli_arXiv_2018 (), which not only broadens the variety of statistical divergences for VI, but makes KLVI a special case of their methods. Stochastic optimization methods from KLVI, such as stochastic VI Hoffman_JMLR_2013 () and blackbox VI Ranganath_ICAIS_2014 (), were generalized to Rényi’s VI and VI in Li_NIPS_2016 (); Dieng_NIPS_2017 (), and the modified algorithms with new divergences outperformed the classical KLVI in some benchmarks of Bayesian regressions and image reconstruction. Nevertheless, meanfield approximation, an important type of KLVI algorithms including the coordinate ascent variational inference (CAVI) and expectation propagation (EP) algorithms Bishop_2006 (); Minka_UAI_2001 (); Blei_JASA_2017 (), were regretfully not revisited or extended for these new divergences.
As the root divergence of the Rényi’s divergence, divergence and many other useful divergences Sason_TIT_2016 (); Sason_Entropy_2018 (), divergence is a more inclusive statistical divergence (family) and was utilized to improve the statistical properties Bamler_NIPS_2017 (); Wang_NIPS_2018 (), sharpness Tao_ICML_2018 (); Zhang_arxiv_2019 (), and surely the generality of variational bounds Tao_ICML_2018 (); Zhang_arxiv_2019 (); Knoblauch_arXiv_2019 (). However, most of these works only dealt with some portions of divergences for their favorable statistical properties, e.g. masscovering Bamler_NIPS_2017 () and tailadaptive Wang_NIPS_2018 (), and did not develop a systematic VI framework that harbors all divergences. Meanwhile, since i) the regular divergence does not explicitly induce an variational bound as elegant as the ELBO Blei_JASA_2017 (), upper bound (CUBO) Dieng_NIPS_2017 (), or Rényi variational bound (RVB) Li_NIPS_2016 (), and ii) only specific choices of divergence result in an variational bound that trivially depends on the evidence Zhang_TPAMI_2019 (), a thorough and comprehensive analysis on the divergence VI has been due for a long time.
In this paper, we extend the traditional VI to divergence, a rich family that comprises many wellknown divergences as special cases Sason_TIT_2016 (), by offering some new insights into the divergence VI and a unified VI framework that encompasses a number of recent developments in VI methods. An explicit benefit of VI is that it allows to perform VI or Bayesian approximation with even more variety of divergences, which can potentially bring us sharper variational bounds, more accurate estimate of true evidence, faster convergence rates, more criteria for selecting approximate model , etc. We hope this effort can be the last brick to complete the building of divergence VI and motivate more useful and efficient VI algorithms in the future. After reviewing the divergence and introducing a crafty surrogate divergence that is interchangeable with the regular divergence, we make the following contributions:

We enrich the divergence VI theory by introducing an VI scheme via minimizing a surrogate divergence, which makes our VI framework compatible with the traditional VI approaches and naturally unifies an amount of existing VI methods, such as KLVI Jordan_ML_1999 (), VI Li_NIPS_2016 (), VI Dieng_NIPS_2017 (), and their related developments Kingma_ICLR_2014 (); Burda_ICLR_2016 (); Wang_NIPS_2018 (); Tao_ICML_2018 (); Li_NIPS_2015 ().

We derive an variational bound for the evidence and equip it with the upper/lower bound criteria and an importanceweighted (IW)bound. The variational bound is realized with an arbitrary divergence and unifies many existing bounds, such as ELBO, CUBO, RVB, and a number of generalized evidence bounds (GLBO) Tao_ICML_2018 ().

We propose a universal optimization solution that comprises a stochastic optimization algorithm and a meanfield approximation algorithm for VI subject to all divergences, whether or not the variational bounds trivially depend on the evidence. Experiments on Bayesian neural networks and variational autoencoders (VAEs) show that VI can be comparable to, or even better than, a number of the stateoftheart variational methods.
2 Preliminary of divergence
We first introduce some definitions and properties related to divergence, which are to be adopted in our subsequent exposition.
2.1 divergence
An divergence that measures the difference between two continuous probability distributions and can be defined as follows Sason_TIT_2016 ().
Definition 1
The divergence from probability density functions to is defined as
(1) 
where is a convex function with .
Definition 1 assumes that is absolutely continuous w.r.t. , which might not be exhaustive, but avoids the unnecessary entanglements with measure theory details. One can however refer to Sason_TIT_2016 (); Sason_Entropy_2018 () for a more rigorous treatment. Most prevailing divergences adopted in VI can be regarded as the special cases of divergence and hence be restored by choosing a proper function . Table 1 and Sason_TIT_2016 (); Sason_Entropy_2018 (); Zhang_arxiv_2019 () present the relationship between some wellknown statistical divergences adopted in VI and their functions. Intuitively, one can perform VI by minimizing either the forward divergence or the reverse divergence , and Murphy_2012 (); Zhang_arxiv_2019 () provide some heuristic discussions on their statistical differences. Since VI based on the reverse KL divergence is more tractable to compute and more statistically sensible, we will develop our VI framework primarily based on the reverse divergence, while one can still unify or commute between the forward and reverse divergences via the dual function , which is also referred to as the perspective function or the conjugate symmetry of in Boyd_2004 (); Sason_TIT_2016 (); Dieng_NIPS_2017 ().
Definition 2
Given a function , the dual function is defined as
One can verify that the dual function has the following properties: i) ; ii) if is convex, is also convex, and iii) if , then . With dual function , an identity between the forward and reverse divergences can be established Dieng_NIPS_2017 ():
In order to facilitate the derivation of variational bound, especially when the latent variable model is involved Nowozin_NIPS_2016 (); Zhang_arxiv_2019 (), we introduce a surrogate divergence defined by the generator function
(2) 
where is constant. It is straightforward to verify that and have the same convexity, and implies , which induces a valid (surrogate) divergence, denoted as , that can virtually replace when needed
Proposition 1
Given two probability distributions and , a convergent sequence , and a convex function such that and is uniformly continuous, the divergences between and satisfy
(3) 
almost everywhere as .
2.2 Shifted homogeneity
We then introduce a class of functions equipped with a structural advantage in decomposition, which will be invoked later to derive the coordinatewise VI algorithm under meanfield assumption.
Definition 3
A convex function belongs to , if , and for any , we have
(4) 
where , and . Function is type shifted homogeneous or if , and type shifted homogeneous or if .
This special class of functions allows to decompose an function into two or more (by iterations) terms, each of which is a product of an function and an exponent. In Table 1, we show that the functions of many wellknown divergences can be classified as functions.
Divergences  

KL divergence Jordan_ML_1999 ()  
General divergence Dieng_NIPS_2017 ()  
Hellinger divergence Sason_Entropy_2018 ()  
Rényi’s divergence 
The duality property between and is stated in Proposition 2.
Proposition 2
Given and , the dual functions and .
When , we can establish a more profound relationship, in contrast with Proposition 1, between divergence and surrogate divergence .
Proposition 3
When and an divergence and its surrogate divergence satisfy
(5) 
By virtue of the equivalence relationship revealed in Proposition 1 and 3, we can interchangeably use divergence and surrogate divergence , and the parameter of surrogate divergence provides an additional degree of freedom when deriving the variational bounds and VI algorithms.
3 Variational bounds and optimization
While it was difficult to retrieve an variational bound Tao_ICML_2018 (); Wang_NIPS_2018 (); Zhang_arxiv_2019 (), which is an expectation over and unifies the existing variational bounds Blei_JASA_2017 (); Li_NIPS_2016 (); Dieng_NIPS_2017 (), by directly manipulating the original divergence in (1), we will show that such a general variational bound can be found when minimizing a crafty surrogate divergence.
3.1 variational bounds
Given a convex function such that and a set of i.i.d. samples , the generator function with can induce a surrogate divergence. Our VI is then initiated from minimizing the following reverse (surrogate) divergence
(6) 
Multiplying both sides of (6) by and with rearrangements, we have
(7) 
For a given evidence , we can minimize the divergence by minimizing the expectation in (7), which is defined as the variational bound . Consequently, by the nonnegativity of divergence Sason_TIT_2016 (); Sason_Entropy_2018 (), we can establish the following inequality.
Theorem 1
Dual function of evidence is bounded above by variational bound
(8) 
and equality is attained when , i.e. .
By properly choosing function, variational bound and (8) can restore the most existing variational bounds and the corresponding inequalities, e.g. for ELBO in Blei_JASA_2017 () and for CUBO in Dieng_NIPS_2017 (). See Supplementary Material (SM) for more restoration examples and some new variational bounds, e.g. an evidence upper bound (EUBO) under KL divergence. While the assumption of or the existence of in (6) might lay additional restrictions in some situations, we can circumvent them by resorting to the VI minimizing the forward surrogate divergence . SM provides more details for this alternative. Additionally, in (8) can be further sharpened by leveraging multiplyweighted posterior samples Burda_ICLR_2016 (), i.e., importanceweighted VI.
Corollary 1
When , the importanceweighted variational bound and the variational bound satisfy
where is defined as
and are i.i.d. samples from .
For clarity and conciseness, we will develop the subsequent results primarily based on . Nevertheless, our readers should feel safe to replace with in the following context and obtain improved outcomes. More interesting results can be observed from (8). After composing both sides of (8) with the inverse dual function , we have the following observations:

When the dual function is increasing (or nondecreasing) on , the composition gives an evidence upper bound:

When the dual function is decreasing (or nonincreasing) on , the composition gives an evidence lower bound:

When the dual function is nonmonotonic on , the composition gives a local evidence bound by applying or on a monotonic interval of .
Based on these observations, we can readily imply a sandwich formula for evidence , which is essential for accurate VI Zhang_TPAMI_2019 ().
Corollary 2
Given convex functions and such that , on an interval where is increasing and is decreasing, the evidence satisfies
(9) 
The evidence bounds in (9) are akin to the GLBO, which was proposed on the basis of a few assumptions and intuitions in Tao_ICML_2018 (). Corollary 1 and Corollary 2 interprets and supplements GLBO with rigorous VI analysis and explicit instructions on choosing function. Compared with the unilateral variational bounds, the bilateral bounds in (9) reveal more information and allow to estimate with more accuracy. To sharpen these bilateral bounds, we need to properly choose the functions and and the recognition model such that and can be attained. For a selected family of , various choices of and will lead to evidence bounds of different sharpness and optimization efficiency. The model selection of approximate distribution is a fundamental problem inherited by all VI algorithms, and a feasible solution is to compare the performance of candidate models while fixing an  or function Tao_ICML_2018 () or alternating among some common divergences. Once the functions and and the recognition model are determined, we can approximate the optimal distribution in or minimize by adjusting the parameters in , which does not require the dual function or be invertible as in (9) and will be discussed in the succeeding subsections.
3.2 Stochastic optimization
While classical VI is limited to conditionally conjugate exponential family models Murphy_2012 (); Blei_JASA_2017 (); Zhang_TPAMI_2019 (), the stochastic optimization makes VI applicable for more modern and complicated problems Hoffman_JMLR_2013 (); Ranganath_ICAIS_2014 (). To minimize with stochastic optimization, we supplement the preceding VI formulation with more details. The approximate model is formulated as , where are the parameters to be optimized. While some papers Mnih_ICML_2014 (); Kingma_ICLR_2014 (); Tao_ICML_2018 () also consider and optimize the parameters in the generative model , we prefer to treat the parameters as latent variables for conciseness. An intuitive approach to apply stochastic optimization is to compute the standard gradient of or w.r.t.
(10) 
where denotes . Since is known as the score function in statistics Cox_1979 () and is a part of the REINFORCE algorithm Williams_ML_1992 (); Mnih_ICML_2014 (), (10) is called score function or REINFORCE gradient. An unbiased Monte Carlo (MC) estimator for (10) can be obtained by drawing from and
(11) 
However, since the variance of estimator (11) can be too large to be useful in practice, the score function gradient is usually employed along with some variation reduction techniques, such as the control variates and RaoBlackwellization Paisley_ICML_2012 (); Mnih_ICML_2014 (); Ranganath_ICAIS_2014 ().
An alternative to the score function gradient is the reparameterization gradient, which empirically has a lower estimation variance Kingma_ICLR_2014 (); Zhang_arxiv_2019 () and can be integrated with neural networks. The reparameterization trick requires the existence of a noise variable and a mapping such that . Instead of directly sampling from , the reparameterization estimators rely on the samples drawn from , for example, a Gaussian latent variable can be reparameterized with a standard Gaussian variable and a mapping . More detailed interpretations as well as recent advances in the reparameterization trick can be found in Kingma_ICLR_2014 (); Ruiz_NIPS_2016 (); Figurnov_NIPS_2018 (); Jankowiak_ICML_2018 (). The gradient of after reparameterization becomes
(12) 
An unbiased MC estimator for (12) is
(13) 
where are drawn from . We also give an unbiased MC estimator for the importanceweighted reparameterization gradient in (14), which will be utilized in later experiments:
(14) 
where noise samples are drawn from . All the aforementioned estimators for variational bounds and gradients are unbiased, while composing these estimator with other functions, e.g. inverse dual functions in (9), can sometimes trade the unbiasedness for numerical stability Li_NIPS_2016 (); Dieng_NIPS_2017 (); Tao_ICML_2018 ().
Nonetheless, the preceding estimators and VI algorithms rely on the full dataset and can be handicapped to tackle the problems with large datasets. Meanwhile, since the properties of functions are flexible, it is nontrivial to represent the variational bounds by the expectation on a datapointwise loss, except for some specific divergences, such as KL divergence Kingma_ICLR_2014 () or divergences with dual functions satisfying , i.e. with . Therefore, to deploy the minibatch training, we integrate the aforementioned estimators with the average likelihood technique Li_NIPS_2016 (). Given a minibatch of datapoints , we approximate the full loglikelihood by . Hence, the ratio in (1014) can be approximated by . When contains local hidden variables, the prior distribution and approximate distribution should also be approximated accordingly. This proxy to the full dataset wraps up our blackbox VI algorithm, which is essentially a stochastic optimization algorithm that only relies on a minibatch of data in each iteration. A reference blackbox VI algorithm and the optimization schemes for a few concrete divergences are given in the SM.
3.3 Meanfield approximation
Meanfield approximation, which simplifies the original VI problem for tractable computation, is historically an important VI algorithm before the emergence of stochastic VI. As the cornerstone of several variational message passing algorithms Winn_JMLR_2005 (); Wand_BA_2011 (), meanfield VI is still evolving Knowles_NIPS_2011 (); Wang_JMLR_2013 (); Blei_JASA_2017 (); Zhang_TPAMI_2019 () and worthy to be generalized for VI. A meanfield approximation assumes that all latent variables are independent, and the recognition model can be fully factorized as , which simplifies the derivations and computation but might lead to less accurate results. The meanfield VI algorithm alternatively updates each marginal distribution to minimize the variational bound . For the divergences with , such as KL divergence, the coordinatewise update rule for is obtained from fixing the other variational factors and singling out from variational bound in (8), which gives
(15) 
For the divergences with , such as  or Rényi’s divergences, the coordinatewise update rule for is obtained by applying the same procedures to the variational bound from the forward VI (see SM), which gives
(16) 
When deriving these meanfield VI update rules (see SM), we only exploit the homogeneity of  or function. CAVI Bishop_2006 (); Blei_JASA_2017 (), EP Minka_UAI_2001 (), and other types of meanfield VI algorithms can be restored from (15) and (16) by choosing a proper  or function. A reference meanfield VI algorithm along with a concrete realization example under KL divergence is provided in the SM. When the inverse function or in (15) or (16) is not analytically solvable, we can either generate a lookup table for or and numerically evaluate (15) or (16) or resort to the stochastic VI.
4 Experiments
The effectiveness and the wide applicability of VI are demonstrated with three empirical examples in this section. We first verify the theoretical results with a synthetic example. The VI is then respectively implemented for a Bayesian neural network for linear regression and a VAE for image reconstruction and generation. Adam optimizer with recommended parameters in Kingma_ICLR_2015 () is employed for stochastic optimization, if not specified. Empirical results and data are reported by their mean value and confidence intervals. More detailed descriptions on the experimental settings, supplemental results, and the demonstration of the meanfield approximation method are provided in the SM.
4.1 Synthetic example
We first demonstrate the VI theory with a vanilla example. Consider a batch of i.i.d. datapoints generated by a latent variable model , , where denotes a univariate normal distribution with mean and variance , and denotes a uniform distribution on interval . Subsequently, for simplicity, we posit a prior distribution , a likelihood distribution , and an approximate model , which is a uniform distribution centered at with width . To verify the order and the sharpness of variational bounds, we fix and approximate the true evidence , IWRVB (), (IW)CUBO (), and (IW)ELBO () in Figure 1(a), which substantiates Theorem 1, Corollary 1 and 2. A variational bound associated with the total variation distance, an divergence with nonmonotonic function, is analyzed in the SM, and more approximation results when can be found in Tao_ICML_2018 (). To demonstrate the effectiveness of stochastic VI algorithm, we set an initial value and update the recognition distribution by optimizing the IWRVB (), (IW)CUBO (), and (IW)ELBO. The IWreparameterization gradient (14) with and is adopted for the training on a dataset of observations, and the variational bounds in Figure 1(b) are evaluated on a test set of observations. The sandwichtype bounds in Figure 1(b) give an estimate of the test logevidence, which is roughly between and .
4.2 Bayesian neural network
We then implement the VI for a singlelayer neural network for Bayesian linear regression. Our experimental setup generally follows the regression settings in Li_NIPS_2016 (), while some parameters vary to adapt to the VI framework. The linear regression is performed with twelve datasets from the UCI Machine Learning Repository UCI (). Each dataset is randomly split into for training and testing, and six different dual functions in are selected such that three wellestablished VIs (KLVI, Rényi’s VI with , and VI with ) and three new VIs (VIs subject to total variation distance and two custom divergences) are tested and compared. One of the custom divergences, inspired by Bamler_NIPS_2017 (), is defined by a convex dual function , where , , and is a parameter to be optimized. The IWreparameterization gradient with , and minibatch size of 32 is employed for training. After 20 trials with 500 training epochs in each trial, the regression results are evaluated by the test root mean squared error (RMSE) and test negative loglikelihood reported in Table 2. The performance of custom VI matches the results of wellestablished VIs on most datasets, and the custom VI quantitatively outperforms others on some datasets, e.g. Fish Toxicity and Stock. A complete version of Table 2, including the regression results of the other two new VIs, and more detailed descriptions on the training process, such as the architecture of neural network, training parameters, numerical stability and estimator biasedness, are provided in the SM.
Dataset  Test RMSE (lower is better)  Test negative loglikelihood (lower is better)  

KLVI  VI  VI  VI  KLVI  VI  VI  VI  
Airfoil  2.16.07  2.36.14  2.30.08  2.34.09  2.17.03  2.27.03  2.26.02  2.29.02 
Aquatic  1.12.06  1.20.06  1.14.07  1.14.06  1.54.04  1.60.08  1.54.07  1.54.06 
Boston  2.76.36  2.99.37  2.86.36  2.87.36  2.49.08  2.54.18  2.48.13  2.49.13 
Building  1.38.12  2.82.51  1.83.22  1.80.21  6.62.02  6.94.13  6.79.03  6.74.04 
CCPP  4.05.09  4.14.11  4.06.08  4.33.12  2.82.02  2.84.03  2.82.02  2.95.01 
Concrete  5.40.24  3.32.34  5.32.27  5.26.21  3.10.04  2.61.18  3.09.04  3.09.03 
Fish Toxicity  0.88.04  0.90.04  0.89.04  0.88.03  1.28.04  1.27.04  1.29.04  1.29.03 
Protein  1.93.19  2.45.42  1.87.17  1.97.21  2.00.07  2.01.08  2.04.08  2.21.04 
Real Estate  7.481.41  7.511.44  7.461.42  7.521.40  3.60.30  3.70.45  3.59.32  3.62.33 
Stock  3.851.12  3.901.09  3.881.13  3.821.11  1.09.04  1.09.04  1.09.04  1.09.04 
Wine  .642.018  .640.021  .638.018  .643.019  .966.027  .965.028  .964.025  .975.027 
Yacht  0.78.12  1.18.18  0.99.12  1.00.18  1.70.02  1.79.03  1.82.01  2.05.01 
4.3 Bayesian variational autoencoder
We also integrate the VI with a Bayesian VAE for image reconstruction and generation on the datasets of Caltech 101 Silhouettes Caltech101 (), Frey Face FreyFace (), MNIST MNIST (), and Omniglot OMNIGLOT (). By replacing the conventional ELBO loss function of VAE Kingma_ICLR_2014 (); MATLAB_VAE () with the more flexible variational bound loss functions, we test and compare the VAEs associated with three wellknown divergences (KLdivergence, Rényi’s divergence with , and divergence with ) and three new divergences (total variation distance and two custom divergences). The dual function for total variation distance is . The custom variational bound loss is induced by the aforementioned dual function with . The custom variational bound loss is induced by dual function , which is convex on . The reparameterization gradient with , is used for training. After 20 trials with 200 training epochs in each trial, the average test reconstruction errors (lower is better) measured by crossentropy are listed in Table 3. In VAE example, the performances of three new VIs also rival the results of three wellknown VIs on most datasets. Reconstructed and generated images, architectures of the encoder and decoder networks, and more detailed interpretations on the custom functions and training process of VAEs are given in the SM.
KLVI  VI  VI  TVVI  VI  VI  

Caltech 101  73.802.27  73.842.16  74.952.76  74.322.26  74.872.56  74.852.94 
Frey Face  160.85.72  160.57.95  161.061.16  161.111.00  160.52.88  160.65.87 
MNIST  59.06.40  62.13.50  61.90.69  62.44.41  59.60.25  59.53.42 
Omniglot  109.62.20  110.57.28  110.81.32  110.21.31  107.13.39  108.29.28 
5 Conclusion
We have introduced a general divergence VI framework equipped with a rigorous theoretical analysis and a standardized optimization solution, which together extend the current VI methods to a broader range of statistical divergences. Empirical experiments on the popular benchmarks imply that this VI method is flexible, effective, and widely applicable, and some custom VI instances can attain stateofthe art results. Future work on VI may include finding the VI instances with more favorable properties, more efficient VI optimization methods, and VI frameworks and theories that are more universal than the VI.
Broader Impact
This work does not present any foreseeable societal consequence.
This work was supported by AFSOR under Grant FA95501510518 and NSF NRI under Grant ECCS1830639. The authors would like to thank the anonymous editors and reviewers for their constructive comments, Dr. Xinyue Chang (Iowa State Univ.), Lei Ding (Univ. of Alberta), Zhaobin Kuang (Stanford), Yang Wang (Univ. of Alabama), and Yanbo Xu (Georgia Tech.) for their helpful suggestions, and Prof. Evangelos A. Theodorou for his insightful and heuristic comments on this paper. In this arXiv version, the authors would also like to thank the readers and staff on arXiv.org.
Supplementary Material for
“Divergence Variational Inference”
Neng Wan
nengwan2@illinois.edu
&Dapeng Li
dapeng.ustc@gmail.com
&Naira Hovakimyan
nhovakim@illinois.edu
University of Illinois at UrbanaChampaign, Urbana, IL 61801
Anker Innovations, Shenzhen, China
This supplementary material provides additional details for some results in the original paper.
Appendix A Proofs of the main results
This section provides i) elaboration on the surrogate divergence including the proofs of Proposition 1, Proposition 2 and Proposition 3, ii) deviations of the variational bound generated from both the reverse and forward surrogate divergence, and iii) an importanceweighted variational bound and the proof of Corollary 1.
a.1 Proof of Proposition 1
We first expand the LHS of (3) by substituting the definitions of divergence (1) and generator function (2).
In order to prove (3), we only need to show that
(17) 
which can be proved by showing that function is continuous in , since the continuity of brings each convergent sequence in to a convergent sequence in . The continuity of can be justified as follows. For arbitrary and , there exists such that
where we have used the uniform continuity of . This completes the proof.
a.2 Proof of Proposition 2
We first consider the scenario when . Since
by letting , we can conclude that . We then consider the case when . Since
by letting , we can conclude that . This completes the proof.
a.3 Proof of Proposition 3
a.4 variational bound from reverse divergence
a.5 variational bound from forward divergence
As we mentioned in Section 3.1, the assumption on or the existence of in (6) can be circumvented by using the VI that minimizes the forward surrogate divergence . Meanwhile, in Section 3.3, the coordinatewise update rule (16) for is also based on the variational bound induced by . The variational bound and a sandwich estimate of evidence from forward surrogate divergence are derived below. First, we notice that the forward surrogate divergence can be decomposed as follows
By the nonnegativity of divergence [17], i.e. , the variational bound from forward divergence follows
(18) 
where equality holds when . Inequality (18) formulates the variational bound induced by forward divergence and supplements Theorem 1, which is based on the reverse divergence. Given convex functions and such that , on an interval where is nondecreasing and is nonincreasing, a sandwich estimate of evidence is given as follows