f-Divergence Variational Inference

# f-Divergence Variational Inference

## Abstract

This paper introduces the -divergence variational inference (-VI) that generalizes variational inference to all -divergences. Initiated from minimizing a crafty surrogate -divergence that shares the statistical consistency with the -divergence, the -VI framework not only unifies a number of existing VI methods, e.g. Kullback–Leibler VI Jordan_ML_1999 (), Rényi’s -VI Li_NIPS_2016 (), and -VI Dieng_NIPS_2017 (), but offers a standardized toolkit for VI subject to arbitrary divergences from -divergence family. A general -variational bound is derived and provides a sandwich estimate of marginal likelihood (or evidence). The development of the -VI unfolds with a stochastic optimization scheme that utilizes the reparameterization trick, importance weighting and Monte Carlo approximation; a mean-field approximation scheme that generalizes the well-known coordinate ascent variational inference (CAVI) is also proposed for -VI. Empirical examples, including variational autoencoders and Bayesian neural networks, are provided to demonstrate the effectiveness and the wide applicability of -VI.

University of Illinois at Urbana-Champaign, Urbana, IL 61801

Anker Innovations, Shenzhen, China

## 1 Introduction

Variational inference (VI) is a machine learning method that makes Bayesian inference computationally efficient and scalable to large datasets. For decades, the dominant paradigm for approximate Bayesian inference has been Markov-Chain Monte-Carlo (MCMC) algorithms, which estimate the evidence via sampling. However, since sampling tends to be a slow and computationally intensive process, these sampling-based approximate inference methods fade when dealing with the modern probabilistic machine learning problems that usually involve very complex models, high-dimensional feature spaces and large datasets. In these instances, VI becomes a good alternative to perform Bayesian inference. The foundation of VI is primarily optimization rather than sampling. To perform VI, we posit as a family of approximate (or recognition) densities and find the member that minimizes the statistical divergence to the true posterior . Meanwhile, since VI also has many elegant and favorable theoretical properties, e.g. variational bounds of the true evidence, it has become the foundation of many popular generative and machine learning models.

Recent advances in VI can be roughly categorized into three groups, improvements over traditional VI algorithms Knowles_NIPS_2011 (); Wang_JMLR_2013 (), developments of scalable VI methods Hoffman_JMLR_2013 (); Kingma_ICLR_2014 (); Li_NIPS_2015 (), and explorations for tighter variational bounds Burda_ICLR_2016 (); Tao_ICML_2018 (). Comprehensive reviews on VI’s background and progression can be found in Blei_JASA_2017 (); Zhang_TPAMI_2019 (). While most of these advancements were built on the classical VI associated with the Kullback–Leibler (KL) divergence, some recent efforts tried to extend the VI framework to other statistical divergences and showed promising results. Among these efforts, Rényi’s -divergence and -divergence as the root divergences (or generators) of the KL divergence were employed for VI in Li_NIPS_2016 (); Dieng_NIPS_2017 (); Regli_arXiv_2018 (), which not only broadens the variety of statistical divergences for VI, but makes KL-VI a special case of their methods. Stochastic optimization methods from KL-VI, such as stochastic VI Hoffman_JMLR_2013 () and black-box VI Ranganath_ICAIS_2014 (), were generalized to Rényi’s -VI and -VI in Li_NIPS_2016 (); Dieng_NIPS_2017 (), and the modified algorithms with new divergences outperformed the classical KL-VI in some benchmarks of Bayesian regressions and image reconstruction. Nevertheless, mean-field approximation, an important type of KL-VI algorithms including the coordinate ascent variational inference (CAVI) and expectation propagation (EP) algorithms Bishop_2006 (); Minka_UAI_2001 (); Blei_JASA_2017 (), were regretfully not revisited or extended for these new divergences.

As the root divergence of the Rényi’s -divergence, -divergence and many other useful divergences Sason_TIT_2016 (); Sason_Entropy_2018 (), -divergence is a more inclusive statistical divergence (family) and was utilized to improve the statistical properties Bamler_NIPS_2017 (); Wang_NIPS_2018 (), sharpness Tao_ICML_2018 (); Zhang_arxiv_2019 (), and surely the generality of variational bounds Tao_ICML_2018 (); Zhang_arxiv_2019 (); Knoblauch_arXiv_2019 (). However, most of these works only dealt with some portions of -divergences for their favorable statistical properties, e.g. mass-covering Bamler_NIPS_2017 () and tail-adaptive Wang_NIPS_2018 (), and did not develop a systematic VI framework that harbors all -divergences. Meanwhile, since i) the regular -divergence does not explicitly induce an -variational bound as elegant as the ELBO Blei_JASA_2017 (), upper bound (CUBO) Dieng_NIPS_2017 (), or Rényi variational bound (RVB) Li_NIPS_2016 (), and ii) only specific choices of -divergence result in an -variational bound that trivially depends on the evidence Zhang_TPAMI_2019 (), a thorough and comprehensive analysis on the -divergence VI has been due for a long time.

In this paper, we extend the traditional VI to -divergence, a rich family that comprises many well-known divergences as special cases Sason_TIT_2016 (), by offering some new insights into the -divergence VI and a unified -VI framework that encompasses a number of recent developments in VI methods. An explicit benefit of -VI is that it allows to perform VI or Bayesian approximation with even more variety of divergences, which can potentially bring us sharper variational bounds, more accurate estimate of true evidence, faster convergence rates, more criteria for selecting approximate model , etc. We hope this effort can be the last brick to complete the building of -divergence VI and motivate more useful and efficient VI algorithms in the future. After reviewing the -divergence and introducing a crafty surrogate -divergence that is interchangeable with the regular -divergence, we make the following contributions:

1. We enrich the -divergence VI theory by introducing an -VI scheme via minimizing a surrogate -divergence, which makes our -VI framework compatible with the traditional VI approaches and naturally unifies an amount of existing VI methods, such as KL-VI Jordan_ML_1999 (), -VI Li_NIPS_2016 (), -VI Dieng_NIPS_2017 (), and their related developments Kingma_ICLR_2014 (); Burda_ICLR_2016 (); Wang_NIPS_2018 (); Tao_ICML_2018 (); Li_NIPS_2015 ().

2. We derive an -variational bound for the evidence and equip it with the upper/lower bound criteria and an importance-weighted (IW-)bound. The -variational bound is realized with an arbitrary -divergence and unifies many existing bounds, such as ELBO, CUBO, RVB, and a number of generalized evidence bounds (GLBO) Tao_ICML_2018 ().

3. We propose a universal optimization solution that comprises a stochastic optimization algorithm and a mean-field approximation algorithm for -VI subject to all -divergences, whether or not the -variational bounds trivially depend on the evidence. Experiments on Bayesian neural networks and variational autoencoders (VAEs) show that -VI can be comparable to, or even better than, a number of the state-of-the-art variational methods.

## 2 Preliminary of f-divergence

We first introduce some definitions and properties related to -divergence, which are to be adopted in our subsequent exposition.

### 2.1 f-divergence

An -divergence that measures the difference between two continuous probability distributions and can be defined as follows Sason_TIT_2016 ().

###### Definition 1

The -divergence from probability density functions to is defined as

 Df(q(z)∥p(z))=:∫f(q(z)p(z))p(z) dz=Ep[f(q(z)p(z))], (1)

where is a convex function with .

Definition 1 assumes that is absolutely continuous w.r.t. , which might not be exhaustive, but avoids the unnecessary entanglements with measure theory details. One can however refer to Sason_TIT_2016 (); Sason_Entropy_2018 () for a more rigorous treatment. Most prevailing divergences adopted in VI can be regarded as the special cases of -divergence and hence be restored by choosing a proper -function . Table 1 and Sason_TIT_2016 (); Sason_Entropy_2018 (); Zhang_arxiv_2019 () present the relationship between some well-known statistical divergences adopted in VI and their -functions. Intuitively, one can perform -VI by minimizing either the forward -divergence or the reverse -divergence , and Murphy_2012 (); Zhang_arxiv_2019 () provide some heuristic discussions on their statistical differences. Since VI based on the reverse KL divergence is more tractable to compute and more statistically sensible, we will develop our -VI framework primarily based on the reverse -divergence, while one can still unify or commute between the forward and reverse -divergences via the dual function , which is also referred to as the perspective function or the conjugate symmetry of in Boyd_2004 (); Sason_TIT_2016 (); Dieng_NIPS_2017 ().

###### Definition 2

Given a function , the dual function is defined as

 f∗(t)=t⋅f(1/t).

One can verify that the dual function has the following properties: i) ; ii) if is convex, is also convex, and iii) if , then . With dual function , an identity between the forward and reverse -divergences can be established Dieng_NIPS_2017 ():

 Df∗(p∥q)=∫p(z)q(z)⋅f(q(z)p(z))⋅q(z) dz=Df(q∥p).

In order to facilitate the derivation of -variational bound, especially when the latent variable model is involved Nowozin_NIPS_2016 (); Zhang_arxiv_2019 (), we introduce a surrogate -divergence defined by the generator function

 fλ(⋅)=f(λ⋅)−f(λ), (2)

where is constant. It is straightforward to verify that and have the same convexity, and implies , which induces a valid (surrogate) -divergence, denoted as , that can virtually replace when needed1. To justify the closeness between divergences and , we first note that and share the same minimum point at , then we have the following statement.

###### Proposition 1

Given two probability distributions and , a convergent sequence , and a convex function such that and is uniformly continuous, the -divergences between and satisfy

 Dfλn(q∥p)→Df(q∥p) (3)

almost everywhere as .

### 2.2 Shifted homogeneity

We then introduce a class of -functions equipped with a structural advantage in decomposition, which will be invoked later to derive the coordinate-wise VI algorithm under mean-field assumption.

###### Definition 3

A convex function belongs to , if , and for any , we have

 f(t~t)=tγf(~t)+f(t)~tη, (4)

where , and . Function is type shifted homogeneous or if , and type shifted homogeneous or if .

This special class of functions allows to decompose an -function into two or more (by iterations) terms, each of which is a product of an -function and an exponent. In Table 1, we show that the -functions of many well-known divergences can be classified as functions.

The duality property between and is stated in Proposition 2.

###### Proposition 2

Given and , the dual functions and .

When , we can establish a more profound relationship, in contrast with Proposition 1, between -divergence and surrogate divergence .

###### Proposition 3

When and an -divergence and its surrogate divergence satisfy

 Dfλ(q∥p)=λγDf(q∥p). (5)

By virtue of the equivalence relationship revealed in Proposition 1 and 3, we can interchangeably use -divergence and surrogate divergence , and the parameter of surrogate divergence provides an additional degree of freedom when deriving the variational bounds and VI algorithms.

## 3 Variational bounds and optimization

While it was difficult to retrieve an -variational bound Tao_ICML_2018 (); Wang_NIPS_2018 (); Zhang_arxiv_2019 (), which is an expectation over and unifies the existing variational bounds Blei_JASA_2017 (); Li_NIPS_2016 (); Dieng_NIPS_2017 (), by directly manipulating the original -divergence in (1), we will show that such a general variational bound can be found when minimizing a crafty surrogate -divergence.

### 3.1 f-variational bounds

Given a convex function such that and a set of i.i.d. samples , the generator function with can induce a surrogate -divergence. Our -VI is then initiated from minimizing the following reverse (surrogate) -divergence

 Dfp(D)−1(q(z)∥p(z|D))=1p(D)⋅Eq(z)[f∗(p(z,D)q(z))]−f(1p(D)). (6)

Multiplying both sides of (6) by and with rearrangements, we have

 Lf(q,D)=Eq(z)[f∗(p(z,D)q(z))]=f∗(p(D))+p(D)⋅Dp(D)−1(q(z)∥p(z|D)). (7)

For a given evidence , we can minimize the -divergence by minimizing the expectation in (7), which is defined as the -variational bound . Consequently, by the non-negativity of -divergence Sason_TIT_2016 (); Sason_Entropy_2018 (), we can establish the following inequality.

###### Theorem 1

Dual function of evidence is bounded above by -variational bound

 Lf(q,D)=Eq(z)[f∗(p(z,D)q(z))]≥f∗(p(D)), (8)

and equality is attained when , i.e. .3

By properly choosing -function, -variational bound and (8) can restore the most existing variational bounds and the corresponding inequalities, e.g. for ELBO in Blei_JASA_2017 () and for CUBO in Dieng_NIPS_2017 (). See Supplementary Material (SM) for more restoration examples and some new variational bounds, e.g. an evidence upper bound (EUBO) under KL divergence. While the assumption of or the existence of in (6) might lay additional restrictions in some situations, we can circumvent them by resorting to the -VI minimizing the forward surrogate -divergence . SM provides more details for this alternative. Additionally, in (8) can be further sharpened by leveraging multiply-weighted posterior samples Burda_ICLR_2016 (), i.e., importance-weighted VI.

###### Corollary 1

When , the importance-weighted -variational bound and the -variational bound satisfy

 Lf(q,D)≥L\rm IWf(q,D,L1)≥L\rm IWf(q,D,L2)L→∞−−−→f∗(p(D)),

where is defined as

 L\rm IWf(q,D,L)=EZ1:L∼q(z)[f∗(1LL∑l=1p(zl,D)q(zl))],

and are i.i.d. samples from .

For clarity and conciseness, we will develop the subsequent results primarily based on . Nevertheless, our readers should feel safe to replace with in the following context and obtain improved outcomes. More interesting results can be observed from (8). After composing both sides of (8) with the inverse dual function , we have the following observations:

1. When the dual function is increasing (or non-decreasing) on , the composition gives an evidence upper bound:

 (f∗)−1∘Lf(q,D)≥p(D).
2. When the dual function is decreasing (or non-increasing) on , the composition gives an evidence lower bound:

 (f∗)−1∘Lf(q,D)≤p(D).
3. When the dual function is non-monotonic on , the composition gives a local evidence bound by applying or on a monotonic interval of .

Based on these observations, we can readily imply a sandwich formula for evidence , which is essential for accurate VI Zhang_TPAMI_2019 ().

###### Corollary 2

Given convex functions and such that , on an interval where is increasing and is decreasing, the evidence satisfies

 (g∗)−1∘Eq(z)[g∗(p(z,D)q(z))]≤p(D)≤(f∗)−1∘Eq(z)[f∗(p(z,D)q(z))]. (9)

The evidence bounds in (9) are akin to the GLBO, which was proposed on the basis of a few assumptions and intuitions in Tao_ICML_2018 (). Corollary 1 and Corollary 2 interprets and supplements GLBO with rigorous -VI analysis and explicit instructions on choosing -function. Compared with the unilateral variational bounds, the bilateral bounds in (9) reveal more information and allow to estimate with more accuracy. To sharpen these bilateral bounds, we need to properly choose the functions and and the recognition model such that and can be attained. For a selected family of , various choices of and will lead to evidence bounds of different sharpness and optimization efficiency. The model selection of approximate distribution is a fundamental problem inherited by all VI algorithms, and a feasible solution is to compare the performance of candidate models while fixing an - or -function Tao_ICML_2018 () or alternating among some common divergences. Once the functions and and the recognition model are determined, we can approximate the optimal distribution in or minimize by adjusting the parameters in , which does not require the dual function or be invertible as in (9) and will be discussed in the succeeding subsections.

### 3.2 Stochastic optimization

While classical VI is limited to conditionally conjugate exponential family models Murphy_2012 (); Blei_JASA_2017 (); Zhang_TPAMI_2019 (), the stochastic optimization makes VI applicable for more modern and complicated problems Hoffman_JMLR_2013 (); Ranganath_ICAIS_2014 (). To minimize with stochastic optimization, we supplement the preceding VI formulation with more details. The approximate model is formulated as , where are the parameters to be optimized. While some papers Mnih_ICML_2014 (); Kingma_ICLR_2014 (); Tao_ICML_2018 () also consider and optimize the parameters in the generative model , we prefer to treat the parameters as latent variables for conciseness. An intuitive approach to apply stochastic optimization is to compute the standard gradient of or w.r.t.

 ∇θLf(qθ,D)=Eqθ(z)[f′(qθ(z)p(z,D))⋅∇θlogqθ(z)], (10)

where denotes . Since is known as the score function in statistics Cox_1979 () and is a part of the REINFORCE algorithm Williams_ML_1992 (); Mnih_ICML_2014 (), (10) is called score function or REINFORCE gradient. An unbiased Monte Carlo (MC) estimator for (10) can be obtained by drawing from and

 ∇θ^Lf(qθ,D)=1KK∑k=1[f′(qθ(zk)p(zk,D))⋅∇θlogqθ(zk)]. (11)

However, since the variance of estimator (11) can be too large to be useful in practice, the score function gradient is usually employed along with some variation reduction techniques, such as the control variates and Rao-Blackwellization Paisley_ICML_2012 (); Mnih_ICML_2014 (); Ranganath_ICAIS_2014 ().

An alternative to the score function gradient is the reparameterization gradient, which empirically has a lower estimation variance Kingma_ICLR_2014 (); Zhang_arxiv_2019 () and can be integrated with neural networks. The reparameterization trick requires the existence of a noise variable and a mapping such that . Instead of directly sampling from , the reparameterization estimators rely on the samples drawn from , for example, a Gaussian latent variable can be reparameterized with a standard Gaussian variable and a mapping . More detailed interpretations as well as recent advances in the reparameterization trick can be found in Kingma_ICLR_2014 (); Ruiz_NIPS_2016 (); Figurnov_NIPS_2018 (); Jankowiak_ICML_2018 (). The gradient of after reparameterization becomes

 ∇θLrepf(qθ,D)=∇θEp(ε)[f∗(p(gθ(ε),D)qθ(gθ(ε)))]. (12)

An unbiased MC estimator for (12) is

 ∇θ^Lrepf(qθ,D)=1KK∑k=1∇θf∗(p(gθ(εk),D)qθ(gθ(εk))), (13)

where are drawn from . We also give an unbiased MC estimator for the importance-weighted reparameterization gradient in (14), which will be utilized in later experiments:

 ∇θ^LIW, repf(qθ,D,L)=1KK∑k=1∇θf∗(1LL∑l=1p(gθ(εk,l),D)qθ(gθ(εk,l))), (14)

where noise samples are drawn from . All the aforementioned estimators for -variational bounds and gradients are unbiased, while composing these estimator with other functions, e.g. inverse dual functions in (9), can sometimes trade the unbiasedness for numerical stability Li_NIPS_2016 (); Dieng_NIPS_2017 (); Tao_ICML_2018 ().

Nonetheless, the preceding estimators and VI algorithms rely on the full dataset and can be handicapped to tackle the problems with large datasets. Meanwhile, since the properties of -functions are flexible, it is non-trivial to represent the -variational bounds by the expectation on a datapoint-wise loss, except for some specific divergences, such as KL divergence Kingma_ICLR_2014 () or divergences with dual functions satisfying , i.e. with . Therefore, to deploy the mini-batch training, we integrate the aforementioned estimators with the average likelihood technique Li_NIPS_2016 (). Given a mini-batch of datapoints , we approximate the full log-likelihood by . Hence, the ratio  in (10-14) can be approximated by . When contains local hidden variables, the prior distribution and approximate distribution should also be approximated accordingly. This proxy to the full dataset wraps up our black-box -VI algorithm, which is essentially a stochastic optimization algorithm that only relies on a mini-batch of data in each iteration. A reference black-box -VI algorithm and the optimization schemes for a few concrete divergences are given in the SM.

### 3.3 Mean-field approximation

Mean-field approximation, which simplifies the original VI problem for tractable computation, is historically an important VI algorithm before the emergence of stochastic VI. As the cornerstone of several variational message passing algorithms Winn_JMLR_2005 (); Wand_BA_2011 (), mean-field VI is still evolving Knowles_NIPS_2011 (); Wang_JMLR_2013 (); Blei_JASA_2017 (); Zhang_TPAMI_2019 () and worthy to be generalized for -VI. A mean-field approximation assumes that all latent variables are independent, and the recognition model can be fully factorized as , which simplifies the derivations and computation but might lead to less accurate results. The mean-field -VI algorithm alternatively updates each marginal distribution to minimize the -variational bound . For the -divergences with , such as KL divergence, the coordinate-wise update rule for is obtained from fixing the other variational factors and singling out from -variational bound in (8), which gives

 (15)

For the -divergences with , such as - or Rényi’s -divergences, the coordinate-wise update rule for is obtained by applying the same procedures to the -variational bound from the forward -VI (see SM), which gives

 q∗j(zj)∝f−1(Eq−j[f(p(z,D)q−j(z−j))]). (16)

When deriving these mean-field -VI update rules (see SM), we only exploit the homogeneity of - or -function. CAVI Bishop_2006 (); Blei_JASA_2017 (), EP Minka_UAI_2001 (), and other types of mean-field VI algorithms can be restored from (15) and (16) by choosing a proper - or -function. A reference mean-field VI algorithm along with a concrete realization example under KL divergence is provided in the SM. When the inverse function or in (15) or (16) is not analytically solvable, we can either generate a lookup table for or and numerically evaluate (15) or (16) or resort to the stochastic -VI.

## 4 Experiments

The effectiveness and the wide applicability of -VI are demonstrated with three empirical examples in this section. We first verify the theoretical results with a synthetic example. The -VI is then respectively implemented for a Bayesian neural network for linear regression and a VAE for image reconstruction and generation. Adam optimizer with recommended parameters in Kingma_ICLR_2015 () is employed for stochastic optimization, if not specified. Empirical results and data are reported by their mean value and confidence intervals. More detailed descriptions on the experimental settings, supplemental results, and the demonstration of the mean-field approximation method are provided in the SM.

### 4.1 Synthetic example

We first demonstrate the -VI theory with a vanilla example. Consider a batch of i.i.d. datapoints generated by a latent variable model , , where denotes a univariate normal distribution with mean and variance , and denotes a uniform distribution on interval . Subsequently, for simplicity, we posit a prior distribution , a likelihood distribution , and an approximate model , which is a uniform distribution centered at with width . To verify the order and the sharpness of -variational bounds, we fix and approximate the true evidence , IW-RVB (), (IW-)CUBO (), and (IW-)ELBO () in Figure 1(a), which substantiates Theorem 1, Corollary 1 and 2. A variational bound associated with the total variation distance, an -divergence with non-monotonic function, is analyzed in the SM, and more approximation results when can be found in Tao_ICML_2018 (). To demonstrate the effectiveness of stochastic -VI algorithm, we set an initial value and update the recognition distribution by optimizing the IW-RVB (), (IW-)CUBO (), and (IW-)ELBO. The IW-reparameterization gradient (14) with and is adopted for the training on a dataset of observations, and the -variational bounds in Figure 1(b) are evaluated on a test set of observations. The sandwich-type bounds in Figure 1(b) give an estimate of the test log-evidence, which is roughly between and .

### 4.2 Bayesian neural network

We then implement the -VI for a single-layer neural network for Bayesian linear regression. Our experimental setup generally follows the regression settings in Li_NIPS_2016 (), while some parameters vary to adapt to the -VI framework. The linear regression is performed with twelve datasets from the UCI Machine Learning Repository UCI (). Each dataset is randomly split into for training and testing, and six different dual functions in are selected such that three well-established -VIs (KL-VI, Rényi’s -VI with , and -VI with ) and three new -VIs (VIs subject to total variation distance and two custom -divergences) are tested and compared. One of the custom -divergences, inspired by Bamler_NIPS_2017 (), is defined by a convex dual function , where , , and is a parameter to be optimized. The IW-reparameterization gradient with , and mini-batch size of 32 is employed for training. After 20 trials with 500 training epochs in each trial, the regression results are evaluated by the test root mean squared error (RMSE) and test negative log-likelihood reported in Table 2. The performance of custom -VI matches the results of well-established -VIs on most datasets, and the custom -VI quantitatively outperforms others on some datasets, e.g. Fish Toxicity and Stock. A complete version of Table 2, including the regression results of the other two new -VIs, and more detailed descriptions on the training process, such as the architecture of neural network, training parameters, numerical stability and estimator biasedness, are provided in the SM.

### 4.3 Bayesian variational autoencoder

We also integrate the -VI with a Bayesian VAE for image reconstruction and generation on the datasets of Caltech 101 Silhouettes Caltech101 (), Frey Face FreyFace (), MNIST MNIST (), and Omniglot OMNIGLOT (). By replacing the conventional ELBO loss function of VAE Kingma_ICLR_2014 (); MATLAB_VAE () with the more flexible -variational bound loss functions, we test and compare the -VAEs associated with three well-known -divergences (KL-divergence, Rényi’s -divergence with , and -divergence with ) and three new -divergences (total variation distance and two custom -divergences). The dual function for total variation distance is . The custom -variational bound loss is induced by the aforementioned dual function with . The custom -variational bound loss is induced by dual function , which is convex on . The reparameterization gradient with , is used for training. After 20 trials with 200 training epochs in each trial, the average test reconstruction errors (lower is better) measured by cross-entropy are listed in Table 3. In -VAE example, the performances of three new -VIs also rival the results of three well-known -VIs on most datasets. Reconstructed and generated images, architectures of the encoder and decoder networks, and more detailed interpretations on the custom -functions and training process of -VAEs are given in the SM.

## 5 Conclusion

We have introduced a general -divergence VI framework equipped with a rigorous theoretical analysis and a standardized optimization solution, which together extend the current VI methods to a broader range of statistical divergences. Empirical experiments on the popular benchmarks imply that this -VI method is flexible, effective, and widely applicable, and some custom -VI instances can attain state-of-the art results. Future work on -VI may include finding the -VI instances with more favorable properties, more efficient -VI optimization methods, and VI frameworks and theories that are more universal than the -VI.

This work does not present any foreseeable societal consequence.

{ack}

This work was supported by AFSOR under Grant FA9550-15-1-0518 and NSF NRI under Grant ECCS-1830639. The authors would like to thank the anonymous editors and reviewers for their constructive comments, Dr. Xinyue Chang (Iowa State Univ.), Lei Ding (Univ. of Alberta), Zhaobin Kuang (Stanford), Yang Wang (Univ. of Alabama), and Yanbo Xu (Georgia Tech.) for their helpful suggestions, and Prof. Evangelos A. Theodorou for his insightful and heuristic comments on this paper. In this arXiv version, the authors would also like to thank the readers and staff on arXiv.org.

Supplementary Material for
-Divergence Variational Inference”

Neng Wan
nengwan2@illinois.edu
&Dapeng Li
dapeng.ustc@gmail.com
&Naira Hovakimyan
nhovakim@illinois.edu

4

University of Illinois at Urbana-Champaign, Urbana, IL 61801

Anker Innovations, Shenzhen, China

This supplementary material provides additional details for some results in the original paper.

## Appendix A Proofs of the main results

This section provides i) elaboration on the surrogate -divergence including the proofs of Proposition 1, Proposition 2 and Proposition 3, ii) deviations of the -variational bound generated from both the reverse and forward surrogate -divergence, and iii) an importance-weighted -variational bound and the proof of Corollary 1.

### a.1 Proof of Proposition 1

We first expand the LHS of (3) by substituting the definitions of -divergence (1) and generator function (2).

 limn→∞Dfλn(q∥p)=limn→∞∫p(z)⋅[f(λn⋅q(z)p(z))−f(λn)] dz=limn→∞∫p(z)⋅f(λn⋅q(z)p(z)) dz−limn→∞f(λn)⋅∫p(z) dz=limn→∞∫p(z)⋅f(λn⋅q(z)p(z)) dz.

In order to prove (3), we only need to show that

 limn→∞∫p(z)⋅f(λn⋅q(z)p(z)) dz=∫limn→∞p(z)⋅f(λn⋅q(z)p(z)) dz=Df(q∥p), (17)

which can be proved by showing that function is continuous in , since the continuity of brings each convergent sequence in to a convergent sequence in . The continuity of can be justified as follows. For arbitrary and , there exists such that

 |g(λ+δ)−g(λ)|=∣∣ ∣∣∫p(z)⋅[f((λ+δ)⋅q(z)p(z))−f(λ⋅q(z)p(z))]dz∣∣ ∣∣≤∫p(z)⋅∣∣ ∣∣f((λ+δ)⋅q(z)p(z))−f(λ⋅q(z)p(z))∣∣ ∣∣dz≤∫p(z)⋅ϵ dz=ε,

where we have used the uniform continuity of . This completes the proof.

### a.2 Proof of Proposition 2

We first consider the scenario when . Since

 f∗(t~t)=t~t⋅f(1t~t)=t~t⋅[(1t)γ0⋅f(1~t)+f(1t)]=t1−γ0⋅f∗0(~t)+f∗0(t)⋅~t,

by letting , we can conclude that . We then consider the case when . Since

 f∗(t~t) =t~t⋅f(1t~t) =t~t⋅[(1t)γ1⋅f(1~t)+f(1t)⋅1~t] =t1−γ1⋅f∗1(~t)+f∗1(t),

by letting , we can conclude that . This completes the proof.

### a.3 Proof of Proposition 3

We start this proof by substituting (1), (2) and (4) into the LHS of (5)

Since when , and when , we have

 Dfλ(q∥p)=λγDf(q∥p).

This completes the proof.

### a.4 f-variational bound from reverse divergence

We provide detailed steps for deriving (6), which is a preliminary step for Theorem 1 and the -variational bound induced by forward surrogate -divergence. A forward surrogate -divergence can be decomposed as

### a.5 f-variational bound from forward divergence

As we mentioned in Section 3.1, the assumption on or the existence of in (6) can be circumvented by using the -VI that minimizes the forward surrogate -divergence . Meanwhile, in Section 3.3, the coordinate-wise update rule (16) for is also based on the -variational bound induced by . The -variational bound and a sandwich estimate of evidence from forward surrogate -divergence are derived below. First, we notice that the forward surrogate -divergence can be decomposed as follows

 Dfp(D)(p(z|D)∥q(z)) =∫q(z)⋅fp(D)(p(z|D)q(z))dz =∫q(z)⋅[f(p(z,D)q(z)⋅p(D)⋅p(D))−f(p(D))]dz =Eq(z)[f(p(z,D)q(z))]−f(p(D)).

By the non-negativity of -divergence [17], i.e. , the -variational bound from forward divergence follows

 Lf(q,D)=Eq(z)[f(p(z,D)q(z))]≥f(p(D)), (18)

where equality holds when . Inequality (18) formulates the -variational bound induced by forward divergence and supplements Theorem 1, which is based on the reverse -divergence. Given convex functions and such that , on an interval where is non-decreasing and is non-increasing, a sandwich estimate of evidence is given as follows

 (g)