Frequentist Consistencyof Variational Bayes

# Frequentist Consistency of Variational Bayes

Yixin Wang
Department of Statistics
Columbia University
yixin.wang@columbia.edu

David M. Blei
Department of Statistics
Department of Computer Science
Columbia University
david.blei@columbia.edu
July 15, 2019
###### Abstract

A key challenge for modern Bayesian statistics is how to perform scalable inference of posterior distributions. To address this challenge, variational Bayes (vb)methods have emerged as a popular alternative to the classical Markov chain Monte Carlo (mcmc)methods. vbmethods tend to be faster while achieving comparable predictive performance. However, there are few theoretical results around vb. In this paper, we establish frequentist consistency and asymptotic normality of vbmethods. Specifically, we connect vbmethods to point estimates based on variational approximations, called frequentist variational approximations, and we use the connection to prove a variational Bernstein–von Mises theorem. The theorem leverages the theoretical characterizations of frequentist variational approximations to understand asymptotic properties of vb. In summary, we prove that (1) the vbposterior converges to the Kullback-Leibler (kl)minimizer of a normal distribution, centered at the truth and (2) the corresponding variational expectation of the parameter is consistent and asymptotically normal. As applications of the theorem, we derive asymptotic properties of vbposteriors in Bayesian mixture models, Bayesian generalized linear mixed models, and Bayesian stochastic block models. We conduct a simulation study to illustrate these theoretical results.

\NAT@set@cites

Keywords: Bernstein–von Mises, Bayesian inference, variational methods, consistency, asymptotic normality, statistical computing

## 1 Introduction

Bayesian modeling is a powerful approach for discovering hidden patterns in data. We begin by setting up a probability model of latent variables and observations. We incorporate prior knowledge by setting priors on latent variables and a functional form of the likelihood. Finally we infer the posterior, the conditional distribution of the latent variables given the observations.

For many modern Bayesian models, exact computation of the posterior is intractable and statisticians must resort to approximate posterior inference. For decades, Markov chain Monte Carlo (mcmc)sampling (Hastings:1970; gelfand1990sampling; Robert:2004) has maintained its status as the dominant approach to this problem. mcmcalgorithms are easy to use and theoretically sound. In recent years, however, data sizes have soared. This challenges mcmcmethods, for which convergence can be slow, and calls upon scalable alternatives. One popular class of alternatives is variational Bayes (vb)methods.

To describe vb, we introduce notation for the posterior inference problem. Consider observations . We posit local latent variables , one per observation, and global latent variables . This gives a joint,

 p(θ,x,z)=p(θ)n∏i=1p(zi|θ)p(xi|zi,θ). (1)

The posterior inference problem is to calculate the posterior .

This division of latent variables is common in modern Bayesian statistics.111In particular, our results are applicable to general models with local and global latent variables (Hoffman:2013). The number of local variables increases with the sample size ; the number of global variables does not. We also note that the conditional independence of Equation 1 is not necessary for our results. But we use this common setup to simplify the presentation. In the Bayesian Gaussian mixture model (gmm) (roberts1998bayesian), the component means, covariances, and mixture proportions are global latent variables; the mixture assignments of each observation are local latent variables. In the Bayesian generalized linear mixed model (glmm) (breslow1993approximate), the intercept and slope are global latent variables; the group-specific random effects are local latent variables. In the Bayesian stochastic block model (sbm) (Wiggins:2008), the cluster assignment probabilities and edge probabilities matrix are two sets of global latent variables; the node-specific cluster assignments are local latent variables. In the latent Dirichlet allocation (lda)model  (blei2003latent), the topic-specific word distributions are global latent variables; the document-specific topic distributions are local latent variables. We will study all these examples below.

vbmethods formulate posterior inference as an optimization (jordan1999introduction; wainwright2008graphical; Blei:2016). We consider a family of distributions of the latent variables and then find the member of that family that is closest to the posterior.

Here we focus on mean-field variational inference (though our results apply more widely). First, we posit a family of factorizable probability distributions on latent variables

 Qn+d={q:q(θ,z)=∏di=1qθi(θi)∏nj=1qzj(zj)}.

This family is called the mean-field family. It represents a joint of the latent variables with (parametric) marginal distributions, .

vbfinds the member of the family closest to the exact posterior , where closeness is measured by kldivergence. Thus vbseeks to solve the optimization,

 (2)

In practice, vbfinds by optimizing an alternative objective, the evidence lower bound (elbo),

This objective is called the elbobecause it is a lower bound on the evidence . More importantly, the elbois equal to the negative KL plus , which does not depend on . Maximizing the elbominimizes the kl (jordan1999introduction).

The optimum approximates the posterior, and we call it the vbposterior.222For simplicity we will write , omitting the subscript on the factors . The understanding is that the factor is indicated by its argument. Though it cannot capture posterior dependence across latent variables, it has hope to capture each of their marginals. In particular, this paper is about the theoretical properties of the vbposterior , the vbposterior of . We will also focus on the corresponding expectation of the global variable, i.e., an estimate of the parameter. It is

 ^θ∗n:=∫θ⋅q∗(θ)dθ.

We call the variational Bayes estimate (vbe).

vbmethods are fast and yield good predictive performance in empirical experiments (Blei:2016). However, there are few rigorous theoretical results. In this paper, we prove that (1) the vbposterior converges in total variation (tv)distance to the klminimizer of a normal distribution centered at the truth and (2) the vbeis consistent and asymptotically normal.

These theorems are frequentist in the sense that we assume the data come from with a true (nonrandom) . We then study properties of the corresponding posterior distribution , when approximating it with variational inference. What this work shows is that the vbposterior is consistent even though the mean field approximating family can be a brutal approximation. In this sense, vbis a theoretically sound approximate inference procedure.

### 1.1 Main ideas

We describe the results of the paper. Along the way, we will need to define some terms: the variational frequentist estimate (vfe), the variational log likelihood, the vbposterior, the vbe, and the vbideal. Our results center around the vbposterior and the vbe. (Table 1 contains a glossary of terms.)

The variational frequentist estimate (vfe) and the variational log likelihood.  The first idea that we define is the variational frequentist estimate (vfe). It is a point estimate of that maximizes a local variational objective with respect to an optimal variational distribution of the local variables. (The vfetreats the variable as a parameter rather than a random variable.) We call the objective the variational log likelihood,

 Mn(θ;x)=maxq(z)Eq(z)[logp(x,z|θ)−logq(z)]. (4)

In this objective, the optimal variational distribution solves the local variational inference problem,

 q†(z)=argminqKL(q(z)||p(z|x,θ)). (5)

Note that implicitly depends on both the data and the parameter .

With the objective defined, the vfeis

 ^θn=argmaxθMn(θ;x). (6)

It is usually calculated with variational expectation maximization (em) (wainwright2008graphical; ormerod2010explaining), which iterates between the E step of Equation 5 and the M step of Equation 6. Recent research has explored the theoretical properties of the vfefor stochastic block models (bickel2013asymptotic), generalized linear mixed models (hall2011asymptotic), and Gaussian mixture models (westling2015establishing).

We make two remarks. First, the maximizing variational distribution of Equation 5 is different from in the vbposterior: is implicitly a function of individual values of , while is implicitly a function of the variational distributions . Second, the variational log likelihood in Equation 4 is similar to the original objective function for the emalgorithm (Dempster:1977). The difference is that the emobjective is an expectation with respect to the exact conditional , whereas the variational log likelihood uses a variational distribution .

Variational Bayes and ideal variational Bayes.  While earlier applications of variational inference appealed to variational emand the vfe, most modern applications do not. Rather they use vb, as we described above, where there is a prior on and we approximate its posterior with a global variational distribution . One advantage of vbis that it provides regularization through the prior. Another is that it requires only one type of optimization: the same considerations around updating the local variational factors are also at play when updating the global factor .

To develop theoretical properties of vb, we connect the vbposterior to the variational log likelihood; this is a stepping stone to the final analysis. In particular, we define the vbideal posterior ,

 π∗(θ|x)=p(θ)exp{Mn(θ;x)}∫p(θ)exp{Mn(θ;x)}dθ. (7)

Here the local latent variables are constrained under the variational family but the global latent variables are not. Note that because it depends on the variational log likelihood , this distribution implicitly contains an optimal variational distribution for each value of ; see Equations 5 and 4.

Loosely, the vbideal lies between the exact posterior and a variational approximation . It recovers the exact posterior when degenerates to a point mass and is always equal to ; in that case the variational likelihood is equal to the log likelihood and Equation 7 is the posterior. But is usually an approximation to the conditional. Thus the vbideal usually falls short of the exact posterior.

That said, the vbideal is more complex that a simple parametric variational factor . The reason is that its value for each is defined by the optimization within . Such a distribution will usually lie outside the distributions attainable with a simple family.

In this work, we first establish the theoretical properties of the vbideal. We then connect it to the vbposterior.

Variational Bernstein–von Mises.  We have set up the main concepts. We now describe the main results.

Suppose the data come from a true (finite-dimensional) parameter . The classical Bernstein–von Mises theorem says that, under certain conditions, the exact posterior approaches a normal distribution, independent of the prior, as the number of observations tends to infinity. In this paper, we extend the theory around Bernstein–von Mises to the variational posterior. Here we summarize our results.

• Lemma 1 shows that the vbideal is consistent and converges to a normal distribution around the vfe. If the vfeis consistent, the vbideal converges to a normal distribution whose mean parameter is a random vector centered at the true parameter. (Note the randomness in the mean parameter is due to the randomness in the observations .)

• We next consider the point in the variational family that is closest to the vbideal in kldivergence. Lemma 2 and Lemma 3 show that this klminimizer is consistent and converges to the klminimizer of a normal distribution around the vfe. If the vfeis consistent (bickel2013asymptotic; hall2011asymptotic) then the klminimizer converges to the klminimizer of a normal distribution with a random mean centered at the true parameter.

• Lemma 4 shows that the vbposterior enjoys the same asymptotic properties as the klminimizers of the vbideal .

• Theorem 5 is the variational Bernstein–von Mises theorem. It shows that the vbposterior is asymptotically normal around the vfe. Again, if the vfeis consistent then the vbposterior converges to a normal with a random mean centered at the true parameter. Further, Theorem 6 shows that the vbe is consistent with the true parameter and asymptotically normal.

• Finally, we prove two corollaries. First, if we use a full rank Gaussian variational family then the corresponding vbposterior recovers the true mean and covariance. Second, if we use a mean-field Gaussian variational family then the vbposterior recovers the true mean and the marginal variance, but not the off-diagonal terms. The mean-field vbposterior is underdispersed.

Related work. This work draws on two themes. The first is the body of work on theoretical properties of variational inference. you2014variational and ormerod2014variational studied variational Bayes for a classical Bayesian linear model. They used normal priors and spike-and-slab priors on the coefficients, respectively. wang2004convergence studied variational Bayesian approximations for exponential family models with missing values. wang2005inadequacy and wang2006convergence analyzed variational Bayes in Bayesian mixture models with conjugate priors. More recently, zhang2017theoretical studied mean field variational inference in stochastic block models (snrs)with a batch coordinate ascent algorithm: it has a linear convergence rate and converges to the minimax rate within iterations. sheth2017excess proved a bound for the excess Bayes risk using variational inference in latent Gaussian models. ghorbani2018instability studied a version of latent Dirichlet allocation (lda)and identified an instability in variational inference in certain signal-to-noise ratio (snr)regimes. zhang2017convergence characterized the convergence rate of variational posteriors for nonparametric and high-dimensional inference. pati2017statistical provided general conditions for obtaining optimal risk bounds for point estimates acquired from mean field variational Bayes. alquier2016properties and alquier2017concentration studied the concentration of variational approximations of Gibbs posteriors and fractional posteriors based on PAC-Bayesian inequalities. yang2017alpha proposed -variational inference and developed variational inequalities for the Bayes risk under the variational solution.

On the frequentist side, hall2011theory; hall2011asymptotic established the consistency of Gaussian variational emestimates in a Poisson mixed-effects model with a single predictor and a grouped random intercept. westling2015establishing studied the consistency of variational emestimates in mixture models through a connection to M-estimation. celisse2012consistency and bickel2013asymptotic proved the asymptotic normality of parameter estimates in the sbmunder a mean field variational approximation.

However, many of these treatments of variational methods—Bayesian or frequentist—are constrained to specific models and priors. Our work broadens these works by considering more general models. Moreover, the frequentist works focus on estimation procedures under a variational approximation. We expand on these works by proving a variational Bernstein–von Mises theorem, leveraging the frequentist results to analyze vbposteriors.

The second theme is the Bernstein–von Mises theorem. The classical (parametric) Bernstein–von Mises theorem roughly says that the posterior distribution of “converges”, under the true parameter value , to , where and is the Fisher information (ghosh2003bayesian; van2000asymptotic; le1953some; le2012asymptotics). Early forms of this theorem date back to Laplace, Bernstein, and von Mises (laplace1809memoire; bernstein; vonmises). A version also appeared in lehmann2006theory. kleijn2012bernstein established the Bernstein–von Mises theorem under model misspecification. Recent advances include extensions to extensions to semiparametric cases (murphy2000profile; kim2006bernstein; de2009bernstein; rivoirard2012bernstein; bickel2012semiparametric; castillo2012semiparametric; castillo2012semiparametric2; castillo2014bernstein; panov2014critical; castillo2015bernstein; ghosal2017fundamentals) and nonparametric cases (cox1993analysis; diaconis1986consistency; diaconis1997consistency; diaconis1998consistency; freedman1999wald; kim2004bernstein; boucheron2009bernstein; james2008large; johnstone2010high; bontemps2011bernstein; kim2009bernstein; knapik2011bayesian; leahu2011bernstein; rivoirard2012bernstein; castillo2012nonparametric; castillo2013nonparametric; spokoiny2013bernstein; castillo2014bayesian; castillo2014bernstein; ray2014adaptive; panov2015finite; lu2017bernstein). In particular, lu2016gaussapprox proved a Bernstein–von Mises type result for Bayesian inverse problems, characterizing Gaussian approximations of probability measures with respect to the kldivergence. Below, we borrow proof techniques from lu2016gaussapprox. But we move beyond the Gaussian approximation to establish the consistency of variational Bayes.

This paper.  The rest of the paper is organized as follows. Section 2 characterizes theoretical properties of the vbideal. Section 3 contains the central results of the paper. It first connects the vbideal and the vbposterior. It then proves the variational Bernstein–von Mises theorem, which characterizes the asymptotic properties of the vbposterior and vbestimate. Section 4 studies three models under this theoretical lens, illustrating how to establish consistency and asymptotic normality of specific vbestimates. Section 5 reports simulation studies to illustrate these theoretical results. Finally, Section 6 concludes with paper with a discussion.

## 2 The vbideal

To study the vbposterior , we first study the vbideal of Equation 7. In the next section we connect it to the vbposterior.

Recall the vbideal is

 π∗(θ\leavevmode\nobreak |\leavevmode\nobreak x)=p(θ)exp(Mn(θ;x))∫p(θ)exp(Mn(θ;x))dθ,

where is the variational log likelihood of Equation 4. If we embed the variational log likelihood in a statistical model of , this model has likelihood

 ℓ(θ;x)∝exp(Mn(θ;x)).

We call it the frequentist variational model. The vbideal is thus the classical posterior under the frequentist variational model ; the vfeis the classical maximum likelihood estimate (mle).

Consider the results around frequentist estimation of under variational approximations of the local variables (bickel2013asymptotic; hall2011asymptotic; westling2015establishing). These works consider asymptotic properties of estimators that maximize with respect to . We will first leverage these results to prove properties of the vbideal and their klminimizers in the mean field variational family . Then we will use these properties to study the vbposterior, which is what is estimated in practice.

This section relies on the consistent testability and the local asymptotic normality (lan)of (defined later) to show the vbideal is consistent and asymptotically normal. We will then show that its klminimizer in the mean field family is also consistent and converges to the klminimizer of a normal distribution in tvdistance.

These results are not surprising. Suppose the variational log likelihood behaves similarly to the true log likelihood, i.e., they produce consistent parameter estimates. Then, in the spirit of the classical Bernstein–von Mises theorem under model misspecification (kleijn2012bernstein), we expect the vbideal to be consistent as well. Moreover, the approximation through a factorizable variational family should not ruin this consistency— point masses are factorizable and thus the limiting distribution lies in the approximating family.

### 2.1 The vbideal

The lemma statements and proofs adapt ideas from ghosh2003bayesian; van2000asymptotic; bickel1967asymptotically; kleijn2012bernstein; lu2016gaussapprox to the variational log likelihood. Let be an open subset of . Suppose the observations are a random sample from the measure with density for some fixed, nonrandom value . are local latent variables, and are global latent variables. We assume that the density maps of the true model and of the variational frequentist models are measurable. For simplicity, we also assume that for each there exists a single measure that dominates all measures with densities as well as the true measure .

###### Assumption 1.

We assume the following conditions for the rest of the paper:

1. (Prior mass) The prior measure with Lebesgue-density on is continuous and positive on a neighborhood of . There exists a constant such that .

2. (Consistent testability) For every there exists a sequence of tests such that

 ∫ϕn(x)p(x,z\leavevmode\nobreak |\leavevmode\nobreak θ0)\emphdz\emphdx→0

and

 supθ:||θ−θ0||≥ϵ∫(1−ϕn(x))ℓ(θ;x)ℓ(θ0;x)p(x,z\leavevmode\nobreak |\leavevmode\nobreak θ0)\emphdz\emphdx→0,
3. (Local asymptotic normality (lan)) For every compact set , there exist random vectors bounded in probability and nonsingular matrices such that

 suph∈K|Mn(θ+δnh;x)−Mn(θ;x)−h⊤Vθ0Δn,θ0+12h⊤Vθ0h|\lx@stackrelPθ0→0,

where is a diagonal matrix. We have as . For , we commonly have .

These three assumptions are standard for Bernstein–von Mises theorem. The first assumption is a prior mass assumption. It says the prior on puts enough mass to sufficiently small balls around . This allows for optimal rates of convergence of the posterior. The first assumption further bounds the second derivative of the log prior density. This is a mild technical assumption satisfied by most non-heavy-tailed distributions.

The second assumption is a consistent testability assumption. It says there exists a sequence of uniformly consistent (under ) tests for testing against for every based on the frequentist variational model. This is a weak assumption. For example, it suffices to have a compact and continuous and identifiable . It is also true when there exists a consistent estimator of . In this case, we can set

The last assumption is a local asymptotic normality assumption on around the true value . It says the frequentist variational model can be asymptotically approximated by a normal location model centered at after a rescaling of . This normalizing sequence determines the optimal rates of convergence of the posterior. For example, if , then we commonly have . We often need model-specific analysis to verify this condition, as we do in Section 4. We discuss sufficient conditions and general proof strategies in Section 3.4.

In the spirit of the last assumption, we perform a change-of-variable step:

 ~θ=δ−1n(θ−θ0). (8)

We center at the true value and rescale it by the reciprocal of the rate of convergence This ensures that the asymptotic distribution of is not degenerate, i.e., it does not converge to a point mass. We define as the density of when has density :

 π∗~θ(~θ\leavevmode\nobreak |\leavevmode\nobreak x)=π∗(θ0+δn~θ\leavevmode\nobreak |\leavevmode\nobreak x)⋅|det(δn)|.

Now we characterize the asymptotic properties of the vbideal.

###### Lemma 1.

The vbideal converges in total variation to a sequence of normal distributions,

Proof sketch of lemma 1. This is a consequence of the classical finite-dimensional Bernstein–von Mises theorem under model misspecification (kleijn2012bernstein). Theorem 2.1 of kleijn2012bernstein roughly says that the posterior is consistent if the model is locally asymptotically normal around the true parameter value . Here the true data generating measure is with density , while the frequentist variational model has densities .

What we need to show is that the consistent testability assumption in creftypecap 1 implies assumption (2.3) in kleijn2012bernstein:

 ∫|~θ|>Mnπ∗~θ(~θ\leavevmode\nobreak |\leavevmode\nobreak x)d~θ\lx@stackrelPθ0→0

for every sequence of constants . To show this, we mimic the argument of Theorem 3.1 of kleijn2012bernstein, where they show this implication for the iid case with a common convergence rate for all dimensions of . See Appendix A for details. ∎

This lemma says the vbideal of the rescaled , , is asymptotically normal with mean . The mean, , as assumed in creftypecap 1, is a random vector bounded in probability. The asymptotic distribution is thus also random, where randomness is due to the data being random draws from the true data generating measure . We notice that if the vfe, , is consistent and asymptotically normal, we commonly have with . Hence, the vbideal will converge to a normal distribution with a random mean centered at the true value .

### 2.2 The KL minimizer of the vbideal

Next we study the klminimizer of the vbideal in the mean field variational family. We show its consistency and asymptotic normality. To be clear, the asymptotic normality is in the sense that the klminimizer of the vbideal converges to the klminimizer of a normal distribution in tvdistance.

###### Lemma 2.

The klminimizer of the vbideal over the mean field family is consistent: almost surely under , it converges to a point mass,

Proof sketch of lemma 2. The key insight here is that point masses are factorizable. Lemma 1 above suggests that the vbideal converges in distribution to a point mass. We thus have its klminimizer also converging to a point mass, because point masses reside within the mean field family. In other words, there is no loss, in the limit, incurred by positing a factorizable variational family for approximation.

To prove this lemma, we bound the mass of under , where is the complement of an -sized ball centered at with as . In this step, we borrow ideas from the proof of Lemma 3.6 and Lemma 3.7 in lu2016gaussapprox. See Appendix B for details. ∎

###### Lemma 3.

The klminimizer of the vbideal of converges to that of in total variation: under mild technical conditions on the tail behavior of (see creftypecap 2 in Appendix C),

Proof sketch of lemma 3. The intuition here is that, if the two distribution are close in the limit, their klminimizers should also be close in the limit. Lemma 1 says that the vbideal of converges to in total variation. We would expect their klminimizer also converges in some metric. This result is also true for the (full-rank) Gaussian variational family if rescaled appropriately.

Here we show their convergence in total variation. This is achieved by showing the -convergence of the functionals of : to , for parametric ’s. -convergence is a classical tool for characterizing variational problems; -convergence of functionals ensures convergence of their minimizers (dal2012introduction; braides2006handbook). See Appendix C for proof details and a review of -convergence. ∎

We characterized the limiting properties of the vbideal and their klminimizers. We will next show that the vbposterior is close to the kldivergence minimizer of the vbideal. Section 3 culminates in the main theorem of this paper – the variational Bernstein–von Mises theorem – showing the vbposterior share consistency and asymptotic normality with the kldivergence minimizer of vbideal.

## 3 Frequentist consistency of variational Bayes

We now study the vbposterior. In the previous section, we proved theoretical properties for the vbideal and its klminimizer in the variational family. Here we first connect the vbideal to the vbposterior, the quantity that is used in practice. We then use this connection to understand the theoretical properties of the vbposterior.

We begin by characterizing the optimal variational distribution in a useful way. Decompose the variational family as

 q(θ,z)=q(θ)q(z),

where and . Denote the prior . Note does not grow with the size of the data. We will develop a theory around vbthat considers asymptotic properties of the vbposterior .

We decompose the elboof Equation 3 into the portion associated with the global variable and the portion associated with the local variables,

 \leavevmode\hyperlinkglo:ELBO\textscelbo(q(θ)q(z)) =∫∫q(θ)q(z)logp(θ,x,z)q(θ)q(z)dθdz =∫∫q(θ)q(z)logp(θ)p(x,z\leavevmode\nobreak |\leavevmode\nobreak θ)q(θ)q(z)dθdz =∫q(θ)logp(θ)q(θ)dθ+∫q(θ)∫q(z)logp(x,z|θ)q(z)dθdz.

The optimal variational factor for the global variables, i.e., the vbposterior, maximizes the elbo. From the decomposition, we can write it as a function of the optimized local variational factor,

 q∗(θ)=argmaxq(θ)supq(z)∫q(θ)(log[p(θ)exp{∫q(z)logp(x,z\leavevmode\nobreak |\leavevmode\nobreak θ)q(z)dz}]−logq(θ))dθ. (9)

One way to see the objective for the vbposterior is as the elboprofiled over , i.e., where the optimal is a function of (Hoffman:2013). With this perspective, the elbobecomes a function of only. We denote it as a functional :

We then rewrite Equation 9 as . This expression for the vbposterior is key to our results.

### 3.1 klminimizers of the vbideal

Recall that the klminimization objective to the ideal vbposterior is the functional . We first show that the two optimization objectives and are close in the limit. Given the continuity of both and , this implies the asymptotic properties of optimizers of will be shared by the optimizers of .

###### Lemma 4.

The negative kldivergence to the vbideal is equivalent to the profiled elboin the limit: under mild technical conditions on the tail behavior of (see for example creftypecap 3 in Appendix D), for

Proof sketch of Lemma 4. We first notice that

 −\leavevmode\hyperlinkglo:KL\textsckl(q(θ)||π∗(θ\leavevmode\nobreak |\leavevmode\nobreak x)) (11) = ∫q(θ)logp(θ)exp(Mn(θ;x))q(θ)dθ (12) = ∫q(θ)(log[p(θ)exp{supq(z)∫q(z)logp(x,z\leavevmode\nobreak |\leavevmode\nobreak θ)q(z)dz}]−logq(θ))dθ. (13)

Comparing Equation 13 with Equation 10, we can see that the only difference between and is in the position of . allows for a single choice of optimal given , while allows for a different optimal for each value of . In this sense, if we restrict the variational family of to be point masses, then and will be the same.

The only members of the variational family of that admit finite are ones that converge to point masses at rate , so we expect and to be close as We prove this by bounding the remainder in the Taylor expansion of by a sequence converging to zero in probability. See Appendix D for details. ∎

### 3.2 The vbposterior

Section 2 characterizes the asymptotic behavior of the vbideal and their klminimizers. Lemma 4 establishes the connection between the vbposterior and the klminimizers of the vbideal . Recall is consistent and converges to the klminimizer of a normal distribution. We now build on these results to study the vbposterior .

Now we are ready to state the main theorem. It establishes the asymptotic behavior of the vbposterior .

###### Theorem 5.

(Variational Bernstein-von-Mises Theorem)

1. The vbposterior is consistent: almost surely under ,

 q∗(θ)\lx@stackreld→δθ0.
2. The vbposterior is asymptotically normal in the sense that it converges to the klminimizer of a normal distribution:

Here we transform to , which is centered around the true and scaled by the convergence rate; see Equation 8. When is the mean field variational family, then the limiting vbposterior is normal:

where is diagonal and has the same diagonal terms as .

Proof sketch of Theorem 5. This theorem is a direct consequence of Lemma 2, Lemma 3, Lemma 4. We need the same mild technical conditions on as in Lemma 3 and Lemma 4. Equation 15 can be proved by first establishing the normality of the optimal variational factor (see Section 10.1.2 of bishop2006pattern for details) and proceeding with Lemma 8. See Appendix E for details. ∎

Given the convergence of the vbposterior, we can now establish the asymptotic properties of the vbe.

###### Theorem 6.

(Asymptotics of the vbe)

Assume . Let denote the vbe.

1. The vbeis consistent: under ,

 ^θ∗n\lx@stackrela.s.→θ0.
2. The vbeis asymptotically normal in the sense that it converges in distribution to the mean of the klminimizer:333The randomness in the mean of the klminimizer comes from . if for some ,

Proof sketch of Theorem 6. As the posterior mean is a continuous function of the posterior distribution, we would expect the vbeis consistent given the vbposterior is. We also know that the posterior mean is the Bayes estimator under squared loss. Thus we would expect the vbeto converge in distribution to squared loss minimizer of the klminimizer of the vbideal. The result follows from a very similar argument from Theorem 2.3 of kleijn2012bernstein, which shows that the posterior mean estimate is consistent and asymptotically normal under model misspecification as a consequence of the Bernsterin–von Mises theorem and the argmax theorem. See Appendix E for details. ∎

We remark that , as in creftypecap 1, is a random vector bounded in probability. The randomness is due to being a random sample generated from .

In cases where vfeis consistent, like in all the examples we will see in Section 4, is a zero mean random vector with finite variance. For particular realizations of the value of might not be zero; however, because we scale by , this does not destroy the consistency of vbposterior or the vbe.

### 3.3 Gaussian vbposteriors

We illustrate the implications of Theorem 5 and Theorem 6 on two choices of variational families: a full rank Gaussian variational family and a factorizable Gaussian variational family. In both cases, the vbposterior and the vbeare consistent and asymptotically normal with different covariance matrices. The vbposterior under the factorizable family is underdispersed.

###### Corollary 7.

Posit a full rank Gaussian variational family, that is

 Qd={q:q(θ)=N(m,Σ)}, (16)

with positive definite. Then

1. almost surely under .

2. .

Proof sketch of corollary 7. This is a direct consequence of Theorem 5 and Theorem 6. We only need to show that Lemma 3 is also true for the full rank Gaussian variational family. The last conclusion implies if for some random variable . We defer the proof to Appendix F. ∎

This corollary says that under a full rank Gaussian variational family, vbis consistent and asymptotically normal in the classical sense. It accurately recovers the asymptotic normal distribution implied by the local asymptotic normality of .

Before stating the corollary for the factorizable Gaussian variational family, we first present a lemma on the klminimizer of a Gaussian distribution over the factorizable Gaussian family. We show that the minimizer keeps the mean but has a diagonal covariance matrix that matches the precision. We also show the minimizer has a smaller entropy than the original distribution. This echoes the well-known phenomenon of vbalgorithms underestimating the variance.

###### Lemma 8.

The factorizable klminimizer of a Gaussian distribution keeps the mean and matches the precision:

where is diagonal with for . Hence, the entropy of the factorizable klminimizer is smaller than or equal to that of the original distribution:

 H(N(⋅;μ0,Σ∗1))≤H(N(⋅;μ0,Σ1)).

Proof sketch of Lemma 8. The first statement is consequence of a technical calculation of the kldivergence between two normal distributions. We differentiate the kldivergence over and the diagonal terms of and obtain the result. The second statement is due to the inequality of the determinant of a positive matrix being always smaller than or equal to the product of its diagonal terms (amir1969product; beckenbach2012inequalities). In this sense, mean field variational inference underestimates posterior variance. See Appendix G for details. ∎

The next corollary studies the vbposterior and the vbeunder a factorizable Gaussian variational family.

###### Corollary 9.

Posit a factorizable Gaussian variational family,

 Qd={q:q(θ)=N(m,Σ,)} (17)

where positive definite and diagonal. Then

1. almost surely under .

2. where is diagonal and has the same diagonal entries as .

3. .

Proof of corollary 9. This is a direct consequence of Lemma 8, Theorem 5, and Theorem 6. ∎

This corollary says that under the factorizable Gaussian variational family, vbis consistent and asymptotically normal in the classical sense. The rescaled asymptotic distribution for recovers the mean but underestimates the covariance. This underdispersion is a common phenomenon we see in mean field variational Bayes.

As we mentioned, the vbposterior is underdispersed. One consequence of this property is that its credible sets can suffer from under-coverage. In the literature on vb, there are two main ways to correct this inadequacy. One way is to increase the expressiveness of the variational family to one that accounts for dependencies among latent variables. This approach is taken by much of the recent vbliterature, e.g. tran2015copula; tran2015variational; ranganath2016hierarchical; ranganath2016operator; liu2016stein. As long as the expanded variational family contains the mean field family, Theorem 5 and Theorem 6 remain true.

Alternative methods to handling underdispersion center around sensitivity analysis and bootstrap. giordano2017covariances identified the close relationship between Bayesian sensitivity and posterior covariance. They estimated the covariance with the sensitivity of the vbposterior means with respect to perturbations of the data. chen2017use explored the use of bootstrap in assessing the uncertainty of a variational point estimate. They also studied the underlying bootstrap theory. giordano2017measuring assessed the clutering stability in Bayesian nonparametric models based on an approximation to the infinitesimal jackknife.

### 3.4 The lancondition of the the variational log likelihood

Our results rest on creftypecap 1.3, the lanexpansion of the variational log likelihood . For models without local latent variables , their variational log likelihood is the same as their log likelihood . The lanexpansion for these models have been widely studied. In particular, iid sampling from a regular parametric model is locally asymptotically normal; it satisfies creftypecap 1.3 (van2000asymptotic). When models do contain local latent variables, however, as we will see in Section 4, finding the lanexpansion requires model-specific characterization.

For a certain class of models with local latent variables, the lanexpansion for the (complete) log likelihood concurs with the expansion of the variational log likelihood . Below we provide a sufficient condition for such a shared lanexpansion. It is satisfied, for example, by the stochastic block model (bickel2013asymptotic) under mild identifiability conditions.

###### Proposition 10.

The log likelihood and the variational log likelihood will have the same lanexpansion if:

1. The conditioned nuisance posterior is consistent under -perturbation at some rate with and :

For all bounded, stochastic , the conditional nuisance posterior converges as

 ∫Dc(θ,ρn)p(z\leavevmode\nobreak |\leavevmode\nobreak x,θ=θ0+