Multivariate Stein Factors for a Class of Strongly Logconcave Distributions
plus 0.3ex
\SHORTTITLEMultivariate Stein Factors for a Class of Strongly Logconcave Distributions
\TITLEMultivariate Stein Factors for a Class of Strongly Logconcave Distributions
\AUTHORSLester Mackey
1 Introduction
In 1972, Stein [22] introduced a powerful method for bounding the maximum expected discrepancy, , between a target distribution and an approximating distribution . Stein’s method classically proceeds in three steps:

First, one identifies a linear operator that generates meanzero functions under the target distribution. A common choice for a continuous target on is the infinitesimal generator of the overdamped Langevin diffusion
^{3} (also known as the Smoluchowski dynamics) [19, Secs. 6.5 and 4.5] with stationary distribution :(1) Here, represents the density of with respect to Lebesgue measure.

Next, one shows that, for every test function in a convergencedetermining class , the Stein equation
(2) admits a solution in a set of functions with uniformly bounded loworder derivatives. These uniform derivative bounds are commonly termed Stein factors.

Finally, one uses whatever tools necessary to upper bound the Stein discrepancy
^{4} (3) which by construction upper bounds the reference metric .
To date, this recipe has been successfully used with the Langevin operator (1) to obtain explicit approximation error bounds for a wide variety of univariate targets [see, e.g., 7, 6].
Notation and terminology We let denote the set of realvalued functions on with continuous derivatives. We further let denote the norm on and define the operator norms for vectors , for matrices , and for tensors . We say a function is strongly concave for if
and we term a function strongly logconcave if is strongly concave. We finally let for all functions and define the Lipschitz constants
2 Stein factors for strongly logconcave distributions
Consider a target distribution on with strongly logconcave density . The following result bounds the derivatives of Stein equation solutions in terms of the smoothness of and the underlying test function . The proof, found in Section 3, is probabilistic, in the spirit of the generator method of Barbour [1] and Gotze [14], and features the synchronous coupling of multiple overdamped Langevin diffusions. {theorem}[Stein factors for strongly logconcave distributions] Suppose that is strongly concave with and . For each , let represent the overdamped Langevin diffusion with infinitestimal generator (1) and initial state . Then, for each Lipschitz , the function
solves the the Stein equation (2) and satisfies
Theorem 2 implies that the Stein discrepancy (3) with set
bounds the smooth function distance for
Our next result shows that control over the smooth function distance also grants control over the Wasserstein distance (also known as the KantorovichRubenstein or earth mover’s distance), , and the boundedLipschitz metric, , which exactly metrizes convergence in distribution on . These metrics govern the test function classes
[SmoothWasserstein inequality] If and are probability measures on with finite means, and is a standard normal random vector, then
Proof.
The first inequality follows directly from the inclusions and .
To establish the second, we fix and and define the smoothed function
where is the density of a vector of independent standard normal variables. We first show that is a close approximation to when is small. Specifically, if is an integrable random vector, independent of , then, by the Lipschitz assumption on ,
We next show that the derivatives of are bounded. Fix any . Since is Lipschitz, it admits a weak gradient, , bounded uniformly by 1 in . We alternate differentiation and integration by parts to develop the representations
for each . The uniform bound on now yields ,
In the final equality we have used the fact that and are jointly normal with zero mean and covariance , so that the product has the distribution of the offdiagonal element of the Wishart distribution [23] with scale and degree of freedom.
We can now develop a bound for using our smoothed functions. Let
represent the maximum derivative bound of , and select and to satisfy . If we let , we then have
where we have chosen to achieve the third inequality. ∎
While Lemma 2 targets Lipschitz test functions, comparable results can be obtained for nonsmooth functions, like the indicators of convex sets, by adapting the smoothing technique of [3, Lem. 2.1].
2.1 Example application to Bayesian logistic regression
Before turning to the proof of Theorem 2, we illustrate a practical application to measuring the quality of Monte Carlo or cubature sample points in Bayesian inference. Consider the Bayesian logistic regression posterior density [see, e.g., 11]
based on observed datapoints and a known prior hyperparameter . In this standard model of binary classification, represents our inferential target, an unknown parameter vector with a multivariate Gaussian prior; is the class label of the th observed datapoint; and is an associated vector of covariates.
Since the normalizing constant of is unknown, it is common practice to approximate expectations under with sample estimates, , based on sample points drawn from a Markov chain or a cubature rule [11]. Theorem 2 furnishes a way to uniformly bound the error of this approximation, , for all sufficiently smooth functions .
3 Proof of Theorem 2
Before tackling the main proof, we will establish a series of useful lemmas. We will make regular use of the following wellknown Lipschitz property:
(4) 
3.1 Properties of overdamped Langevin diffusions
Our first lemma enumerates several properties of the overdamped Langevin diffusion that will prove useful in the proofs to follow. {lemma}[Overdamped Langevin properties] If is strongly concave, then the overdamped Langevin diffusion with infinitesimal generator (1) and is welldefined for all times , has stationary distribution , and satisfies strong continuity on with norm , that is, as for all .
Proof.
Consider the Lyapunov function . The strong logconcavity of , the CauchySchwarz inequality, and the arithmeticgeometric mean inequality imply that
for some constants . Since is locally Lipschitz, [15, Thm. 3.5] implies that the diffusion is welldefined, and [21, Thm. 2.1] guarantees that is a stationary distribution. The argument of [13, Prop. 15] with [15, Thm. 3.5] substituted for [15, Thm. 3.4] and [10, Sec. 5, Cor. 1.2] now yields strong continuity. ∎
3.2 Highorder weighted difference bounds
A second, technical lemma bounds the growth of weighted smooth function differences in terms of the proximity of function arguments. The result will be used to characterize the smoothness of as a function of the starting point (Lemma 3.3) and, ultimately, to establish the smoothness of (Theorem 2). {lemma}[Highorder weighted difference bounds] Fix any weights and any vectors . If , then
(5) 
Moreover, if , then
(6)  
Proof.
To establish the secondorder difference bound (5), we first apply Taylor’s theorem with meanvalue remainder to and to obtain
for some . CauchySchwarz, the definition of the operator norm, and the Lipschitz gradient relation (4) now yield the advertised conclusion (5).
To derive the thirdorder difference bound (6), we apply Taylor’s theorem with meanvalue remainder to , , , and to write
(7)  
for some . We will bound each line in this expression in turn. First we see, by CauchySchwarz and the Lipschitz property (4), that
Next, we invoke our secondorder difference bound (5) on the function , apply the CauchySchwarz inequality, and use the definition of the operator norm to conclude that
To bound the subsequent line, we note that CauchySchwarz, the definition of the operator norm, and the Lipschitz property (4) imply that
Similarly,
Finally, CauchySchwarz and the definition of the operator norm give
Bounding the thirdorder difference (7) in terms of these four estimates yields (6). ∎
3.3 Synchronous coupling lemma
Our proof of Theorem 2 additionally rests upon a series of coupling inequalities which serve to characterize the smoothness of as a function of . The couplings espoused in the lemma to follow are termed synchronous, because the same Brownian motion is used to drive each process.
[Synchronous coupling inequalities] Suppose that is strongly concave with and . Fix a dimensional Wiener process , any vectors with , and any weights , and define the growth factors
(8) 
For each starting point of the form with , , and , consider an overdamped Langevin diffusion solving the stochastic differential equation
(9) 
and define the differenced processes
These coupled processes almost surely satisfy the synchronous coupling bounds,
(10)  
(11)  
(12) 
the secondorder differenced function bound,
(13)  
and the thirdorder differenced function bound,
(14)  
for each , , and .
Proof.
By Lemma 3.1, each process with , , and is welldefined for all times . The firstorder bound (10) is well known, and a concise proof can be found in [4].
Secondorder bounds
To establish the second conclusion (11), we consider the Itô process of secondorder differences
and apply Itô’s lemma to the mapping . This yields
Fix a value . For any , the Lemma 3.2 secondorder difference inequality (5), the first order coupling bound (10), CauchySchwarz, and the Lipschitz identity (4) together give the estimates
(15)  
(16) 
Applying the estimate (15) to the function with yields
where, to achieve the second inequality, we used the strong logconcavity of . Now we may derive the secondorder synchronous coupling bound (11), since
Applying the synchronous coupling bound (11) to the estimate (16) finally delivers the secondorder differenced function bound (13).
Thirdorder bounds
To establish the third conclusion (12), we consider the Itô process of thirdorder differences and invoke Itô’s lemma once more for the mapping . This produces Fix a value , and introduce the shorthand and . For any , the Lemma 3.2 thirdorder difference inequality (6), the coupling bounds (10) and (11), CauchySchwarz, and the Lipschitz identity (4) together imply the estimates (17) (18) where we have applied the triangle inequality to achieve (17). Applying the bound (17) to the thrice continuously differentiable function with gives In the final line, we used the strong logconcavity of . Our efforts now yield (12) via The thirdorder differenced function bound (3.3) then follows by applying the thirdorder synchronous coupling bound (12) to the estimate (18). ∎3.4 Proof of Theorem 2
By Lemma 3.1, for each , the overdamped Langevin diffusion is welldefined with stationary distribution . Moreover, for each , the diffusion , by definition, satisfies
for a dimensional Wiener process. In what follows, when considering the joint distribution of a finite collection of overdamped Langevin diffusions, we will assume that the diffusions are coupled in the manner of Lemma 3.3, so that each diffusion is driven by a shared dimensional Wiener process .
Fix any and any with bounded first, second, and third derivatives. We divide the remainder of our proof into five components, establishing that exists, is Lipschitz, has a Lipschitz gradient, has a Lipschitz Hessian, and solves the Stein equation (2).
Existence of
To see that the integral representation of is welldefined, note that
The first relation uses the stationarity of , the second uses the Lipschitz relation (4), the third uses the firstorder coupling inequality (10) of Lemma 3.3, and the last uses the fact that strongly logconcave distributions have subexponential tails and therefore finite moments of all orders [8, Lem. 1].
Lipschitz continuity of
Lipschitz continuity of
To demonstrate that is differentiable with Lipschitz gradient, we first establish a weighted secondorder difference inequality for .
For any vectors with and weights ,
(20) 
Proof.
Now, fix any with . As a first application of the Lemma 3.4 secondorder difference inequality (20), we will demonstrate the existence of the directional derivative
(21) 
Indeed, Lemma 3.4 implies that, for any integers ,
Hence, the sequence is Cauchy, and the directional derivative (21) exists.
To see that the directional derivative (21) is also Lipschitz, fix any , and consider the bound
(22) 
where the second inequality follows from Lemma 3.4. Since each directional derivative is Lipschitz continuous, we may conclude that is continuously differentiable with Lipschitz continuous gradient . Our Lipschitz function deduction (19) and the Lipschitz relation (4) additionally supply the uniform bound
Lipschitz continuity of
To demonstrate that is differentiable with Lipschitz gradient, we begin by establishing a weighted thirdorder difference inequality for .
Fix any vectors with and weights , and define and as in (3.3) . Then,
(23)  