1 Introduction

# Multivariate Stein Factors for a Class of Strongly Log-concave Distributions

plus 0.3ex \SHORTTITLEMultivariate Stein Factors for a Class of Strongly Log-concave Distributions \TITLEMultivariate Stein Factors for a Class of Strongly Log-concave Distributions \AUTHORSLester Mackey1 and Jackson Gorham2\KEYWORDSStein’s method; Stein factors; multivariate log-concave distribution; overdamped Langevin diffusion; generator method; synchronous coupling; Stein discrepancy \AMSSUBJ 60J60; 62E17; 60E15 \SUBMITTEDApril 4, 2016 \ACCEPTEDAugust 8, 2016 \VOLUME0 \YEAR2012 \PAPERNUM0 \DOIvVOL-PID \ABSTRACTWe establish uniform bounds on the low-order derivatives of Stein equation solutions for a broad class of multivariate, strongly log-concave target distributions. These “Stein factor” bounds deliver control over Wasserstein and related smooth function distances and are well-suited to analyzing the computable Stein discrepancy measures of Gorham and Mackey. Our arguments of proof are probabilistic and feature the synchronous coupling of multiple overdamped Langevin diffusions.

## 1 Introduction

In 1972, Stein [22] introduced a powerful method for bounding the maximum expected discrepancy, , between a target distribution and an approximating distribution . Stein’s method classically proceeds in three steps:

1. First, one identifies a linear operator that generates mean-zero functions under the target distribution. A common choice for a continuous target on is the infinitesimal generator of the overdamped Langevin diffusion3 (also known as the Smoluchowski dynamics) [19, Secs. 6.5 and 4.5] with stationary distribution :

 (Au)(x)=12⟨∇u(x),∇logp(x)⟩+12⟨∇,∇u(x)⟩. (1)

Here, represents the density of with respect to Lebesgue measure.

2. Next, one shows that, for every test function in a convergence-determining class , the Stein equation

 (2)

admits a solution in a set of functions with uniformly bounded low-order derivatives. These uniform derivative bounds are commonly termed Stein factors.

3. Finally, one uses whatever tools necessary to upper bound the Stein discrepancy4

 (3)

which by construction upper bounds the reference metric .

To date, this recipe has been successfully used with the Langevin operator (1) to obtain explicit approximation error bounds for a wide variety of univariate targets  [see, e.g., 7, 6].5 The same operator has been used to analyze multivariate Gaussian approximation [2, 14, 20, 5, 17, 18], but few other multivariate distributions have established Stein factors. To extend the reach of the multivariate literature, we derive uniform Stein factor bounds for a broad class of strongly log-concave target distributions in Theorem 2. The result covers common Bayesian target distributions, including Bayesian logistic regression posteriors under Gaussian priors, and explicitly relates the Stein discrepancy (3) and practical Monte Carlo diagnostics based thereupon [12] to standard probability metrics, like the Wasserstein distance.

Notation and terminology We let denote the set of real-valued functions on with continuous derivatives. We further let denote the norm on and define the operator norms for vectors , for matrices , and for tensors . We say a function is -strongly concave for if

and we term a function -strongly log-concave if is -strongly concave. We finally let for all functions and define the Lipschitz constants

 Mk(h)

## 2 Stein factors for strongly log-concave distributions

Consider a target distribution on with strongly log-concave density . The following result bounds the derivatives of Stein equation solutions in terms of the smoothness of and the underlying test function . The proof, found in Section 3, is probabilistic, in the spirit of the generator method of Barbour [1] and Gotze [14], and features the synchronous coupling of multiple overdamped Langevin diffusions. {theorem}[Stein factors for strongly log-concave distributions] Suppose that is -strongly concave with and . For each , let represent the overdamped Langevin diffusion with infinitestimal generator (1) and initial state . Then, for each Lipschitz , the function

solves the the Stein equation (2) and satisfies

 M1(uh)≤ 2kM1(h),M2(uh)≤2L3k2M1(h)+1kM2(h), and M3(uh)≤

Theorem 2 implies that the Stein discrepancy (3) with set

bounds the smooth function distance for

Our next result shows that control over the smooth function distance also grants control over the -Wasserstein distance (also known as the Kantorovich-Rubenstein or earth mover’s distance), , and the bounded-Lipschitz metric, , which exactly metrizes convergence in distribution on . These metrics govern the test function classes

 W≜{h:Rd→R∣M1(h)≤1}andBL≜W∩{h:Rd→R∣supx∈Rd|h(x)|≤1}.
{lemma}

[Smooth-Wasserstein inequality] If and are probability measures on with finite means, and is a standard normal random vector, then

###### Proof.

The first inequality follows directly from the inclusions and .

To establish the second, we fix and and define the smoothed function

 ht(x)=∫Rdh(x+tz)ϕ(z)dzfor eachx∈Rd,

where is the density of a vector of independent standard normal variables. We first show that is a close approximation to when is small. Specifically, if is an integrable random vector, independent of , then, by the Lipschitz assumption on ,

We next show that the derivatives of are bounded. Fix any . Since is Lipschitz, it admits a weak gradient, , bounded uniformly by 1 in . We alternate differentiation and integration by parts to develop the representations

 ∇ht(x) =∫Rd∇h(x+tz)ϕ(z)dz=1t∫Rdzh(x+tz)ϕ(z)dz, ∇2ht(x) =1t∫Rd∇h(x+tz)z⊤ϕ(z)dz=1t2∫Rd(zz⊤−I)h(x+tz)ϕ(z)dz,and ∇3ht(x)[v] =1t2∫Rd∇h(x+tz)v⊤(zz⊤−I)ϕ(z)dz

for each . The uniform bound on now yields ,

 M2(ht) M3(ht)

In the final equality we have used the fact that and are jointly normal with zero mean and covariance , so that the product has the distribution of the off-diagonal element of the Wishart distribution [23] with scale and degree of freedom.

We can now develop a bound for using our smoothed functions. Let

represent the maximum derivative bound of , and select and to satisfy . If we let , we then have

 dW(μ,ν)

where we have chosen to achieve the third inequality. ∎

{remark}

While Lemma 2 targets Lipschitz test functions, comparable results can be obtained for non-smooth functions, like the indicators of convex sets, by adapting the smoothing technique of [3, Lem. 2.1].

### 2.1 Example application to Bayesian logistic regression

Before turning to the proof of Theorem 2, we illustrate a practical application to measuring the quality of Monte Carlo or cubature sample points in Bayesian inference. Consider the Bayesian logistic regression posterior density [see, e.g., 11]

based on observed datapoints and a known prior hyperparameter . In this standard model of binary classification, represents our inferential target, an unknown parameter vector with a multivariate Gaussian prior; is the class label of the -th observed datapoint; and is an associated vector of covariates.

Since the normalizing constant of is unknown, it is common practice to approximate expectations under with sample estimates, , based on sample points drawn from a Markov chain or a cubature rule [11]. Theorem 2 furnishes a way to uniformly bound the error of this approximation, , for all sufficiently smooth functions .

Concretely, we have, for all unit vectors ,

 u⊤1∇2logp(β)u1 =−1/σ2−L∑l=1e⟨β,vl⟩(1+e⟨β,vl⟩)2⟨vl,u1⟩2≤−1/σ2, ∇3logp(β)[u1,u2,u3] ∇4logp(β)[u1,u2,u3,u4]

Hence, Theorem 2 applies with and . We may now plug the associated Stein factors

into the non-uniform graph Stein discrepancy of [12] to obtain a computable upper bound on or for any discrete probability measure .

## 3 Proof of Theorem 2

Before tackling the main proof, we will establish a series of useful lemmas. We will make regular use of the following well-known Lipschitz property:

 Mk(h) (4)

### 3.1 Properties of overdamped Langevin diffusions

Our first lemma enumerates several properties of the overdamped Langevin diffusion that will prove useful in the proofs to follow. {lemma}[Overdamped Langevin properties] If is strongly concave, then the overdamped Langevin diffusion with infinitesimal generator (1) and is well-defined for all times , has stationary distribution , and satisfies strong continuity on with norm , that is, as for all .

###### Proof.

Consider the Lyapunov function . The strong log-concavity of , the Cauchy-Schwarz inequality, and the arithmetic-geometric mean inequality imply that

 (AV)(x)=⟨x,∇logp(x)⟩+d=⟨x,∇logp(x)−∇logp(0)⟩+⟨x,∇logp(0)⟩+d

for some constants . Since is locally Lipschitz, [15, Thm. 3.5] implies that the diffusion is well-defined, and [21, Thm. 2.1] guarantees that is a stationary distribution. The argument of [13, Prop. 15] with [15, Thm. 3.5] substituted for [15, Thm. 3.4] and [10, Sec. 5, Cor. 1.2] now yields strong continuity. ∎

### 3.2 High-order weighted difference bounds

A second, technical lemma bounds the growth of weighted smooth function differences in terms of the proximity of function arguments. The result will be used to characterize the smoothness of as a function of the starting point (Lemma 3.3) and, ultimately, to establish the smoothness of (Theorem 2). {lemma}[High-order weighted difference bounds] Fix any weights and any vectors . If , then

 |λ(h(x)−h(y))−λ′(h(x′)−h(y′))−⟨∇h(y),λ(x−y)−λ′(x′−y′)⟩| (5)

Moreover, if , then

 |λ(h(x)−h(y)−(h(z)−h(w)))−λ′(h(x′)−h(y′)−(h(z′)−h(w′))) −⟨∇h(z),λ(x−y−(z−w))−λ′(x′−y′−(z′−w′))⟩| (6)
###### Proof.

To establish the second-order difference bound (5), we first apply Taylor’s theorem with mean-value remainder to and to obtain

 λ(h(x)−h(y))−λ′(h(x′)−h(y′))−⟨∇h(y),λ(x−y)−λ′(x′−y′)⟩ = λ′⟨∇h(y)−∇h(y′),x′−y′⟩+(λ⟨∇2h(ζ)(x−y),x−y⟩−λ′⟨∇2h(ζ′)(x′−y′),x′−y′⟩)/2

for some . Cauchy-Schwarz, the definition of the operator norm, and the Lipschitz gradient relation (4) now yield the advertised conclusion (5).

To derive the third-order difference bound (6), we apply Taylor’s theorem with mean-value remainder to , , , and to write

 |λ(h(x)−h(y)−(h(z)−h(w)))−λ′(h(x′)−h(y′)−(h(z′)−h(w′))) −⟨∇h(z),λ(x−y−(z−w))−λ′(x′−y′−(z′−w′))⟩| (7) =|λ′⟨∇h(z)−∇h(z′),x′−y′−(z′−w′)⟩+λ⟨∇h(z)−∇h(x),(y−x)−(y′−x′)⟩ +⟨λ(∇h(z)−∇h(x))−λ′(∇h(z′)−∇h(x′)),y′−x′⟩ +λ⟨∇2h(z)(w−z),w−z⟩/2−λ⟨∇2h(x)(y−x),y−x⟩/2 −λ′⟨∇2h(z′)(w′−z′),w′−z′⟩/2+λ′⟨∇2h(x′)(y′−x′),y′−x′⟩/2 +λ∇3h(ζ′′)[w−z,w−z,w−z]/6−λ∇3h(ζ′′′′)[y−x,y−x,y−x]/6 −λ′∇3h(ζ′′′)[w′−z′,w′−z′,w′−z′]/6+λ′∇3h(ζ′′′′′)[y′−x′,y′−x′,y′−x′]/6|

for some . We will bound each line in this expression in turn. First we see, by Cauchy-Schwarz and the Lipschitz property (4), that

 |λ′⟨∇h(z)−∇h(z′),x′−y′−(z′−w′)⟩+λ⟨∇h(z)−∇h(x),(y−x)−(y′−x′)⟩|

Next, we invoke our second-order difference bound (5) on the function , apply the Cauchy-Schwarz inequality, and use the definition of the operator norm to conclude that

 |⟨λ(∇h(z)−∇h(x))−λ′(∇h(z′)−∇h(x′)),y′−x′⟩|

To bound the subsequent line, we note that Cauchy-Schwarz, the definition of the operator norm, and the Lipschitz property (4) imply that

 |⟨∇2h(z)(w−z),w−z⟩−⟨∇2h(x)(y−x),y−x⟩| =|⟨∇2h(z)(w−z+y−x),x−y−(z−w)⟩+⟨(∇2h(z)−∇2h(x))(y−x),y−x⟩|

Similarly,

 |⟨∇2h(z′)(w′−z′),w′−z′⟩−⟨∇2h(x′)(y′−x′),y′−x′⟩|

Finally, Cauchy-Schwarz and the definition of the operator norm give

 |λ∇3h(ζ′′)[w−z,w−z,w−z]−λ∇3h(ζ′′′′)[y−x,y−x,y−x] −λ′∇3h(ζ′′′)[w′−z′,w′−z′,w′−z′]+λ′∇3h(ζ′′′′′)[y′−x′,y′−x′,y′−x′]|

Bounding the third-order difference (7) in terms of these four estimates yields (6). ∎

### 3.3 Synchronous coupling lemma

Our proof of Theorem 2 additionally rests upon a series of coupling inequalities which serve to characterize the smoothness of as a function of . The couplings espoused in the lemma to follow are termed synchronous, because the same Brownian motion is used to drive each process.

{lemma}

[Synchronous coupling inequalities] Suppose that is -strongly concave with and . Fix a -dimensional Wiener process , any vectors with , and any weights , and define the growth factors

 f1(x,x′,ϵ,ϵ′,ϵ′′) f2(x,x′,ϵ,ϵ′,ϵ′′) (8)

For each starting point of the form with , , and , consider an overdamped Langevin diffusion solving the stochastic differential equation

 dZt,z+b′v′+bv =12∇logp(Zt,z+b′v′+bv)dt+dWtwithZ0,z+b′v′+bv=z+b′v′+bv, (9)

and define the differenced processes

 Vt ≜(Zt,x′+ϵ′′v′−Zt,x′)/ϵ′′−(Zt,x+ϵ′v′−Zt,x)/ϵ′and Ut ≜Zt,x′+ϵ′′v′+ϵv−Zt,x′+ϵ′′v′−(Zt,x′+ϵv−Zt,x′)/ϵϵ′′ −Zt,x+ϵ′v′+ϵv−Zt,x+ϵ′v′−(Zt,x+ϵv−Zt,x)/ϵϵ′.

These coupled processes almost surely satisfy the synchronous coupling bounds,

 ekt/2 (10) ekt/2 (11) ekt/2 (12)

the second-order differenced function bound,

 (h2(Zt,x′+ϵ′′v′)−h2(Zt,x′))/ϵ′′−(h2(Zt,x+ϵ′v′)−h2(Zt,x))/ϵ′ (13) ≤

and the third-order differenced function bound,

 (h3(Zt,x′+ϵ′′v′+ϵv)−h3(Zt,x′+ϵ′′v′)−(h3(Zt,x′+ϵv)−h3(Zt,x′)))/(ϵϵ′′) −(h3(Zt,x+ϵ′v′+ϵv)−h3(Zt,x+ϵ′v′)−(h3(Zt,x+ϵv)−h3(Zt,x)))/(ϵϵ′) (14) ≤ +

for each , , and .

###### Proof.

By Lemma 3.1, each process with , , and is well-defined for all times . The first-order bound (10) is well known, and a concise proof can be found in [4].

#### Second-order bounds

To establish the second conclusion (11), we consider the Itô process of second-order differences

 Vt =12∫t0∇logp(Zs,x′+ϵ′′v′)−∇logp(Zs,x′)ϵ′′−∇logp(Zs,x+ϵ′v′)−∇logp(Zs,x)ϵ′ds

and apply Itô’s lemma to the mapping . This yields

Fix a value . For any , the Lemma 3.2 second-order difference inequality (5), the first order coupling bound (10), Cauchy-Schwarz, and the Lipschitz identity (4) together give the estimates

 (h2(Zs,x′+ϵ′′v′)−h2(Zs,x′))/ϵ′′−(h2(Zs,x+ϵ′v′)−h2(Zs,x))/ϵ′ ≤ ≤ (15) ≤ (16)

Applying the estimate (15) to the function with yields

 ⟨Vs,(∇logp(Zs,x′+ϵ′′v′)−∇logp(Zs,x′))/ϵ′′−(∇logp(Zs,x+ϵ′v′)−∇logp(Zs,x))/ϵ′⟩ ≤ ≤

where, to achieve the second inequality, we used the -strong log-concavity of . Now we may derive the second-order synchronous coupling bound (11), since

Applying the synchronous coupling bound (11) to the estimate (16) finally delivers the second-order differenced function bound (13).

#### Third-order bounds

To establish the third conclusion (12), we consider the Itô process of third-order differences and invoke Itô’s lemma once more for the mapping . This produces Fix a value , and introduce the shorthand and . For any , the Lemma 3.2 third-order difference inequality (6), the coupling bounds (10) and (11), Cauchy-Schwarz, and the Lipschitz identity (4) together imply the estimates (17) (18) where we have applied the triangle inequality to achieve (17). Applying the bound (17) to the thrice continuously differentiable function with gives In the final line, we used the -strong log-concavity of . Our efforts now yield (12) via The third-order differenced function bound (3.3) then follows by applying the third-order synchronous coupling bound (12) to the estimate (18). ∎

### 3.4 Proof of Theorem 2

By Lemma 3.1, for each , the overdamped Langevin diffusion is well-defined with stationary distribution . Moreover, for each , the diffusion , by definition, satisfies

 dZt,x=12∇logp(Zt,x)dt+dWtwithZ0,x=x,

for a -dimensional Wiener process. In what follows, when considering the joint distribution of a finite collection of overdamped Langevin diffusions, we will assume that the diffusions are coupled in the manner of Lemma 3.3, so that each diffusion is driven by a shared -dimensional Wiener process .

Fix any and any with bounded first, second, and third derivatives. We divide the remainder of our proof into five components, establishing that exists, is Lipschitz, has a Lipschitz gradient, has a Lipschitz Hessian, and solves the Stein equation (2).

#### Existence of uh

To see that the integral representation of is well-defined, note that

The first relation uses the stationarity of , the second uses the Lipschitz relation (4), the third uses the first-order coupling inequality (10) of Lemma 3.3, and the last uses the fact that strongly log-concave distributions have subexponential tails and therefore finite moments of all orders [8, Lem. 1].

#### Lipschitz continuity of uh

We next show that is Lipschitz. Fix any vector , and consider the difference

 |uh(x+v)−uh(x)| (19)

The second relation is an application of the Lipschitz relation (4), and the third applies the first-order coupling inequality (10) of Lemma 3.3.

#### Lipschitz continuity of ∇uh

To demonstrate that is differentiable with Lipschitz gradient, we first establish a weighted second-order difference inequality for .

{lemma}

For any vectors with and weights ,

 (20)
###### Proof.

We apply the Lemma 3.3 second-order function coupling inequality (13) to obtain

 = ≤

The desired bound follows by integrating the final expression. ∎

Now, fix any with . As a first application of the Lemma 3.4 second-order difference inequality (20), we will demonstrate the existence of the directional derivative

 ∇vuh(x)≜limϵ→0uh(x+ϵv)−uh(x)ϵ. (21)

Indeed, Lemma 3.4 implies that, for any integers ,

 ≤

Hence, the sequence is Cauchy, and the directional derivative (21) exists.

To see that the directional derivative (21) is also Lipschitz, fix any , and consider the bound

 ≤ (22)

where the second inequality follows from Lemma 3.4. Since each directional derivative is Lipschitz continuous, we may conclude that is continuously differentiable with Lipschitz continuous gradient . Our Lipschitz function deduction (19) and the Lipschitz relation (4) additionally supply the uniform bound

#### Lipschitz continuity of ∇2uh

To demonstrate that is differentiable with Lipschitz gradient, we begin by establishing a weighted third-order difference inequality for .

{lemma}

Fix any vectors with and weights , and define and as in (3.3) . Then,

 |(uh(x′+ϵ′′v′+ϵv)−uh(x′+ϵ′′v′)−(uh(x′+ϵv)−uh(x′)))/(ϵϵ′′) −(uh(x+ϵ′v′+ϵv)−uh(x+ϵ′v′)−(uh(x+ϵv)−uh(x)))/(ϵϵ′)| (23) ≤
###### Proof.

Introduce the shorthand and . We apply the Lemma 3.3 third-order function coupling inequality (3.3) to the thrice continuously differentiable function to obtain

 |(uh(x′+ϵ′′v′+ϵv)−uh(x′+ϵ′′v′)−(uh(x′+ϵv)−uh(x′)))/(ϵϵ′′) −(uh(x+ϵ′v′+ϵv)−uh(x+ϵ′v′)−(uh(x+ϵv)−uh(x)))/(ϵϵ′)| = ≤

Integrating this final expression yields the advertised bound. ∎

Now, fix any with . As a first application of the Lemma 3.4 third-order difference inequality (3.4), we will demonstrate the existence of the second-order directional derivative

 ∇v′∇vuh(x) ≜limϵ′→0∇vuh(x+ϵ′v′)−∇vuh(x)ϵ′ (24) =limϵ′→0limϵ→0uh(x+ϵ′v′+ϵv)−uh(x+ϵv)−(uh(x+ϵ′v′)−uh(x))ϵϵ′.

Lemma