Error bounds for Metropolis–Hastings algorithms applied to perturbations of Gaussian measures in high dimensions

Error bounds for Metropolis–Hastings algorithms applied to perturbations of Gaussian measures in high dimensions

\fnmsAndreas \snmEberle\correflabel=e1]eberle@uni-bonn.de [ Institut für Angewandte Mathematik
Universität Bonn
Endenicher Allee 60
53115 Bonn
Germany
\printeade1
Universität Bonn
\smonth11 \syear2012\smonth2 \syear2013
\smonth11 \syear2012\smonth2 \syear2013
\smonth11 \syear2012\smonth2 \syear2013
Abstract

The Metropolis-adjusted Langevin algorithm (MALA) is a Metropolis–Hastings method for approximate sampling from continuous distributions. We derive upper bounds for the contraction rate in Kantorovich–Rubinstein–Wasserstein distance of the MALA chain with semi-implicit Euler proposals applied to log-concave probability measures that have a density w.r.t. a Gaussian reference measure. For sufficiently “regular” densities, the estimates are dimension-independent, and they hold for sufficiently small step sizes that do not depend on the dimension either. In the limit , the bounds approach the known optimal contraction rates for overdamped Langevin diffusions in a convex potential.

A similar approach also applies to Metropolis–Hastings chains with Ornstein–Uhlenbeck proposals. In this case, the resulting estimates are still independent of the dimension but less optimal, reflecting the fact that MALA is a higher order approximation of the diffusion limit than Metropolis–Hastings with Ornstein–Uhlenbeck proposals.

[
\kwd
\doi

10.1214/13-AAP926 \volume24 \issue1 2014 \firstpage337 \lastpage377 \newproclaimdefi[prop]Definition \newproclaimrmk[prop]Remark \newproclaimex[prop]Example \newproclaimassumption[prop]Assumption

\runtitle

Error bounds for Metropolis–Hastings algorithms

{aug}

class=AMS] \kwd[Primary ]60J22 \kwd[; secondary ]60J05 \kwd65C40 \kwd65C05

Metropolis algorithm \kwdMarkov chain Monte Carlo \kwdLangevin diffusion \kwdEuler scheme \kwdcoupling \kwdcontractivity of Markov kernels

1 Introduction

The performance of Metropolis–Hastings (MH) methods MRTT , H , RC for sampling probability measures on high-dimensional continuous state spaces has attracted growing attention in recent years. The pioneering works by Roberts, Gelman and Gilks RGG and Roberts and Rosenthal RR show in particular that for product measures on , the average acceptance probabilities for the Random Walk Metropolis algorithm (RWM) and the Metropolis adjusted Langevin algorithm (MALA) converge to a strictly positive limit as only if the step sizes go to zero of order , , respectively. In this case, a diffusion limit as has been derived, leading to an optimal scaling of the step sizes maximizing the speed of the limiting diffusion, and an asymptotically optimal acceptance probability.

Recently, the optimal scaling results for RWM and MALA have been extended significantly to targets that are not of product form but have a sufficiently regular density w.r.t. a Gaussian measure; cf. MPS , PSTb . On the other hand, it has been pointed out BRSV , BS , HSV , CRSW that for corresponding perturbations of Gaussian measures, the acceptance probability has a strictly positive limit as for small step sizes that do not depend on the dimension, provided the random walk or Euler proposals in RWM and MALA are replaced by Ornstein–Uhlenbeck or semi-implicit (“preconditioned”) Euler proposals, respectively; cf. also below. Pillai, Stuart and Thiéry PST show that in this case, the Metropolis–Hastings algorithm can be realized directly on an infinite-dimensional Hilbert space arising in the limit as , and the corresponding Markov chain converges weakly to an infinite-dimensional overdamped Langevin diffusion as .

Mixing properties and convergence to equilibrium of Langevin diffusions have been studied intensively Has , MT , B , BCG , PZ , Roy . In particular, it is well-known that contractivity and exponential convergence to equilibrium in Wasserstein distance can be quantified if the stationary distribution is strictly log-concave CL , RS ; cf. also E11 for a recent extension to the nonlog-concave case. Because of the diffusion limit results, one might expect that the approximating Metropolis–Hastings chains have similar convergence properties. However, this heuristics may also be wrong, since the convergence of the Markov chains to the diffusion is known only in a weak and nonquantitative sense.

Although there is a huge number of results quantifying the speed of convergence to equilibrium for Markov chains on discrete state spaces (cf. P , S-C for an overview), there are relatively few quantitative results on Metropolis–Hastings chains on when is large. The most remarkable exception are the well-known works DFK , KLS , LV , LV1 , LV2 which prove an upper bound for the mixing time that is polynomial in the dimension for Metropolis chains with ball walk proposals for uniform measures on convex sets and more general log-concave measures.

Below, we develop an approach to quantify Wasserstein contractivity and convergence to equilibrium in a dimension-independent way for the Metropolis–Hastings chains with Ornstein–Uhlenbeck and semi-implicit Euler proposals. Our approach applies in the strictly log-concave case (or, more generally, if the measure is strictly log-concave on an appropriate ball) and yields bounds for small step sizes that are very precise. The results for semi-implicit Euler proposals require less restrictive assumptions than those for Ornstein–Uhlenbeck proposals, reflecting the fact that the corresponding Markov chain is a higher order approximation of the diffusion.

Our results are closely related and complementary to the recent work HSV11 , and to the dimension-dependent geometric ergodicity results in B-RHV . In particular, in HSV11 , Hairer, Stuart and Vollmer apply related methods to establish exponential convergence to equilibrium in Wasserstein distance for Metropolis–Hastings chains with Ornstein–Uhlenbeck proposals in a less quantitative way, but without assuming log-concavity. In the context of probability measures on function spaces, the techniques developed here are applied in the PhD Thesis of Gruhlke G .

We now recall some basic facts on Metropolis–Hastings algorithms and describe our setup and the main results. Sections 2 and 3 contain basic results on Wasserstein contractivity of Metropolis–Hastings kernels, and contractivity of the proposal kernels. In Sections 4 and 5, we prove bounds quantifying rejection probabilities and the dependence of the rejection event on the current state for Ornstein–Uhlenbeck and semi-implicit Euler proposals. These bounds, combined with an upper bound for the exit probability of the corresponding Metropolis–Hastings chains from a given ball derived in Section 6 are crucial for the proof of the main results in Section 7.

1.1 Metropolis–Hastings algorithms

Let be a lower bounded measurable function such that

and let denote the probability measure on with density proportional to . We use the same letter for the measure and its density, that is,

(1)

Below, we view the measure defined by (1) as a perturbation of the standard normal distribution in ; that is, we decompose

(2)

with a measurable function , and obtain the representation

(3)

with normalization constant . Here denotes the Euclidean norm.

Note that in , any probability measure with a strictly positive density can be represented as an absolutely continuous perturbation of as in (3). In an infinite-dimensional limit, however, the density may degenerate. Nevertheless, also on infinite-dimensional spaces, absolutely continuous perturbations of Gaussian measures form an important and widely used class of models.

{ex}

[(Transition path sampling)] We briefly describe a typical application; cf. HSV and G for details. Suppose that we are interested in sampling a trajectory of a diffusion process in conditioned to a given endpoint at time . We assume that the unconditioned diffusion process satisfies a stochastic differential equation of the form

(4)

where is an -dimensional Brownian motion, and is bounded from below. Then, by Girsanov’s theorem and Itô’s formula, a regular version of the law of the conditioned process satisfying and on the path space is given by

(5)

where is the law of the Brownian bridge from to ,

(6)

and ; cf. Roy . In order to obtain finite-dimensional approximations of the measure on , we consider the Wiener–Lévy expansion

(7)

of a path in terms of the basis functions and with . Here the coefficients , , , , are real numbers. Recall that truncating the series at corresponds to taking the polygonal interpolation of the path adapted to the dyadic partition of the interval . Now fix , let and let

denote the vector consisting of the first components in the basis expansion of a path . Then the image of the Brownian bridge measure under the projection that maps to is the -dimensional standard normal distribution ; for example, cf. Ste . Therefore, a natural finite-dimensional approximation to the infinite-dimensional sampling problem described above consists in sampling from the probability measure

(8)

on where is a normalization constant, and

(9)

with denoting the polygonal path corresponding to .

Returning to our general setup, suppose that is an absolutely continuous transition kernel on with strictly positive densities . Let

(10)

Note that does not depend on . The Metropolis–Hastings algorithm with proposal kernel is the following Markov chain Monte Carlo method for approximate sampling and Monte Carlo integration w.r.t. :

{longlist}

[(1)]

Choose an initial state .

For :

  • Sample and independently.

  • If , then accept the proposal, and set , else reject the proposal and set .

The algorithm generates a time-homogeneous Markov chain with initial state and transition kernel

(11)

Here

(12)

is the average rejection probability for the proposal when the Markov chain is at . Note that restricted to is again absolutely continuous with density

Since

is a symmetric function in and , the kernel satisfies the detailed balance condition

(13)

In particular, is a stationary distribution for the Metropolis–Hastings chain, and the chain with initial distribution is reversible. Therefore, under appropriate ergodicity assumptions, the distribution of will converge to as .

To analyze Metropolis–Hastings algorithms it is convenient to introduce the function

(14)

For any ,

(15)

In particular, for any ,

(16)
(17)
(18)

The function defined by (14) can also be represented in terms of : Indeed, since

we have

(19)

where denotes the standard normal density in .

1.2 Metropolis–Hastings algorithms with Gaussian proposals

We aim at proving contractivity of Metropolis–Hastings kernels w.r.t. appropriate Kantorovich–Rubinstein–Wasserstein distances. For this purpose, we are looking for a proposal kernel that has adequate contractivity properties and sufficiently small rejection probabilities. The rejection probability is small if the proposal kernel approximately satisfies the detailed balance condition w.r.t. .

1.2.1 Ornstein–Uhlenbeck proposals

A straightforward approach would be to use a proposal density that satisfies the detailed balance condition

(20)

w.r.t. the standard normal distribution. In this case,

(21)

The simplest form of proposal distributions satisfying (20) are the transition kernels of AR (discrete Ornstein–Uhlenbeck) processes given by

(22)

for some constant . If is a standard normally distributed -valued random variable, then the random variables

(23)

have distributions . Note that by (21), the acceptance probabilities

(24)

for Ornstein–Uhlenbeck proposals do not depend on .

1.2.2 Euler proposals

In continuous time, under appropriate regularity and growth conditions on , detailed balance w.r.t. is satsfied exactly by the transition functions of the diffusion process solving the over-damped Langevin stochastic differential equation

(25)

because the generator

is a self-adjoint operator on an appropriate dense subspace of ; cf. Roy . Although we cannot compute and sample from the transition functions exactly, we can use approximations as proposals in a Metropolis–Hastings algorithm. A corresponding MH algorithm where the proposals are obtained from a discretization scheme for the SDE (25) is called a Metropolis-adjusted Langevin algorithm (MALA); cf. RT , RC .

In this paper, we focus on the MALA scheme with proposal kernel

(26)

for some constant ; that is, is the distribution of

where is a standard normal random variable with values in .

Note that if is replaced by , then (1.2.2) is a standard Euler discretization step for the SDE (25). Replacing by ensures that detailed balance is satisfied exactly for . Alternatively, (1.2.2) can be viewed as a semi-implicit Euler discretization step for (25):

{rmk}

[(Euler schemes)] The explicit Euler discretization of the over-damped Langevin equation (25) with time step size is given by

(28)

where , are i.i.d. random variables with distribution . The process defined by (28) is a time-homogeneous Markov chain with transition kernel

(29)

Even for , the measure is not a stationary distribution for the kernel . A semi-implicit Euler scheme for (25) with time-step size is given by

(30)

with i.i.d. with distribution ; cf. HSV . Note that the scheme is implicit only in the linear part of the drift but explicit in . Solving for in (30) and substituting with yields the equivalent equation

(31)

We call the Metropolis–Hastings algorithm with proposal kernel a semi-implicit MALA scheme with step size .

Proposition 1.1 ((Acceptance probabilities for semi-implicit MALA))

Let and . Then the acceptance probabilities for the Metropolis-adjusted Langevin algorithm with proposal kernels are given by with

(32)

For explicit Euler proposals with step size , a corresponding representation holds with

The proof of the proposition is given in Section 4 below.

{rmk}

For explicit Euler proposals, the correction term in (1.1) does not vanish for . More significantly, this term goes to infinity as , and the variance of w.r.t. the proposal distribution is of order .

1.3 Bounds for rejection probabilities

We fix a norm on such that

(34)

We assume that is sufficiently smooth w.r.t. with derivatives growing at most polynomially:

{assumption}

The function is in , and for any , there exist finite constants , such that

holds for any and . For discretizations of infinite-dimensional models, will typically be a finite-dimensional approximation of a norm that is almost surely finite w.r.t. the limit measure in infinite-dimensions.

{ex}

[(Transition path sampling)] Consider the situation of Example 1.1, and assume that is in . Then by (9) and (6), is . For and , the directional derivatives of are given by

(35)

where are the polygonal paths in corresponding to , respectively, for and . Assuming for some integer as , we can estimate

where , , is a discrete norm of the polygonal path and are finite constants that do not depend on the dimension . One could now choose for the minus norm the norm on corresponding to the discrete norm on polygonal paths. However, it is more convenient to choose a norm coming from an inner product. To this end, we consider the norms

on path space , and the induced norms

on where . One can show that for , the norm can be bounded from above by independently of the dimension; cf. G . On the other hand, if , then for -almost every path of the Brownian bridge. This property will be crucial when restricting to balls w.r.t. . For with , both requirements are satisfied, and Assumption 1.3 holds with constants that do not depend on the dimension.

The next proposition yields in particular an upper bound for the average rejection probability w.r.t. both Ornstein–Uhlenbeck and semi-implicit Euler proposals at a given position ; cf. BouRabee for an analogue result:

Proposition 1.2 ((Upper bounds for MH rejection probabilities))

Suppose that Assumption 1.3 is satisfied and let . Then there exist polynomials and of degrees , , respectively, such that for any and ,

The result is a consequence of Proposition 1.1. The proof is given in Section 4 below.

{rmk}

(1) The polynomials and in Proposition 1.2 are explicit; cf. the proof below. They depend only on the values in Assumption 1.3 for , , respectively, and on the moments

(36)

but they do not depend on the dimension . For semi-implicit Euler proposals, the upper bound in Proposition 1.2 is stated in explicit form for the case and in (4) below.

(2) For explicit Euler proposals, corresponding estimates hold with replaced by ; cf. Remark 4 below. Note, however, that as .

Our next result is a bound of order , , respectively, for the average dependence of the acceptance event on the current state w.r.t. Ornstein–Uhlenbeck and semi-implicit Euler proposals. Let denote the dual norm of on , that is,

Note that

For a function ,

that is, the plus norm of determines the Lipschitz constant w.r.t. the minus norm.

Proposition 1.3 ((Dependence of rejection on the current state))

Suppose that Assumption 1.3 is satisfied, and let . Then there exist polynomials and of degrees , , respectively, such that for any and ,

(37)
(38)
(39)
(40)

where denotes the line segment between and .

The proof of the proposition is given in Section 5 below.

{rmk}

Again, the polynomials and are explicit. They depend only on the values in Assumption 1.3 for , , respectively, and on the moments for , , respectively, but they do not depend on the dimension . For semi-implicit Euler proposals, the upper bound in Proposition 1.3 is made explicit for the case and in (5) below.

For Ornstein–Uhlenbeck proposals, it will be useful to state the bounds in Propositions 1.2 and 1.3 more explicitly for the case , that is, when the second derivatives of are uniformly bounded w.r.t. the minus norm:

Proposition 1.4

Suppose that Assumption 1.3 is satisfied for with . Then for any and ,

and

The proof is given in Sections 4 and 5 below. Again, corresponding bounds also hold for norms for .

1.4 Wasserstein contractivity

The bounds in Propositions 1.2, 1.3 and 1.4 can be applied to study contractivity properties of Metropolis–Hastings transition kernels. Recall that the Kantorovich–Rubinstein or -Wasserstein distance of two probability measures and on the Borel -algebra w.r.t. a given metric  on is defined by

where consists of all couplings of and , that is, all probability measures on with marginals and ; cf., for example, Villani . Recall that a coupling of and can be realized by random variables and defined on a joint probability space such that and .

In order to derive upper bounds for the distances , and, more generally, , we define a coupling of the MALA transition probabilities , by setting

Here , , is the basic coupling of the proposal distributions defined by (1.2.2) with , and the random variable is uniformly distributed in and independent of .

Correspondingly, we define a coupling of the Metropolis–Hastings transition kernels based on Ornstein–Uhlenbeck proposals by setting

Let

denote the centered ball of radius w.r.t. . As a consequence of Proposition 1.4 above, we obtain the following upper bound for the Kantorovich–Rubinstein–Wasserstein distance of and w.r.t. the metric :

Theorem 1.5 ((Contractivity of MH transitions based on OU proposals))

Suppose that Assumption 1.3 is satisfied for with . Then for any , , and ,

where

with an explicit constant that only depends on the values and .

The proof is given in Section 7 below.

Theorem 1.5 shows that Wasserstein contractivity holds on the ball provided and is chosen sufficiently small depending on [with ]. In this case, the contraction constant depends on the dimension only through the values of the constants and . On the other hand, the following one-dimensional example shows that for , the acceptance-rejection step may destroy the contraction properties of the OU proposals:

{ex}

Suppose that and . If with a constant , then by Theorem 1.5, Wasserstein contractivity holds for the Metropolis–Hastings chain with Ornstein–Uhlenbeck proposals on the interval provided is chosen sufficiently small. On the other hand, if for with a constant , then the logarithmic density

is strictly concave for , and it can be easily seen that Wasserstein contractivity on does not hold for the MH chain with OU proposals if is sufficiently small.

A disadvantage of the result for Ornstein–Uhlenbeck proposals stated above is that not only a lower bound on the second derivative of is required (this would be a fairly natural condition as the example indicates), but also an upper bound of the same size. For semi-implicit Euler proposals, we can derive a better result that requires only a strictly positive lower bound on the second derivative of and Assumption 1.3 with arbitrary constants to be satisfied. For this purpose we assume that

for an inner product on , and we make the following assumption on :

{assumption}

There exists a strictly positive constant such that

(41)

Of course, Assumption 1.4 is still restrictive, and it will often be satisfied only in a suitable ball around a local minimum of . Most of the results below are stated on a given ball w.r.t. the minus norm. In this case it is enough to assume that 1.4 holds on that ball. If coincides with the Euclidean norm , then the assumption is equivalent to convexity of . Moreover, since , a sufficient condition for (41) to hold is

(42)

As a consequence of Propositions 1.2 and 1.3 above, we obtain the following upper bound for the Kantorovich–Rubinstein–Wasserstein distance of and w.r.t. the metric :

Theorem 1.6 ((Contractivity of semi-implicit MALA transitions))

Suppose that Assumptions 1.3 and 1.4 are satisfied. Then for any , and ,

where

with