A Proof of Lemma 1

# Optimal scaling for the transient phase of Metropolis Hastings algorithms: The longtime behavior

## Abstract

We consider the Random Walk Metropolis algorithm on with Gaussian proposals, and when the target probability measure is the -fold product of a one-dimensional law. It is well known (see Roberts et al. (Ann. Appl. Probab. 7 (1997) 110–120)) that, in the limit , starting at equilibrium and for an appropriate scaling of the variance and of the timescale as a function of the dimension , a diffusive limit is obtained for each component of the Markov chain. In Jourdain et al. (Optimal scaling for the transient phase of the random walk Metropolis algorithm: The mean-field limit (2012) Preprint), we generalize this result when the initial distribution is not the target probability measure. The obtained diffusive limit is the solution to a stochastic differential equation nonlinear in the sense of McKean. In the present paper, we prove convergence to equilibrium for this equation. We discuss practical counterparts in order to optimize the variance of the proposal distribution to accelerate convergence to equilibrium. Our analysis confirms the interest of the constant acceptance rate strategy (with acceptance rate between and ) first suggested in Roberts et al. (Ann. Appl. Probab. 7 (1997) 110–120).

We also address scaling of the Metropolis-Adjusted Langevin Algorithm. When starting at equilibrium, a diffusive limit for an optimal scaling of the variance is obtained in Roberts and Rosenthal (J. R. Stat. Soc. Ser. B. Stat. Methodol. 60 (1998) 255–268). In the transient case, we obtain formally that the optimal variance scales very differently in depending on the sign of a moment of the distribution, which vanishes at equilibrium. This suggest that it is difficult to derive practical recommendations for MALA from such asymptotic results.

\kwd
\aid

0 \volume20 \issue4 2014 \firstpage1930 \lastpage1978 \doi10.3150/13-BEJ546 \newproclaimdefinitionDefinition \newremarkremarkRemark

\runtitle

Optimal scaling for the transient phase of MH algorithms

{aug}

, and

diffusion limits \kwdMALA \kwdoptimal scaling \kwdpropagation of chaos \kwdrandom walk Metropolis

## 1 Introduction

Many Markov Chain Monte Carlo (MCMC) methods are based on the Metropolis–Hastings algorithm, see [16, 13]. To set up the notation, let us recall this well-known sampling technique. Let us consider a target probability distribution on with density . Starting from an initial random variable , the Metropolis–Hastings algorithm generates iteratively a Markov chain in two steps. At time , given , a candidate is sampled using a proposal distribution with density . Then, the proposal is accepted with probability , where

 α(x,y)=1∧p(y)q(y,x)p(x)q(x,y).

Here and in the following, we use the standard notation . If the proposed value is accepted, then otherwise . The Markov Chain is by construction reversible with respect to the target density , and thus admits as an invariant distribution. The efficiency of this algorithm highly depends on the choice of the proposal distribution . One common choice is a Gaussian proposal centered at point with variance :

 q(x,y)=1(2\uppiσ2)n/2exp(−|x−y|22σ2).

Since the proposal is symmetric (), the acceptance probability reduces to

 α(x,y)=1∧p(y)p(x). (1)

Metropolis–Hastings algorithms with symmetric kernels are called random walk Metropolis (RWM). Another popular choice yields the so called Metropolis adjusted Langevin algorithm (MALA). In this case, the proposal distribution is a Gaussian random variable with variance and centered at point (in particular, it is not symmetric). It corresponds to one step of a time-discretization with timestep of the (overdamped) Langevin dynamics: which is ergodic with respect to (here, is a standard -dimensional Brownian motion).

In both cases (RWM and MALA), the variance remains to be chosen. It should be sufficiently large to ensure a good exploration of the state space, but not too large otherwise the rejection rate becomes typically very high since the proposed moves fall in low probability regions, in particular in high dimension. It is expected that the higher the dimension, the smaller the variance of the proposal should be. The first theoretical results to optimize the choice of in terms of the dimension can be found in [22]. The authors study the RWM algorithm under two fundamental (and somewhat restrictive) assumptions: (i) the target probability distribution is the -fold tensor product of a one-dimensional density:

 p(x)=n∏i=1exp(−V(xi)), (2)

where and , and (ii) the initial distribution is the target probability (what we refer to as the stationarity assumption in the following):

 Xn0∼p(x)dx.

The superscript in the Markov chain explicitly indicates the dependency on the dimension . Then, under additional regularity assumption on , the authors prove that for a proper scaling of the variance as a function of the dimension, namely

 σ2n=ℓ2n,

where is a fixed constant, the Markov process (where denotes the first component of ) converges in law to a diffusion process:

 dXt=√h(I,ℓ)dBt−h(I,ℓ)12V′(Xt)dt, (3)

where

 h(I,ℓ)=2ℓ2Φ(−ℓ√I2)and% I=∫R(V′)2exp(−V). (4)

Here and in the following, denotes the integer part (for , and ) and is the cumulative distribution function of the normal distribution (). The scalings of the variance and of the time as a function of the dimension are indications on how to make the RWM algorithm efficient in high dimension. Moreover, a practical counterpart of this result is that should be chosen such that is maximum (the optimal value of is with ), in order to optimize the time scaling in (3). This optimal value of corresponds equivalently to a constant average acceptance rate, with approximate value : for this choice of , in the limit large,

 ∫∫α(x,y)p(x)q(x,y)dxdy≃0.234.

Notice that the optimal average acceptance rate does not depend on , and is thus the same whatever the target probability. Thus, the practical way to choose is to scale it in such a way that the average acceptance rate is roughly . Similar results have been obtained for the MALA algorithm in [23]. In this case, the scaling for the variance is , the time scaling is and the optimal average acceptance rate is .

There exists several extensions of such results for various Metropolis–Hastings algorithms, see [23, 24, 10, 17, 18, 7, 6, 8], and some of them relax in particular the first main assumption mentioned above about the separated form of the target distribution, see [11, 4, 5, 9]. Extensions to infinite dimensional settings have also been explored, see [15, 21, 9].

All these results assume stationarity: the initial measure is the target probability. To the best of the authors’ knowledge, the only works which deal with a nonstationary case are [12] where the RWM and the MALA algorithms are analyzed in the Gaussian case and [20]. In the latter paper, the target measure is assumed to be absolutely continuous with respect to the law of an infinite dimensional Gaussian random field and this measure is approximated in a space of dimension where the MCMC algorithm is performed. The authors consider a modified RWM algorithm (called preconditioned Crank–Nicolson walk) started at a deterministic initial condition and prove that when tends to as tends to (with no restriction on the rate of convergence of to ), the rescaled algorithm converges to a stochastic partial differential equation, started at the same initial condition.

The aim of this article is to discuss extensions of the previous results for the RWM and the MALA algorithms, without assuming stationarity. The main findings are the following.

Concerning the RWM algorithm, in the companion paper [14], we prove that, using the same scaling for the variance and the time as in the stationary case (namely and considering ), one obtains in the limit goes to infinity a diffusion process nonlinear in the sense of McKean (see Equation (7) below). This is discussed in Section 2. Contrary to (3), this diffusion process cannot be obtained from the simple Langevin dynamics by a deterministic time-change and its long-time behavior is not obvious. In Section 3, we first prove that its unique stationary distribution is . Assuming that this measure satisfies a logarithmic Sobolev inequality, we prove that the Kullback–Leibler divergence of the marginal distribution at time with respect to converges to at an exponential rate. In Section 4, we discuss optimizing strategies which take into account the transient phase. Roughly speaking, the usual strategy which consists in choosing (recall that ) such that the average acceptance rate is constant (with value between and ) seems to be a very good strategy. This is numerically illustrated in Section 5.

Concerning the MALA algorithm, the situation is much more complicated. The scaling to be used seems to depend on the sign of an average quantity (see Section 6.1.3). In particular, the scaling which has been identified in [23] under the stationary assumption is not adapted to the transient phase. It seems difficult to draw any practical recommendation from this analysis. This is explained with details in Section 6.

## 2 Scaling limit for the RWM algorithm

In this section, we state the diffusion limit for the RWM algorithm, and explain formally why this result holds. A rigorous proof can be found in [14].

### 2.1 The RWM algorithm and the convergence result

We consider a Random Walk Metropolis algorithm using Gaussian proposal with variance , and with target defined by (2). The Markov chain generated using this algorithm writes:

 Xi,nk+1=Xi,nk+ℓ√nGik+11Ak+1,1≤i≤n (5)

with

 Ak+1={Uk+1≤e∑ni=1(V(Xi,nk)−V(Xi,nk+(ℓ/√n)Gik+1))},

where is a sequence of independent and identically distributed (i.i.d.) normal random variables independent from a sequence of i.i.d. random variables with uniform law on . We assume that the initial positions are exchangeable (namely the law of the vector is invariant under permutation of the indices) and independent from all the previous random variables. Exchangeability is preserved by the dynamics: for all , are exchangeable. We denote by the sigma field generated by and .

For and , let

 Yi,nt = (⌈nt⌉−nt)Xi,n⌊nt⌋+(nt−⌊nt⌋)Xi,n⌈nt⌉ = Xi,n⌊nt⌋+(nt−⌊nt⌋)ℓ√nGi⌈nt⌉1A⌈nt⌉

be the linear interpolation of the Markov chain obtained by rescaling time (the characteristic time scale is ). This is the classical diffusive time-scale for a random walk, since the variance of each move is of order .

Let us define the notion of convergence (namely the propagation of chaos) that will be useful to study the convergence of the interacting particle system in the limit goes to infinity. {definition} Let be a separable metric space. A sequence of exchangeable -valued random variables is said to be -chaotic where is a probability measure on if for fixed , the law of converges in distribution to as goes to .

According to [25], Proposition 2.2, the -chaoticity is equivalent to a law of large numbers result, namely the convergence in probability of the empirical measures to the constant when the space of probability measures on is endowed with the weak convergence topology.

We are now in position to state the convergence result for the RWM algorithm, taken from [14]. Here and in the following, the bracket notation refers to the duality bracket for probability measures on : for a probability measure and a bounded measurable function,

 ⟨μ,ϕ⟩=∫ϕdμ.
###### Theorem 1

Let be a probability measure on such that . Let us also assume that

 V is a C3 function on R with bounded second and third order derivatives. (6)

If the initial positions are -chaotic and such that

 Missing or unrecognized delimiter for \bigr

then the processes are -chaotic where denotes the law (on the space of continuous function with values in ) of the unique solution to the stochastic differential equation nonlinear in the sense of McKean

 dXt = Γ1/2(E[(V′(Xt))2],E[V′′(Xt)],ℓ)dBt (7) −G(E[(V′(Xt))2],E[V′′(Xt)],ℓ)V′(Xt)dt,

where is a Brownian motion independent from the initial position distributed according to . The functions and are, respectively, defined by: for , and ,

 Γ(a,b,ℓ)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩ℓ2Φ(−ℓb2√a)+ℓ2eℓ2(a−b)/2Φ(ℓ(b2√a−√a)),\quad% \mbox{if }a∈(0,+∞),ℓ22,\quad\mbox{if % }a=+∞,ℓ2e−ℓ2b+/2,\quad\mbox{if }a=0, (8)

where , and

 Missing or unrecognized delimiter for \biggr (9)

Notice that the assumption on is for example satisfied when the random variables are i.i.d. according to some probability measure on .

This convergence result generalizes the previous result by Roberts et al. [22] where the same diffusive limit is obtained under the restrictive assumption that the vector of initial positions is distributed according to the target distribution . In this case, indeed solves the stochastic differential equation (3)–(4) with time-homogeneous coefficients (here, we use the fact that where , see [14], Lemma 1). Moreover, by taking , this theorem also yields similar results as [12], where the authors consider a nonstationary case, but restrict their analysis to the evolution of for Gaussian targets.

In addition to the previous convergence result, we are able to identify the limiting average acceptance rate.

###### Proposition 1

Under the assumptions of Theorem 1, the function

 Missing or unrecognized delimiter for \bigl

converges locally uniformly to and in particular, the average acceptance rate converges locally uniformly to where for any and ,

 acc(a,b,ℓ)=Γ(a,b,ℓ)ℓ2. (10)

### 2.2 A formal derivation of the limiting process (7)

Let us introduce the infinitesimal generator associated to (7):

 Missing or unrecognized delimiter for \bigr (11)

For a probability measure on , is well defined by boundedness of (see (6)), and is also well defined in .

The relationship between (7) and (11) is the following: if satisfies (7), then for any smooth test function , is a martingale, where denotes the law of : for any ,

 Extra open brace or missing close brace (12)

Actually, as explained in [14], Section 3.1, the martingale representation of the solution is a weak formulation of (7): solutions to (12) are solutions in distribution to (7).

Let us now present formally how (7) is derived. First, let us explain how the scaling of as a function of is chosen. The idea (see [24]) is to choose in such a way that the limiting acceptance rate (when ) is neither zero nor one. In the first case, this would mean that the variance of the proposal is too large, so that all proposed moves are rejected. In the second case, the variance of the proposal is too small, and the rate of convergence to equilibrium is thus not optimal. In particular, it is easy to check that should go to zero as goes to infinity. Now, notice that the limiting acceptance rate is:

 E(1Ak+1|Fnk) = E(e∑ni=1(V(Xi,nk)−V(Xi,nk+σnGik+1))∧1|Fnk) (13) = E(e−∑ni=1(V′(Xi,nk)σnGik+1+V′′(Xi,nk)(σ2n/2))∧1|Fnk)+O(nσ3n)+O(√nσ2n) = exp(an−bn2)Φ(bn2√an−√an)+Φ(−bn2√an) +O(nσ3n)+O(√nσ2n),

where and . The formula (13) is obtained by explicit computations (see [22], Proposition 2.4). From this expression, assuming a propagation of chaos (law of large numbers) on the random variables , one can check that the correct scaling for the variance is in order to obtain a nontrivial limiting acceptance rate (see [14], Section 2.3).

Using this scaling, we observe that, for a test function ,

 E(φ(X1,nk+1)|Fnk) =E(φ(X1,nk+ℓ√nG1k+11Ak+1)∣∣Fnk) =φ(X1,nk)+φ′(X1,nk)ℓ√nE(G1k+11Ak+1|Fnk) +ℓ22nφ′′(X1,nk)E((G1k+1)21Ak+1|Fnk)+O(n−3/2). (14)

We compute (by conditioning with respect to ):

 E(G1k+11Ak+1|Fnk) =E(G1k+1e∑ni=1(V(Xi,nk)−V(Xi,nk+(ℓ/√n)Gik+1))∧1|Fnk) =E(G1k+1e−∑ni=1(V′(Xi,nk)(ℓ/√n)Gik+1+V′′(Xi,nk)(ℓ2/(2n)))∧1|Fnk)+O(n−1) =−V′(X1,nk)1ℓ√nG(⟨νnk,(V′)2⟩,⟨νnk,V′′⟩,ℓ)+O(n−1), (15)

where

 νnk=1nn∑i=1δXi,nk

denotes the empirical distribution associated to the interacting particle system. The equation (2.2) is again a consequence of explicit computations (see [14], Equation (A.3)), and the fact that the remainder is of order requires a detailed analysis (see [14], Proposition 7). Likewise, for the diffusion term, we get

 E((G1k+1)21Ak+1|Fnk) =E((G1k+1)2e∑ni=1(V(Xi,nk)−V(Xi,nk+(ℓ/√n)Gik+1))∧1|Fnk) =E((G1k+1)2e−∑ni=1(V′(Xi,nk)(ℓ/√n)Gik+1+V′′(Xi,nk)(ℓ2/(2n)))∧1|Fnk)+O(n−1) =1ℓ2Γ(⟨νnk,(V′)2⟩,⟨νnk,V′′⟩,ℓ)+O(n−1). (16)

To obtain (2.2), we again used an explicit computation (see [14], Equation (A.5)).

By plugging (2.2) and (2.2) into (2.2), we see that the correct scaling in time is to consider such that (diffusive timescale), and we get:

 E(φ(Y1,n(k+1)/n)|Fnk) = φ(Y1,nk/n)−φ′(Y1,nk/n)1nV′(Y1,nk/n)G(⟨νnk,(V′)2⟩,⟨νnk,V′′⟩,ℓ) Missing or unrecognized delimiter for \bigr = φ(Y1,nk/n)+1n(Lνnkφ)(Y1,nk/n)+O(n−3/2),

where is defined by (11). This can be seen as a discrete-in-time version (over a timestep of size ) of the martingale property (12). Thus, by sending to infinity, assuming that converges to the law of , we expect to converge to a solution to (7). For a rigorous proof, we refer to [14].

## 3 Longtime convergence for the RWM nonlinear dynamics

We would like to study the limiting dynamics (7) obtained for the RWM algorithm, that we recall for convenience

 dXt=Γ1/2(E[(V′(Xt))2],E[V′′(Xt)],ℓ)dBt−G(E[(V′(Xt))2],E[V′′(Xt)],ℓ)V′(Xt)dt,

where and are, respectively, defined by (8) and (9). The associated Fokker–Planck equation is ( denotes the density of the random variable ):

 ⎧⎨⎩∂tψt=∂x(G(a[ψt],b[ψt],ℓ)V′ψt+Γ(a[ψt],b[ψt],ℓ)∂xψt/2),wherea[ψt]=∫R(V′)2ψtandb[ψt]=∫RV′′ψt. (17)

Let us denote . Notice that and . We thus expect to be the longtime limit of .

### 3.1 Stationary solution

We start the analysis of the limiting process by checking that the solution of (7) has the expected stationary distribution.

###### Proposition 2

There exists a unique stationary distribution for the process defined by (7). In addition, this distribution is absolutely continuous with respect to the Lebesgue measure, with density .

Before proving Proposition 2, we need some preliminary facts the proof of which is postponed to Appendix A.

###### Lemma 1

Defining the function by:

one has

 sign(Γ(a,b,ℓ)−2G(a,b,ℓ))=sign(a−b). (19)

Moreover, the function defined for , and by

is a continuous function satisfying

 ∀ℓ>0,∀M∈(0,+∞),inf(a,b)∈[0,M]×[−M,M]F(a,b,ℓ)>0. (21)
{pf*}

Proof of Proposition 2 Let . Since is bounded then one can check that (see [14], Lemma 1). By (19), we get that . Let us define the Langevin diffusion

 d~Xt=√2G(c,c,ℓ)dBt−G(c,c,ℓ)V′(~Xt)dt

with distributed according to the density . It is well known that for any the density of is and therefore . Then it is clear that the process satisfies (7). Hence, is a stationary probability distribution for the stochastic differential equation (7).

Let us now prove the uniqueness of the invariant measure. Assume that there exists another stationary probability measure with density (the fact that the stationary measure admits a density is standard, since the diffusion term is bounded from below). Assume . Since and , the stochastic differential equation (7) with distributed according to the density reduces in this case to which does not admit a stationary distribution. Thus, necessarily, we have

 ∫RV′2p∞<∞.

Let us denote and . Then, Equation (7) with distributed according to the density reduces to

 dXt=Γ1/2(a,b,ℓ)dBt−G(a,b,ℓ)V′(Xt)dt.

The stationary distribution thus writes

 p∞∝exp(−2G(a,b,ℓ)Γ(a,b,ℓ)V).

By integration by parts, we obtain that

 bΓ(a,b,ℓ)=2aG(a,b,ℓ).

Hence, by definition of and (21), we obtain and by (19) we get that . In conclusion, .

### 3.2 Longtime convergence

It is actually possible to prove that, for fixed , the law of solution to (7) converges exponentially fast to the equilibrium density . The proof is based on entropy estimates, using the Fokker–Planck equation (17), and requires the notion of logarithmic Sobolev inequality. {definition} The probability measure satisfies a logarithmic Sobolev inequality with constant (in short ) if and only if, for any probability measure absolutely continuous with respect to ,

 H(μ|ν)≤12ρI(μ|ν), (22)

where is the Kullback–Leibler divergence (or relative entropy) of with respect to and is the Fisher information of with respect to .

With a slight abuse of notation, we will denote in the following and the Kullback–Leibler divergence and the Fisher information associated with the continuous probability distributions and . We recall that, by the Csiszar–Kullback inequality (see, for instance, [2], Théorème 8.2.7, page 139), for any probability densities and ,

 ∫R|ψ−ϕ|≤√2H(ψ|ϕ), (23)

so that may be seen as a measure of the “distance” between and .

###### Theorem 2

Let us assume (6), and that admits a density such that and . Then, for all ,

 ddtH(ψt|ψ∞)≤−F(a[ψt],b[ψt],ℓ)2I(ψt|ψ∞), (24)

and the function is decreasing.

Let us assume moreover that satisfies a . Then there exists a positive and nonincreasing function such that

 H(ψt|ψ∞)≤e−tλ(H(ψ0|ψ∞))H(ψ0|ψ∞). (25)

Equation (25) shows that converges exponentially fast to . {remark} Roughly speaking, satisfies a LSI if grows sufficiently fast at infinity. For example, according to [2], Théorème 6.4.3, a sufficient condition for to satisfy a LSI, is that does not vanish outside of some compact subset of and

 lim|x|→∞V′′(x)(V′(x))2=0%andlimsup|x|→∞|V(x)+ln|V′(x)||(V′(x))2<+∞.

In the Gaussian case , satisfies . {pf*}Proof of Theorem 2 By simple computation, we have (for notational convenience, we write for ):

 Extra close brace or missing open brace = Extra close brace or missing open brace (26) = Extra close brace or missing open brace Extra close brace or missing open brace = Extra close brace or missing open brace Extra close brace or missing open brace = Extra close brace or missing open brace −G(a,b,ℓ)a+Γ(a,b,ℓ)b/2.

On the other hand, we have

 Extra close brace or missing open brace = Extra close brace or missing open brace = Extra close brace or missing open brace = Extra close brace or missing open brace

We thus obtain

 Extra close brace or missing open brace Extra close brace or missing open brace −G(a,b,ℓ)a+Γ(a,b,ℓ)b/2 Extra close brace or missing open brace Extra close brace or missing open brace

where the ratio is nonnegative by (19). We remark that

 (a−b)2 = Extra close brace or missing open brace = Extra close brace or missing open brace

Using the function defined in (20), we deduce that

 Extra close brace or missing open brace ≤ Extra close brace or missing open brace ≤ Extra close brace or missing open brace

which is (24). Since by Lemma 1, is positive, we deduce that

 Extra close brace or missing open brace

Let us now assume that satisfies a logarithmic-Sobolev inequality (22) with parameter . We thus have, from (24),

 ddtH(ψt|ψ∞)≤−ρF(a[ψt],b[ψt],ℓ)H(ψt|ψ∞). (27)

Thus, to obtain exponential convergence, in view of Lemma 1 and since , we need a (uniform-in-time) upper bound on , to get a (uniform-in-time) positive lower bound on . This is the aim of the next paragraph.

First, notice that by [14], Lemma 1 and Lemma 3, . Now, according to [19], Theorem 1, since satisfies a , also satisfies the transport inequality: for any probability density on ,

 Extra close brace or missing open brace

where, in the definition of the quadratic Wasserstein distance , the infimum is taken over all coupling measures on with marginals and . Moreover, for a coupling measure between the probability measures and , we have, using Cauchy–Schwarz inequality,