Optimal scaling of random walk Metropolis algorithms with discontinuous target densities

# Optimal scaling of random walk Metropolis algorithms with discontinuous target densities

[ [    [ [    [ [ University of Manchester, University of Warwick and Brock University P. Neal
School of Mathematics
University of Manchester
Sackville Street
Manchester
M60 1QD
United Kingdom
G. Roberts
Department of Statistics
University of Warwick
Coventry
CV4 7AL
United Kingdom
W. K. Yuen
Department of Mathematics
Brock University
St. Catharines
ON L2S 3A1
\smonth2 \syear2011\smonth9 \syear2011
\smonth2 \syear2011\smonth9 \syear2011
\smonth2 \syear2011\smonth9 \syear2011
###### Abstract

We consider the optimal scaling problem for high-dimensional random walk Metropolis (RWM) algorithms where the target distribution has a discontinuous probability density function. Almost all previous analysis has focused upon continuous target densities. The main result is a weak convergence result as the dimensionality of the target densities converges to . In particular, when the proposal variance is scaled by , the sequence of stochastic processes formed by the first component of each Markov chain converges to an appropriate Langevin diffusion process. Therefore optimizing the efficiency of the RWM algorithm is equivalent to maximizing the speed of the limiting diffusion. This leads to an asymptotic optimal acceptance rate of under quite general conditions. The results have major practical implications for the implementation of RWM algorithms by highlighting the detrimental effect of choosing RWM algorithms over Metropolis-within-Gibbs algorithms.

[
\kwd
\doi

10.1214/11-AAP817 \volume22 \issue5 2012 \firstpage1880 \lastpage1927

\runtitle

Optimal scaling of RWM algorithms

{aug}

A]\fnmsPeter \snmNeal\correflabel=e1]P.Neal-2@manchester.ac.uk, B]\fnmsGareth \snmRobertslabel=e2]Gareth.O.Roberts@warwick.ac.uk and C]\fnmsWai Kong \snmYuenlabel=e3]jyuen@brocku.ca

class=AMS] \kwd[Primary ]60F05 \kwd[; secondary ]65C05. Random walk Metropolis \kwdMarkov chain Monte Carlo \kwdoptimal scaling.

## 1 Introduction

Random walk Metropolis (RWM) algorithms are widely used generic Markov chain Monte Carlo (MCMC) algorithms. The ease with which RWM algorithms can be constructed has no doubt played a pivotal role in their popularity. The efficiency of a RWM algorithm depends fundamentally upon the scaling of the proposal density. Choose the variance of the proposal to be too small and the RWM will converge slowly since all its increments are small. Conversely, choose the variance of the proposal to be too large and too high a proportion of proposed moves will be rejected. Of particular interest is how the scaling of the proposal variance depends upon the dimensionality of the target distribution. The target distribution is the distribution of interest and the MCMC algorithm is constructed such that the stationary distribution of the Markov chain is the target distribution.

The Introduction is structured as follows. We outline known results for continuous independent and identically distributed product densities from RGG () and subsequent work. We highlight the scope and limitations of the results before introducing the discontinuous target densities to be studied in this paper. While the statements of the key results (Theorem 2.1) in this paper are similar to those given for continuous target densities, the proofs are markedly different. A discussion of why a new method of proof is required for discontinuous target densities is given. Finally, we give an outline of the remainder of the paper.

The results of this paper have quite general consequences for the implementation of Metropolis algorithms on discontinuous densities (as are commonly applied in many Bayesian Statistics problems), namely: {longlist}[(1)]

Full- (high-) dimensional update rules can be an order of magnitude slower than strategies involving smaller dimensional updates. (See Theorem 3.3 below.)

For target densities with bounded support, Metropolis algorithms can be an order of magnitude slower than algorithms which first transform the target support to for some .

In RGG (), a sequence of target densities of the form

 πd(xd)=d∏i=1f(xdi) (1)

were considered as , where is twice differentiable and satisfies certain mild moment conditions; see RGG (), (A1) and (A2). The following random walk Metropolis algorithm was used to obtain a sample from . Draw from . For and let be independent and identically distributed (i.i.d.) according to and . At time , propose

 Yd=Xdt+σdZdt, (2)

where is the proposal standard deviation to be discussed shortly. Set with probability

 α(Xdt,Yd)≡1∧πd(Yd)πd(Xdt). (3)

Otherwise set . It is straightforward to check that has stationary distribution , and hence, for all , . The key question addressed in RGG () was: starting from the stationary distribution, how should be chosen to optimize the rate at which the RWM algorithm explores the stationary distribution? Since the components of are i.i.d., it suffices to study the marginal behavior of the first component, . In RGG (), it was shown that if and , then

 Ud⇒Uas d→∞, (4)

where satisfies the Langevin SDE

 dUt=√h(l)dBt+ϕ(l)f′(Ut)2f(Ut)dt (5)

with and with being the standard normal c.d.f. and . Note that the “speed measure” of the diffusion only depends upon through . The diffusion limit for is unsurprising in that for a time interval of length , moves are made each of size . Therefore the movements in the first component (appropriately normalized) converge to those of a Langevin diffusion with the “most efficient” asymptotic diffusion having the largest speed measure . Since the diffusion limit involves speeding up time by a factor of , we say that the mixing of the algorithm is . The optimal value of  is , which leads to an average optimal acceptance rate (aoar) of 0.234. This has major practical implications for practitioners, in that, to monitor the (asymptotic) efficiency of the RWM algorithm it is sufficient to study the proportion of proposed moves accepted.

There are three key assumptions made in RGG (). First, , that is, the algorithm starts in the stationary distribution and is chosen to optimize exploration of the stationary distribution. This assumption has been made in virtually all subsequent optimal scaling work; see, for example, BR00 (), NR08 (), NR06 (), Bed06 () and RR01 (). The one exception is CRR05 (), where is started from the mode of with explicit calculations given for a standard multivariate normal distribution. In CRR05 (), it is shown that is optimal for maximizing the rate of convergence to the stationary distribution. Since convergence is shown to occur within iterations, the time taken to explore the stationary distribution dominates the time taken to converge to the stationary distribution, and thus overall it is optimal to choose . It is difficult to prove generic results for . However, the findings of CRR05 () suggest that even when , it is best to scale the proposal distribution based upon . It is worth noting that in CRR05 () it was found that for the Metropolis adjusted Langevin algorithm (MALA), the optimal scaling of for started at the mode of a multivariate normal is compared to for .

Second, is an i.i.d. product density. This assumption has been relaxed by a number of authors with and an aoar of 0.234 still being the case, for example, independent, scaled product densities (RR01 () and Bed06 ()), Gibbs random fields BR00 (), exchangeable normals NR06 () and elliptical densities SR09 (). Thus the simple rule of thumb of tuning such that one in four proposed moves are accepted holds quite generally. In Bed08 () and SR09 (), examples where the aoar is strictly less than 0.234 are given. These correspond to different orders of magnitude being appropriate for the scaling of the proposed moves in different components.

Third, the results are asymptotic as . However, simulations have shown that for i.i.d. product densities an acceptance rate of 0.234 is close to optimal for ; see, for example, NR06 (). Departures from the i.i.d. product density require larger for the asymptotic results to be optimal, but is often seen in practical MCMC problems. In NR10 () and Sherlock (), optimal acceptance rates are obtained for finite for some special cases.

With the exceptions of NR08 (), NR10 () and SR09 (), in the above works is assumed to have a continuous (and suitably differentiable) probability density function (p.d.f.). The aim of the current work is to investigate the situation where the target distribution has a discontinuous p.d.f., and specifically, target distributions confined to the -dimensional hypercube . That is, we consider target distributions of the form

 πd(xd)=d∏i=1f(xdi), (6)

where

 f(x)∝exp(g(x))1{0

and is twice differentiable upon with

 g∗=sup0≤y≤1|g′(y)|<∞. (8)

We then use the following random walk Metropolis algorithm to obtain a sample from . Draw from . For and let be independent and identically distributed (i.i.d.) according to and . At time , propose

 Yd=Xdt+σdZdt. (9)

Set with probability

 α(Xdt,Yd)≡1∧πd(Yd)πd(Xdt). (10)

Otherwise set .

In NR08 () and SR09 (), spherical and elliptical densities are considered which have very different geometry to the hypercube restricted densities. Therefore different approaches are taken in these papers with results akin to those obtained for continuous target densities. Densities of the form (7) have previously been studied in NR10 (), where the expected square jumping distance (ESJD) has been computed. The ESJD is

 (11)

the mean squared distance between and , where . In NR10 (), Appendix B, it is shown that for and ,

 dEπd[d∑i=1(Xd1,i−Xd0,i)2]→l23exp(−l2)as d→∞. (12)

Thus asymptotically (as ) the ESJD is maximized by taking which corresponds to an aoar of (0.1353). In this paper, we show that and an aoar of holds more generally for target distributions of the form given by (6) and (7). Moreover, we prove a much stronger result than that given in NR10 (), in that, we prove that converges weakly to an appropriate Langevin diffusion with speed measure as , where . This gives a clear indication of how the Markov chain explores the stationary distribution. By contrast the ESJD only gives a measure of average behavior and does not take account of the possibility of the Markov chain becoming “stuck.” If is very low, the Markov chain started is likely to spend a large number of iterations at before accepting a move away from . Note that since involves speeding up time by a factor of , we say that the mixing of the algorithm is . The ESJD is easy to compute and asymptotically, as , the ESJD (appropriately scaled) converges to . Thus in discussing possible extensions of the Langevin diffusion limit proved in Theorem 2.1 for i.i.d. product densities of the form given in (6) and (7), we make considerable use of the ESJD. However, we highlight the limitations of the ESJD in discussing extensions of Theorem 2.1.

In most previous work on optimal scaling, the components of are taken to be independent and identically distributed random variables. The reason for choosing for discontinuous target densities is mathematical convenience. The results proved in this paper hold with Gaussian rather than uniform proposal distributions, but some elements of the proof are less straightforward. For discussion of the ESJD for densities (6) for general subject to , see NR10 (), Appendix B.

While the key result, a Langevin diffusion limit for the movement in the first component, is the same as RGG (), the proof is markedly different. Note that, for finite , and are not Markov chains since whether or not a proposed move is accepted depends upon all the components in . In RGG (), it is shown that there exists such that as and

 supxd∈Fd∣∣∣E[α(xd,xd+σdZd)]−2Φ(−l√I2)∣∣∣≤εd, (13)

where as . While (13) is not explicitly stated in RGG (), it is the essence of the requirements of the sets , stating that for large , with high probability over the first iterations the acceptance probability of the Markov chain is approximately constant, being within of . (Note rather than is used for dimensionality in RGG ().) Thus in the limit as the effect of the other components on movements in the first component converges to a deterministic acceptance probability . The situation is more complex for of the form given by (6) and (7) as the acceptance rate in the limit as is inherently stochastic. For example, suppose is the uniform distribution on the -dimensional hypercube so that . Letting and , this gives

 E[α(xd,xd+σdZd)]=∏i∈RLd(12+xi2σd)×∏i∈RUd(12+1−xi2σd). (14)

Thus the acceptance probability is totally determined by the components at the boundary (within of 0 or 1). The total number of components in is which converges in distribution to as . Thus the number of components close to the boundary is inherently stochastic. Moreover, the location of the components within plays a crucial role in the acceptance probability; see (14). Therefore there is no hope of replicating directly the method of proof applied in RGG () and subsequently, in BR00 () and NR06 ().

We need a homogenization argument which involves looking at over steps; cf. NR08 (). In particular, we show that the acceptance probability converges very rapidly to its stationary measure, so that over iterations approximately proposed moves are accepted. By comparison, ; thus the value of an individual component only makes small changes over iterations. That is, we show that there exists such that, for any , as and for ,

 supxd∈~Fd∣∣∣1[dδ][dδ]−1∑t=0E[α(Xdt,Xdt+σdZdt)|Xd0=xd]−exp(−lf∗2)∣∣∣≤εd (15)

for some as . For large , with high probability over the first iterations the Markov chain stays in , where the average number of accepted proposed moves in the following iterations is . The arguments are considerably more involved than in NR08 (), where spherically constrained target distributions were studied, due to the very different geometry of the hypercube and spherical constraints applied in this paper and NR08 (), respectively. In particular, in NR08 (), with an aoar of .

By exploiting the homogenization argument it is possible to prove that  converges weakly to an appropriate Langevin diffusion , given in Theorem 2.1. In Section 2, Theorem 2.1 is presented along with an outline of the proof. Also in Section 2, a description of the pseudo-RWM algorithm is given. The pseudo-RWM algorithm plays a key role in the proof of Theorem 2.1. The pseudo-RWM process moves at each iteration and the moves in the pseudo-RWM process are identical to those of the RWM process, conditioned upon a proposed move in the RWM process being accepted. The proof of Theorem 2.1 is long and technical with the details split into three key sections which are given in the Appendix; see Section 2 for more details. In Section 3, two interesting extensions of Theorem 2.1 are given. In particular, Theorem 3.3 has major practical implications for the implementation of RWM algorithms by highlighting the detrimental effect of choosing RWM algorithms over Metropolis-within-Gibbs algorithms. The target densities for which theoretical results can be proved are limited, so discussion of possible extensions of Theorem 2.1 are given. In particular, we discuss general  restricted to the hypercube, general discontinuities in and .

## 2 Pseudo-RWM algorithm and Theorem 2.1

We begin by defining the pseudo-random walk Metropolis (pseudo-RWM) process. We will then be in position to formally state the main theorem, Theorem 2.1. An outline of the proof of Theorem 2.1 is given, with the details, which are long and technical, placed in the Appendix.

For , let

Let denote the probability of accepting a move in the RWM process given the current state of the process is . Then

 Jd(xd)=∫hd(zd){1∧πd(xd+σdzd)πd(xd)}dzd. (16)

Let , the total number of components of in . By Taylor’s theorem for all and ,

 g(xi+σdzi)−g(xi)≥−g∗σd (17)

with defined in (8). Hence, for all ,

 Jd(xd) = ∫hd(zd){1∧d∏i=1exp(g(xi+σdzi))exp(g(xi))}1{xd+σdzd∈[0,1]d}dzd ≥ ∫hd(zd){1∧exp(−dg∗σd)}1{xd+σdzd∈[0,1]d}dzd = exp(−lg∗)∫hd(zd)1{xd+σdzd∈[0,1]d}dzd ≥ exp(−lg∗)(12)bld(xd).

This lower bound for will be used repeatedly.

The pseudo-RWM process moves at each iteration, which is the key difference to the RWM process. Furthermore, the moves in the pseudo-RWM process are identical to those of the RWM process, conditioned upon a move in the RWM process being accepted, that is, its jump chain. For , let denote the successive states of the pseudo-RWM process, where . The pseudo-RWM process is a Markov process, where for , and given that , has p.d.f.

Note that for . Since , we can couple the two processes to have the same starting value . A continued coupling of the two processes is outlined below. Suppose that . Then for any ,

 P(s⋃j=1{Xdt+j=xd}|Xdt=xd)=(1−Jd(xd))s. (19)

That is, the number of iterations the RWM algorithm stays at before moving follows a geometric distribution with “success” probability . Therefore for , let denote independent geometric random variables, where for , denotes a geometric random variable with “success” probability . For , let and for , let

 Udt=sup{s∈Z+\dvtxs−1∑j=0Mj(Jd(^Xdj))≤t},

where the sum is zero if vacuous. For , attach to . Thus denotes the total number of iterations the RWM process spends at before moving to . Hence, the RWM process can be constructed from by setting and for all , . Obviously the above process can be reversed by setting equal to the th accepted move in the RWM process.

For each , the components of are independent and identically distributed. Therefore we focus attention on the first component as this is indicative of the behavior of the whole process. For and , let and .

###### Theorem 2.1

Fix . For all , let . Then, as ,

 Vd⇒V

in the Skorokhod topology on , where satisfies the (reflected) Langevin SDE on

 dVt=√ϕ(l)dBt+12ϕ(l)g′(Vt)dt+dL0t(V)−dL1t(V) (20)

with . Note that is standard Brownian motion,

 ϕ(l)=l23exp(−f∗l2)

and .

Here denotes the local time of at and the SDE (20) corresponds to standard reflection at the boundaries and (see, e.g., Chapter VI of RevYor ()).

{pf}

As noted in Section 1, the acceptance probability of the RWM process is inherently random and therefore it is necessary to consider the behavior of the RWM process averaged over iterations, for . Fix and let be a sequence of positive integers satisfying . For , let and for , let . For all , and . Hence, for all ,

 sup0≤s≤T|~Vds−Vds|≤[dδ]σd. (21)

Therefore by Bill (), Theorem 4.1, as , if as . Hence we proceed by showing that

 ~Vd⇒Vas d→∞. (22)

Let be the (discrete-time) generator of and let be an arbitrary test function of the first component only. Thus

 GδdH(xd)=d2[dδ]E[H(~Xd1)−H(~Xd0)|~Xd0=xd]. (23)

The generator of the (limiting) one-dimensional diffusion for an arbitrary test function is given by

 GH(x)=ϕ(l){12g′(x)H′(x)+12H′′(x)} (24)

for all at least for all , where is defined in (25) below.

First note that the diffusion defined by (24) is regular; see EK (), page 366. Therefore by EK (), Chapter 8, Corollary 1.2, it is sufficient to restrict attention to functions

 H∈D≡{h\dvtxh∈^C([0,1])∩C2((0,1))∩D∗,Gh∈^C([0,1])}, (25)

where is the set of twice differentiable functions upon , is the set of bounded continuous functions upon and is obtained by setting in EK (), page 367, (1.11) and is given by

 D∗={h\dvtxh′(0)=h′(1)=0}. (26)

Let and . Then combined with implies that . It then follows from being bounded on and that . These observations will play a key role in Appendix C.

Now (22) is proved using EK (), Chapter 4, Corollary 8.7, by showing that there exists a sequence of sets such that for any ,

 P([Td2/[dδ]]⋃j=0{Xdj∉~Fd})→0as d→∞ (27)

and

 supxd∈~Fd|GδdH(xd)−GH(x1)|→0as d→∞. (28)

Let the sets and be such that and

 ~Fd={xd;P([dδ]⋃j=0{^Xdj∉Fd}|^Xd0=xd)≤d−3}, (29)

where , , and are defined below. Recall that , the total number of components of in . We term  the rejection region, in that, for any component in , there is positive probability of proposing a move outside the hypercube with such moves automatically being rejected. Let

 F1d = {xd;bld(xd)≤γlogd}, (30) F2d = [dδ]⋂k=[dβ]{xd;|bk3/4d(xd)−E[bk3/4d(Xd0)]|≤√k}, (31) F3d = {xd;sup[dβ]≤kd≤[dδ]sup0≤r≤l|λd(xd;r;kd)−λ(r)|≤d−γ}, (32) F4d = {xd;∣∣∣1dd∑j=1g′(xj)2−Ef[g′(X1)2]∣∣∣

where and . In Appendix A, we prove (27) for the sets given in (29). Note that (27) follows immediately from Theorem A.13, (119) since . An outline of the roles played by each is given below. For () the total number of components in (close to) the rejection region are controlled. For after iterations the total number and position of the points in are approximately from the stationary distribution of . Finally, for , ; this is the key requirement for the sets given in RGG (), cf. RGG (), page 114, .

The proof of (28) splits into two parts and exploits the pseudo-RWM process. Let

 Pd=max{K=0,1,…,[dδ−1];1[dδ]K−1∑j=0Mj(Jd(^Xdj))≤1}/[dδ], (34)

the proportion of accepted moves in the first iterations, where the sum is set equal to zero if vacuous. Then and

 GδdH(xd)=d2[dδ]E[H(^Xd[Pddδ])−H(^Xd0)|^Xd0=xd]. (35)

In Appendix B, we show that for all , as . Consequently, it is useful to introduce which is defined for fixed as

 ^Gδ,πdH(xd) = = d2[dδ][πdδ−1]∑j=0E[H(^Xdj+1)−H(^Xdj)|^Xd0=xd] = 1[dδ][πdδ−1]∑j=0E[^GdH(^Xdj)|^Xd0=xd],

where

 ^GdH(^Xdj)=d2E[H(^Xd1−^Xd0)|^Xd0=xd]. (37)

Finally in Appendix C, we prove in Lemma C.6 that

 sup0≤π≤1supxd∈~Fd|^Gδ,πdH(xd)−GH(x1)|→0as d→∞. (38)

The triangle inequality is then utilized to prove (28) in Lemma C.6 using (38) and as .

It should be noted that in Appendix C, we assume that , in particular in Lemma C.1. In Appendixes A and B we make no such assumption. However, corresponds to (uniform distribution), and proving Lemma C.6 in this case follows similar but simpler arguments to those given in Appendix C.

A key difference between the diffusion limits for continuous and discontinuous i.i.d. product densities is the dependence of the speed measure upon . For continuous (suitably differentiable) , depends upon , which is a measure of the “roughness” of . For discontinuous densities of the form (7), depends upon , the (mean of the) limit of the density at the boundaries (discontinuities). Discussion of the role of the density in the behavior of the RWM algorithm is given in Section 3.

The most important consequence of Theorem 2.1 is the following result.

###### Corollary 2.2

Let . Then

 EπdE[Jd(Xd0)]→a(l)as d→∞.

is maximized by with

 a(^l)=exp(−2)=0.1353.

Clearly, if is known, can be calculated explicitly. However, where MCMC is used, will often only be known up to the constant of proportionality. This is where Corollary 2.2 has major practical implications, in that, to maximize the speed of the limiting diffusion, and hence, the efficiency of the RWM algorithm, it is sufficient to monitor the average acceptance rate, and to choose such that the average acceptance rate is approximately . Therefore there is no need to explicitly calculate or estimate the constant of proportionality.

## 3 Extensions

In this section, we discuss the extent to which the conclusions of Theorem 2.1 extend beyond being an i.i.d. product density upon the -dimensional hypercube and . First we present two extensions of Theorem 2.1. The second extension, Theorem 3.3, is an important practical result concerning lower-dimensional updating schema.

Suppose that is nonzero on the positive half-line. That is,

 f(x)∝exp(g(x))(x>0) (39)

and otherwise.

###### Theorem 3.1

Fix . For all , let , given by (39), with . Then, as ,

 Vd⇒V

in the Skorokhod topology on , where satisfies the (reflected) Langevin SDE on

 dVt=√ϕ(l)dBt+12ϕ(l)g′(