On the ergodicity of the adaptive Metropolis algorithm on unbounded domains

# On the ergodicity of the adaptive Metropolis algorithm on unbounded domains

[ [    [ [[ University of Helsinki and University of Jyväskylä Department of Mathematics and Statistics
P.O. Box 68
FI-00014 University of Helsinki
Finland
Department of Mathematics and Statistics
FI-40014 University of Jyväskylä
Finland
\smonth6 \syear2008\smonth2 \syear2009
\smonth6 \syear2008\smonth2 \syear2009
\smonth6 \syear2008\smonth2 \syear2009
###### Abstract

This paper describes sufficient conditions to ensure the correct ergodicity of the Adaptive Metropolis (AM) algorithm of Haario, Saksman and Tamminen [Bernoulli 7 (2001) 223–242] for target distributions with a noncompact support. The conditions ensuring a strong law of large numbers require that the tails of the target density decay super-exponentially and have regular contours. The result is based on the ergodicity of an auxiliary process that is sequentially constrained to feasible adaptation sets, independent estimates of the growth rate of the AM chain and the corresponding geometric drift constants. The ergodicity result of the constrained process is obtained through a modification of the approach due to Andrieu and Moulines [Ann. Appl. Probab. 16 (2006) 1462–1505].

[
\kwd
\doi

10.1214/10-AAP682 \volume20 \issue6 2010 \firstpage2178 \lastpage2203 \newproclaimdefinitionDefinition\newproclaimremark[definition]Remark

\runtitle

On the ergodicity of adaptive Metropolis

{aug}

A]\fnmsEero \snmSaksman\thanksreft1label=e1]eero.saksman@helsinki.fi and B]\fnmsMatti \snmVihola\corref\thanksreft2label=e2]matti.vihola@iki.fi label=u2,url]http://iki.fi/mvihola

\thankstext

t1Supported by the Academy of Finland, Projects 113826 and 118765, and by the Finnish Centre of Excellence in Analysis and Dynamics Research.

\thankstext

t2Supported by the Academy of Finland, Projects 110599 and 201392, by the Finnish Academy of Science and Letters, Vilho, Yrjö and Kalle Väisälä Foundation, by the Finnish Centre of Excellence in Analysis and Dynamics Research and by the Finnish Graduate School in Stochastics and Statistics.

class=AMS] \kwd[Primary ]65C05 \kwd[; secondary ]65C40 \kwd60J27 \kwd93E15 \kwd93E35. Adaptive Markov chain Monte Carlo \kwdconvergence \kwdergodicity \kwdMetropolis algorithm \kwdstochastic approximation.

## 1 Introduction

The Markov chain Monte Carlo (MCMC) method, first proposed by metropolis , is a commonly used device for numerical approximation of integrals of the type

 π(f)=∫f(x)π(x)dx,

where is a probability density function. Intuitively, the method is based on producing a sample of random variables from the distribution defines. The integral is approximated with the average . In particular, the random variables are a realization of a Markov chain, constructed so that the chain has as the unique invariant distribution.

One of the most commonly applied constructions of such a chain in is to let with some fixed point , and recursively for :

1. simulate , where is an independent random variable distributed according to some symmetric proposal distribution , for example, a zero-mean Gaussian, and

2. with probability , the proposal is accepted and ; otherwise the proposal is rejected and .

This symmetric random-walk Metropolis algorithm is often efficient enough, even in a relatively complex and high-dimensional situation, provided that the proposal distribution is selected properly. Finding a good proposal for a particular problem can, however, be a difficult task.

Recently, there has been a number of publications describing different adaptation techniques aiming to find a good proposal automatically saksman-am , andrieu-robert , atchade-rosenthal , andrieu-moulines , roberts-rosenthal (see also the review article andrieu-thoms ). It has been a common practice to perform trial runs, and determine the proposal from the outcome. The recently proposed methods are different in that they adapt on-the-fly, continuously during the estimation run. In this paper, we focus on the forerunner of these methods, the Adaptive Metropolis (AM) algorithm saksman-am , which is a random-walk Metropolis sampler with a Gaussian proposal having a covariance . The proposal covariance is updated continuously during the run, according to the history of the chain. In general, such an adaptation may, if carelessly implemented, destroy the correct ergodicity properties, that is, that does not converge to as (see, e.g., roberts-rosenthal ). For practical considerations of the AM algorithm, the reader may consult saksman-jrstatsoc , roberts-rosenthal-examples .

This paper describes sufficient conditions under which the AM algorithm preserves the correct ergodicity properties, and almost surely as for any function that is bounded on compact sets and grows at most exponentially as . Our main result (Theorem 5) holds for the original AM process (without re-projections) having a target distribution supported on . Essentially, the target density must have asymptotically lighter tails than for some , and for large enough , the sets must have uniformly regular contours. Our assumptions are very close to the well-known conditions proposed by Jarner and Hansen jarner-hansen to ensure the geometric convergence of a (nonadaptive) Metropolis process. By the techniques of this paper, one may also establish a central limit theorem (see Theorem 5).

The ergodicity results for the AM process rely on three main contributions. First, in Section 2, we describe an adaptive MCMC framework, in which the adaptation parameter is constrained at each time to a feasible adaptation set. In Section 3, we prove a strong law of large numbers for such a process, through a modification of the technique of Andrieu and Moulines andrieu-moulines . Second, we propose an independent estimate for the growth rate of a process satisfying a general drift condition in Section 4. Third, in Section 5, we provide an estimate for constants of geometric drift for a symmetric random-walk Metropolis process, when the target distribution has super-exponentially decaying tails with regular contours.

The paper is essentially self-contained, and assumes little background knowledge. Only the basic martingale theory is needed to follow the argument, with the exception of Theorem A by Meyn and Tweedie meyn-tweedie-computable , restated in Appendix A. Even though we consider only the AM algorithm, our techniques apply also to many other adaptive MCMC schemes of similar type.

## 2 General framework and notation

We consider an adaptive Markov chain Monte Carlo (MCMC) chain evolving in space , where is the state space of the “MCMC” chain and the adaptation parameter evolves in , where is a separable normed vector space. We assume an underlying probability space , and denote the expectation with respect to by . The natural filtration of the chain is denoted with where . We also assume that we are given an increasing sequence of subsets of the adaptation parameter space . The random variables form a stochastic chain, starting from and , and for , satisfying the following recursion:

 Xn+1 ∼ PSn(Xn,⋅), (1) Sn+1 = σn+1(Sn,ηn+1H(Sn,Xn+1)), (2)

where is a transition probability for each , is an adaptation function, and is a decreasing sequence of adaptation step sizes . The functions are defined as

Thus, ensures that lies in for each . The recursion (2) can also be considered as constrained Robbins–Monro stochastic approximation (see andrieu-moulines , sa-verifiable and references therein).

Let be a function. We define a -norm of a function as

 ∥f∥V:=supx∈X|f(x)|V(x).

As usual, we denote the integration of a function with respect to a (signed) measure as , and define for a transition probability . The -norm of a signed measure is defined as

 ∥μ∥V:=sup|f|≤V|μ(f)|.

The indicator function of a set is denoted as and equals one if and zero otherwise. In addition, we use the notation and .

Finally, we define the following regularity property for a family of functions . {definition} Suppose . Given an increasing sequence of subsets , , we say that a family of functions , with , is -polynomially Lipschitz with constants , if for all , we have

 ∥fs∥V≤cnεand∥fs−fs′∥V≤cnε|s−s′|.

## 3 Ergodicity of sequentially constrained adaptive MCMC

This section contains general ergodicity results for a sequentially constrained process defined in Section 2. These results can be seen auxiliary to our results on Adaptive Metropolis in Section 5, but may be applied to other adaptive MCMC methods as well.

Suppose that the adaptation algorithm has the form given in (1) and (2), and the following assumptions are satisfied for some and .

1. [(A1)]

2. For each , the transition probability has as the unique invariant distribution.

3. For each , the following uniform drift and minorization condition holds for all :

 PsV(x) ≤ λnV(x)+bn\mathbh1Cn(x)∀x∈X, (3) Ps(x,A) ≥ δnνs(A)∀x∈Cn,∀A⊂X, (4)

where is a subset (a minorization set), is a drift function such that , and is a probability measure on , concentrated on . Furthermore, the constants and are increasing, and is decreasing with respect to , and they are polynomially bounded so that

 (1−λn)−1∨δ−1n∨bn≤cnε.
4. For all and any , there is such that for all ,

 ∥Psf−Ps′f∥Vr≤c′nε∥f∥Vr|s−s′|.
5. There is a such that for all , and

 |H(s,x)|≤cnεVβ(x).
###### Theorem \thedefinition

Assume (A1)(A4) hold and let be a function with for some . Assume , where is an independent constant, and that . Then

 1nn∑k=1f(Xk)n→∞to28.452756pt\rightarrowfillπ(f)almost surely. (5)

The proof of Theorem 3 is postponed to the end of this section. We start by the following lemma, whose proof is given in Appendix A. It shows that if we have polynomially worse bounds for drift and minorization constants, then the speed of geometric convergence can get only polynomially worse.

###### Lemma \thedefinition

Suppose (A2) holds. Then, one has for that for all and ,

 ∥Pks(x,⋅)−π(⋅)∥Vr≤Vr(x)Lnρkn

with bound

 Ln∨(1−ρn)−1≤c2nκ2ε,

where is an independent constant, and .

Observe that the statement in Lemma 3 entails that any function is integrable with respect to the measures and , for all , and . The next three results are modified from Proposition 3, Lemma 5 and Proposition 6 of andrieu-moulines , respectively. The first one bounds the regularity of the solutions of the Poisson equation

 ^fs−Ps^fs=fs−π(fs) (6)

for a polynomially Lipschitz family of functions.

###### Proposition \thedefinition

Suppose that (A1)(A3) hold, and the family of functions is -polynomially Lipschitz with constants , for some . There is an independent constant and a constant , such that:

{longlist}

The family is -polynomially Lipschitz with constants .

Define, for any , the function

 ^fs:=∞∑k=0[Pksfs−π(fs)]. (7)

Then, solves the Poisson equation (6), and the families and are -polynomially Lipschitz with constants . In other words,

 ∥^fs∥Vr+∥Ps^fs∥Vr ≤ c3nκ3ε, (8) ∥^fs−^fs′∥Vr+∥Ps^fs−Ps′^fs′∥Vr ≤ c3nκ3ε|s−s′| (9)

for all .

{pf}

Throughout the proof, suppose .

The part (i) follows easily from Lemma 3, since

 ∥Psfs∥Vr ≤ ∥Psfs−π(fs)∥Vr+|π(fs)|≤[c2nκ2ε+π(Vr)]∥fs∥Vr, ∥Psfs−Ps′fs′∥Vr ≤ ∥(Ps−Ps′)fs∥Vr+∥Ps′(fs−fs′)∥Vr ≤ c′nε∥fs∥Vr|s−s′|+~cnκ2ε∥fs−fs′∥Vr≤~cn(κ2+1)ε|s−s′|.

Consider then (ii). Estimate (8) follows by the definition of and Lemma 3,

 ∥^fs∥Vr ≤ ∞∑k=0∥Pksfs−π(fs)∥Vr≤Ln∥fs∥Vr∞∑k=0ρkn = Ln1−ρn∥fs∥Vr≤(c2nκ2ε)2cnε=c22cn(2κ2+1)ε.

The above bound clearly applies also to , and the convergence implies that solves (6).

For (9), define an auxiliary transition probability by setting , and write

 Pksf−Pks′f=k−1∑j=0(Pjs−Π)(Ps−Ps′)[Pk−j−1s′f−π(f)]

since for all . By Lemma 3 and assumption (A3), we have for all and

 ∥(Pjs−Π)(Ps−Ps′)[Pk−j−1s′f−π(f)]∥Vr ≤Lnρjn∥(Ps−Ps′)[Pk−j−1s′f−π(f)]∥Vr ≤Lnρjnc′nε|s−s′|∥Pk−j−1s′f−π(f)∥Vr ≤L2nρk−1n,

which gives that

 ∥Pksf−Pks′f∥Vr≤kL2nρk−1nc′nε|s−s′|∥f∥Vr. (10)

Write then

 ^fs−^fs′=∞∑k=0[Pksfs−Pks′fs]−∞∑k=0[Pks′(fs′−fs)−π(fs′−fs)].

By Lemma 3 and estimate (10) we have

 ∥^fs−^fs′∥Vr ≤ L2nc′nε|s−s′|(∞∑k=0kρk−1n)∥fs∥Vr+Ln(∞∑k=0ρkn)∥fs′−fs∥Vr ≤ [L2nc′nε(1−ρn)−2cnε+Ln(1−ρn)−1cnε]|s−s′| ≤ [(c2nκ2ε)2c′nε(c2nκ2ε)2cnε+(c2nκ2ε)(c2nκ2ε)cnε]|s−s′| ≤ 2c42c′cn(4κ2+2)ε|s−s′|.

The same bound applies, with a similar argument, to .

###### Lemma \thedefinition

Assume that (A2) holds. Then, for all , any sequence of positive numbers, and , we have that

 E[Vr(Xk)] ≤ cr4k2rεVr(x0), (11) E[maxm≤j≤k(ajV(Xj))r] ≤ cr4(k∑j=majj2ε)rVr(x0), (12)

where the constant depends only on .

{pf}

For and , we can apply the drift inequality (3) and the monotonicity of and to obtain

 E[V(Xk)] = E[E[V(Xk)|Fk−1]]=E[PSk−1V(Xk−1)] ≤ λkE[V(Xk−1)]+bk≤⋯≤λkkV(x0)+bkk−1∑j=0λjk ≤ (1+bk∞∑j=0λjk)V(x0)≤(1+c2k2ε)V(x0)≤c4k2εV(x0).

This estimate with Jensen’s inequality yields for that

 E[Vr(Xk)]≤(E[V(Xk)])r≤cr4k2rεVr(x0).

Similarly, we have

 E[maxm≤j≤k(ajV(Xj))r] ≤ ≤ (k∑j=majE[V(Xj)])r ≤ cr4(k∑j=majj2ε)rVr(x0)

by Jensen’s inequality and estimate (3).

Assume that is a regular enough family of functions. Consider the following decomposition, which is one of the key observations in andrieu-moulines ,

 k∑j=1[fSj(Xj)−π(fSj)]=Mk+R(1)k+R(2)k, (14)

where is a martingale with respect to , and and are “residual” sequences, given by

 Mk := k∑j=1[^fSj−1(Xj)−PSj−1^fSj−1(Xj−1)], R(1)k := k∑j=1[^fSj(Xj)−^fSj−1(Xj)], R(2)k := PS0^fS0(X0)−PSk^fSk(Xk).

Recall that solves the Poisson equation (6). The following proposition controls the fluctuations of these terms individually.

###### Proposition \thedefinition

Assume (A1)(A4) hold, and let be -polynomially Lipschitz with constants for some . Then, for any , for all and , there is a , such that for all ,

 P[supk≥n|Mk|k≥δ] ≤ c∗δ−pnpε∗−(p/2)∧(p−1)Vαp(x0), (15) P[supk≥n|R(1)k|kξ≥δ] ≤ c∗δ−p(∞∑j=1(j∨n)ε∗−ξηj)pV(α+β)p(x0), (16) P[supk≥n|R(2)k|kξ≥δ] ≤ c∗δ−pnpε∗−(ξ−α)pVαp(x0), (17)

whenever is small enough to ensure that , where is an independent constant.

{pf}

In this proof, is a constant that can take different values at each appearance. By Proposition 3, we have that for all . Since , we can bound the martingale differences for as follows:

 E|dMℓ|p = E|^fSℓ−1(Xℓ)−PSℓ−1^fSℓ−1(Xℓ−1)|p ≤ E∣∣∥^fSℓ−1∥VαVα(Xℓ)+∥PSℓ−1^fSℓ−1∥VαVα(Xℓ−1)∣∣p ≤ 2p(c3ℓκ3ε)p(E[Vαp(Xℓ)]+E[Vαp(Xℓ−1)]) ≤ 2p+1cp3cαp4ℓpκ3εℓ2αpεVαp(x0)≤~cℓ(κ3+2α)pεVαp(x0)

by (11) of Lemma 3. For , we have, by Burkholder and Minkowski’s inequalities,

 E|Mk|p ≤ ≤ ~ck(κ3+2α)pε+p/2Vαp(x0),

where the constant depends only on . For , the estimate (3) yields, by Burkholder’s inequality,

 E|Mk|p ≤ cpE[k∑ℓ=1(|dMℓ|p)2/p]p/2≤cpk∑ℓ=1E|dMℓ|p ≤ ~ck(κ3+2α)pε+1Vαp(x0).

The two cases combined give that

 E|Mk|p≤~ck(κ3+2α)pε+(p/2)∨1Vαp(x0). (19)

Now, by Corollary B of Birnbaum and Marshall’s inequality in Appendix B,

 P[maxn≤k≤m|Mk|k≥δ] ≤ δ−p[m−pE|Mm|p+m−1∑k=n(k−p−(k+1)−p)E|Mk|p] ≤ δ−p[m−pE|Mm|p+pm−1∑k=nk−p−1E|Mk|p]

for all . By letting , we have from (19)

 m−pE|Mm|p≤~cmp(κ∗ε+(1/2)∨(1/p)−1)m→∞to28.452756pt\rightarrowfill0,

since . Now, (15) follows by

 P[supk≥n|Mk|k≥δ] ≤ ~cδ−p[∞∑k=nk(κ3+2α)pε+(p/2)∨1−p−1]Vαp(x0) ≤ ~cδ−pnpκ∗ε−(p/2)∧(p−1)Vαp(x0)

since we have that .

By Proposition 3, for . By construction, , and assumption (A4) ensures that , so

 |^fSℓ(Xℓ)−^fSℓ−1(Xℓ)|≤c3ℓκ3ε|Sℓ−Sℓ−1|Vα(Xℓ)≤c3ℓκ3εηℓcℓεVα+β(Xℓ).

Let . Since for , we obtain

 k−ξ∣∣R(1)k∣∣ ≤ k−ξk∑ℓ=1|^fSℓ(Xℓ)−^fSℓ−1(Xℓ)| ≤ ~ck∑ℓ=1(ℓ∨n)(κ3+1)ε−ξηℓVα+β(Xℓ)

and then by Minkowski’s inequality and (11) of Lemma 3,

 E[maxn≤k≤mk−ξp∣∣R(1)k∣∣p] ≤E[m∑ℓ=1~c(ℓ∨n)(κ3+1)ε−ξηℓV(α+β)p(Xℓ)]p (20) ≤~c[m∑ℓ=1(E[(ℓ∨n)(κ3+1)ε−ξηℓVα+β(Xℓ)]p)1/p]p ≤~c[∞∑ℓ=1(ℓ∨n)(κ3+1+2α+2β)ε−ξηℓ]pV(α+β)p(x0).

Finally, consider . From Proposition 3, we have that , and by (12) of Lemma 3,

 E[maxn≤k≤mk−ξp|PSk^fSk(Xk)|p] ≤ cp3E[maxn≤k≤m(k(κ3ε−ξ)/αV(Xk))αp] ≤ cp3cαp4(m∑k=nk(κ3ε−ξ)/α+2ε)αpVαp(x0) ≤ ~cn(κ3+2α)pε+(α−ξ)pVαp