On the ergodicity of the adaptive Metropolis algorithm on unbounded domains
Abstract
This paper describes sufficient conditions to ensure the correct ergodicity of the Adaptive Metropolis (AM) algorithm of Haario, Saksman and Tamminen [Bernoulli 7 (2001) 223–242] for target distributions with a noncompact support. The conditions ensuring a strong law of large numbers require that the tails of the target density decay superexponentially and have regular contours. The result is based on the ergodicity of an auxiliary process that is sequentially constrained to feasible adaptation sets, independent estimates of the growth rate of the AM chain and the corresponding geometric drift constants. The ergodicity result of the constrained process is obtained through a modification of the approach due to Andrieu and Moulines [Ann. Appl. Probab. 16 (2006) 1462–1505].
10.1214/10AAP682 \volume20 \issue6 2010 \firstpage2178 \lastpage2203 \newproclaimdefinitionDefinition\newproclaimremark[definition]Remark
On the ergodicity of adaptive Metropolis
A]\fnmsEero \snmSaksman\thanksreft1label=e1]eero.saksman@helsinki.fi and B]\fnmsMatti \snmVihola\corref\thanksreft2label=e2]matti.vihola@iki.fi label=u2,url]http://iki.fi/mvihola
t1Supported by the Academy of Finland, Projects 113826 and 118765, and by the Finnish Centre of Excellence in Analysis and Dynamics Research.
t2Supported by the Academy of Finland, Projects 110599 and 201392, by the Finnish Academy of Science and Letters, Vilho, Yrjö and Kalle Väisälä Foundation, by the Finnish Centre of Excellence in Analysis and Dynamics Research and by the Finnish Graduate School in Stochastics and Statistics.
class=AMS] \kwd[Primary ]65C05 \kwd[; secondary ]65C40 \kwd60J27 \kwd93E15 \kwd93E35. Adaptive Markov chain Monte Carlo \kwdconvergence \kwdergodicity \kwdMetropolis algorithm \kwdstochastic approximation.
1 Introduction
The Markov chain Monte Carlo (MCMC) method, first proposed by metropolis , is a commonly used device for numerical approximation of integrals of the type
where is a probability density function. Intuitively, the method is based on producing a sample of random variables from the distribution defines. The integral is approximated with the average . In particular, the random variables are a realization of a Markov chain, constructed so that the chain has as the unique invariant distribution.
One of the most commonly applied constructions of such a chain in is to let with some fixed point , and recursively for :

simulate , where is an independent random variable distributed according to some symmetric proposal distribution , for example, a zeromean Gaussian, and

with probability , the proposal is accepted and ; otherwise the proposal is rejected and .
This symmetric randomwalk Metropolis algorithm is often efficient enough, even in a relatively complex and highdimensional situation, provided that the proposal distribution is selected properly. Finding a good proposal for a particular problem can, however, be a difficult task.
Recently, there has been a number of publications describing different adaptation techniques aiming to find a good proposal automatically saksmanam , andrieurobert , atchaderosenthal , andrieumoulines , robertsrosenthal (see also the review article andrieuthoms ). It has been a common practice to perform trial runs, and determine the proposal from the outcome. The recently proposed methods are different in that they adapt onthefly, continuously during the estimation run. In this paper, we focus on the forerunner of these methods, the Adaptive Metropolis (AM) algorithm saksmanam , which is a randomwalk Metropolis sampler with a Gaussian proposal having a covariance . The proposal covariance is updated continuously during the run, according to the history of the chain. In general, such an adaptation may, if carelessly implemented, destroy the correct ergodicity properties, that is, that does not converge to as (see, e.g., robertsrosenthal ). For practical considerations of the AM algorithm, the reader may consult saksmanjrstatsoc , robertsrosenthalexamples .
In the original paper saksmanam presenting the AM algorithm, the first ergodicity result for such adaptive algorithms was obtained. More precisely, a strong law of large numbers was proved for bounded functionals, when the algorithm is run on a compact subset of . After that, several authors have obtained more general conditions under which an adaptive MCMC process preserves the correct ergodicity properties. Andrieu and Robert andrieurobert established the connection between adaptive MCMC and stochastic approximation, and proposed a general framework for adaptation. Atchadé and Rosenthal atchaderosenthal developed further the technique of saksmanam . Andrieu and Moulines andrieumoulines made important progress by generalizing the Poisson equation and martingale approximation techniques to the adaptive setting. They proved the ergodicity and a central limit theorem for a class of adaptive MCMC schemes. Roberts and Rosenthal robertsrosenthal use an interesting approach based on coupling to show a weak law of large numbers. However, in the case of AM, all the techniques essentially assume that the adapted parameter is constrained to a predefined compact set, or do not present concrete verifiable conditions. The only result to overcome this assumption is the one by Andrieu and Moulines andrieumoulines . Their result, however, requires a modification of the algorithm, including additional reprojections back to some fixed compact set.
This paper describes sufficient conditions under which the AM algorithm preserves the correct ergodicity properties, and almost surely as for any function that is bounded on compact sets and grows at most exponentially as . Our main result (Theorem 5) holds for the original AM process (without reprojections) having a target distribution supported on . Essentially, the target density must have asymptotically lighter tails than for some , and for large enough , the sets must have uniformly regular contours. Our assumptions are very close to the wellknown conditions proposed by Jarner and Hansen jarnerhansen to ensure the geometric convergence of a (nonadaptive) Metropolis process. By the techniques of this paper, one may also establish a central limit theorem (see Theorem 5).
The ergodicity results for the AM process rely on three main contributions. First, in Section 2, we describe an adaptive MCMC framework, in which the adaptation parameter is constrained at each time to a feasible adaptation set. In Section 3, we prove a strong law of large numbers for such a process, through a modification of the technique of Andrieu and Moulines andrieumoulines . Second, we propose an independent estimate for the growth rate of a process satisfying a general drift condition in Section 4. Third, in Section 5, we provide an estimate for constants of geometric drift for a symmetric randomwalk Metropolis process, when the target distribution has superexponentially decaying tails with regular contours.
The paper is essentially selfcontained, and assumes little background knowledge. Only the basic martingale theory is needed to follow the argument, with the exception of Theorem A by Meyn and Tweedie meyntweediecomputable , restated in Appendix A. Even though we consider only the AM algorithm, our techniques apply also to many other adaptive MCMC schemes of similar type.
2 General framework and notation
We consider an adaptive Markov chain Monte Carlo (MCMC) chain evolving in space , where is the state space of the “MCMC” chain and the adaptation parameter evolves in , where is a separable normed vector space. We assume an underlying probability space , and denote the expectation with respect to by . The natural filtration of the chain is denoted with where . We also assume that we are given an increasing sequence of subsets of the adaptation parameter space . The random variables form a stochastic chain, starting from and , and for , satisfying the following recursion:
(1)  
(2) 
where is a transition probability for each , is an adaptation function, and is a decreasing sequence of adaptation step sizes . The functions are defined as
Thus, ensures that lies in for each . The recursion (2) can also be considered as constrained Robbins–Monro stochastic approximation (see andrieumoulines , saverifiable and references therein).
Let be a function. We define a norm of a function as
As usual, we denote the integration of a function with respect to a (signed) measure as , and define for a transition probability . The norm of a signed measure is defined as
The indicator function of a set is denoted as and equals one if and zero otherwise. In addition, we use the notation and .
Finally, we define the following regularity property for a family of functions . {definition} Suppose . Given an increasing sequence of subsets , , we say that a family of functions , with , is polynomially Lipschitz with constants , if for all , we have
3 Ergodicity of sequentially constrained adaptive MCMC
This section contains general ergodicity results for a sequentially constrained process defined in Section 2. These results can be seen auxiliary to our results on Adaptive Metropolis in Section 5, but may be applied to other adaptive MCMC methods as well.
Suppose that the adaptation algorithm has the form given in (1) and (2), and the following assumptions are satisfied for some and .

[(A1)]

For each , the transition probability has as the unique invariant distribution.

For each , the following uniform drift and minorization condition holds for all :
(3) (4) where is a subset (a minorization set), is a drift function such that , and is a probability measure on , concentrated on . Furthermore, the constants and are increasing, and is decreasing with respect to , and they are polynomially bounded so that

For all and any , there is such that for all ,

There is a such that for all , and
Theorem \thedefinition
Assume (A1)–(A4) hold and let be a function with for some . Assume , where is an independent constant, and that . Then
(5) 
The proof of Theorem 3 is postponed to the end of this section. We start by the following lemma, whose proof is given in Appendix A. It shows that if we have polynomially worse bounds for drift and minorization constants, then the speed of geometric convergence can get only polynomially worse.
Lemma \thedefinition
Suppose (A2) holds. Then, one has for that for all and ,
with bound
where is an independent constant, and .
Observe that the statement in Lemma 3 entails that any function is integrable with respect to the measures and , for all , and . The next three results are modified from Proposition 3, Lemma 5 and Proposition 6 of andrieumoulines , respectively. The first one bounds the regularity of the solutions of the Poisson equation
(6) 
for a polynomially Lipschitz family of functions.
Proposition \thedefinition
Suppose that (A1)–(A3) hold, and the family of functions is polynomially Lipschitz with constants , for some . There is an independent constant and a constant , such that:
The family is polynomially Lipschitz with constants .
Define, for any , the function
(7) 
Then, solves the Poisson equation (6), and the families and are polynomially Lipschitz with constants . In other words,
(8)  
(9) 
for all .
Throughout the proof, suppose .
The part (i) follows easily from Lemma 3, since
Consider then (ii). Estimate (8) follows by the definition of and Lemma 3,
The above bound clearly applies also to , and the convergence implies that solves (6).
For (9), define an auxiliary transition probability by setting , and write
since for all . By Lemma 3 and assumption (A3), we have for all and
which gives that
(10) 
Write then
By Lemma 3 and estimate (10) we have
The same bound applies, with a similar argument, to .
Lemma \thedefinition
Assume that (A2) holds. Then, for all , any sequence of positive numbers, and , we have that
(11)  
(12) 
where the constant depends only on .
For and , we can apply the drift inequality (3) and the monotonicity of and to obtain
This estimate with Jensen’s inequality yields for that
Similarly, we have
by Jensen’s inequality and estimate (3).
Assume that is a regular enough family of functions. Consider the following decomposition, which is one of the key observations in andrieumoulines ,
(14) 
where is a martingale with respect to , and and are “residual” sequences, given by
Recall that solves the Poisson equation (6). The following proposition controls the fluctuations of these terms individually.
Proposition \thedefinition
Assume (A1)–(A4) hold, and let be polynomially Lipschitz with constants for some . Then, for any , for all and , there is a , such that for all ,
(15)  
(16)  
(17) 
whenever is small enough to ensure that , where is an independent constant.
In this proof, is a constant that can take different values at each appearance. By Proposition 3, we have that for all . Since , we can bound the martingale differences for as follows:
by (11) of Lemma 3. For , we have, by Burkholder and Minkowski’s inequalities,
where the constant depends only on . For , the estimate (3) yields, by Burkholder’s inequality,
The two cases combined give that
(19) 
Now, by Corollary B of Birnbaum and Marshall’s inequality in Appendix B,
for all . By letting , we have from (19)
since . Now, (15) follows by
since we have that .