On the ergodicity of the adaptive Metropolis algorithm on unbounded domains

On the ergodicity of the adaptive Metropolis algorithm on unbounded domains

[ [    [ [[ University of Helsinki and University of Jyväskylä Department of Mathematics and Statistics
P.O. Box 68
FI-00014 University of Helsinki
Finland
\printeade1
Department of Mathematics and Statistics
P.O. Box 35 (MaD)
FI-40014 University of Jyväskylä
Finland
\printeade2
\printeadu2
\smonth6 \syear2008\smonth2 \syear2009
\smonth6 \syear2008\smonth2 \syear2009
\smonth6 \syear2008\smonth2 \syear2009
Abstract

This paper describes sufficient conditions to ensure the correct ergodicity of the Adaptive Metropolis (AM) algorithm of Haario, Saksman and Tamminen [Bernoulli 7 (2001) 223–242] for target distributions with a noncompact support. The conditions ensuring a strong law of large numbers require that the tails of the target density decay super-exponentially and have regular contours. The result is based on the ergodicity of an auxiliary process that is sequentially constrained to feasible adaptation sets, independent estimates of the growth rate of the AM chain and the corresponding geometric drift constants. The ergodicity result of the constrained process is obtained through a modification of the approach due to Andrieu and Moulines [Ann. Appl. Probab. 16 (2006) 1462–1505].

[
\kwd
\doi

10.1214/10-AAP682 \volume20 \issue6 2010 \firstpage2178 \lastpage2203 \newproclaimdefinitionDefinition\newproclaimremark[definition]Remark

\runtitle

On the ergodicity of adaptive Metropolis

{aug}

A]\fnmsEero \snmSaksman\thanksreft1label=e1]eero.saksman@helsinki.fi and B]\fnmsMatti \snmVihola\corref\thanksreft2label=e2]matti.vihola@iki.fi label=u2,url]http://iki.fi/mvihola

\thankstext

t1Supported by the Academy of Finland, Projects 113826 and 118765, and by the Finnish Centre of Excellence in Analysis and Dynamics Research.

\thankstext

t2Supported by the Academy of Finland, Projects 110599 and 201392, by the Finnish Academy of Science and Letters, Vilho, Yrjö and Kalle Väisälä Foundation, by the Finnish Centre of Excellence in Analysis and Dynamics Research and by the Finnish Graduate School in Stochastics and Statistics.

class=AMS] \kwd[Primary ]65C05 \kwd[; secondary ]65C40 \kwd60J27 \kwd93E15 \kwd93E35. Adaptive Markov chain Monte Carlo \kwdconvergence \kwdergodicity \kwdMetropolis algorithm \kwdstochastic approximation.

1 Introduction

The Markov chain Monte Carlo (MCMC) method, first proposed by metropolis , is a commonly used device for numerical approximation of integrals of the type

where is a probability density function. Intuitively, the method is based on producing a sample of random variables from the distribution defines. The integral is approximated with the average . In particular, the random variables are a realization of a Markov chain, constructed so that the chain has as the unique invariant distribution.

One of the most commonly applied constructions of such a chain in is to let with some fixed point , and recursively for :

  1. simulate , where is an independent random variable distributed according to some symmetric proposal distribution , for example, a zero-mean Gaussian, and

  2. with probability , the proposal is accepted and ; otherwise the proposal is rejected and .

This symmetric random-walk Metropolis algorithm is often efficient enough, even in a relatively complex and high-dimensional situation, provided that the proposal distribution is selected properly. Finding a good proposal for a particular problem can, however, be a difficult task.

Recently, there has been a number of publications describing different adaptation techniques aiming to find a good proposal automatically saksman-am , andrieu-robert , atchade-rosenthal , andrieu-moulines , roberts-rosenthal (see also the review article andrieu-thoms ). It has been a common practice to perform trial runs, and determine the proposal from the outcome. The recently proposed methods are different in that they adapt on-the-fly, continuously during the estimation run. In this paper, we focus on the forerunner of these methods, the Adaptive Metropolis (AM) algorithm saksman-am , which is a random-walk Metropolis sampler with a Gaussian proposal having a covariance . The proposal covariance is updated continuously during the run, according to the history of the chain. In general, such an adaptation may, if carelessly implemented, destroy the correct ergodicity properties, that is, that does not converge to as (see, e.g., roberts-rosenthal ). For practical considerations of the AM algorithm, the reader may consult saksman-jrstatsoc , roberts-rosenthal-examples .

In the original paper saksman-am presenting the AM algorithm, the first ergodicity result for such adaptive algorithms was obtained. More precisely, a strong law of large numbers was proved for bounded functionals, when the algorithm is run on a compact subset of . After that, several authors have obtained more general conditions under which an adaptive MCMC process preserves the correct ergodicity properties. Andrieu and Robert andrieu-robert established the connection between adaptive MCMC and stochastic approximation, and proposed a general framework for adaptation. Atchadé and Rosenthal atchade-rosenthal developed further the technique of saksman-am . Andrieu and Moulines andrieu-moulines made important progress by generalizing the Poisson equation and martingale approximation techniques to the adaptive setting. They proved the ergodicity and a central limit theorem for a class of adaptive MCMC schemes. Roberts and Rosenthal roberts-rosenthal use an interesting approach based on coupling to show a weak law of large numbers. However, in the case of AM, all the techniques essentially assume that the adapted parameter is constrained to a predefined compact set, or do not present concrete verifiable conditions. The only result to overcome this assumption is the one by Andrieu and Moulines andrieu-moulines . Their result, however, requires a modification of the algorithm, including additional re-projections back to some fixed compact set.

This paper describes sufficient conditions under which the AM algorithm preserves the correct ergodicity properties, and almost surely as for any function that is bounded on compact sets and grows at most exponentially as . Our main result (Theorem 5) holds for the original AM process (without re-projections) having a target distribution supported on . Essentially, the target density must have asymptotically lighter tails than for some , and for large enough , the sets must have uniformly regular contours. Our assumptions are very close to the well-known conditions proposed by Jarner and Hansen jarner-hansen to ensure the geometric convergence of a (nonadaptive) Metropolis process. By the techniques of this paper, one may also establish a central limit theorem (see Theorem 5).

The ergodicity results for the AM process rely on three main contributions. First, in Section 2, we describe an adaptive MCMC framework, in which the adaptation parameter is constrained at each time to a feasible adaptation set. In Section 3, we prove a strong law of large numbers for such a process, through a modification of the technique of Andrieu and Moulines andrieu-moulines . Second, we propose an independent estimate for the growth rate of a process satisfying a general drift condition in Section 4. Third, in Section 5, we provide an estimate for constants of geometric drift for a symmetric random-walk Metropolis process, when the target distribution has super-exponentially decaying tails with regular contours.

The paper is essentially self-contained, and assumes little background knowledge. Only the basic martingale theory is needed to follow the argument, with the exception of Theorem A by Meyn and Tweedie meyn-tweedie-computable , restated in Appendix A. Even though we consider only the AM algorithm, our techniques apply also to many other adaptive MCMC schemes of similar type.

2 General framework and notation

We consider an adaptive Markov chain Monte Carlo (MCMC) chain evolving in space , where is the state space of the “MCMC” chain and the adaptation parameter evolves in , where is a separable normed vector space. We assume an underlying probability space , and denote the expectation with respect to by . The natural filtration of the chain is denoted with where . We also assume that we are given an increasing sequence of subsets of the adaptation parameter space . The random variables form a stochastic chain, starting from and , and for , satisfying the following recursion:

(1)
(2)

where is a transition probability for each , is an adaptation function, and is a decreasing sequence of adaptation step sizes . The functions are defined as

Thus, ensures that lies in for each . The recursion (2) can also be considered as constrained Robbins–Monro stochastic approximation (see andrieu-moulines , sa-verifiable and references therein).

Let be a function. We define a -norm of a function as

As usual, we denote the integration of a function with respect to a (signed) measure as , and define for a transition probability . The -norm of a signed measure is defined as

The indicator function of a set is denoted as and equals one if and zero otherwise. In addition, we use the notation and .

Finally, we define the following regularity property for a family of functions . {definition} Suppose . Given an increasing sequence of subsets , , we say that a family of functions , with , is -polynomially Lipschitz with constants , if for all , we have

3 Ergodicity of sequentially constrained adaptive MCMC

This section contains general ergodicity results for a sequentially constrained process defined in Section 2. These results can be seen auxiliary to our results on Adaptive Metropolis in Section 5, but may be applied to other adaptive MCMC methods as well.

Suppose that the adaptation algorithm has the form given in (1) and (2), and the following assumptions are satisfied for some and .

  1. [(A1)]

  2. For each , the transition probability has as the unique invariant distribution.

  3. For each , the following uniform drift and minorization condition holds for all :

    (3)
    (4)

    where is a subset (a minorization set), is a drift function such that , and is a probability measure on , concentrated on . Furthermore, the constants and are increasing, and is decreasing with respect to , and they are polynomially bounded so that

  4. For all and any , there is such that for all ,

  5. There is a such that for all , and

Theorem \thedefinition

Assume (A1)(A4) hold and let be a function with for some . Assume , where is an independent constant, and that . Then

(5)

The proof of Theorem 3 is postponed to the end of this section. We start by the following lemma, whose proof is given in Appendix A. It shows that if we have polynomially worse bounds for drift and minorization constants, then the speed of geometric convergence can get only polynomially worse.

Lemma \thedefinition

Suppose (A2) holds. Then, one has for that for all and ,

with bound

where is an independent constant, and .

Observe that the statement in Lemma 3 entails that any function is integrable with respect to the measures and , for all , and . The next three results are modified from Proposition 3, Lemma 5 and Proposition 6 of andrieu-moulines , respectively. The first one bounds the regularity of the solutions of the Poisson equation

(6)

for a polynomially Lipschitz family of functions.

Proposition \thedefinition

Suppose that (A1)(A3) hold, and the family of functions is -polynomially Lipschitz with constants , for some . There is an independent constant and a constant , such that:

{longlist}

The family is -polynomially Lipschitz with constants .

Define, for any , the function

(7)

Then, solves the Poisson equation (6), and the families and are -polynomially Lipschitz with constants . In other words,

(8)
(9)

for all .

{pf}

Throughout the proof, suppose .

The part (i) follows easily from Lemma 3, since

Consider then (ii). Estimate (8) follows by the definition of and Lemma 3,

The above bound clearly applies also to , and the convergence implies that solves (6).

For (9), define an auxiliary transition probability by setting , and write

since for all . By Lemma 3 and assumption (A3), we have for all and

which gives that

(10)

Write then

By Lemma 3 and estimate (10) we have

The same bound applies, with a similar argument, to .

Lemma \thedefinition

Assume that (A2) holds. Then, for all , any sequence of positive numbers, and , we have that

(11)
(12)

where the constant depends only on .

{pf}

For and , we can apply the drift inequality (3) and the monotonicity of and to obtain

This estimate with Jensen’s inequality yields for that

Similarly, we have

by Jensen’s inequality and estimate (3).

Assume that is a regular enough family of functions. Consider the following decomposition, which is one of the key observations in andrieu-moulines ,

(14)

where is a martingale with respect to , and and are “residual” sequences, given by

Recall that solves the Poisson equation (6). The following proposition controls the fluctuations of these terms individually.

Proposition \thedefinition

Assume (A1)(A4) hold, and let be -polynomially Lipschitz with constants for some . Then, for any , for all and , there is a , such that for all ,

(15)
(16)
(17)

whenever is small enough to ensure that , where is an independent constant.

{pf}

In this proof, is a constant that can take different values at each appearance. By Proposition 3, we have that for all . Since , we can bound the martingale differences for as follows:

by (11) of Lemma 3. For , we have, by Burkholder and Minkowski’s inequalities,

where the constant depends only on . For , the estimate (3) yields, by Burkholder’s inequality,

The two cases combined give that

(19)

Now, by Corollary B of Birnbaum and Marshall’s inequality in Appendix B,

for all . By letting , we have from (19)

since . Now, (15) follows by

since we have that .

By Proposition 3, for . By construction, , and assumption (A4) ensures that , so

Let . Since for , we obtain

and then by Minkowski’s inequality and (11) of Lemma 3,

(20)

Finally, consider . From Proposition 3, we have that , and by (12) of Lemma 3,