Sequentially interacting Markov chain Monte Carlo methods

Sequentially interacting Markov chain Monte Carlo methods

[ [    [ [    [ [ Carnegie Mellon University, INRIA Bordeaux Sud-Ouest and University of British Columbia A. Brockwell
Department of Statistics
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, Pennsylvania 15123
USA
\printeade1
P. Del Moral
Centre INRIA Bordeaux Sud-Ouest
Institut de Mathématiques
Université Bordeaux I
33405 Talence Cedex
France
\printeade2
A. Doucet
Departments of Computer Science
 and Statistics
University of British Columbia
333-6356 Agricultural Road
Vancouver, BC, V6T 1Z2
Canada
\printeade3
\smonth1 \syear2008\smonth8 \syear2009
\smonth1 \syear2008\smonth8 \syear2009
\smonth1 \syear2008\smonth8 \syear2009
Abstract

Sequential Monte Carlo (SMC) is a methodology for sampling approximately from a sequence of probability distributions of increasing dimension and estimating their normalizing constants. We propose here an alternative methodology named Sequentially Interacting Markov Chain Monte Carlo (SIMCMC). SIMCMC methods work by generating interacting non-Markovian sequences which behave asymptotically like independent Metropolis–Hastings (MH) Markov chains with the desired limiting distributions. Contrary to SMC, SIMCMC allows us to iteratively improve our estimates in an MCMC-like fashion. We establish convergence results under realistic verifiable assumptions and demonstrate its performance on several examples arising in Bayesian time series analysis.

[
\kwd
\doi

10.1214/09-AOS747 \volume38 \issue6 2010 \firstpage3387 \lastpage3411 \newproclaimassumptionAssumption

\runtitle

Sequentially interacting MCMC

{aug}

A]\fnmsAnthony \snmBrockwelllabel=e1]Abrock@stat.cmu.edu, B]\fnmsPierre \snmDel Morallabel=e2]Pierre.Del-Moral@inria.fr and C]\fnmsArnaud \snmDoucet\correflabel=e3]Arnaud@stat.ubc.ca

class=AMS] \kwd[Primary ]65C05 \kwd60J05 \kwd[; secondary ]62F15. Markov chain Monte Carlo \kwdnormalizing constants \kwdsequential Monte Carlo \kwdstate-space models.

1 Introduction

Let us consider a sequence of probability distributions where , which we will refer to as “target” distributions. We shall also refer to as the time index. For ease of presentation, we shall assume here that is defined on a measurable space where , and , . We denote where for . Each is assumed to admit a density with respect to a -finite dominating measure denoted and . Additionally, we have

where is known pointwise and the normalizing constant is unknown.

In a number of important applications, it is desirable to be able to sample from the sequence of distributions and to estimate their normalizing constants ; the most popular statistical application is the class of nonlinear non-Gaussian state-space models detailed in Section 4. In this context, is the posterior distribution of the hidden state variables from time to given the observations from time to and is the marginal likelihood of these observations. Many other applications, including contingency tables and population genetics, are discussed in Delm04 (), Douc01 () and Liu2001 ().

A standard approach to solve this class of problems relies on Sequential Monte Carlo (SMC) methods; see Douc01 () and Liu2001 () for a review of the literature. In the SMC approach, the target distributions are approximated by a large number of random samples, termed particles, which are carried forward over time by using a combination of sequential importance sampling and resampling steps. These methods have become the tools of choice for sequential Bayesian inference but, even when there is no requirement for “real-time” inference, SMC algorithms are increasingly used as an alternative to MCMC; see, for example, Chopin2002 (), DelMoralDoucetJasra2006 () and Liu2001 () for applications to econometrics models, finite mixture models and contingency tables. They also allow us to implement goodness-of-fit tests easily in a time series context whereas a standard MCMC implementation is cumbersome gerlach1999 (). Moreover, they provide an estimate of the marginal likelihood of the data.

The SMC methodology is now well established and many theoretical convergence results are available Delm04 (). Nevertheless, in practice, it is typically impossible to, a priori, determine the number of particles necessary to achieve a fixed precision for a given application. In such scenarios, users typically perform multiple runs for an increasing number of particles until stabilization of the Monte Carlo estimates is observed. Moreover, SMC algorithms are substantially different from MCMC algorithms and can appear difficult to implement for nonspecialists.

In this paper, we propose an alternative to SMC named Sequentially Interacting Markov Chain Monte Carlo (SIMCMC). SIMCMC methods allow us to compute Monte Carlo estimates of the quantities of interest iteratively as they are, for instance, when using MCMC methods. This allows us to refine the Monte Carlo estimates until a suitably chosen stopping time. Furthermore, for people familiar with MCMC methods, SIMCMC methods are somewhat simpler than SMC methods to implement, because they only rely on MH steps. However, SIMCMC methods are not a class of MCMC methods. These are non-Markovian algorithms which can be interpreted as an approximation of “ideal” standard MCMC chains. It is based on the same key idea as SMC methods; that is as is often “close” to , it is sensible to use as part of a proposal distribution to sample . In SMC methods, the correction between the proposal distribution and the target distribution is performed using Importance Sampling whereas in SIMCMC methods it is performed using an MH step. Such a strategy is computationally much more efficient than sampling separately from each target distribution using standard MCMC methods and also provides direct estimates of the normalizing constants .

The potential real-time applications of SIMCMC methods are also worth commenting on. SMC methods have been used in various real-time engineering applications, for example, in neural decoding brockwell2004 () and in target tracking Gilks99 (), septier2009 (). In these problems, it is important to be able to compute functionals of the posterior distributions of some quantity of interest, but it must also be done in real-time. SMC methods work with collections of particles that are updated sequentially to reflect these distributions. Clearly, in such real-time problems it is important that the collections of particles are not too large, or else the computational burden can cause the SMC algorithm to fall behind the system being analyzed. SIMCMC methods provide a very convenient way to make optimal use of what computing power is available. Since SIMCMC works by adding one particle at a time to collections representing distributions, we can simply run it continually in between arrival of successive observations, and it will accrue as many particles as it can in whatever amount of time is taken.

Related ideas where we also have a sequence of nested MCMC-like chains “feeding” each other and targeting a sequence of increasingly complex distributions have recently appeared in statistics kou2006 () and physics lyman2006 (). In the equi-energy sampler kou2006 (), the authors consider a sequence of distributions indexed by a temperature and an energy truncation whereas in lyman2006 () the authors consider a sequence of coarse-grained distributions. It is also possible to think of SIMCMC methods and the algorithms in kou2006 () and lyman2006 () as nonstandard adaptive MCMC schemes andrieumoulines2006 (), robertsrosenthal2007 () where the parameters to be adapted are probability distributions instead of finite-dimensional parameters. Our convergence results rely partly on ideas developed in this field andrieumoulines2006 ().

The rest of the paper is organized as follows. In Section 2, we describe SIMCMC methods, give some guidelines for the design of efficient algorithms and discuss implementation issues. In Section 3, we present some convergence results. In Section 4, we demonstrate the performance of this algorithm for various Bayesian time series problems and compare it to SMC. Finally, we discuss a number of further potential extensions in Section 5. The proofs of the results in Section 3 can be found in Appendix.

2 Sequentially interacting Markov chain Monte Carlo

2.1 The SIMCMC algorithm

Let be the iteration counter. The SIMCMC algorithm constructs sequences . In Section 3, we establish weak necessary conditions ensuring that as approaches infinity, the distribution of approaches ; we will assume here that these conditions are satisfied to explain the rationale behind our algorithm. To specify the algorithm, we require a sequence of proposal distributions, specified by their densities

Each () is a probability density in its last argument with respect to , which may depend (for ) on the first argument. Proposals are drawn from for updates of the sequence , from for updates of the sequence , and so on. (Selection of proposal distributions is discussed in Section 2.2.) Based on these proposals, we define the weights

(1)

For any measure on , we define

and

(2)

We also denote by the empirical measure approximation of the target distribution given by

(3)

Intuitively, the SIMCMC algorithm proceeds as follows. At each iteration of the algorithm, the algorithm samples for by first sampling , then and so on. For , is a standard Markov chain generated using an independent MH sampler of invariant distribution and proposal distribution . For , we would like to approximate an independent MH sampler of invariant distribution and proposal distribution . As it is impossible to sample from exactly, we replace at iteration by its current empirical measure approximation . Similarly for , we approximate an MH sampler of invariant distribution and proposal distribution by replacing at iteration by its current empirical measure approximation . The sequences generated this way are clearly non-Markovian.

Sequentially interacting Markov chain Monte Carlo

Initialization,

For , set randomly .

For iteration

For

Sample .

With probability

(4)

set , otherwise set .

For

Sample .

With probability

(5)

set , otherwise set .

The (ratio of) normalizing constants can easily be estimated by

(6)

Equation (6) follows from the identity

and the fact that asymptotically (as ) is distributed according to under weak conditions given in Section 3.

2.2 Algorithm settings

In a similar manner to SMC methods, the performance of the SIMCMC algorithm depends on the proposal distributions. However, it is possible to devise some useful guidelines for this sequence of (pseudo-)independent samplers, using reasoning similar to that adopted in the SMC framework. Asymptotically, is distributed according to and is just the importance weight (up to a normalizing constant) between and . The proposal distribution minimizing the variance of this importance weight is simply given by

(7)

where is the conditional density of given under , that is,

(8)

In the SMC literature, is called the “optimal” importance density Douc00 (). This yields

(9)

where

(10)

with

In this case, as is independent of , the algorithm described above can be further simplified. It is indeed possible to decide whether to accept or reject a candidate even before sampling it. This is more computationally efficient because if the move is to be rejected there is no need to sample the candidate. In most applications, it will be difficult to sample from (7) and/or to compute (9) as it involves computing up to a normalizing constant. In this case, we recommend approximating (7). Similar strategies have been developed successfully in the SMC framework fearnhead1999 (), Douc00 (), liuchen1998 (), Pitt99 (). The advantages of such sampling strategies in the SIMCMC case will be demonstrated in the simulation section.

Generally speaking, most of the methodology developed in the SMC setting can be directly reapplied here. This includes the use of Rao-Blackwellisation techniques to reduce the dimensionality of the target distributions Douc00 (), liuchen1998 () or of auxiliary particle-type ideas where we build target distributions biased toward “promising” regions of the space fearnhead1999 (), Pitt99 ().

2.3 Implementation issues

2.3.1 Burn-in and storage requirements

We have presented the algorithm without any burn-in. This can be easily included if necessary by considering at iteration of the algorithm

where

where is an appropriate number of initial samples to be discarded as burn-in. Note that when , we have .

Note that in its original form, the SIMCMC algorithm requires storing the sequences . This could be expensive if the number of target distributions and/or the number of iterations of the SIMCMC are large. However, in many scenarios of interest, including nonlinear non-Gaussian state-space models or the scenarios considered in DelMoralDoucetJasra2006 (), it is possible to drastically reduce these storage requirements as we are only interested in estimating the marginals and we have and . In such cases, we only need to store , resulting in significant memory savings.

2.3.2 Combining sampling strategies

In practice, it is possible to combine the SIMCMC strategy with SMC methods; that is we can generate say (approximate) samples from using SMC then use the SIMCMC strategy to increase the number of particles until the Monte Carlo estimates stabilize. We emphasize that SIMCMC will be primarily useful in the context where we do not have a predetermined computational budget. Indeed, if the computational budget is fixed, then better estimates could be obtained by switching the iteration and time loops in the SIMCMC algorithm.

2.4 Discussion and extensions

Standard MCMC methods do not address the problem solved by SIMCMC methods. Trans-dimensional MCMC methods Green03 () allow us to sample from a sequence of “related” target distributions of different dimensions but require the knowledge of the ratio of normalizing constants between these target distributions. Simulated tempering and parallel tempering require all target distributions to be defined on the same space and rely on MCMC kernels to explore each target distribution; see jasra2006 () for a recent discussion of such techniques. As mentioned earlier in the Introduction, ideas related to SIMCMC where a sequence of “ideal” MCMC algorithms is approximated have recently appeared in statistics kou2006 () and physics lyman2006 (). However, contrary to these algorithms, the target distributions considered here are of increasing dimension and the proposed interacting mechanism is simpler. Whereas the equi-energy sampler kou2006 () allows “swap” moves between chains, we only use the samples of the sequence to feed but is never used to generate .

There are many possible extensions of the SIMCMC algorithm. In this respect, the SIMCMC algorithm is somehow a proof-of-concept algorithm demonstrating that it is possible to make sequences targeting different distributions interact without the need to define a target distribution on an extended state space. For example, a simple modification of the SIMCMC algorithm can be easily parallelized. Instead of sampling our candidate at iteration according to we can sample it instead from : this allows us to simulate the sequences on parallel processors. It is straightforward to adapt the convergence results given in Section 3 to this parallel version of SIMCMC.

In the context of real-time applications where is typically the posterior distribution of some states given the observations , SIMCMC methods can also be very useful. Indeed, SMC methods cannot easily address situations where the observations arrive at random times whereas SIMCMC methods allow us to make optimal use of what computing power is available by adding as many particles as possible until the arrival of a new observation. In such cases, a standard implementation would consist of updating our approximation of at “time” by adding iteratively particles to the approximations for a lag until the arrival of data . For , such schemes have been recently proposed independently in septier2009 ().

3 Convergence results

We now present some convergence results for SIMCMC. Despite the non-Markovian nature of this algorithm, we are here able to provide realistic verifiable assumptions ensuring the asymptotic consistency of the SIMCMC estimates. Our technique of proof rely on Poisson’s equation glynn1996 (); similar tools have been used in andrieumoulines2006 () to study the convergence of adaptive MCMC schemes and in delmoralmiclo2003 () to study the stability of self-interacting Markov chains.

Let us introduce such that where . We denote by the expectation with respect to the distribution of the simulated sequences initialized at and . For any measure and test function , we write .

Our proofs rely on the following assumption. {assumption} For any , there exists such that for any

(11)

This assumption is quite weak and can be easily checked in all the examples presented in Section 4. Note that a similar assumption also appears when bounds are established for SMC methods Delm04 ().

Our first result establishes the convergence of the empirical averages toward the correct expectations at the standard Monte Carlo rate.

Theorem 3.1

Given Assumption 3, for any and any , there exist such that for any , and

Using (6), a straightforward corollary of Theorem 3.1 is the following result.

Theorem 3.2

Given Assumption 3, for any and any , there exist such that for any , and

and for

By the Borel–Cantelli lemma, Theorems 3.1 and 3.2 also ensure almost sure convergence of the empirical averages and of the normalizing constant estimates.

Our final result establishes that each sequence converges toward .

Theorem 3.3

Given Assumption 3, for any , and we have

4 Applications

In this section, we will focus on the applications of SIMCMC to nonlinear non-Gaussian state-space models. Consider an unobserved -valued Markov process satisfying

We assume that we have access to observations which, conditionally on , are independent and distributed according to

(12)

This family of models is important, because almost every stationary time series model appearing in the literature can be cast into this form. Given , we are often interested in computing the sequence of posterior distributions to perform goodness-of-fit and/or to compute the marginal likelihood . By defining the unnormalized distribution as

(13)

(which is typically known pointwise), we have and so that SIMCMC can be applied.

We will consider here three examples where the SIMCMC algorithms are compared to their SMC counterparts. For a fixed number of iterations/particles, SMC and SIMCMC have approximately the same computational complexity. The same proposals and the same number of samples were thus used to allow for a fair comparison. Note that we chose not to use any burn-in period for the SIMCMC and we initialize the algorithm by picking for any where is a sample from the prior. The SMC algorithms were implemented using a stratified resampling procedure kitagawa1996 (). The first two examples compare SMC to SIMCMC in terms of log-likelihood estimates. The third example demonstrates the use of SIMCMC in a real-time tracking application.

4.1 Linear Gaussian model

We consider a linear Gaussian model where :

(14)

with , , , and is a random (doubly stochastic) matrix. Here is a Gaussian distribution of mean and variance-covariance matrix . For this model we can compute the marginal likelihood exactly using the Kalman filter. This allows us to compare our results to the ground truth.

We use two proposal densities: the prior density and the optimal density (14) given by which is a Gaussian. In both cases, it is easy to check that Assumption 3 is satisfied.

For , we simulated a realization of observations for and . In Tables 1 and 2, we present the performance of both SIMCMC and SMC for a varying , a varying number of samples and the two proposal distributions in terms on Root Mean Square Error (RMSE) of the log-likelihood estimate.

1000 2500 5000 10,000 25,000
SMC,
SIMCMC,
SMC,
SIMCMC,
SMC,
SIMCMC,
Table 1: RMSE for SMC and SIMCMC algorithms over 100 realizations for prior proposal
1000 2500 5000 10,000 25,000
SMC, 0.33 0.17 0.09 0.06 0.04
SIMCMC, 0.37 0.19 0.14 0.11 0.06
SMC, 0.28 0.16 0.10 0.07 0.06
SIMCMC, 0.29 0.23 0.15 0.12 0.07
SMC, 0.18 0.14 0.09 0.05 0.07
SIMCMC, 0.31 0.20 0.16 0.12 0.10
Table 2: RMSE for SMC and SIMCMC algorithms over 100 realizations for optimal proposal

As expected, the performance of our estimates is very significantly improved when the optimal distribution is used as the observations are quite informative. Unsurprisingly, SMC outperform SIMCMC for a fixed computational complexity.

4.2 A nonlinear non-Gaussian state-space model

We now consider a nonlinear non-Gaussian state-space model introduced in kitagawa1996 () which has been used in many SMC publications:

where , and . As the sign of the state is not observed, the posterior distributions are multimodal. SMC approximations are able to capture properly the multimodality of the posteriors. This allows us to assess here whether SIMCMC can also explore properly these multimodal distributions by comparing SIMCMC estimates to SMC estimates.

We use as a proposal density the prior density . In this case, it is easy to check that Assumption 3 is satisfied.

2500 5000 10,000 25,000 50,000
SMC, 0.80 0.55 0.40 0.24 0.17
SIMCMC, 0.95 0.60 0.75 0.59 0.41
SMC, 0.33 0.23 0.17 0.11 0.07
SIMCMC, 0.91 0.70 0.50 0.38 0.29
SMC, 0.13 0.10 0.08 0.05 0.03
SIMCMC, 0.28 0.21 0.19 0.12 0.08
Table 3: Average RMSE of log-likelihood estimates for SMC and SIMCMC algorithms over 100 realizations

In Table 3, we present the performance of both SIMCMC and SMC for a varying number of samples and a varying whereas we set . Both SMC and SIMCMC are performing better as the signal to noise ratio degrades. This should not come as a surprise. As we are using the prior as a proposal, it is preferable to have a diffuse likelihood for good performance. Experimentally we observed that SIMCMC and SMC estimates coincide for large . However for a fixed computational complexity, SIMCMC is outperformed by SMC in terms of RMSE.

4.3 Target tracking

We consider here a target tracking problem Gilks99 (), Liu2001 (). The target is modeled using a standard constant velocity model

(15)

where , with and

The state vector is such that (resp., ) corresponds to the horizontal (resp., vertical) position of the target whereas (resp., ) corresponds to the horizontal (resp., vertical) velocity. In many real tracking applications, the observations are collected at random times fox2001 (). We have the following measurement equation:

(16)

where ; these parameters are representative of real-world tracking scenarios. We assume that we only observe at the time indexes where and, when , we observe with probability . We are here interested in estimating the associated sequence of posterior distributions on-line.

Assume the computational complexity of the SMC method is such that only particles can be used in one time unit. In such scenarios, we can either use SMC with particles to estimate the sequence of posterior distributions of interest or SMC with say particles and chose to ignore observations that might appear when . These are two standard approaches used in applications. Alternatively, we can use the SIMCMC method to select adaptively the number of particles as discussed in Section 2.4. If SIMCMC algorithm only adds particles to the approximation of the current posterior at time , it will use approximately particles to approximate this posterior if the next observation only appears at time .

Algorithm SMC with SMC with SIMCMC
Average RMSE 2.14 3.21 1.62
Table 4: Average RMSE of the Monte Carlo state estimate

We simulated 50 realizations of observations according to the model (15) and (16) and use as a proposal density the prior density . This ensures that Assumption 3 is satisfied. In Table 4, we display the performance of SMC with particles, particles (ignoring some observations) and SIMCMC using an adaptive number of particles in terms of the average RMSE of the estimate of the conditional expectation . In such scenarios, SIMCMC methods clearly outperforms SMC methods.

5 Discussion

We have described an iterative algorithm based on interacting non-Markovian sequences which is an alternative to SMC and have established convergence results validating this methodology. The algorithm is straightforward to implement for people already familiar with MCMC. The main strength of SIMCMC compared to SMC is that it allows us to iteratively improve our estimates in an MCMC-like fashion until a suitable stopping criterion is met. This is useful as in numerous applications the number of particles required to ensure the estimates are reasonably precise is unknown. It is also useful in real-time applications when one is unsure of exactly how much time will be available between successive arrivals of data points.

As discussed in Section 2.4, numerous variants of SIMCMC can be easily developed which have no SMC equivalent. The fact that such schemes do not need to define a target distribution on an extended state-space admitting as marginals offers a lot of flexibility. For example, if we have access to multiple processors, it is possible to sample from each independently using standard MCMC and perform several interactions simultaneously. Adaptive versions of the algorithms can also be proposed by monitoring the acceptance ratio of the MH steps. If the acceptance probability of the MH move between say and is low, we could, for example, increase the number of proposals at this time index.

From a theoretical point of view, there are a number of interesting questions to explore. Under additional stability assumptions on the Feynman–Kac semigroup induced by and  Delm04 (), Chapter 4, we have recently established in DelMoral2009 () convergence results ensuring that, for functions of the form , the bound in Theorem 3.1 is independent of . A central limit theorem has also been established in bercu2008 ().

Appendix: Proofs of convergence

Our proofs rely on the Poisson equation glynn1996 () and are inspired by ideas developed in andrieumoulines2006 (), andrieuajay2007 (), delmoralmiclo2003 (). However, contrary to standard adaptive MCMC schemes andrieumoulines2006 (), robertsrosenthal2007 (), the Markov kernels we consider do not necessarily admit the target distribution as invariant distribution; see delmoralmiclo2003 () for similar scenarios. However, in our context, it is still possible to establish stronger results than in delmoralmiclo2003 () as we can characterize exactly these invariant distributions; see Proposition 1.

.1 Notation

We denote by the set of probability measures on . We introduce the independent Metropolis–Hastings (MH) kernel defined by

For , we associate with any the Markov kernel defined by

where . In (.1) and (.1), we have for

We use to denote the total variation norm and for any Markov kernel

.2 Preliminary results

We establish here the expression of the invariant distributions of the kernels and and establish that these kernels are geometrically ergodic. We also provide some perturbation bounds for and its invariant distribution.

Lemma 1

Given Assumption 3, is uniformly geometrically ergodic of invariant distribution .

By construction, is an MH algorithm of invariant distribution . Uniform ergodicity follows from Assumption 3; see, for example, Theorem 2.1. in mengersentweedie1996 ().

Proposition 1

Given Assumption 3, for any and any , is uniformly geometrically ergodic of invariant distribution

(19)

where and are defined, respectively, in (8) and (10).

{pf}

To establish the result, it is sufficient to rewrite

This shows that is nothing but a standard MH algorithm of proposal distribution and target distribution given by (19). This distribution is always proper because Assumption 3 implies that over . Uniform ergodicity follows from Theorem 2.1. in mengersentweedie1996 (). {Corollary*} It follows that for any there exists such that for any and

(20)
Proposition 2

Given Assumption 3, for any , we have for any , , and

(21)

and

(22)
{pf}

For any , we have the following decomposition:

From Assumption 3, we know that for any

and from (.1) for any and

thus