Efficient Bayesian Inference for Switching State-Space Models using Discrete Particle Markov Chain Monte Carlo Methods

Efficient Bayesian Inference for Switching State-Space Models using Discrete Particle Markov Chain Monte Carlo Methods

Nick Whiteley, Christophe Andrieu
Department of Mathematics,
University of Bristol,
University Walk,
Bristol BS8 1TW, UK.
Email: {Nick.Whiteley,C.Andrieu}@bris.ac.uk
   Arnaud Doucet
Department of Statistics,
University of British Columbia,
Vancouver V6T 1Z2, BC, Canada.
Email: Arnaud@stat.ubc.ca
Abstract

Switching state-space models (SSSM) are a very popular class of time series models that have found many applications in statistics, econometrics and advanced signal processing. Bayesian inference for these models typically relies on Markov chain Monte Carlo (MCMC) techniques. However, even sophisticated MCMC methods dedicated to SSSM can prove quite inefficient as they update potentially strongly correlated discrete-valued latent variables one-at-a-time (Carter and Kohn, 1996; Gerlach et al., 2000; Giordani and Kohn, 2008). Particle Markov chain Monte Carlo (PMCMC) methods are a recently developed class of MCMC algorithms which use particle filters to build efficient proposal distributions in high-dimensions (Andrieu et al., 2010). The existing PMCMC methods of Andrieu et al. (2010) are applicable to SSSM, but are restricted to employing standard particle filtering techniques. Yet, in the context of discrete-valued latent variables, specialised particle techniques have been developed which can outperform by up to an order of magnitude standard methods (Fearnhead, 1998; Fearnhead and Clifford, 2003; Fearnhead, 2004). In this paper we develop a novel class of PMCMC methods relying on these very efficient particle algorithms. We establish the theoretical validy of this new generic methodology referred to as discrete PMCMC and demonstrate it on a variety of examples including a multiple change-points model for well-log data and a model for U.S./U.K. exchange rate data. Discrete PMCMC algorithms are shown to outperform experimentally state-of-the-art MCMC techniques for a fixed computational complexity. Additionally they can be easily parallelized (Lee et al., 2010) which allows further substantial gains.

Keywords: Bayesian inference, Markov chain Monte Carlo, optimal resampling, particle filters, sequential Monte Carlo, switching state-space models.

1 Introduction

Linear Gaussian Switching State-Space Models (SSSM) are a class of time series models in which the parameters of a linear Gaussian model switch according to a discrete latent process. They are ubiquitous in statistics (Cappé et al., 2005; Frühwirth-Schnatter, 2006), econometrics (Kim and Nelson, 1999; Giordani et al., 2007) and advanced signal processing (Barembruch et al., 2009; Costa et al., 2005) as they allow us to describe in a compact and interpretable way regime switching time series. SSSM have been successfully used to describe, among others, multiple change-point models (Fearnhead and Clifford, 2003; Giordani and Kohn, 2008), nonparametric regression models with outliers (Carter and Kohn, 1996) and Markov switching autoregressions (Billio and Monfort, 1998; Frühwirth-Schnatter, 2006; Kim and Nelson, 1999).

Performing Bayesian inference for SSSM requires the use of Markov chain Monte Carlo (MCMC) techniques. The design of efficient sampling techniques for this class of models has been a subject of active research for over fifteen years, dating back at least as far as Carter and Kohn (1994); Shephard (1994). A recent overview of MCMC in this context can be found in Cappé et al. (2005); Frühwirth-Schnatter (2006). The main practical difficulty lies in simulating from the conditional distribution of the trajectory of the discrete-valued latent process. The cost of computing this distribution grows exponentially in the length of the observation record and therefore obtaining an exact sample from it is impractical for all but tiny data sets. A standard strategy is instead to update the components of the discrete latent process one-at-a-time (Carter and Kohn, 1996; Gerlach et al., 2000; Giordani and Kohn, 2008). However, it is well-known that such an approach can significantly slow down the convergence of MCMC algorithms. An alternative is to sample approximately from the joint distribution of the latent discrete trajectory using particle filters: non-iterative techniques based on a combination of importance sampling and resampling techniques, see Doucet et al. (2001); Liu (2001) for a review of the literature. Empirical evidence suggests that particle filters are able to provide samples whose distribution is close to the target distribution of interest and this evidence is backed up by the rigourous quantitative bounds established in Del Moral (2004, chapter 8). This motivates using particle filters as proposal distributions within MCMC.

This idea is very natural, but its realization is far from trivial as the distribution of a sample generated by a particle filter does not admit a closed-form expression hence preventing us from directly using the standard Metropolis-Hastings (MH) algorithm. In a recent paper Andrieu et al. (2010) have shown that it is possible to bypass this problem. The authors have proposed a whole class of MCMC algorithms named Particle MCMC (PMCMC) relying on proposals built using particle filters. These algorithms have been demonstrated in the context of non-linear non-Gaussian state-space models and are directly applicable to SSSM; see also Flury and Shephard (2010) for applications in financial econometrics. However, the standard particle methods employed in Andrieu et al. (2010) do not fully exploit the discrete nature of the latent process in SSSM. This was recognized early by Paul Fearnhead who proposed an alternative generic algorithm, which we refer to as the Discrete Particle Filter (DPF) (Fearnhead, 1998). The DPF bypasses the importance sampling step of standard particle techniques and can be interpreted as using a clever random pruning mechanism to select support points from the exponentially growing sequence of discrete latent state spaces. The DPF methodology has been demonstrated successfully in a variety of applications (Cappé et al., 2005; Fearnhead, 1998; Fearnhead and Clifford, 2003; Fearnhead, 2004). It has been shown to significantly outperform alternative sophisticated approaches such as the Rao-Blackwellized particle filters developed in Chen and Liu (2000); Doucet et al. (2000, 2001) by up to an order of magnitude for a fixed computational complexity.

The main contribution of this article is to propose a novel class of PMCMC algorithms referred to as discrete PMCMC methods relying on the DPF for this important class of statistical models. The practical efficiency of the proposed methods relies on an original backward sampling procedure. We show that on a variety of applications this new generic methodology outperforms state-of-the-art MCMC algorithms for a fixed computational complexity. Moreover, as in the case of standard particle filters (Lee et al., 2010), the DPF can be parallelized easily. This suggests that even greater computational gains can be achieved.

The rest of the paper is organised as follows. In Section 2 we present the general class of SSSM considered in this paper and give an illustrative example. In Section 3, we discuss the intractability of exact inference in SSSM and present the DPF algorithm (Fearnhead, 1998; Fearnhead and Clifford, 2003; Fearnhead, 2004). Our presentation is slightly non-standard and explicitly introduces the random support sets generated by the algorithm. This allows us to describe the DPF precisely and compactly in a probabilistic way which proves useful to establish the validity of the proposed algorithms. We also review standard MCMC techniques used in this context. In Section 4 we introduce discrete PMCMC algorithms relying on the DPF to perform inference in SSSM and present some theoretical results. In Section 5, we review generic practical issues and demonstrate the efficiency of the proposed methods in the context of three examples. Finally in Section 6 we discuss several extensions of this work.

2 Switching state-space models

2.1 Model

From herein, we use the standard convention whereby capital letters are used for random variables while lower case letters are used for their values. Hereafter for any generic process we will denote . The identity matrix of size is denoted and the matrix of zeros of size by .

Consider the following SSSM, also known in the literature as a conditionally linear Gaussian state-space model or a jump linear system. The latent state process is such that takes values in a finite set . It is characterized by its initial distribution and transition probabilities for

(1)

Conditional upon , we have a linear Gaussian state-space model defined through and for

(2)
(3)

where is the normal distribution of mean and covariance , , , are matrices of appropriate dimension and is an exogeneous input. Here is some static parameter which may be multidimensional, for example . For purposes of precise specification of resampling algorithms in the sequel and without loss of generality we label the elements of with numbers, for example for some . We may then endow each Cartesian product space with the corresponding lexicographical order relation. From henceforth, whenever we refer to ordering of a set of points in it is with respect to the latter relation.

We give here a simple example of a SSSM. Two more sophisticated examples are discussed in Section 5.

2.1.1 Example: Auto-regression with shifting level

Let and for a Markov chain on with transition matrix , consider the process defined by

where for each , and are real-valued and and are i.i.d. . The initial distribution on is and is assumed known. This is a natural generalization of a first order autoregressive model to the case where the level is time-varying with shifts driven by the latent process . This model can be expressed in state-space form by setting

The unknown parameters of this model are where is the transition matrix of . In this model and more generally in SSSMs, inferences about the latent processes and from a particular data set are likely to be highly sensitive to values of these parameters if they are assumed known.

2.2 Inference aims

Our aim is to perform Bayesian inference in SSSMs, conditional upon some observations and for some , treating both the latent trajectories and the parameter as unknowns. Where applicable, the values of the input sequence are assumed known, but for clarity we suppress them from our notation. We ascribe a prior density to so Bayesian inference relies on the joint density

(4)

where the definition of follows from Eq. (1)-(2)-(3). This posterior can be factorized as follows

(5)

where

(6)

Conditional upon Eq. (2)-(3) define a linear Gaussian state-space model so it is possible to compute efficiently the statistics of the conditional multivariate Gaussian density in Eq. (5) and the conditional marginal likelihood in Eq. (6) using Kalman techniques. For example can be computed using the product of predictive densities

(7)

where . The statistics of these Gaussian predictive densities can be computed using the Kalman filter which is recalled in Appendix A for sake of convenience. For simplicity of presentation throughout the following we assume that for each and the support of is . This assumption is satisfied in the vast majority of cases considered in practice and in all the examples we consider. The techniques discussed below can be transferred to cases where this assumption is not met with only cosmetic changes.

3 Inference techniques for switching state-space models

3.1 Exact Inference and Intractability

The main difficulty faced in the exact computation of , is the need to perform the summation in the denominator of Eq. (6) over up to values of , where is the cardinality of . For even modest values of , this sum is too expensive to compute exactly. In the applications we consider, is of the order of thousands, so exact computation is practically impossible.

Even if is treated as fixed, inference is intractable. In this case, we wish to compute , whose normalization involves the same problematic summation. One approach is to obtain by sequential computation of via the recursive relationship

(with the predictive as defined in the previous section) but the computation involved increases exponentially in . For purposes of exposition in the sequel, we remark that, as for each the support of is then the sequence of such supports satisfies the trivial recursion

and is evidently growing in cardinality with . Hence, in both the cases of computing and it is necessary to rely on approximations and we focus here on Monte Carlo methods.

3.2 Monte Carlo Methods

We next review two classes of Monte Carlo techniques to perform inference in SSSM. The first method we discuss is the DPF algorithm of Fearnhead (1998). For a fixed parameter value , this algorithm allows us to compute an approximation of the posterior distribution and an approximation of the marginal likelihood . We present this algorithm in a slightly non-standard way which allows us to describe it probabilistically in a concise and precise manner. This will prove useful for the development of the discrete PMCMC algorithms in Section 4. We also review MCMC methods which have been developed to approximate and discuss their advantages and limitations.

3.3 The discrete particle filter

The DPF algorithm proposed in Fearnhead (1998); Fearnhead and Clifford (2003) is a non-iterative procedure approximating the posterior distribution and the marginal likelihood . Practically, the DPF approximation of the posterior distributions is made sequentially in time using a collection of weighted trajectories or “particles” ,

The parameter controls the precision of the algorithm. The larger it is, the more accurate (on average) the approximation of the target distribution. It has been demonstrated experimentally in Cappé et al. (2005); Fearnhead and Clifford (2003); Fearnhead (2004) that the DPF algorithm outperforms significantly, sometimes by one order of magnitude, the Rao-Blackwellized particle filters proposed in Chen and Liu (2000); Doucet et al. (2000, 2001) and that it is able to provide very good approximations of in realistic scenarios even with a moderate number of particles. The action of the DPF can be summarised as follows.

Assume that we have, at time step obtained consisting of distinct particles with weights that sum to 1. A resampling step is then applied, exactly of the trajectories survive and their weights are adjusted accordingly. The resampling mechanism is chosen in such a way as to be optimal in some sense. Throughout the remainder of the paper we treat the case of minimising the sum of variances of the importance weights as in Fearnhead and Clifford (2003) but exactly the same method applies to other schemes discussed in Barembruch et al. (2009). Features of this resampling scheme which distinguish it from standard methods, such as multinomial resampling, are that it results in no duplicated particles and gives post-resampling weights which are non-uniform.

Whereas standard particle methods rely on a stochastic proposal mechanism to explore the space, the DPF performs all its exploration deterministically. This is possible because of the finite cardinality of the latent discrete space. Consider one of particles which survived the resampling operation, each of which is a point in . Call the point in question and denote by and respectively the mean and covariance of the Gaussian density . From this point new particles are formed, and for each one of them, , and the associated unnormalized weight are calculated using the Kalman filtering recursions (included for reference in Appendix A). This procedure is repeated for the remaining particles, resulting in weighted trajectories. The weights are then normalized to yield a probability distribution constituting .

This outline of the DPF operations highlights the function of the resampling step: in the case of the DPF it acts to prune the exponentially growing (in ) tree of possible paths . It is convenient to specify the DPF in a slightly non-standard way which highlights that the only randomness in this algorithm arises from the resampling step. To this end, we introduce random support sets with each taking a value which is a subset of . It is stressed that, in the following interpretation, the ’s are not random variables, and are just points in the state space (and Cartesian products thereof) used for indexing. With this notation, we write the DPF approximation for as

(8)

Under the probability law of the DPF algorithm, which we discuss in more detail later, for each , , with probability . We thus see in Eq. (8) the effect of the parameter : it specifies the number of support points of the approximation . We next provide pseudo code for the DPF algorithm and then go on to discuss several issues related to its practical use and its theoretical representation.

 

DPF algorithm

At time

Set and for each , compute , and using the Kalman filter.

Compute and normalise the weights. For each ,

(9)

At times

If set otherwise set to the unique solution of

Maintain the trajectories in which have weights strictly superior to , then apply the stratified resampling mechanism to the other trajectories to yield survivors. Set to the set of surviving and maintained trajectories.

Set .

For each , compute , and using the Kalman filter.

Compute and normalise the weights. For each

(10)
(11)

 

3.3.1 Exact computation at the early iterations

For small it is practically possible to compute exactly. It is only once is large enough that that we need to employ the resampling mechanism to prune the set of trajectories. This action is represented conceptually in the DPF algorithm above by the artifice of setting if is such that . When this condition is satisfied, the resampling step is not called into action. Of course in the practically unrealistic case that the DPF, unlike standard SMC algorithms, thus reduces to exact recursive computation of .

3.3.2 Computing and stratified resampling

The threshold is a deterministic function of the weights . A method for solving is given in Fearnhead and Clifford (2003). The stratified resampling mechanism, which is employed once has been computed, proceeds as follows at time ; this was originally proposed in Carpenter et al. (1999); Kitagawa (1996), although not in the context of the DPF.

 

Stratified resampling

Normalise the weights of the particles and label them according to the order of the corresponding to obtain ;

Construct the corresponding cumulative distribution function: for ,

Sample uniformly on and set for

For , if there exists such that , then survives.

 

3.3.3 Computational Requirements

Assuming that the cost of evaluating is for all , the computational complexity of the DPF is at each time step due to the propagation of Kalman filtering operations and the generation of a single uniform random variable. The parallelisation techniques described in Lee et al. (2010) could readily be exploited when performing the Kalman computations.

3.3.4 Estimating

Of particular interest in the sequel is the fact that the DPF provides us with an estimate of the marginal likelihood given by

(12)

where

(13)

Inevitably, for fixed , the quality of the particle approximation to the distribution decreases as increases. For fixed , once is larger than , the DPF computes exactly.

Before introducing the details of the new PMCMC algorithms, we review some existing MCMC algorithms for performing inference in SSSM.

3.4 Standard Markov chain Monte Carlo methods

Designing efficient MCMC algorithms to sample from is a difficult task. Most existing MCMC methods approach this problem using some form of Gibbs sampler and can be summarized as cycling in some manner through the sequence of distributions , and or .

Sampling efficiently from is often feasible due to the small or moderate size of and the fact that for many models and parameters of interest, conjugate priors are available. When conjugate priors are not used, Metropolis-within-Gibbs steps may be applied.

A variety of efficient algorithms have been developed to sample from . These methods rely on the conditionally linear Gaussian structure of the model and involve some form of forward filtering backward sampling recursion (Carter and Kohn, 1994; Frühwirth-Schnatter, 1994). Variants of these schemes which approach the task by explicitly sampling the state disturbances may be more efficient and/or numerically stable for some classes of models (De Jong and Shephard, 1995; Durbin and Koopman, 2002). In all the numerical examples we consider, sampling from was performed using the simulation smoother of Durbin and Koopman (2002).

Sampling from can also be performed efficiently using a forward filtering backward sampling recursion (Carter and Kohn, 1994; Chib, 1996) when is a Markov chain. The resulting Gibbs sampler is elegant but it can mix very slowly as and are usually strongly correlated. To bypass this problem, Carter and Kohn (1996); Gerlach et al. (2000) proposed to integrate out using the Kalman filter as discussed in Subsection 2.2. However, as mentioned in the introduction, exact sampling from is typically infeasible as the cost of computing this distribution is exponential in . Therefore, in the algorithms of Carter and Kohn (1996); Gerlach et al. (2000), the discrete variables are updated one-at-a-time according to their full conditional distributions . It was shown in Carter and Kohn (1996); Gerlach et al. (2000) that this strategy can improve performance drastically compared to algorithms where is updated conditional upon . From hereon we refer to the Gibbs sampler of Gerlach et al. (2000) as the “standard Gibbs” algorithm.

At this stage, we comment a little further on the method of Gerlach et al. (2000) as it is relevant to the new algorithms described in the later sections. The Gibbs sampler of Gerlach et al. (2000) achieves a sweep of samples from , , etc. by a “backward–forward” procedure exploiting the identities

(14)

and

(15)

In Gerlach et al. (2000), it was shown that the coefficients of in which are needed to evaluate (15) can be computed recursively for (the backward step). Then, for each , and are obtained through standard Kalman filtering recursions, (15) is computed for each and a draw is made from (14) (the forward step). In the resulting algorithm, if the computational cost of evaluating is , the cost of one sampling sweep through , , etc. grows linearly is .

More recently, adaptive MCMC methods have been suggested to make one-at-a-time updates (Giordani and Kohn, 2008). However, these algorithms are still susceptible to slow mixing if the components of are strongly correlated. Moreover even if we were able to sample efficiently using one-at-a-time updates, this algorithm might still converge slowly if and are strongly correlated; e.g. if is a Markov chain and includes the transition matrix of this chain. Hammer and Tjelmeland (2011) have suggested an approximate, deterministic algorithm for forward filtering-backward smoothing in switching state space models, and the use of this method for making independent proposals as part of a Metropolis-Hastings scheme. By contrast, and as we shall see in the following section, the stochastic nature of the DPF algorithm allows the construction not only of exact Metropolis-Hastings-type algorithms, but also exact Particle Gibbs samplers. It is not clear how to achieve the latter using the deterministic forward-backward algorithm of Hammer and Tjelmeland (2011).

4 Discrete particle Markov chain Monte Carlo methods for switching state-space models

A natural idea arising from the previous section is to use the output of the DPF algorithm as part of a proposal distribution for a MCMC algorithm targeting or . This could allow us, in principle, to design automatically an efficient high-dimensional proposal for MCMC. However a direct application of this idea would require us to be able to both sample from and evaluate pointwise the unconditional distribution of a particle sampled from . This distribution is given by

where the expectation is with respect to the probability law of the DPF algorithm: the stochasticity which produces the random probability measure in Eq. (8). While sampling from is straightforward as it only requires running the DPF algorithm to obtain then sampling from this random measure, the analytical expression for this distribution is clearly not available.

The novel MCMC updates presented in this section, under the umbrella term discrete PMCMC, circumvent this problem by considering target distributions on an extended space, over all the random variables of the DPF algorithm. Details of their theoretical validity are given in Subsection 4.3 but are not required for implementation of the algorithms. The key feature of these discrete PMCMC algorithms is that they are “exact approximations” to standard MCMC updates targeting . More precisely, on the one hand these algorithms can be thought of as approximations to possibly “idealized” standard MH updates parametrized by the number of particles used to construct the DPF approximation. On the other hand, under mild assumptions, discrete PMCMC algorithms are guaranteed to generate asymptotically (in the number of MCMC iterations used) samples from , for any fixed number of particles, in other words, for virtually any degree of approximation.

In Subsection 4.1, we describe the Particle MMH (Marginal Metropolis-Hastings) algorithm which can be thought of as an exact approximation of an idealised “Marginal MH” (MMH) targeting directly the marginal distribution of . This algorithm admits a form similar to the PMMH discussed in Andrieu et al. (2010) but its validity relies on different arguments. In Subsection 4.2 we present a particle approximation of a Gibbs sampler targeting , called the Particle Gibbs (PG) algorithm. It is a particle approximation of the “ideal” block Gibbs sampler which samples from by sampling iteratively from the full conditionals and . This algorithm is significantly different from the PG sampler presented in Andrieu et al. (2010) and incorporates a novel backward sampling mechanism. Convergence results for these algorithms are established in Subsection 4.3.

4.1 Particle marginal Metropolis-Hastings sampler

Let us consider the following ideal “marginal” MH (MMH) algorithm to sample from where and are updated simultaneously using the proposal given by

In this scenario the proposed is perfectly “adapted” to the proposed and the resulting MH acceptance ratio is given by

(16)

This algorithm is equivalent to a MH update working directly on the marginal density , justifying the MMH terminology. This algorithm is appealing but typically cannot be implemented as the marginal likelihood terms and cannot be computed exactly and it is impossible to sample exactly from . We propose the following particle approximation of the MMH algorithm where, whenever a sample from and the expression for the marginal likelihood are needed, their DPF approximation counterparts are used instead.

 

PMMH sampler for SSSM

Initialisation,

Set arbitrarily.

Run the DPF targeting , sample and denote

the marginal likelihood estimate.

For iteration

Sample .

Run the DPF targeting , sample and denote

the marginal likelihood estimate.

With probability

(17)

set , , ,

otherwise set , ,

 

4.2 Particle Gibbs sampler

As discussed in Section 3.4, an attractive but impractical strategy to sample from consists of using the Gibbs sampler which iterates sampling steps from and or a modified Gibbs sampler where we insert a sampling step from after having sampled from to update according to . Numerous implementations rely on the fact that sampling from the conditional density or is feasible and thus the potentially difficult design of a proposal density for can be bypassed. However, as mentioned before, it is typically impossible to sample from . Clearly substituting to the sampling step from , sampling from the DPF approximation would not provide Gibbs samplers admitting the correct invariant distribution.

We now present a valid particle approximation of the Gibbs sampler which assumes we can sample from . Similarly it is possible to build a valid particle approximation of the modified Gibbs sampler by the same arguments, but we omit the details here for brevity.

 

PG sampler for SSSM

Initialisation,

Set arbitrarily.

For iteration

Sample .

Run a conditional DPF algorithm targeting conditional upon

Run a backward sampling algorithm to obtain .

 

The remarkable property enjoyed by the PG algorithm is that under weak assumptions it generates samples from in steady state for any number of particles used to build the required DPF approximations. The non-standard steps of the PG sampler are the conditional DPF algorithm and backward sampling algorithms which we now describe.

Given a value of and a trajectory , the conditional DPF algorithm proceeds as follows.

 

Conditional DPF algorithm

At time

Set and for each (which includes ), compute , and using the Kalman filter.

Compute and normalise the weights. For each ,

(18)

At times

If set otherwise set to the unique solution of