Rao-Blackwellized particle smoothers for conditionally linear Gaussian modelsSupported by the projects Learning of complex dynamical systems (Contract number: 637-2014-466) and Probabilistic modeling of dynamical systems (Contract number: 621-2013- 5524), both funded by the Swedish Research Council, and the project Bayesian Tracking and Reasoning over Time (Reference: EP/K020153/1), funded by the EPSRC.

Rao-Blackwellized particle smoothers for conditionally linear Gaussian modelsthanks: Supported by the projects Learning of complex dynamical systems (Contract number: 637-2014-466) and Probabilistic modeling of dynamical systems (Contract number: 621-2013- 5524), both funded by the Swedish Research Council, and the project Bayesian Tracking and Reasoning over Time (Reference: EP/K020153/1), funded by the EPSRC.

Fredrik Lindsten, Pete Bunch, Simo Särkkä,
Thomas B. Schön, and Simon J. Godsill
23 May 2015
Abstract

Sequential Monte Carlo (SMC) methods, such as the particle filter, are by now one of the standard computational techniques for addressing the filtering problem in general state-space models. However, many applications require post-processing of data offline. In such scenarios the smoothing problem—in which all the available data is used to compute state estimates—is of central interest. We consider the smoothing problem for a class of conditionally linear Gaussian models. We present a forward-backward-type Rao-Blackwellized particle smoother (RBPS) that is able to exploit the tractable substructure present in these models. Akin to the well known Rao-Blackwellized particle filter, the proposed RBPS marginalizes out a conditionally tractable subset of state variables, effectively making use of SMC only for the “intractable part” of the model. Compared to existing RBPS, two key features of the proposed method are: (i) it does not require structural approximations of the model, and (ii) the aforementioned marginalization is done both in the forward direction and in the backward direction.

1 Introduction

State-space models (SSMs) comprise one of the most important model classes in statistical signal processing, automatic control, econometrics, and related areas. A general discrete-time SSM is given by

(1a)
(1b)

where is the latent state process and is the observed measurement process (we use the common convention that denotes an arbitrary probability density function (PDF) induced by the model (1), which is identified by its arguments). When the model is linear and Gaussian the filtering and smoothing problems can be solved optimally by using methods such as the Kalman filter and the Rauch-Tung-Striebel smoother, respectively (see, e.g., [KailathSH:2000]). When going beyond the linear Gaussian case, however, no analytical solution for the optimal state inference problem is available, which calls for approximate computational methods.

Many popular deterministic methods are based on Gaussian approximations, for instance through linearization and related techniques. An alternative approach, for which the accuracy of the approximation is limited basically only by the computational budget, is to use Monte Carlo methods. Among these, sequential Monte Carlo (SMC) methods such as particle filters (PF) and particle smoothers (PS) play a prominent role (see, e.g., [DoucetJ:2011, Gustafsson:2010a]).

While SMC can be applied directly to the general model (1), it has been recognized that, in many cases, there is a tractable substructure available in the model. This structure can then be exploited to improve the performance of the SMC method. In particular, the Rao-Blackwellized PF (RBPF) [SchonGN:2005, ChenL:2000] has been found to be very useful for addressing the filtering problem in conditionally linear Gaussian (CLG) SSMs (see Section 2). As pointed out in [CappeMR:2005], CLG models have found an “exceptionally broad range of applications”.

However, many of the applications, as well as system identification, of SSMs rely on batch analysis of data. The central object of interest is then the smoothing distribution, that is, the distribution of the system state(s) conditionally on all the observed data. While there exist many SMC-based smoothers (see e.g., [LindstenS:2013] and the references therein) variance reduction by Rao-Blackwellization has not been as well explored for smoothing as for filtering.

In this paper, we present a Rao-Blackwellized PS (RBPS) for general CLG models. The proposed method is based on the forward filter/backward simulator (FFBS) [GodsillDW:2004]. Contrary to the related forward-backward-type RBPS used by [Kim:1994], the proposed method does not require any structural approximations of the model. Another key feature of the proposed method is that it employs Rao-Blackwellization both in the forward and backward directions, as opposed to [FongGDW:2002] who sample the full system state in the backward direction. The use of Rao-Blackwellization also in the backward direction is necessary for the smoother to be truly Rao-Blackwellized. An alternative RBPS, specifically targeting the marginal smoothing distribution, which is based on the generalized two-filter formula is presented in [BriersDM:2010].

This contribution builds upon two previous conference publications, [SarkkaBG:2012] and [LindstenBGS:2013], where we studied two specific model classes, hierarchical models and mixed linear/nonlinear models, respectively (see the next section for definitions). Furthermore, independently of [SarkkaBG:2012], Whiteley et al. [WhiteleyAD:2010] have derived essentially the same RBPS for hierarchical models as we present here, although they study explicitly the special case of jump Markov systems. The present work goes beyond [SarkkaBG:2012, LindstenBGS:2013] on several accounts. First, the techniques used for the derivations and, as an effect, details of the algorithmic specifications in these two proceedings differ substantially. Here, we harmonise the derivation and provide a general algorithm which is applicable to both types of CLG models under study. We also improve the previous results by extending the method to more general models. Specifically, we allow for correlation between the process noises entering the conditionally linear and the nonlinear parts of the model, and rank-deficient process noise covariances in the conditionally linear parts. This comprises an important class of models in practical applications [GustafssonGBFJKN:2002]. Finally, we provide several extensions to the main method (Section 5) that we view as a key part of the proposed methodology.

For a vector and a positive semidefinite matrix , we write . We write for matrix determinant and and for the Gaussian distribution and PDF, respectively.

2 Conditionally linear Gaussian models

Let the system state be partitioned into two parts: , where is referred to as the nonlinear state and is referred to as the linear state. The SSM (1) is said to be CLG if the conditional process follows a time-inhomogeneous linear Gaussian SSM. For concreteness, we will study two specific classes of CLG models, which are of particular practical interest. However, by combining these two model classes the proposed method can straightforwardly be generalized to other CLG models.

Model 1 (Hierarchical CLG model)

A hierarchical CLG model is given by,

(2a)
(2b)
(2c)

with process noise and measurement noise , respectively, where is a positive definite matrix for any .

Model 1 can be seen as a generalization of a jump Markov system, in which the “jump” or “mode” variable is allowed to be continuous. However, the hierarchical structure of Model 1 can sometimes be limiting. We will therefore study also the following model class.

Model 2 (Mixed linear/nonlinear model)

A mixed linear/nonlinear CLG model is given by,

(3a)
(3b)

with process noise and measurement noise , respectively, where and are assumed to be positive definite matrices for any .

Note that Model 2 allows for a cross-dependence in the dynamics of the two state-components, that is depends explicitly on and vice versa. Mixed linear/nonlinear models arise, for instance, when the observations depend nonlinearly on a subset of the states in a system with linear dynamics. See [GustafssonGBFJKN:2002] for several examples from target tracking where this model is used.

Remark 1

We do not assume that is full rank, that is, it is only the part of the process noise that enters on the nonlinear state that is assumed to be non-degenerate. Note also that the model (3a) readily allows for correlation between the components of the process noise entering on the nonlinear state and on the linear state, respectively.

It is worth to emphasize that Model 1 is not a special case of Model 2, since may be non-Gaussian in (2a). Nevertheless, Model 1 is simpler than Model 2 in many respects. Indeed, one reason for why we study both model classes in parallel is to more clearly convey the idea of the derivation. This is possible since we can start by looking at the (simpler) hierarchical CLG model, before generalizing the expressions to the (more involved) mixed linear/nonlinear model.

3 Background

3.1 Particle filtering and smoothing

Consider first the general SSM (1). A PF is an SMC algorithm used to approximate the intractable filtering density (see e.g. [DoucetJ:2011, Gustafsson:2010a]). Rather than targeting the sequence of filtering densities directly, however, the PF targets the sequence of joint smoothing densities for . This is done by representing with a set of weighted particles , each of which is a state trajectory . These particles define the point-mass approximation,

(4)

where denotes a Dirac distribution at point . In the simplest particle filter, the -th set of particles are formed by sampling from the previous distribution (resampling) and then from an importance distribution . A weight is assigned to each particle to account for the discrepancy between the proposal and the target density. The importance weight is given by the ratio of target and proposal densities, which simplifies to,

(5)

Note that an approximation to is obtained by marginalization of (4), which equates to simply discarding for each particle .

The term “smoothing” encompasses a number of related inference problems. Basically, it amounts to computing the posterior PDF of some (past) state variable, given a batch of measurements . Here we focus on the estimation of the complete joint smoothing density, . Any marginal smoothing density can be computed from the joint smoothing density by marginalization.

In fact, the joint smoothing distribution is approximated at the final step of the particle filter [Kitagawa:1996]. However, this approximation suffer from the problem of path degeneracy, that is, the number of unique particles decreases rapidly for [GodsillDW:2004, DoucetJ:2011]. To mitigate this issue, a diverse set of particles may be generated by sampling state trajectories using the forward filtering/backward simulation (FFBS) algorithm [GodsillDW:2004]. FFBS exploits a sequential factorization of the joint smoothing density:

(6)

At time , a final state is first sampled from the particle filter approximation . Then, working backward from time , each subsequent state is sampled (approximately) from the backward kernel, . The resulting trajectory is then an approximate sample from the joint smoothing distribution.

Using the Markov property, the backward kernel may be expressed as

(7)

By using the PF approximation of the filtering distribution, we obtain the following point-mass approximation of the backward kernel:

(8)

with . The FFBS algorithm samples from this approximation in the backward simulation pass.

Typically, we repeat the backward simulation, say, times. This generates a collection of backward trajectories which define a point-mass approximation of the joint smoothing distribution according to,

(9)

From this, any marginal or fixed-interval smoothing distribution can be approximated by simply discarding the parts of the backward trajectories which are not of interest.

3.2 Rao-Blackwellized particle filter

We now turn our attention specifically to CLG models, such as Model 1 or Model 2. The structure inherent in these models can be exploited when addressing the filtering problem. This is done in the RBPF, which is based on the factorization . Since the model is CLG, it holds that

(10)

for some mean and covariance functions, and , respectively. A PF is used to estimate only the nonlinear state marginal density while conditional Kalman filters, one for each particle, are used to compute the moments for the linear state in (10). The resulting RBPF approximation is given by

where and . The particle weights are given by the ratio of and the importance density. See [ChenL:2000] for details on the implementation for the hierarchical CLG model and [SchonGN:2005] for the mixed linear/nonlinear model. The reduced dimensionality of the particle approximation results in a reduction in variance of associated estimators [LindstenSO:2011, Chopin:2004].

For numerical stability, it is recommended to implement the conditional Kalman filters on square-root form. That is, we propagate, e.g., the Cholesky factor of the conditional covariance matrix, rather than the covariance matrix itself, where is such that

(11)

See Section 5.3.

4 Rao-Blackwellized particle smoothing

We now turn to the derivation of the new RBPS. The method is an FFBS which uses the RBPF as a forward filter. The novelty lies in the construction of a backward simulator which samples only the nonlinear state in the backward pass. Difficulty arises because marginally (and conditionally on the observations) the nonlinear state process is non-Markovian. Practically, this means that the backward kernel cannot be expressed in a simple way, as in (7). We address this difficulty by deriving a backward recursion for a set of sufficient statistics for the backward kernel. This backward recursion is reminiscent of the backward filter in the two-filter smoothing formula for a linear Gaussian SSM (see e.g., [KailathSH:2000, Chapter 10]).

The basic idea is presented in Section 4.1, together with the statement of a general algorithm which samples state trajectories for the nonlinear states. We then consider the two specific model classes, the hierarchical CLG model and the mixed linear/nonlinear model, in Section 4.2 and Section 4.3, respectively. In Section 4.4 we discuss how to compute the smoothing distribution for the linear states.

4.1 Rao-Blackwellized backward simulation

We wish to derive a backward simulator for the nonlinear process . That is, the target density is . However, when marginalizing the linear states , we introduce a dependence in the measurement likelihood on the complete history . As a consequence, we must sample complete trajectories produced by the RBPF when simulating the nonlinear backward trajectories; see [LindstenS:2013, Chapter 4] for a general treatment of backward simulation in the non-Markovian setting. To solidify the idea, note that the target density can be expressed as

(12)

Assume that we have run a backward simulator from time down to time . Hence, we have generated a partial, nonlinear backward trajectory , which is an approximate sample from . To extend this trajectory to time , we draw one of the RBPF particles (with probabilities computed below). We then set and discard . This procedure is then repeated for each time , resulting in a complete backward trajectory .

To compute the backward sampling probabilities, we note that the first factor in (12) can be expressed as,

(13)

The second factor in this expression can be approximated by the forward RBPF, analogously to a standard FFBS. Similarly to (8), this results in a point-mass approximation of the backward kernel, given by

(14)

with

(15)

We thus employ the following backward simulation strategy to sample :

  1. Run a forward RBPF for times .

  2. Sample with with .

  3. For to :

    1. Sample with .

    2. Set .

Note that Step 3a) effectively means that we simulate from the point-mass approximation of the backward kernel (14), discard , and set . More detailed pseudo-code is given in Algorithm 1 below.

It remains to find an expression (up to proportionality) for the predictive PDF in (15), in order to compute the backward sampling weights. In fact, since the model is CLG, this PDF can be computed straightforwardly by running a conditional Kalman filter from time up to . However, using such an approach to calculate the weights at time would require separate Kalman filters to run over time steps, resulting in a total computational complexity scaling quadratically with . To avoid this, we seek a more efficient computation of the weights (15). This is accomplished by propagating a set of sufficient statistics backward in time, as the trajectory is generated. Specifically, these statistics are computed by running a conditional backward information filter for , conditionally on , . The idea stems from [GerlachCK:2000], who use the same approach for Markov chain Monte Carlo sampling in jump Markov systems.

To see how this can be done, note that he predictive PDF in (15) can be expressed as

(16)

This expression is related to the factorization used in the two-filter smoothing formula. The second factor of the integrand is the conditional forward filtering density. This density, computed in the forward RBPF, is given by (10). Similarly, we can view the first factor of the integrand as the density targeted by a conditional backward filter, akin to the one used in the two-filter smoothing formula.

Indeed, we will derive a conditional backward information filter for this density, and thereby show that

(17)

where , and depend on , but are independent of , and the proportionality is with respect to .111By proportionality with respect to some variable , we mean that the constant hidden in the proportionality sign is independent of this variable. Note that (17) is not a PDF in . Still, it can be instructive to think about the above expression as a Gaussian PDF with information vector and information matrix . We choose to express the backward statistics on information form since, as we shall see later, is not necessarily invertible. The interpretation of (17) as a Gaussian PDF for implicitly corresponds to the assumption of a non-informative (flat) prior on . As pointed out above, this interpretation might be useful for understanding the role of the backward statistics, but it does not affect our derivation in any way.

As an intermediate step of the derivation, we will also show the related identity,

(18)

for some and and where the proportionality is with respect to . Computing (18) given (17) corresponds to the measurement update of the backward information filter (the measurement is taken into account). Similarly, computing (17) given (18), with replaced by , corresponds to a backward prediction.

Assume for now that (17) holds. To compute the integral (16) we make use of the following lemma. The proof is omitted for brevity, but follows straightforwardly by plugging in the expression for and carrying out the integration with respect to .

Lemma 1

Let and let , for some constant vectors and and matrices and , respectively. Let and be a constant matrix and vector, respectively. Then with,

where and .

From (10) and (11) it follows that if we write

(19)

then the distribution of in (19) is . In the above, we have dropped the dependence on for brevity. The integral in (16) can thus be computed by applying Lemma 1 with , , , and . It follows that,

(20)

where the proportionality is with respect to and with,

(21a)
(21b)

By plugging this result into (15), we obtain an expression for the backward sampling weights. It remains to show the identity (17) and to find the updating equations for the statistics . These recursions will be derived explicitly for the two model classes under study in the consecutive two sections. The resulting Rao-Blackwellized backward simulator is given in Algorithm 1. As for a standard FFBS, the backward simulation is typically repeated times, to generate a collection of backward trajectories which can be used to approximate .

  1. Forward filter: Run an RBPF for time . For each , store .

  2. Initialize: Draw with . Compute and according to (23).

  3. For to :

    1. Backward filter prediction:

    2. (Model 1: hierarchical)

      • Compute for .

      • Compute according to (25).

    3. (Model 2: mixed)

      • Compute according to (33) for each forward filter particle .

    4. For :

      1. Compute according to (21).

      2. Compute .

    5. Normalize the weights, .

    6. Draw with .

    7. Set .

    8. (Model 2: mixed) Set .

    9. Backward filter measurement update: Compute according to (27).

Algorithm 1 Rao-Blackwellized backward simulator

4.2 Model 1 – Hierarchical CLG model

We now consider Model 1, the hierarchical CLG model, and prove the identities (17) and (18). We also derive explicit updating equations for the statistics and , respectively.

Remark 2

The expressions derived in this section have previously been presented by [WhiteleyAD:2010] who, independently from our preliminary work in [SarkkaBG:2012], have derived an RBPS for hierarchical CLG models. Nevertheless, we believe that the present section will be useful in order to make the derivation for the (more involved) mixed linear/nonlinear model in Section 4.3 more accessible.

For notational simplicity, we write for and similarly for other functions of . To initialize the backward statistics at time , we note that (2c) can be written as,

(22)

with

(23a)
(23b)

which shows that (18) holds at time (with the convention ). We continue by using an inductive argument. Hence, assume that (18) holds at some time . To prove (17) we do a backward prediction step. We have, for ,

(24)

Using the induction hypothesis and (2b), the above integral can be computed by applying Lemma 1 with , , , , and . It follows that (24) coincides with (17), with

(25a)
(25b)
(25c)
where we have defined the quantities and .

Note that the above statistics depend on only through the factor in (25a). This is important from an implementation point of view, since it implies that we do not need to make the backward prediction for each forward filter particle; see Algorithm 1.

Next, to prove (18) for , we assume that (17) holds at time . We have,

(26)

The first factor is given by (2c), analogously to (23), and the second factor is given by (17). By collecting terms from the two factors, we see that (26) coincides with (18), where

(27a)
(27b)

As pointed out above, this correspond to the backward measurement update. Since we are working with the information form of the backward filter, the measurement update simply corresponds to the addition of a term to the information vector and the information matrix, respectively.

4.3 Model 2 – Mixed linear/nonlinear CLG model

We now turn to the mixed linear/nonlinear model (3) and prove the identities (17) and (18) for this class of systems. First, note that the measurement equations are identical for the models (2) and (3). Consequently, the initialization (23) and the backward measurement update (27) hold for the mixed linear/nonlinear model as well. We will thus focus on the backward prediction step.

Similarly to (24) we factorize the backward prediction density according to,

(28)

Note that the first factor now depends on . From (3a), we can express this density as,

(29)

Next, we address the integral in (28). Since the process noise enters the expressions for both and in (3a), there is a statistical dependence between and . In other words, since we allow for cross-correlation between the process noises entering on and , respectively, knowledge about will contain information about . This has to be taken into account when computing the second factor of the integrand in (28). To handle this, we make use of a Gram-Schmidt orthogonalization to decorrelate the process noises. Let , where

(30)

Note that is a projection matrix: . It follows that and , We can then rewrite the dynamical equation (3a) as,

(31a)
(31b)
where
(31c)
(31d)

and where the process noises entering on and are now independent. Hence, from (31b), we can write

(32)

The integral in (28) can now be computed by applying Lemma 1 with , , , , and . Combining this result with (29) and collecting the terms, we see that (28) coincides with (17) with,

(33a)
(33b)
(33c)
where we have defined the quantities

As opposed to the hierarchical model, the predicted backward statistics all depend explicitly on for this model. This implies that the backward prediction has to be done for each forward filter particle, see Algorithm 1. It should be noted, however, that the updating equations (33) can be simplified for some special cases of the mixed linear/nonlinear model (3). In particular, if the dynamics (3a) are Gaussian and linear in both and (the measurement equation (3b) may be nonlinear in ), it is enough to do one backward prediction. Models with linear dynamics and nonlinear measurement equations are indeed common in many applications, see [GustafssonGBFJKN:2002].

4.4 Smoothing the linear states

Algorithm 1 provides a way of simulating nonlinear state trajectories, approximately distributed according to . However, it is often the case that we are also interested in smoothed estimates of the linear states . These estimates can be obtained by fusing the statistics from a forward conditional Kalman filter, with the backward statistics computed during the backward simulation. Note, however, that the forward statistics need to be computed anew; that is, we can not simply use the statistics from the forward RBPF. The reason is that the statistics should be computed conditionally on the nonlinear trajectories simulated in the backward sweep, which are in general different from the trajectories simulated by the RBPF.

Let be a backward trajectory generated by Algorihtm 1. To compute the conditional smoothing PDF for we start by noting that

(34)

Since the model is CLG, the latter factor can be computed by running a Kalman filter, conditionally on the fixed nonlinear state trajectory . We get,

(35)

for some mean vector and covariance matrix , respectively (cf., (10)). By fusing this information with the backward information filter, given by (17), we get,