Particle Filtering and Smoothing Using Windowed Rejection Sampling

# Particle Filtering and Smoothing Using Windowed Rejection Sampling

J. N. Corcoran and D. Jennings
Postal Address: Department of Applied Mathematics, University of Colorado, Box 526 Boulder CO 80309-0526, USA; email: corcoran@colorado.edu, phone: 303-492-0685
July 9, 2014
###### Abstract

“Particle methods” are sequential Monte Carlo algorithms, typically involving importance sampling, that are used to estimate and sample from joint and marginal densities from a collection of a, presumably increasing, number of random variables. In particular, a particle filter aims to estimate the current state of a stochastic system that is not directly observable by estimating a posterior distribution where the are observations related to the through some measurement model . A particle smoother aims to estimate a marginal distribution for . Particle methods are used extensively for hidden Markov models where is a Markov chain as well as for more general state space models.

Existing particle filtering algorithms are extremely fast and easy to implement. Although they suffer from issues of degeneracy and “sample impoverishment”, steps can be taken to minimize these problems and overall they are excellent tools for inference. However, if one wishes to sample from a posterior distribution of interest, a particle filter is only able to produce dependent draws. Particle smoothing algorithms are complicated and far less robust, often requiring cumbersome post-processing, “forward-backward” recursions, and multiple passes through subroutines. In this paper we introduce an alternative algorithm for both filtering and smoothing that is based on rejection sampling “in windows” . We compare both speed and accuracy of the traditional particle filter and this “windowed rejection sampler” (WRS) for several examples and show that good estimates for smoothing distributions are obtained at no extra cost.

00footnotetext: Keywords: particle filtering, rejection sampling, hidden Markov models
AMS Subject classification: 65C05, 65C60, 62G07

## 1 Introduction

Particle filters and smoothers are sequential Monte Carlo methods, typically importance sampling methods, that are often employed to sample from and provide estimates of the distribution of a set or subset of latent variables in a hidden Markov model given observations. They are constructed specifically to provide updated sampling and estimation when additional observations become available without reprocessing all observations.

Consider the hidden Markov model with underlying and unobserved states , transition density , and an initial distribution with density . (In this paper we will assume a continuous state space, though the sampling techniques described will apply in the discrete case as well.) Suppose that represents a sequence of observable variables that are conditionally independent when the unobserved states are given and where each is related to the unobserved process through and a “measurement model” density . Such a model is depicted in Figure 1(a).

Our goal is to sample from the density , where denotes the vector for , sequentially in the sense that samples from will be used along with a new observation to produce the desired points. The sampled points can then be used to approximate, for example, expectations of the form and marginal distributions for . The estimation of is known as the filtering problem and that of for is known as the smoothing problem. In Section 4, we will see that our proposed algorithm, based on rejection sampling “in windows”, vastly outperforms traditional particle methods for the smoothing problem and has some advantages for filtering as well.

The Markov and conditional independence assumptions allow us to write our “target density” for the model in Figure 1(a) in the recursive form

 π(x0:n|y1:n) ∝ π(x0)n∏i=1π(yi|xi)π(xi|xi−1) ∝ π(yn|xn)⋅π(xn|xn−1)⋅π(x0:n−1|y1:n−1) (1b)

so that draws from can be related to draws from .

In Section 2, we review the sequential importance sampling and resampling methods, which make up the most commonly used particle filtering algorithm, that is typically used to estimate and sample from (1). We will then offer our new rejection sampling alternative in Section 3. In Section 4 we give simulation results, including direct comparisons to traditional particle filtering and an application to a second hidden layer Markov model.

## 2 Importance Sampling Based Sequential Methods

We now briefly describe existing importance sampling and resampling approaches to the problem of estimating and/or sampling sequentially from a target density

 π(x1:n)=h(x1:n)/Zn, (2)

where . A much more complete review is given by Doucet and Johansen [7]. Without loss of generality, can be replaced by and the target density can be a conditional density.

To establish notation, note that, if represent points (“particles”) sampled from , sequentially or otherwise, a simple unbiased estimator of the target density is given by the empirical density that puts weight on each of the points. We may write this succinctly as

 ˆπ(x1:n)=1NN∑i=11l[X(i)1:n=x1:n] (3)

where is the indicator function that takes on the value when , and zero otherwise.

We can then estimate, for example, , for a given function by

 ˆE[f(X1:n)]=1NN∑i=1f(X(i)1:n). (4)

It is easy to verify that (3) and (4) are unbiased for and , respectively.

### 2.1 A Review of Non-Sequential Importance Sampling

When one can not sample directly from , importance sampling can be used to instead sample points from a more tractable density, and these points can be used to estimate the target density. Suppose that is another density, presumably tractable, with the same support as .

We can write

 π(x1:n)=h(x1:n)Zn=w(x1:n)q(x1:n)Zn (5)

where

 w(x1:n):=h(x1:n)q(x1:n). (6)

Then we proceed to estimate and as follows.

• Sample .111“iid” denotes “independent and identically distributed” values.

• Estimate with

 ˆq(x1:n)=1NN∑i=11l[X(i)1:n=x1:n] (7)
• Estimate , when , with

 ˆZn=1NN∑i=1w(X(i)1:n). (8)
• Estimate with

 (9)

where

 W(i)n=w(X(i)1:n)∑Nj=1w(X(j)1:n).

It is routine to show that (9) is an unbiased estimator of and that the optimal importance sampling density , in terms of minimizing variance, is .

### 2.2 Sequential Importance Sampling (SIS)

Sequential importance sampling (SIS) is importance sampling for in such a way where draws from are “extended” to -dimensional points that are then reweighted to produce draws from . To this end, the importance sampling density is chosen to have the form

 q(x1:n)=q(x1)n∏i=2q(xi|xi−1)

so that it may be sampled from recursively.

The associated importance sample weights may then also be computed recursively since, for ,

 w(x1:n) = h(x1:n)q(x1:n)=h(x1:n)q(x1:n−1)q(xn|xn−1) (10) = h(x1:n−1)q(x1:n−1)⋅h(x1:n)h(x1:n−1)q(xn|xn−1) =: w(x1:n−1)⋅α(x1:n)

where is an incremental weight function that is defined as

 α(x1:n):=h(x1:n)h(x1:n−1)q(xn|xn−1).

SIS Algorithm

At the first time step (),

• Sample .

• Compute weights for using (6).

At times ,

• Sample independently with .

• Compute weights for .

At any time , one can estimate and using (9) and (8). One can also obtain approximate dependent draws from by sampling from (9). That is, by sampling from the set of values using respective weights .

In practice, iteration of the SIS algorithm leads to a “degeneracy of weights” problem (see, for example, [2], [5], [6], and [9]) where the weights of all but one particle will approach zero, causing the method to break down and give meaningless results.

One way to avoid the issue of degenerate weights is to implement a resampling scheme at each time step.

SIS Algorithm With Resampling (SIR)

At the first time step (),

• Sample .

• Compute weights for .

• Compute the normalized weights

 W(i)1=w(X(i)1)∑Nj=1w(X(j)1)

for .

• Sample points, , with replacement, from the set with respective probabilities . That is, sample from

 ˆπ(x1)=N∑i=1W(i)11l[X(i)1=x1].

Note that are now equally weighted particles, each with weight .

Assign weights for .

At times ,

• Sample independently with .

• Extend each particle to particles .

• Compute associated weights for .

(This is consistent with SIS since the previous weights in (10) have been replaced by the constant and the current weights have yet to be normalized.)

• Compute the normalized weights

 W(i)n=w(˜X(i)1:n−1,X(i)n)∑Nj=1w(˜X(j)1:n−1,X(j)n)

for .

• Sample -dimensional points, , with replacement, from the set with respective probabilities . That is, sample from

 (11)

Note that are now equally weighted particles, each with weight .

Assign weights for .

As with SIS, we may, at any time , estimate using (11) or by using

 ˆπ(x1:n)=1NN∑i=11l[˜X(i)1:n=x1:n]. (12)

We may obtain approximate dependent draws from by sampling from (12). That is, by sampling uniformly, with replacement, from the set of values .

An obvious issue with the SIR algorithm, known as the sample impoverishment problem, is a decreasing diversity of particles as those with higher weights will likely be drawn multiple times. Additionally, the resampling step may make the algorithm prohibitively slow and can lead to increased variance of estimates. One way to address the speed and variance is to only resample in the case of a small (user chosen threshold for) effective sample size

 Neff:=[N∑i=1(W(i)n)2]−1. (13)

A small effective sample size indicates higher variability of the weights which, in turn, indicates impending degeneracy. In this case, resampling would be especially prudent.

It is important to note that, when resampling, one should be originally sampling a very large number of points in order to minimize the effects of resampling repeated values from a discrete distribution. Additional details about SIS and SIR procedures can be found in Doucet at al. [5, 6] and in Andrieu and Doucet [1].

## 3 The Windowed Rejection Sampler

In order to draw from a possibly unnormalized density using rejection sampling, one must find a density and a constant such that

 h(x)≤Mq(x).

Then, one proceeds as in Algorithm 1.

Since may be a high-dimensional and conditional density, we could, in theory, use rejection sampling to draw exactly from

 π(x0:n|y1:n)∝h(x0:n|y1:n):=π(x0)∏ni=1π(yi|xi)π(xi|xi−1)=[∏ni=1π(yi|x∗i)]M⋅π(x0)∏ni=1π(xi|xi−1)q(x0:n)

where

 x∗i=argmaxxiπ(yi|xi).

For most applications of particle filtering in the literature, , or at least an upper bound on , with respect to , is easily attainable. However, for even small values of , the acceptance probability, , can be prohibitively small and of course we are losing the benefit of sequential sampling. Thus, we propose rejection sampling “in windows”.

### Rejection Sampling in Windows

“Windowed Rejection Sampling” (WRS) is based on the idea that, depending on the covariance structure of the chain, at some point future observed ’s will eventually cease to significantly affect earlier ’s. For example, if , we can write

 π(x0,x1,x2,x3,x4|y1,y2,y3,y4)=π(x1,x2,x3,x4|x0,y1,y2,y3,y4)⋅π(x0|y1,y2,y3,y4)≈π(x1,x2,x3,x4|x0,y1,y2,y3,y4)⋅π(x0|y1,y2,y3),

and so we can sample approximately from by first sampling from and then sampling from . Sampling from can be achieved by sampling from and considering only the values. In this example, we say that we are using rejection sampling in a “window of length 4”.

More formally, the WRS algorithm is described in Algorithm 2 for a given window length . We discuss the choice of this tuning parameter in Sections 4 and 5.

## 4 Examples

### 4.1 One Dimensional Normals with One Hidden Layer

We begin, for the purpose of illustration, with a very explicit description of the algorithm with a window length of , for the simple model

 X0∼N(μ0,σ20),Xn+1=aXn+σXεn+1,Yn=bXn+σYνn

where and are independent sequences. Assume that only the are observed.

Let denote the density.

We begin by using rejection sampling to produce iid draws from

 π(x0:2|y1:2)∝[π(y1|x1)π(y2|x2)]⋅[π(x2|x1)π(x1|x0)π(x0)]

using .

It is easy to see that

 x∗i=argmaxxiπ(yi|xi)=argmaxxiN(yi;bxi,σ2Y)=yi/bfori=1,2,….

We repeatedly draw independent realizations of and of until the first time that

 u≤π(y1|x1)π(y2|x2)π(y1|x∗1)π(y2|x∗2).

Repeating this procedure times, we collect only the values of and record them as .

Now, moving our window of length to the right by , we use rejection sampling to produce iid draws from

 π(x1:3|x0,y1:3)∝[π(y1|x1)π(y2|x2)π(y3|x3)]⋅[π(x3|x2)π(x2|x1)π(x1|x0)]

using

 q(x1:3|x0)=π(x3|x2)π(x2|x1)π(x1|x0)=N(x3;ax2,σ2X)⋅N(x2;ax1,σ2X)⋅N(x1;ax0,σ2X).

That is, for each , we repeatedly draw independent realizations of and of until the first time that

 u≤π(y1|x1)π(y2|x2)π(y3|x3)π(y1|x∗1)π(y2|x∗2)π(y3|x∗3),

collecting the resulting value of and recording it as .

Move the window of length to the right by to produce draws from , and retain the resulting values for for .

Continuing in this manner, we generated independent values for for this model using parameters , , , , , and . (We looked first at such a large in order to really see the resulting distribution without being concerned about sampling variability.) For comparison, we produced independent draws directly from the 11-dimensional distribution using 11-dimensional (non-windowed) rejection sampling. (As expected, the acceptance probabilities are quite small in this case and such direct simulation from the target density is not reasonably achieved for much larger .) Finally, we produced dependent draws approximately from using the SIR algorithm. Figure 2 shows the average marginal values for through for each algorithm. Only the WRS results change between graphs, showing the anticipated increasing accuracy of the WRS algorithm as the window length increases. For the given model and parameters, a window length of appears to be sufficient. Indeed, as the perfect high-dimensional rejection sampling algorithm is a special case of the WRS algorithm for fixed and maximal window length, it is easy to code the WRS algorithm once and run it first to get perfect draws from for the highest feasible and then to run it with lower , gradually increasing until sample statistics such as those shown in Figure 2 reach the desired accuracy.

For this example, the WRS algorithm produced iid draws from an approximation to whereas the SIR algorithm (resampling every time) produced dependent draws with only roughly unique 11-dimensional values and only about unique values for marginally. Figure 3 shows marginal distributions for produced by the WRS and SIR algorithms. The overlayed curves are the target normal densities that can be computed analytically for this simple illustrative model. The time to run the WRS and SIR algorithms were comparable. Coded in C++, the WRS algorithm completed in less than seconds on a standard222We wish to convey relative speed between algorithms in lieu of particular machine specifications. laptop while the SIR algorithm completed in 8-10 seconds. The SIR algorithm would obviously speed up if resampling was not done at every step and would likely be the faster algorithm if both were programmed in R where resampling is efficient and accept/reject looping is inefficient.

When comparing the algorithms for larger , it becomes crucial for efficiency to not resample at every time step of the SIR algorithm. As per the discussion at the end of Section 2, we resample only when the effective sample size , given by (13) falls below the a threshold . Following a suggestion in [6], we choose . Due to the Markov nature of the model, we are able to continue forward with the window length for increasing . When , the estimated filtering distribution is shown in Figure 4 for both algorithms. The SIR algorithm resulted in approximately distinct values. An estimated smoothing distribution, namely an estimate of is shown in Figure 5. It is well known (see [7]) that the SIR algorithm will produce severely impoverished samples for and small relative to , and in this example we are left with only % distinct values. Suggested approaches to mitigate degeneracy in SIR are discussed in [7], but the WRS algorithm performs well without modification.

### 4.2 Stochastic Volatility Model

We now consider the stochastic volatility model, often used in finance and also considered in [7], where

 X0∼N(0,σ2/(1−α2)),Xn+1=αXn+σεn+1,Yn=βeXn/2νn

where and are independent sequences. Again we assume that only the are observed.

Following the same procedure in Section 4.1, using a general window length , we need to define , and . It is easy to show that

 x∗i=argmaxxiπ(yi|xi)=argmaxxiN(yi;0,β2exi)=ln(y2i/β2)fori=1,2,….

Taking

 q(x0)=π(x0)=N(x0;0,σ2/(1−α2))andq(xi|xi−1)=π(xi|xi−1)=N(xi;αxi−1,σ2),

we have

 q(x0:w−1)=q(x0)w−1∏i=1q(xi|xi−1) (14)

and

 q(xj:(j+w−1)|xj−1)=j+w−1∏i=jq(xi|xi−1). (15)

We ran Step 1 of the WRS algorithm (Algorithm 2) with in order to get perfect draws via rejection sampling from . ( was chosen as the maximal value after which rejection sampling from became prohibitively slow.) We then ran the complete WRS algorithm with and increasing , starting from , until the marginal sample means from WRS for produced rough (non-simultaneous) two standard deviation confidence intervals which contained the perfect draw estimates of the means. That is, if denotes the sample mean of the values of resulting from rejection sampling from and and denote the sample means and variances of the marginal distributions produced by WRS, was increased until for all . This resulted in a window length of . Figure 6 shows the sample means for the marginal draws with WRS and aligning with those from both perfect and SIR draws. Figure 7 shows the proportion of surviving distinct values for each marginal distribution using SIR, which is in contrast to the WRS algorithm giving 100% distinct values for all marginals. Figure 8 shows histograms of the marginal distribution of obtained from the draws from using 11-dimensional rejection sampling, the SIR algorithm, and the WRS algorithm with . The overlayed curve in each case is the density estimated by 100,000 perfect draws of marginalized from 11-dimensional rejection sampling. The SIR algorithm results stand out as the roughest estimate, though, again, it is well known [7] that this “straight” application of the SIR algorithm does not produce good smoothing estimates and that additional steps should be taken to improve it. The WRS algorithm, on the other hand gives a nice estimate of the smoothing distribution straight away with a speed that is comparable to the initial pass of the SIR algorithm. (By “comparable” we mean that, for the examples in this paper as well as others we tried, the WRS algorithm was either faster than the SIR algorithm or slower but still finishing only a few seconds behind. In the worst case, the WRS algorithm was about one minute behind the SIR algorithm. No particular effort was made to fully optimize either code.)

### 4.3 A “Highly Nonlinear” Model

In this section, we apply the WRS algorithm the “highly nonlinear” model considered in several papers including [3], [4], [6], and [8].

 X0∼N(μ,σ2)Xn+1=0.5Xn+25Xn1+X2n+8cos(1.2n)+εn+1Yn=0.05X2n+νn

for independent and . The parameters were chosen as , , and .

Taking

 q(x0)=π(x0)=N(x0;μ,σ2)andq(xi|xi−1)=π(xi|xi−1)=N(xi;,m(xi−1),σ2X),

with , we define and