Bandwidth Selection In Pre-Smoothed Particle Filters
For the purpose of maximum likelihood estimation of static parameters, we apply a kernel smoother to the particles in the standard SIR filter for non-linear state space models with additive Gaussian observation noise. This reduces the Monte Carlo error in the estimates of both the posterior density of the states and the marginal density of the observation at each time point. We correct for variance inflation in the smoother, which together with the use of Gaussian kernels, results in a Gaussian (Kalman) update when the amount of smoothing turns to infinity. We propose and study of a criterion for choosing the optimal bandwidth in the kernel smoother. Finally, we illustrate our approach using examples from econometrics. Our filter is shown to be highly suited for dynamic models with high signal-to-noise ratio, for which the SIR filter has problems.
Keywords: adaptive bandwidth selection; kernel smoothing; likelihood estimation; particle filter; state space model; variance inflation
State space models are commonly used to represent dynamical systems in a wide range of scientific fields. For linear and Gaussian state space models, the Kalman Filter can be used to sequentially obtain the posterior mean and covariance of the current state vector, as well as the likelihood function required for estimation of model parameters. Gaussian mixture filters (Alspach and Sorenson, 1972) were among the first attempts to account for non-normality in the posterior, resulting from non-linearity, either in the state equation or in the observation equation. Later, sequential Monte Carlo (MC) based filtering methods, collectively known as particle filters, were introduced (Gordon et al., 1993; Kitagawa, 1996; Liu and Chen, 1998). The particle filter has the prospect of providing a sampling-based consistent estimate of the posterior distribution, but in many cases the sample size (number of particles) required to bring the MC error within tolerable bounds is prohibitively large. Consequently, there is now a large literature on improving the baseline particle filtering algorithms to work for a moderate numbers of particles. These include Pitt and Shephard (1999), various methods proposed in the chapters of Doucet et al. (2001) and more recently Polson et al. (2008), Chorin and Tu (2009) and Chopin et al. (2013).
Recently, a renewed interest in the use of particle filters for computing marginal likelihood (integrating over state variables) for the purpose of parameter estimation has emerged (Fernandez-Villaverde and Rubio-Ramirez, 2007; Andrieu et al., 2010; Kantas et al., 2009; Malik and Pitt, 2011; DeJong et al., 2013). This is also the context of the present paper. Similar to Malik and Pitt (2011) and DeJong et al. (2013) we obtain a likelihood approximation which is continuous in the parameters, hence facilitating numerical optimization. We target situations with highly non-linear state evolution, high signal-to-noise ratios, and with low-to-moderate dimensional state vector, for which adaptation is difficult.
Throughout, we assume that the measurement model is linear and Gaussian, which at first glance may appear restrictive. However, non-linear measurement equations with additive Gaussian noise can also be handled by a simple augmentation of the state vector, as shown in Section 3.1 below.
Let and denote the state vector and observation vector, respectively, at time , and define . In particle or ensemble methods the predictive density is represented by a random sample. We use a kernel smoother which can be updated analytically against a linear Gaussian measurement model . From the resulting mixture approximation of the posterior we draw a uniformly weighted sample of particles, which after a parallel run through the state equations, constitutes the approximation of the next predictive distribution . The resulting filter, which we call the Pre-Smoothed Particle Filter (PSPF) is a special case of the preregularized particle filter (Le Gland et al., 1998; Hürzeler and Künsch, 1998).
The main contribution of the present paper is to determine the optimal amount of smoothing in each updating step of the PSPF. This is done adaptively, i.e. for each time point an optimal bandwidth paramter is sought. For small the PSPF approaches the SIR filter, i.e. has low bias but high variance. Further, we correct for variance inflation (Jones, 1991), and hence when the kernel estimate reduces to a Gaussian density with mean and covariance calculated from the ensemble representation of . At this end of the spectrum the PSPF is strongly related to the Ensemble Kalman Filter (Stordal et al., 2011), which has low MC variance but high bias.
The rest of this paper is laid out as follows. Section 2 introduces notation and explains challenges related to particle filtering. Section 3 explains the pre-smoothed update and provides a method for automatic bandwidth selection. Section 4 introduces the PSPF, and also compares the PSPF to other particle filters using simulation experiments. Finally, Section 5 outlines two realistic applications, and Section 6 provides a discussion.
2 Model, Notation and Background
2.1 Model and notation
We consider a generic state space model consisting of a state transition equation and an observation equation, with the former given by
where is the state transition function
(). The random disturbance term , which can account for incomplete model specification, may be either absent, of fixed dimension, or of infinite dimension in the case that the state dynamics are governed by a stochastic differential equation. Under the assumption that the s are independent (1) describes a Markov process, with transition probability density denoted by . Given the realization of , evaluation of typically amounts to solving a differential equation. The system (1) is initialized by drawing from a distribution with density . It is assumed that and are sufficiently regular to ensure that is continuous, and thereby that all involved conditional distributions can be estimated consistently using kernel density estimators.
The observation equation is
where , is non-degenerate and the matrix is independent of the state, but may vary non-stochastically with time . Moreover, we use the notation , . denotes the multivariate Gaussian probability density function evaluated at , the identity matrix. Finally, we indicate which stochastic variable an expectation or variance is taken over using subscripts (e.g. when expectation is taken over variable ).
2.2 The SIR filter and sample impoverishment
This section introduces particle filters and the Sampling Importance Resampling (SIR) filter of Gordon et al. (1993), which is the limit of PSPF as . Any particle filtering approach relies on alternating between two steps: prediction (p) in which is represented by a random sample , and filtering (f) in which similarly is approximated by . These random samples of size are referred to as filter- and predictive swarms, respectively, and are updated iteratively from each other. The prediction step, used in both SIR and PSPF, consists of , where the are independent random draws from the distribution of . In the filtering step Bayes formula is invoked:
where is a normalizing constant. Obtaining a uniformly weighted sample to complete next time-step’s prediction is simply a matter of drawing multinomially from with weights (4). A byproduct of the SIR filter is that the marginal likelihood of needed for parameter estimation can be approximated as
for large (see e.g. Del Moral (2004, Proposition 7.4.1.)).
Sample impoverishment in the SIR filter occurs when, at time step , the predictive particle swarm and the data likelihood are poorly aligned (see e.g. Pitt and Shephard (1999)). The multinomial probabilities (4) then become very unevenly distributed, and the multinomial sampling will yield many repeated particles. Over time the swarm will degenerate in the sense that all particles can be traced back to a single particle in the initial swarm ( Sample impoverishment also increases the MC error of the likelihood estimator (5). This is likely to occur during numerical optimization of the likelihood, when the optimization algorithm tries an infeasible parameter value rendering the particle swarm and the data likelihood incompatible. The effect is amplified by a high signal-to-noise ratio in the system. Numerous strategies have been proposed for aligning (adapting) the predictive swarm to the coming observation (see e.g. Cappe et al. (2007) for an overview), but these typically rely on evaluation of (or some of the characteristics of ) which may be costly. The PSPF, on the other hand, avoids evaluation of , and relies only on the ability to simulate (1) efficiently.
3 The Pre-Smoothing Update
In this section we consider the pre-smoothing (PS) update, as an alternative to Rubin (1987)’s SIR update when the observation equation is linear in the state and additively Gaussian. Focusing on a single updating step we can drop the index in our noation. In a general perspective, the problem we address is the Bayesian updating problem of evaluating the posterior density and the marginal density when the prior is represented by a random sample. In particular, we focus on optimal selection of the smoothing parameter in the PS update, with the overarching objective of producing accurate estimates of . In Section 4 we again return to the filter setting.
3.1 The updating problem
Consider the evaluation of the posterior and marginal , for the model
in a setting where is an unknown prior density, while and are given matrices. The available information about is a random sample drawn from . Our aim is to estimate both and for a given . We denote by and the empirical mean and covariance matrix of the sample , respectively.
where the smoothing parameter governs the proportions of the total variance under stemming from inter-kernel variance () and intra-kernel variance () in the Gaussian mixture (8). The replacement of the more conventional bandwidth parameter by simplifies certain expressions in what follows. The estimator (8) avoids the “variance inflation” to which the standard kernel estimator (Silvermann, 1986) is prone, as it is easily verified that the mean and variance under is and , respectively. For close to 1 ( close 0) behaves as a standard kernel estimator with equally weighted point masses located at as the limit. For () a Gaussian limit is obtained, i.e. .
By substituting for in the Bayes rule (3), we obtain the PS estimators
In our notation we have omitted the dependence on .
|Gaussian ()||SIR ()|
|Property||High bias, low variance||Low bias, high variance|
As varies from 1 to 0 the PS updates moves from a SIR update to the Gaussian update, both of which are summarized in Table 1. The mean of each posterior mixture component concurrently moves smoothly from () toward what is dictated by the likelihood, reducing the potential for sample impoverishment. In the same vein, we have as , i.e. uniform weighting. These properties of the PS update (and the updates employed in other pre-regularized filters) differ from those of the update mechanisms employed in post-smoothed particle filters advocated by Musso et al. (2001) and Flury and Shephard (2009), where the (one step) posterior locations and weights are unchanged relative to the SIR. However, these latter approaches do not require the Gaussian data-likelihood which is underlying the PS update.
The fact that is a finite Gaussian mixture (for ) has a number practical advantages. Firstly, moments, marginal- and conditional distributions of the approximate posterior are easily derived from the representation (12). Further, has continuous support, and therefore direct copying of particles, which is applied in the SIR filter, is avoided in the resampling step. Sampling from is trivial. Moreover continuous (with respect to the parameters) sampling, resulting in a continuous simulated likelihood function, can be implemented.
The apparently restrictive linearity assumption (6) can be relaxed by augmenting the state variable . The case of a non-linear measurement function with additive Gaussian noise can be accommodated without any conceptual change to the framework. The measurement variance is then split in two parts and , , and the augmented state vector is where is an auxiliary variable introduced for convenience. For the augmented system, equations (6)-(7) take the form
where is the induced prior and is the matrix that selects from . Now, is a tuning parameter that must be chosen jointly with . Estimates of are easily obtained as a marginal in the finite mixture representation of . An application of this approach is given in section 5.2 below.
3.2 Criterion for smoothing parameter selection
A critical part of the PS update is the selection of the smoothing parameter, with the aim of obtaining both a representative posterior particle swarm and and accurate estimate of the marginal likelihood. For this purpose Flury and Shephard (2009) argue that the integrated mean squared error (MISE) of , which is commonly used in the kernel smoothing literature (Silvermann, 1986) is not a suitable criterion in a particle filter setting. Nevertheless, is has been used for pre- and post-smoothed particle filters by e.g. Le Gland et al. (1998); Hürzeler and Künsch (1998); Musso et al. (2001). Instead, Flury and Shephard (2009) propose to minimize the MISE of the posterior cumulative distribution function. We propose a third criterion, namely to minimize the mean squared error (MSE) of , which is given as
This criterion has the advantage of being analytically simple, in addition to targetting minimal Monte Carlo error in the likelihood function as explained below. Minimization of gives an optimal bias-variance balance that depends on the observation .
Switching momentarily to a dynamical system setting (with ) for the reminder of this paragraph, estimates , and thus choosing (13) as the criterion targets directly the factors involved in the likelihood function (5). However, it should be noted in the dynamic setting that current period’s filter distribution must be represented accurately as it serves as an important input to next period’s likelihood evaluation. We show in section 4.2 that using an approximation to (which targets ) also leads to competitive estimates of the filtering distribution (i.e. ). This will in particular be true whenever most of the information carried in comes from the likelihood (which is typical for the class of models we consider) as is almost proportional to , and therefore the posterior estimator is relatively insensitive to the choice of smoothing parameter. On the other hand, assuming a concentrated observation likelihood, and in addition that is invertible (i.e. ), will be highly sensitive to the choice of smoothing parameter since a zeroth order approximation of is proportional to . Hence it appears sensible to choose even in a dynamic setting, in particular in high signal-to-noise situations.
3.3 Plug-in and approximation
There are two obstacles to direct use of the criterion as given in (13). First, the expectation is taken over which has unknown distribution (see (7)). The same problem occurs in standard kernel estimation, and is solved by the use of a plug-in estimator (Silvermann, 1986). The second problem is caused by the use of the shrunk kernel estimator, which involves the empirical (depending on ) quantities and through (9) and (10). Even if was known, analytical evaluation of the expectation in (13) would not be possible, and we have to resort to an approximation. Jones (1991) encounters the same problem, but argues that the effect can be ignored asymptotically when and . However, we consider the full range so the same asymptotic arguments do not apply. Instead we attempt to approximate the expectation (13) for finite .
We start by addressing the second problem. For the purpose of evaluating the mean and variance in (13) we replace and in expressions (8-11) by new random variables, and , respectively, both taken to be independent of . The simplification (approximation) lies mostly in this independence assumption, but also in the distributional assumptions made about and below. The reason that we cannot ignore the sampling variability in and is that the variance term in (13) would then be exactly zero for . Hence, for small we would underestimate the MSE of .
We make the following distributional choices
i.e. and are distributed as if they where calculated from a sample of iid vectors. Plug-in versions of (14) and (15), i.e. where has been replaced by and by , are used immediately below for notational convenience. Strictly speaking these replacements take place after all moment calculations have been carried out.
After these simplifications, it is necessary to restate our criterion
where expectation and variance now is taken relative to , and , which we emphasize are independent by assumption. Writing out the details of (16) we get where
with and .
The next sections outline pilot distributions and develop asymptotic approximations (in ) that will enable us to evaluate the mean and variance in (16).
|where and .|
For the variance term in , we employ for convenience a Gaussian pilot,
For the squared bias term in (16) a Gaussian pilot is ruled out because, as shown below, this leads asymptotically to zero bias for all . Instead a two-component Gaussian mixture
is used. The bias pilot is flexible, allowing for analytical
computations of moments, and
may be estimated from using an EM-algorithm (see e.g. McLachlan and Peel, 2000, section 2.8 for details). To minimize the computational burden we perform only a few EM-iterations, and further computational savings are obtained by running the EM on a subsample of when is large.
Practical squared bias
Under the above introduced simplifying approximations, and in particular under pilot density , we have that
A closed form expression for is given in Table 2. The expectation over in (20) does not appear to have closed form, and we therefore employ the asymptotical (in ) mean statement of Corollary 2.2 of Iwashita and Siotani (1994) to obtain
where serves as the practical approximation to .
Note that in (27) depends on and is hence estimated using the pilot density needs to be estimated. Under the pilot density it has a closed form expression
Finally, is taken as the practical squared bias term. Note in particular that it is easily verified that also the practical squared bias vanishes for , as in this case.
To underpin the claim that a non-Gaussian bias pilot is needed, we momentarily choose the parameters of so that coincides with , e.g. via , , , . Then whereas which shows that the practical bias would vanish as for all if a Gaussian bias pilot was employed.
The variance of , taken under pilot density , relies on the identity developed in Appendix A:
where explicit expressions for are can be found in Table 2. As in the calculations leading to the squared bias, the expectations and variance over in (23) do not appear to have closed form expressions. Consequently we employ mean statement of Corollary 2.2 of Iwashita and Siotani (1994) to obtain
Here, is the Jacobian matrix of up to a factor with explicit expression given in Table 2, and tr denotes the matrix trace. Combining (24) and (25) with (23) yields the practical variance approximation as
which will be used throughout the rest of the paper.
Given a prior sample , observation and matrices , , the location of one approximately optimal smoothing parameter involves the following steps:
Calculate , from and estimate from (possibly a subset of) using a few iterations of the EM-algorithm.
Compute using (22).
Numerically minimize (27) with respect to , and return the minimizer as .
Our MATLAB implementation of the smoothing parameter selection procedure uses the built in function gmdistribution.fit for estimating and the minimization of is carried out using the fminbnd optimizer. Our corresponding C++ implementation uses routines similar to those provided in Press et al. (2007), section 16.1 (EM) and section 10.4 (minimization).
4 Pre-Smoothed Particle Filters
Equipped with an optimal one-step PS update, PSPF for accumulating the log-likelihood follows quite directly, and consists of the following steps:
Simulate , set and .
As for the SIR filter (section 2.2) set to obtain an approximate sample from .
Compute optimal smoothing parameter using the algorithm given in section 3.3.4 with , and given by the model specification.
Sample from the posterior representation (12) to obtain an equally weighted approximate sample from .
Set so that now approximates .
If , set and go to step 2, else stop.
Various bias reduction techniques are available (see e.g. Shephard and Pitt (1997)) for computing the log-likelihood increment , but we do not employ those here in order to make comparisons easy. The PSPF has, like the SIR filter computational complexity . We have found that in practice the bottleneck in the PS update is actually fitting the prior pilot using the EM algorithm. Other, more problem specific pilots are conceivable, but we do not discuss this further here.
4.1 Continuous resampling
The resampling step obtains equally weighted particles from a finite mixture representation (12) of the filter distribution. Resampling is typically done by repetition of first drawing a component in the mixture, and subsequently sampling from this component. This process originates discontinuities in the simulated likelihood function, even if common random numbers are applied for repeated evaluation, and makes maximizing the simulated log-likelihood difficult. This issue was first addressed by Pitt (2002), who obtains continuous draws from a univariate mixture of point masses. For multivariate finite Gaussian mixture representations, Malik and Pitt (2011) provide an algorithm that may be used to as the resampling step in the PSPF for arbitrary (as the variance in each component of are equal). However for we have found algorithms based on computing on a fine grid using fast Fourier transform (FFT) methods and sampling from the corresponding CDFs desirable (more detailed descriptions are given in Appendix B). Like Malik and Pitt (2011)’s algorithm, these FFT-based algorithms have linear complexity in , but we find them easier to program and tune.
4.2 Comparison with other particle filters
To compare the proposed methodology with currently preferred particle filters, we carry out some simulation experiments.
The first model we consider is is given as
where denotes a matrix with each element equal to 1.
The distribution of is a Gaussian mixture consisting of three equally weighted components
. We consider each combination of dimensions and measurement error scales , The log-likelihood is available via the Kalman filter by conditioning on each component in the mixture, and therefore admit comparison between the particle filter-based log-likelihood approximations and the truth for this globally non-Gaussian model. We consider a short time series so that the non-Gaussian features introduced by the initial Gaussian mixture do not die out. The contending filters are PSPF, SIR, Ensemble Kalman Filter (EnKF), and post- and pre-smoothed regularized filters as described in Musso et al. (2001). The latter two are implemented using Gaussian kernels with bandwidth selection based on the standard MISE-based plug-in formulas (Musso et al., 2001, equation 2.3), and are therefore referred to as MISE-Post and MISE-Pre respectively. The above mentioned filters rely only on simulation from the state equation, and are therefore directly comparable with respect to scope. As additional references, we also compare with Auxiliary SIR (ASIR, based on the mean of as described in section 3.2 of Pitt and Shephard (1999)) and a fully adapted Auxiliary SIR (FASIR, Pitt and Shephard (1999)). ASIR and FASIR use knowledge of the mean of , and knowledge of the full distribution of respectively, and are therefore not directly comparable to PSPF with respect to scope. We report in Table 3 the bias (loglike. bias), standard error (loglike. std.dev.) and RMSE (loglike. RMSE) of the respective estimates of across 10,000 data sets simulated from (28-29). In addition, as a crude measure for comparing the posterior simulation performance, we also report the square root of the expected squared Euclidian distance between the mean of and the simulated (filter RMSE). We used for PSPF and for the other filters so that the computing times for each filter are comparable when implemented in MATLAB. The mean computing times relative to PSPF (relative CPU time) are also reported in Table 3. For all filters but EnKF, non-continuous resampling was performed in each time step.
From Table 3, we see that PSPF produces smaller log-likelihood RMSEs then other filters that are based only on simulation of the state, except in the case where MISE-Pre has the smallest log-likelihood RMSE. However, assuming momentarily that the log-likelihood RMSEs are , it should be noted that even in case, the log-likelihood RMSE would be the smallest for PSPF if the same was applied. The log-likelihood performances of the PSPF and the EnKF are fairly similar, but it should be noted that the EnKF is not consistent, and therefore the biases cannot be eliminated. For increasing dimensions and fixed , PSPF and EnKF becomes more similar, which is a consequence of being chosen closer to 0 in the high- cases (to counteract the curse of dimensionality). SIR and MISE-Post perform poorly with respect to log-likelihood estimation in all the high signal-to-noise ratio () cases, and also in the moderate signal-to-noise ratio () cases for . MISE-Pre performs well in the cases, but the performance relative to PSPF deteriorates as grows.
The ASIR exhibit highly variable second stage weights, suggesting that the generic importance sampling density implicitly introduced works poorly for this model. As it is the optimal one step ahead particle filter, FASIR works extremely well in all cases, with log-likelihood RMSEs that are two orders of magnitude smaller than PSPF. Thus in the (not too often encountered) cases where full adaptation is possible, one should opt for the FASIR over the PSPF.
With respect to posterior simulation performance, PSPF produces filter RMSE results almost identical to those of FASIR, indicating that the posterior samples of PSPF are close to those of the optimal one step ahead filter. MISE-Pre also produces filter RMSEs close to those of FASIR, which underpins the claim made in Section 3.2 for this model, namely that the posterior estimator of pre-smoothed updates are relatively insensitive to the choice of smoothing parameter. In the same vein, the log-likelihood results of PSPF relative to those of MISE-Pre show that log-likelihood estimation is more sensitive to smoothing parameter selection and therefore targeting MSE() as is done here seems highly sensible.
|log-like. std. dev.||0.303||27.49||0.418||24.82||0.328||27.55||0.006|
|relative CPU time||1.0||0.6||0.5||1.2||1.0||0.9||1.2|
|log-like. std. dev.||0.608||649.5||0.676||647.4||1.192||585.2||0.007|
|relative CPU time||1.0||1.1||1.2||2.4||1.7||1.6||2.1|
|log-like. std. dev.||0.813||4.6e3||0.798||4.6e3||2.540||4.3e3||0.008|
|relative CPU time||1.0||1.4||1.7||2.9||2.1||2.1||2.6|
|log-like. std. dev.||0.291||0.300||0.404||0.244||0.173||6.772||0.006|
|relative CPU time||1.0||0.8||0.5||1.5||1.0||1.0||1.1|
|log-like. std. dev.||0.597||4.510||0.671||4.420||0.761||8.899||0.009|
|relative CPU time||1.0||1.1||1.2||2.2||1.7||1.7||2.0|
|log-like. std. dev.||0.809||41.74||0.797||41.99||1.983||29.49||0.012|
|relative CPU time||1.0||1.4||1.7||2.7||2.1||2.1||2.6|
For the near-Gaussian model (28-29), the EnKF has a similar performance to PSPF. To further explore the difference between PSPF and EnKF, we consider a second simulation experiment that involves the non-linear model
where . The non-linear measurement equation is taken from a well-known test case used by e.g. Andrieu et al. (2010). In particular, such models are capable of generating bimodal filtering distributions as the sign of cannot be determined from observations . For the PSPF and EnKF filters to be applicable, we need to augment the state as indicated in section 3.1, namely