Sequential Bayesian Inference for Dynamic State Space Model Parameters

# Sequential Bayesian Inference for Dynamic State Space Model Parameters

Arnab Bhattacharya and Simon Wilson

## 1. Introduction

Dynamic state-space models (Durbin and Koopman, 2001), consisting of an unknown state Markov process and noisy observations of that process that are conditionally independent, are used in a wide variety of applications e.g. wireless networks (Haykin et al., 2004), object tracking (Ristic et al., 2004) etc. The model is specified by an initial distribution , a transition kernel and an observation distribution . These distributions are defined in terms of a collection of static (e.g. non-time varying) parameters . The joint model to time is:

 p(y1:t,x0:t,θ)=(t∏j=1p(yj|xj,θ)p(xj|xj−1,θ))p(x0|θ)p(θ), (1.1)

where .

In this paper, we propose a new approach for approximating . The principal idea is to approximate the posterior on a carefully constructed grid to which points are added or deleted where necessary. Our method is used alongside a large range of state process estimation algorithms making it highly flexible, even for non-linear or non-Gaussian models. We provide evidence corresponding to our claims through a series of examples.

The literature has tended to focus on computation of the predictive distributions and , and the filtering distributions . Updating of these distributions is by the well-known forward equations (Arulampalam et al., 2002); for example for the prediction of state process we need to compute

 p(xt|y1:t−1,θ)=∫p(xt|xt−1,θ)p(xt−1|y1:t−1,θ)dxt−1. (1.2)

For the linear Gaussian case and assuming known parameters, these computations reduce to the closed form of the Kalman filter (Kalman, 1960). Generally, Gaussian forms of the model allows filtering and prediction to be done quickly, exactly and sequentially, without the need to store the data sequence (West and Harrison, 1997). Exact inference has also been achieved for models where the state process is discrete (Cappé et al., 2005).

Extending inference to non-linear and/or non-Gaussian models has proved to be challenging since analytical solutions to the above integrals do not exist. Functional approximation approaches, derived from the Kalman filter, such as the extended Kalman filter (Haykin, 2001), unscented Kalman filter (Julier and Uhlmann, 1997) etc have been proposed. Monte Carlo approaches such as the bootstrap filter (Gordon et al., 1993) and auxiliary particle filter (Pitt and Shephard, 1999) have also been widely used. Attempts have also been made to combine sampling based methods with functional approximations (Merwe et al., 2001).

As regards the static parameter estimation problem, few closed form solutions are available. Different approaches are employed, often involving joint inference of static parameters and the state process (Kantas et al., 2015). We will discuss some of the existing methodologies in the next section.

Section 2. is a review of some of the online static parameter estimation methods, some of which are used later for comparison. Section 3. outlines the principle of the method. Sections 4. and 5. describe the two main issues to be resolved in order to implement the method: approximations to one-step ahead filtering and prediction densities, and updating the grid defining posterior density. A discussion on the theoretical aspects of error accumulation is provided in section 6. Section 7. illustrates the method and assesses its performance against alternative approaches. Section 8. contains some concluding remarks.

## 2. Online Parameter Inference Review

In this paper, interest is in online inference of the the posterior distribution , along with the filtering and prediction densities associated with the state process (and not on the joint density); a recent review is Kantas et al. (2015). Several approaches have been proposed in the literature:

Joint KF/EKF/UKF: A common practice in the engineering literature is to add dynamics to static parameters, such as assuming with variance decreasing with , and make inference on at each time point using a single Kalman filter or extensions thereof. One advantage of this is that the state and parameters are typically correlated a posteriori, even in linear systems (Haykin, 2001); however this is known to suffer from numerical instability issues.

Dual KF/EKF/UKF: Dual filters assume that the state and the parameter have separate state space representations, and thus two filters can be run concurrently (Wan and Nelson, 1997). The prediction is derived using the parameter mean and is updated using the filtered state mean. Because each posterior only uses the first moment of the other posterior and ignores the variance, these methods are known to produce low variance posteriors.

Online gradient method: Sequential optimization of is also possible. If is the estimate after the first observations, it is updated to after receiving a new data (Poyiadjis et al., 2005). The problem of evaluating the gradient for the whole data has been bypassed in the case of hidden Markov models (LeGland and Mever, 1997). A problem associated with any gradient method is that it is extremely sensitive to initialization and may converge to a local maximum.

Online EM: Online versions of the EM algorithm (Dempster et al., 1977), suitable for a dynamic state space model with unknown model parameters, have been proposed in Cappé (2011). A major advantage is that it always attempts to maximize the likelihood, allowing methods such as variational inference to be used to estimate the parameters. However this method can also converge to a local maximum.

Liu and West filter: This is the most generic among particle filter that performs dual estimation of the state and parameters. Artificial dynamics are introduced for the parameter and subsequently a kernel density estimate of is proposed from which can be sampled. Shrinkage is introduced to control for over-dispersion in the kernel density function. A major drawback though is that it requires a significant amount of tuning for quantities such as the kernel bandwidth.

Storvik’s filter: Storvik (2002) generates particles from the parameter’s posterior distribution without assuming any associated random drift. It is further assumed that the posterior distribution of depends on a low-dimensional set of sufficient statistics that can be efficiently updated for each . The choice of this set is the biggest stumbling block for this algorithm.

Particle learning: Carvalho et al. (2010) have provided a modified version of Storvik’s filter which has proved to be more efficient. Sufficient statistics are derived for the class of conditional dynamic linear models (CDLM), thus providing additional structure to the algorithm. Further, the position of the resampling step in Storvik’s algorithm is now interchanged with the propagation step, that allows particle deficiency of the posterior of to be reduced.

For large , particle degeneracy for sampling filters is difficult to avoid, even where artificial drift has been added (Andrieu et al., 2005; Kantas et al., 2015). Online EM and gradient ascent methods do not have this problem. The method in this paper has the same advantage as that of the latter approaches, making it suitable for long sequences of data. It is to be kept in mind that ours is an on-line sequential algorithm and the computational load does not increase over time, unlike other generic methods like (Chopin et al., 2013).

## 3. Principle

The principle of the proposed method is based on two observations.

The first observation is that many dynamic state space models have a relatively small number of static parameters, so that in principle can be computed and stored on a discrete grid of practical size. This has been noted as a property of many latent models (Rue et al., 2009).

The second observation is that there are useful identities for parameter estimation in latent models. The one of interest here is a sequential version of the basic marginal likelihood identity or BMI (Chib, 1995),

 p(θ|y1:t)∝p(y1:t,x0:t,θ)/p(x0:t|y1:t,θ)|x0:t=x∗(θ), (3.1)

that is valid for any for which and forms the basis for the integrated nested Laplace approximation (INLA); see Rue et al. (2009).

The sequential version of the BMI, which lies at the heart of this approach, is:

 p(θ|y1:t)∝p(θ|y1:t−1)p(yt|xt,θ)p(xt|y1:t−1,θ)/p(xt|y1:t,θ)∣∣xt=x∗(θ). (3.2)

Typically although the identity is valid for any where . Equation 3.2 is useful for sequential estimation because its computation does not grow with , as is the case with Equation 3.1. Taking Equation 3.2, then if prediction and filtering approximations and are available, any approximation can be updated:

 ~p(θ|y1:t)∝~p(θ|y1:t−1)p(yt|xt,θ)~p(xt|y1:t−1,θ)/~p(xt|y1:t,θ)∣∣xt=x∗(θ). (3.3)

Assuming that of low dimension, computing Equation 3.3 on a discrete grid offers the potential for fast sequential estimation.

Alternatively, the sequential version of Bayes’ theorem,

 p(θ|y1:t) ∝p(θ|y1:t−1)p(yt|y1:t−1,θ). (3.4)

and the availability of an approximation from any of the filters mentioned, also allows for updating. Furthermore, using , one can compute the marginal log likelihood , where

 p(yj|y1:j−1)=∫p(yj|y1:j−1,θ)p(θ|y1:j)dθ≈∫~p(yj|y1:j−1,θ)~p(θ|y1:j)dθ. (3.5)

The integration in Equation 3.5 is approximated quickly by a sum over the grid of values of .

This suggests the following sequential estimation algorithm when an approximate posterior predictive distribution, or just state prediction and filtering distributions, are available. The process is initiated by choosing a prior and then deterministically selecting a discrete set of points, say , to approximately cover the region of high density of the prior, and compute at each of the selected points. The posterior of is updated at each point by Equation 3.3 or 3.4. We would like to reiterate that the objective of this study is not to track the complete trajectory of the state process, but rather to update the posterior parameter and state-filtering density over time.

In this method, may also be approximated by INLA up to some specified time , and from then on the sequential update of Equation 3.3 or 3.4 is used. This can be particularly helpful because INLA produces a grid that usually provides good support. The value of is determined by how quickly the update is required, given that the INLA computation slows down with . Note that can be equal to , i.e. INLA is not used at all; this means the starting grid is constructed on the prior .

A few issues remain to be addressed in order to implement this algorithm: the form of the approximations or that of and and how to adapt the grid so that it tracks the support of the posterior over time. These issues are addressed in the next sections.

## 4. Posterior Predictive and Filtering Density Approximations

For the Kalman filter, all the distributions required in Equations 3.2 and 3.4 are Gaussian, producing exact updates. The means and variances of these Gaussians are sequentially updated. West and Harrison (1997) is a comprehensive study of models of this type.

There is a large literature about more general filtering algorithms; see Doucet et al. (2001). In this paper, we have used functional approximations of the filtering distributions based on extensions to the Kalman filter that incorporate non-linearity and non-Gaussian error while holding on to the basic principle of updating the first two moments. The extended Kalman filter linearizes a non-linear model to create a Gaussian approximation to the filtering and prediction densities. The unscented Kalman filter (Julier and Uhlmann, 1997) also produces Gaussian approximations to the filtering and prediction densities but propagates the means and covariances through the non-linear function. It tends to be more accurate than the extended Kalman filter, more so for strongly non-linear models.

Other finite dimensional suboptimal filters have also been proposed. Some of them are the Gaussian sum filter (Ito and Xiong, 1999), quadrature filters (Davis and Rabinowitz, 2007) etc. In our examples, we have used the Kalman filter, the unscented Kalman filter and the quadrature filter.

## 5. Updating the Grid

Recall that is of dimension and assume that there is an initial grid available at time . If INLA is not used then and an initial grid is assumed that covers the region of high density of the prior. It is also assumed that the grid is made up of points along a set of basis vectors for , so that it can be written as the Cartesian product

 Θt=Θt1×Θt2×⋯×ΘtK, (5.1)

where each is a set of ordered values for the th component of .

### 5.1 Checking the Grid

For , the grid is checked every observations to see if it is still an appropriate support for .

The following simple approach appears to perform well in the examples of the next section. First, the basis that represents the coordinate system for is not changed; it is assumed that by there is a basis that will be satisfactory for all future . Second, changes to the grid are based on the univariate marginals of the posterior along the grid coordinates. For the th coordinate, the marginal is:

 ~pk(θtki|y1:t)=∑θ∈Θt,−k,i~p(θ|y1:t)⎛⎜ ⎜⎝K∏l=1l≠kδ(θtl)⎞⎟ ⎟⎠,i=1,…,ntk,

where, to give the appropriate marginalizing sum, is with , summation is over all coordinates except the th, and is the step size on the grid at along the th coordinate at time .

Changes are considered when the support of the marginals is not adequately covered by the grid in one of the following situations:

1. If or (i.e. the values at each end of the grid support) are greater than a certain proportion, say (chosen to be between 0.15 and 0.3) of the value at the mode of the marginal then an extra point will be added at that end in e.g. at and/or , where is a step size along coordinate . Either addition has the effect, through Equation 5.1, of adding a dimensional hyperplane to . This is referred to as an external point addition.

2. Similarly, if either edge value or is less than a certain small proportion, say (chosen to be small, say 0.001) of the marginal mode then either point is removed for . Either deletion removes a dimensional hyperplane from . The resulting new edge points in are then also checked and removed if they also satisfy the criterion, allowing more than 1 point to be removed from each end of each . Internal grid points are not dropped in this approach. This is referred to as external deletion.

3. If the change in marginal between two existing points, relative to its modal value,

 ∣∣ ∣∣~p(θtki|y1:t)−~p(θtk,i−1|y1:t)maxntkm=1~p(θtkm|y1:t)∣∣ ∣∣,

is larger than a certain threshold value, say (usually chosen between 0.3 and 0.4) then an extra point is added to at . This adds a dimensional hyperplane to . This is called internal point addition.

Figure 1 shows examples of these additions and deletions. Note that it is possible to increase or decrease the thresholds over time and can be tweaked by the user in advance; for example keeping in mind the fact that the posterior shrinks over time one can assign a lower value to later in time such that more points are added to the region of high density.

### 5.2 Computing ~p(θ|y1:t) at added points in Θt

For a set of newly added points to the grid, the most obvious next step is to compute at any that has been added from . However, this is computationally infeasible as gets large and furthermore it would remove the property of the algorithm that it does not require the entire filtering sequence to be rerun from scratch.

Instead, extrapolation or interpolation methods are used to compute a value of at any added external and internal point respectively from the existing values of . Several methods exist e.g. kriging (Cressie, 1993), thin plate splines (Wahba, 1990) etc. We use tensor product spline surfaces (Schumaker, 2007), more specifically tensor product linear interpolation on the log scale. In our experience so far, this method is both fast and produces good values for new .

Tensor product linear interpolation has also been used to compute the moments of the state estimation filter corresponding to the added external or internal grid points, which in turn provides us with necessary filtering densities such as and . For such points, this simple method works in most cases since we are essentially providing an informative starting value for moments in the Kalman filter framework, even in multivariate settings. Table 1 summarises the complete algorithm.

## 6. Approximation Error

Theoretical results about the error in approximating using this approach can be divided amongst the various aspects of the approximation:

1. Error from using a discrete grid;

2. Convergence of the filtering algorithm, including state filtering;

3. Interpolation error in relation to adding grid points.

In this section, existing results around each of these 3 aspects are discussed and a roadmap to proving error bounds is laid out.

It is noted that the unnormalised evolves independently at each point in the grid, except for the case of points that are added, where the initial value of is an interpolation of neighbouring values. Convergence results for the polynomial interpolation that is used here are well documented (Kress, 1998). For densities with uniformly bounded derivatives, we can ensure that the interpolation is arbitrarily accurate by increasing the order of the polynomial. However, for the rest of this section, we consider a simpler case where adding/dropping points is not permitted, allowing the third source of error to be ignored and making it sufficient to consider the error in at one value only, and then separately the consequences of that for errors related to the use of a discrete grid.

For the discretization error, were the approximation to be exact on the grid e.g.  then the standard theory of Riemann integration gives us that all integrals of , at least over compact sets, will be approximated arbitrarily well by finite difference summations of as the grid density increases, including over irregular grids.

Therefore the key to results about bounding the error in the approximation of by lies in the choice of starting values (that come from the prior), error from the use of approximate filters and how that error propagates through to by repeated application of Equations 3.3 or 3.4 at a single grid point. These results are the most challenging of the 3 error sources to discuss.

At time , given that our algorithm provides the following filtering distribution

 ~p(θ|y1:t)=C~p(θ)∏i~p(yi|y1:i−1,θ),

where is the accumulated constant of proportionality, the total variation norm between the log of the approximation and the actual data generating process is the following (see Appendix for details):

 |logp(θ|y1:t)−log~p(θ|y1:t)|≤|logp(θ)−log~p(θ)|+∑i|logp(yi|y1:i−1,θ)−log~p(yi|y1:i−1,θ)|.

Given that the error is bounded by a sum that is infinite in the limit, it seems that at best it will be possible to show that the error is bounded rather than converging to 0.

Convergence of to can possibly be achieved under the following relatively strong conditions:

• converges to as for each .

• Further, assuming that both and are bounded away from both 0 and on a compact subset of in the limit is then sufficient for convergence between and .

• Convergence need to be sufficiently fast between and such that is bounded. For example, geometric convergence of to will give a bounded error to by the geometric sum formula.

Convergence results in the literature pertain to the state process and hence to differences between and . These results can easily be extended to the prediction distribution . Focus is on the stability of the filter, so that the total variation norm

 ∥p(xt|y1:t,θ)−~p(xt|y1:t,θ)∥−−→t→00.

In the simplest setting, convergence results are well researched on:

• For the linear filter, properties such as uniform detectability and uniform stabilizability provide the conditions for convergence (Anderson and Moore, 1981). Since for , the Kalman filter converges for any , convergence for our algorithm would essentially depend only on , where represents the prior chosen by us, while is the prior in the true data generating process.

• Finite state space models provide simple settings to study convergence of the filter. Assuming that the actual data generating process arises from , whereas we use as the prior; one can calculate and respectively. The total variation between and is bounded above under ergodic signal, bounded error density, conditions on initial values and certain “mixing” conditions (Atar and Zeitouni, 1997). The upper bound is dependent on the Hilbert metric between and and mixing parameters (see Appendix for more details). For our algorithm, this would mean that convergence of would again be dependent on our choice of the prior along with the aforementioned conditions.

Convergence results for non-linear systems are much more difficult to compute; the existing proofs are based on assumptions which are very restrictive, especially in terms of practical application. An illustration of a potential approach in general nonlinear filtering is provided in Chigansky (2006), where convergence is proved along the same lines as that of finite state space models.

Most solutions to general state space filtering are sub-optimal, and initial estimate is found to play a very important role. Kluge et al. (2010) and others before them proved that estimation error in EKF is exponentially bounded in mean square under permitted nonlinearity and very conservative bounds for each of initial error and noise covariance matrix. Other filters like UKF or Gaussian filters require tuning of signal covariance matrices to achieve stability; no quantitative bounds have been computed and verification of existing assumptions prove to be exceptionally complex (Xiong et al., 2009). It is extremely difficult to generalise these results to our method, and hence pointwise convergence of would possibly depend on our choice of prior along with the conditions associated with each filter, as well as those mentioned under general conditions.

A simulation study we have performed substantiates the claim above, vis a vis a linear Gaussian model. Both for state and parameter, priors have been constructed to be extremely narrow and the true starting values for the data generating process are “very distant” from the region of high density of these priors. Our filter was occasionally found to diverge under these circumstances. An ill-defined state process prior caused more issues than a parameter prior.

## 7. Examples and Comparison With Alternative Approaches

Our method has been implemented for four real data sets and its performance is compared to offline methods such as INLA, MCMC, iterated filtering (Ionides et al., 2011)and PMCMC (Andrieu et al., 2010), as well as the online method of Liu and West (henceforth referred to as L&W filter). Computation times have not been compared because of the difficulties in accounting for differences in implementation and code optimization that have been made across all of these methods, even given that they are all written in R. For INLA, the R-INLA package was used (see www.r-inla.org), for MCMC in stochastic volatility model the package stochvol was used (Kastner, 2016b) and for iterated filtering, PMCMC and the L&W filter, the R package pomp was used (King et al., 2010).

For all the examples the heuristic parameters are set at and . The starting grid has been constructed by placing a regular grid on the region of high density of the prior.

### 7.1 Nile data

The local level model (West and Harrison, 1997) is a simple dlm of the form

 yt =xt−1+ηt, (7.1) xt =xt−1+ϵt, (7.2)

where and . Here it is fitted to the well-known Nile data set, being annual measurements of the river Nile at Ashwan from to . Paremeter inference is done for . These data have been elaborately studied by others e.g. Cobb (1978), Durbin and Koopman (2001) etc. Maximum likelihood estimates are and .

The sequential inference algorithm is implemented with a variety of informative and non-informative priors for and . The starting grid has been constructed on points in each dimension. Performance is evaluated qualitatively through trace plots of the posterior median. We would also like to mention at the outset that average runtime was about 3 secs.

Because of a lack of identifiability between and , there is significant sensitivity to the prior and outlier observations. This can cause the approximation to concentrate around a local mode in the posterior.

This is illustrated in Figure 2, where trace plots of parameters with three different prior distributions are given:

Prior :

Flat log inverse gamma prior with and for both log variance parameters and an uninformative prior for , namely normal with mean and variance ;

Prior :

An informative log inverse gamma prior for with and , and that for with and . Also an informative normal() for ;

Prior :

Flat prior for and (same as Prior ) and a prior for similar to Prior but with larger variance ().

It can be seen that in this example, the prior on the state process (or equivalently the starting value for ) plays the most significant role. The marginal posterior distributions are largely in agreement with the MLE in spite of the prior influence, as is evident in Figure 3. It should be mentioned here that when we used INLA to provide the starting grid, our algorithm managed to identify the MLE even when the priors were uninformative for all unknowns (prior ).

As regards model fit, Figure 4 plots the posterior predictions with the data. Predictive performance is good and the effect of the prior is seen in generally tighter prediction intervals where the prior was informative.

It has already been mentioned that the maximum likelihood estimation of the unknown static parameters in this model are and . The L&W filter was run 50 times for each of the above mentioned priors. Table 2 shows the median of each of the , and posterior quantiles at . The methods largely agree under priors 2 and 3, but L&W fails to track the MLEs in the uninformative prior case (prior 1).

### 7.2 Tokyo rainfall data

Figure 5 displays the number of instances of rainfall over mm in Tokyo for each day during the years (Kitagawa, 1987). For each calendar day the data is defined as the following:

 yt =0if no rain in either year, =1if rain in only 1 year and, =2if rain in both years,

with only 1 observation at , corresponding to 29th February . These data have been analyzed before by several authors, for example Fahrmeir and Tutz (2001) and Rue and Held (2005).

Both Fahrmeir and Tutz (2001) and Kitagawa (1987) fitted the same model, in which the state process, which is related to the probability through a logit link, is a random walk process. The model is given by,

 yt ={Bin(1,πt),t=60(February $29$)Bin(2,πt),t≠60. (7.3) πt =1/(1+exp(−xt)), (7.4) xt =xt−1+ϵtϵt∼N(0,σ2). (7.5)

Rue and Held (2005) use a circular random walk of order for the state process that cannot be directly analyzed using the “forward looking” state space approach employed here. Instead a second order difference random walk, also in Fahrmeir and Knorr-Held (2000), is used:

 xt=2xt−1+xt−2+ϵtϵt∼N(0,σ2). (7.6)

The state variance parameter is the lone parameter in this problem. We have used EKF to compute the estimation in the state process as discussed extensively in Fahrmeir (1992). A flat prior for the state variance parameter is used and INLA is used for the first iterations which is sufficient to get a grid with good posterior support. Average runtime for this data was approximately 30 secs.

Figure 6 shows the sequential parameter learning and comparison with ’batch’ methods: INLA and an MCMC procedure. All 3 methods give a similar inference by the time that all the data are observed.

Figure 7 plots the smoothing mean along with bands denoting and quantiles. The smoothing mean seems to capture the “rainy patches” in Tokyo quite well and matches with similar inferences made in other works.

### 7.3 Stochastic volatility model

The stochastic volatility model (Taylor, 1982) can be written:

 yt =exp(xt/2)ϵt, (7.7) xt =μ+ϕ(xt−1−μ)+σηt, (7.8) xt ∼N(μ,σ2/(1−ϕ2)),

where and are independent standard Gaussian variables. The set of unknown static parameters in the model are : the level of log-variance , the persistence of log-variance and the volatility of log-variance .

The daily difference of the dollar-pound exchange rates between October and June , , plotted in Figure 8, are analysed with this model.

The algorithm was applied to these data with a set of 32 different priors, corresponding to a mixture of informative and uninformative cases. The starting grids for each prior were 40 for each parameter.

Medians of the posterior quantiles for all the parameters, across the 32 prior specifications, along with step ahead predictions, are plotted in Figure 9, along with the posterior mean from an analysis of all the data using the MCMC approach of Kastner (2016a) and INLA. It can be seen that there is good robustness to the prior specification, at least once most of the data are analysed. A much larger estimate of is given by INLA; our method and the MCMC approach are in much better agreement. The bottom-right panel assesses model fit via box plots of posterior p-values (Gelman, 2003), which demonstrates good predictive performance of our method regardless of the prior specifications. The average runtime for our method is approx. 280 sec including calculation of posterior quantiles and p-values.

In addition, Table 3 shows a similar comparison with the method of L&W. The table summarises the posterior distribution across the 32 priors, showing the median of the posterior median, 2.5% and 97/5% quantiles once all the data were analysed. In general, the table shows the difficulties of parameter estimation for this model, with large differences between the approaches. Our method is in rough agreement with L&W, but it is also able to provide a reasonable estimate for the AR parameter. INLA showed the largest sensitivity across the different priors that were used. L&W provide narrower probability bounds than our method.

### 7.4 Univariate non-stationary growth model

The univariate non-stationary growth model (UNGM) is a very challenging model for parameter estimation and has been discussed in numerous occasions in the literature; see Kitagawa (1996); Andrieu et al. (2010) for discrete time and Xu and Yoon (2009) for continuous time version. The model is given by:

 yt =x2t/20+ϵt, where ϵt∼N(0,σ2y), (7.9) xt =0.5xt−1+25xt−1/(1+x2t−1)+8cos(1.2t)+ηt, where ηt∼N(0,σ2x), (7.10) x0 ∼N(0,1).

The parameters are . Estimation procedures for parameters for this model have focussed on batch analysis rather than online. As well as being highly nonlinear, both in the state and the observation process, the state process is multimodal, making filtering and prediction difficult, particularly when is small relative to (Kotecha and Djurić, 2003).

Data were simulated from this model with . For estimation of the state process, both the augmented UKF (Wu et al., 2005) and quadrature methods were applied, and the latter performed better, at the cost of a longer computation time. Updating the posterior using — that is, the sequential version of Bayes Law — produced better performance. Another issue is prior sensitivity, which can be considerable because of the nonlinearity and lack of identifiability in the state process. Because of this, to evaluate our method, multiple data sets from this model were generated, while keeping the prior on the state and parameters to be the same in all the runs with the grid to be 40 for both.

Figure 10 summarises the inference over these runs which ran over an average time of 850 sec again including computation for posterior quantiles and p-values which slows the process down signifiantly. Note the extremely narrow posterior bounds for each run (top row of the plot), along with consistent under-estimation of the state variance parameter.

An explanation for the under-estimation of is that while has been identified by our method, note the high variability between the runs, mostly indicating over-estimation, an indication of an identifiability problem. In several cases, the posterior distribution diverged considerably from the true value, usually due to the presence of severe nonlinearity. Furthermore, the individual runs were found to have extremely narrow probability bounds, implying that the posterior is concentrated in a narrow region of parameter space. However, Figure 11, shows that predictive performance is good (this is for a single run). It is possible to identify the problem faced in identifying the observation when it is close to , i.e. when is small, in the figure.

Table 4 compares the performance of the L&W method and two other offline methods: iterated filtering and PMCMC. All the algorithms were provided with a similar set of priors and their posterior medians were recorded. The values provided in the table are the , and percentiles of these medians from multiple runs. L&W algorithm seem to have estimated the state variance parameter correctly, while our method clearly under-estimated it. However, performance of our algorithm is better for the measurement variance parameter. A major disadvantage for the L&W method is that all the runs ended in a single particle, thus degeneracy is certain, a problem our method does not face. Iterated filtering does contain the true parameter values in its bounds, but at the cost of very large standard errors. PMCMC is the only method which estimates both the parameters correctly, even though it is extremely sensitive to the proposal distribution for the parameters and very easily can get caught in local modes. This goes on to display the extreme complication associated with the UNGM model, because of the nonlinearity in both levels of the state space models. In conclusion one may say that for models with severe nonlinearity and identifiability issues existing online algorithms fail to fit the model well.

## 8. Conclusion

A method of sequential static parameter estimation for generalized dynamic state space models has been proposed and compared to multiple alternative approaches, both online and offline. Our method achieved similar accuracy to the offline methods in almost all the examples provided and usually performed better than the particle filter. Furthermore, it does not suffer from usual scenarios of degeneracy.

Our method has several appealing properties. It can be applied to models with a non-Gaussian state process, and so is more general than INLA. When the underlying process is Gaussian, it can make use of INLA for a good initialization. It is a very flexible framework in that any filtering and prediction algorithm, not just those used here, can be plugged in as the approximations for posterior predictive or filtering distributions. Dynamic grid updating by tensor product splines is fast, simple and seems to work well, at least in the examples described in the paper; many alternative interpolation methods could be implemented once this approach fails to be sufficiently accurate. Finally, the algorithm is trivially parallelizable over the grid; the computation of at each is completely independent.

The principal disadvantage of the approach is that it is restricted to models with a relatively small number of fixed parameters, hence suited for example to inference on hyper-parameters related to the case of hierarchical model in a dynamic setting. Since the number of hyper-parameters in such models are usually much smaller than the number of parameters, which themselves could be time dependent, our algorithm can be fast enough for online Bayesian inference.

Finally, a short discussion on computational complexity of our algorithm is provided. In this paper, our method has not been formally compared to the other online algorithms in terms of computation time, such as L&W method, because the latter is applied from an R package, which would have more optimized coding than ours. This method has computational complexity where is the number of grid points in each dimension and . Because of the recursive nature of the algorithm, the computation is independent of . For each grid point, the complexity also depends on that of the Kalman filter, EKF, UKF or other state estimation methods; total complexity then becomes , where corresponds to computation in the state estimation process. For the Kalman filter, for example, where and denote the dimensions of the state and observation process respectively. However note that because of the nature of the algorithm, the state estimation steps can be trivially parallelized across points in the grid.

## 9. Appendix

### 9.1 Convergence of filtering distribution under Hilbert metric approach

We will first present the general form for a discrete time filter. Let be a Markov sequence with values in and transition probability density and initial probability density . The observation process is an i.i.d sequence conditional of , such that we have as the likelihood. We drop the dependence on from future references as they don’t play a role in the convergence proofs involving state filtering.

The filtering equation of given for the general problem can be written as

 πt(xt)=p(yt|xt)∫Rdp(xt|xt−1)πt−1(xt−1)dxt−1∫Rdp(yt|xt)∫Rdp(xt|xt−1)πt−1(xt−1)dxt−1dxt.

Even though the filtering equation above is nonlinear, its solution at each time point is obtained by using the linear Zakai equation:

 ρt(xt)=p(yt|xt)∫Rdp(xt|xt−1)πt−1(xt−1)dxt−1,for t≥0, (9.1)

and thus we have

 πt(xt)=ρt(xt)∫Rdρt(u)du. (9.2)

The Hilbert projective metric approach has been used to prove the stability result of the state filter for a finite state space model by Atar and Zeitouni (1997) and by LeGland and Oudjane (2004) in particle filters. Let be a measurable set and be the space of nonnegative measures on with the partial order relation if for any measurable . The measures and are comparable if for some constants .

• (Hilbert projective distance metric) The Hilbert projective distance is defined as

 h(p,q)=log

,  p, q ∈M_+ are comparable and otherwise.

The following properties of the Hilbert metric are useful:

1. is a non-negative symmetric function;

2. It satisfies the triangle inequality

 h(p,q)≤h(p,r)+h(r,q),p,q,r∈M+;
3. iff for some ;

4. for any and any scalars ;

5. for any .

Further let be a linear operator, mapping to itself, then the Birkhoff contraction coefficient is defined as

 τ(K)=sup0

Defining -diameter as , we have

 τ(K)=tanh(H(K)4). (9.4)

In the filtering concept, the operator is of a particular integral structure and then if the transition kernel satisfies the “mixing conditions”, i.e. if and s.t.

 0<κ⋆≤p(xt|xt−1)≤κ⋆<∞,

it can be shown that

 τ(K)≤κ⋆−κ⋆κ⋆+κ⋆<1. (9.5)

Let be the filtering posterior that one can get from the data generating process and be the applied model, and and be the respective initializations. The condition , i.e. is sufficient for to be a valid initialization. Under these assumptions we have the following theorem

###### Theorem 9..1

Suppose the signal evolves on a subset , i.e.  for all . Assume that and there exists a reference measure on , with respect to which the transition law of the signal has a uniformly positive and bounded density, i.e.

 0<λ⋆≤p(xt|xt−1)≤λ⋆<∞.

Then

 ∥πn−~πn∥≤2log(3)h(p(x0),~p(x0))(λ⋆−λ⋆λ⋆+λ⋆)n,n≥1.

### 9.2 Outline for convergence proof in our algorithm

Given that our algorithm provides the following filtering distribution:

 p(θ|y1:t) ∝p(θ|y1:t−1)p(yt|y1:t−1,θ), =Ctp(θ|y1:t−1)p(yt|y1:t−1,θ), =Cp(θ)∏jp(yj|y1:j−1,θ).

We now have

 logp(θ|y1:t)=logp(θ)+logC+∑jlogp(yj|y1:j−1,θ).

Assuming is the approximate filter and is the true posterior coming from the true data generating process, the total variation norm between the corresponding logs is

 |logp(θ|y1:t)−log~p(θ|y1:t)| =∣∣ ∣∣[logp(θ)+logC+∑jlogp(yj|y1:j−1,θ)]−[log~p(θ)+log~C+∑jlog~p(yj|y1:j−1,θ)]∣∣ ∣∣, ≤|logp(θ)−log~p(θ)|+∑j∣∣logp(yj|y1:j−1,θ)−log~p(yj|y1:j−1,