Adapting the Number of Particles in Sequential Monte Carlo Methods through an Online Scheme for Convergence Assessment
Particle filters are broadly used to approximate posterior distributions of hidden states in state-space models by means of sets of weighted particles. While the convergence of the filter is guaranteed when the number of particles tends to infinity, the quality of the approximation is usually unknown but strongly dependent on the number of particles. In this paper, we propose a novel method for assessing the convergence of particle filters in an online manner, as well as a simple scheme for the online adaptation of the number of particles based on the convergence assessment. The method is based on a sequential comparison between the actual observations and their predictive probability distributions approximated by the filter. We provide a rigorous theoretical analysis of the proposed methodology and, as an example of its practical use, we present simulations of a simple algorithm for the dynamic and online adaptation of the number of particles during the operation of a particle filter on a stochastic version of the Lorenz system.
Many problems in science and engineering can be described by dynamical models where hidden states of the systems change over time and observations that are functions of the states are available. Often, the observations are sequentially acquired and the interest is in making recursive inference on the hidden states. In many applications, the Bayesian approach to the problem is adopted because it allows for optimal inclusion of prior knowledge of the unknown state in the estimation process [1, 2]. In this case, the prior information and the likelihood function that relates the hidden state and the observation are combined yielding a posterior distribution of the state.
Exact Bayesian inference, however, is only possible in a small number of scenarios, including linear Gaussian state-space models (using the Kalman filter [3, 4]) and finite state-space hidden Markov models (HMM filters ). Therefore, in many other practical problems, only approximate inference methods can be used. One class of suboptimal methods is particle filtering, which is also known as sequential Monte Carlo sampling [6, 7, 8, 9, 10]. Since the publication of , where the sampling importance resampling (SIR) filter was introduced, particle filtering has received outstanding attention in research and practice. Particle filters approximate posterior distributions of the hidden states sequentially and recursively. They do it by exploiting the principle of importance sampling and by using sets of weighted particles [6, 7, 12].
One key parameter of particle filters is the number of particles. It can be proved that the rate of convergence of the approximate probability distribution towards the true posterior is inversely proportional to the square root of the number of particles used in the filter [12, 13]. This, too, entails that the filter “perfectly” approximates the posterior distribution when the number of particles tends to infinity. However, since the computational cost grows with the number of particles, practitioners must choose a specific number of particles in the design of their filters.
In many applications, the observations arrive sequentially, and there is a strict deadline for processing each new observation. Then, one could argue that the best solution in terms of filter performance is to increase the number of particles as much as possible and keep it fixed. Also, in some hardware implementations, the number of particles is a design parameter that cannot be modified during implementation. Nevertheless, in many other applications where resources are scarce or are shared with a dynamical allocation and/or with energy restrictions, one might be interested in adapting the number of particles in a smart way. One would use enough particles to achieve a certain performance requirement but without wasting resources by using many more particles if they do not translate into a significant improvement of the filter performance.
The selection of the number of particles, however, is often a delicate subject because, (1) the performance of the filter (the quality of the approximation) cannot usually be described in advance as a function of the number of particles, and (2) the mismatch between the approximation provided by the filter and the unknown posterior distribution is obviously also unknown. Therefore, although there is a clear trade-off between performance and computational cost, this relation is not straightforward; e.g., increasing the number of particles over a certain value may not significantly improve the quality of the approximation while decreasing the number of particles below some other value can dramatically affect the performance of the filter.
Few papers in the wide literature have addressed the problem of online assessment of the filter convergence for the purpose of adapting the number of particles. In , the number of particles is selected so that a bound on the approximation error does not exceed a threshold with certain probability. The latter error is defined as the Kullback-Leibler divergence (KLD) between the approximate filter distribution and a grid-discretized version of the true one (which is itself a potentially-costly approximation with an unknown error). In , an adaptation of the number of particles is proposed, based on the KLD approach of  and an estimate of the variance of the estimators computed via the particle filter, along with an improvement of the proposal distributions. In , the adaptation of the number of particles is based on the effective sample size. These methods are heuristic: they do not enjoy any theoretical guarantees (in the assessment of the approximation errors made by the particle filter) and the allocation of particles, therefore, cannot be ensured to be optimal according to any probabilistic criterion. Some techniques based on more solid theoretical ground have been proposed, within the applied probability community, during the last few years. We discuss them below.
Two types of unbiased estimators of the variance in the approximation of integrals using a class of particle filters were analyzed in  using the Feynman-Kac framework of . As an application of these results, it was suggested to use these estimators to select the number of particles in the filter. In particular, the scheme proposed in  is a batch procedure in which a particle filter is run several times over the whole data sequence, with increasing number of particles, until the variance of the integral of interest is found to fall below a prescribed threshold. This approach cannot be used for online assessment, which is the goal of the present paper. Another batch method (thus, also not applicable for online assessment) for particle allocation has been recently proposed in , where an ad hoc autoregressive model is fitted to estimate the variance of the estimators produced by the particle filter.
Papers on so-called alive particle filters can also be found in the literature [20, 21, 22]. These articles focus on models where the likelihood function can take zero value for some regions of the state space, in such a way that there is the risk that a collection of zero-weight particles are generated if a standard algorithm is employed. To avoid this limitation, alive particle filters are based on sampling schemes where new particles are generated until a prescribed number of them, , attain non-zero weights. The computational cost of the algorithm per time step is, therefore, random. Moreover, the number is chosen a priori and there is no assessment of whether allows for reaching adequate accuracy of the estimators (the methodology proposed in the present manuscript can be directly applied to alive particle filters in order to adapt ).
In order to guarantee that the particle set yields a sufficiently good representation, in  it is proposed to test whether the particle estimate of the predictive density of the observation at time given the previous data is sufficiently large, i.e., whether it is above a prescribed (heuristically chosen) threshold. When the particle set does not satisfy this condition, it is discarded and a new collection of particles is generated. The number of particles is not adapted, since all generated sets have the same size. The computational cost of this algorithm is random.
Finally, in [24, Chapter 4] it is proposed to use the coefficient of variation of the weights (or, equivalently, the effective sample size) in order to detect those observations for which there is a large -divergence between the proposal distribution used to generate the set of particles and the target distribution. This connection is rigorously established in . The algorithm, however, is computationally costly compared to classical methods: at each time step, a complete set of particles and weights are generated, and the coefficient of variation is computed. If this coefficient is too high, the particles are discarded, the algorithm “rolls back,” and a new, larger set of particles is generated for better representation of the target distribution (this step is termed “refuelling” in ). Although the algorithm enjoys theoretical guarantees, it relies on keeping the particle approximation “locked” to the target distribution at all times. It is known that, once the particle filter has lost track of the state distribution, the effective sample size (and, hence, coefficient of variation) becomes uninformative  and, therefore, the link with the -divergence is lost.
We introduce a model–independent methodology for the online assessment of the convergence of particle filters and carry out a rigorous analysis that ensures the consistency of the proposed scheme under fairly standard assumptions. The method is an extension of our previous work presented in . In the proposed scheme, the observations are processed one at a time and the filter performance is assessed by measuring the discrepancy between the actual observation at each time step and a number of fictitious data-points drawn from the particle approximation of the predictive probability distribution of the observations. The method can be exploited to adjust the number of particles dynamically when the performance of the filter degrades below a certain desired level. This would allow a practitioner to select the operation point by considering performance-computational cost tradeoffs. Based on the method, we propose a simple and efficient algorithm that adjusts the number of particles in real time. We demonstrate the performance of the algorithm numerically by running it for a stochastic version of the -dimensional Lorenz 63 system. As already noted, this paper builds on the method from . However, the main difference here is that the underlying model is not questioned – instead, it is assumed to be correct. The connection between  and the present work is that they both build upon the ability to compute predictive statistics of the upcoming observations that turn out to be independent of the underlying state space model. In this paper we have rigorous theoretical results regarding the particle approximations of the predictive distribution of the observations (while this issue was ignored in ). Finally, we suggest practical schemes for the online adjustment of the number of particles.
Let us point out that the adaptive procedure for the online selection of the number of particles described herein is only one of many that can exploit the results of the convergence analysis. In other words, our analysis opens the door for development of new families of algorithms for online adaptation of the number of particles by way of online convergence assessment.
I-C Organization of the paper
The rest of the paper is organized as follows. In Section II we describe the class of state space Markov models and provide a basic background on the well-known bootstrap particle filter of . The theoretical results that enable the online assessment of particle filters are stated in Section III, with full details and proofs contained in Appendix A. The proposed methodology for online convergence assessment of the particle filter is introduced in Section IV. Furthermore, this section provides a simple algorithm for the dynamic, online adaptation of the number of particles. In Section V, we illustrate the validity of the method by means of computer simulations for a stochastic Lorenz 63 model. Finally, Section VI contains a summary of results and some concluding remarks.
Ii Particle filtering
In this section we describe the class of state space models of interest and then present the standard particle filter (PF), which is the basic building block for the methods to be introduced later.
Ii-a State space models and stochastic filtering
Let us consider discrete-time, Markov dynamic systems in state-space form described by the triplet111In most of the paper we abide by a simplified notation where denotes the probability density function (pdf) of the random variable . This notation is argument-wise, hence if we have two random variables and , then and denote the corresponding density functions, possibly different; denotes the joint pdf and is the conditional pdf of given . A more accurate notation, which avoids ambiguities, is used for the analysis and the statement of the theoretical results. Besides, vectors are denoted by bold-face letters, e.g., x, while regular-face is used for scalars, e.g., .
denotes discrete time;
is the -dimensional (random) system state at time , which takes variables in the set ,
is the a priori pdf of the state, while
denotes the conditional density of the state given ;
is the -dimensional observation vector at time , which takes values in the set and is assumed to be conditionally independent of all other observations given the state ,
is the conditional pdf of given . It is often referred to as the likelihood of , when it is viewed as a function of given .
The model described by Eqs. (1)–(3) includes a broad class of systems, both linear and nonlinear, with Gaussian or non-Gaussian perturbations. Here we focus on the case where all the model parameters are known. However, the proposed method can also be used for models with unknown parameters for which suitable particle filtering methods are available [27, 28, 29]. We assume that the prior distribution of the state is also known.
The stochastic filtering problem consists in the computation of the sequence of posterior probability distributions given by the so-called filtering densities , . The pdf is closely related to the one-step-ahead predictive state density , which is of major interest in many applications and can be written down by way of the Chapman-Kolmogorov equation,
Using Bayes’ theorem together with Eq. (4), we obtain the well-known recursive factorization of the filtering pdf
For conciseness and notational accuracy, we use the measure-theoretic notation
to represent the filtering and the predictive posterior probability distributions of the state, respectively. Note that and are probability measures, hence, given a Borel set , and denote the posterior probability of the event conditional on and , respectively.
However, the object of main interest for the convergence assessment method to be introduced in this paper is the predictive pdf of the observations, namely the function and the associated probability measure
The density is the normalization constant of the filtering density , and it is related to the predictive state pdf through the integral
Ii-B The standard particle filter
A PF is an algorithm that processes the observations sequentially in order to compute Monte Carlo approximations of the sequence of probability measures . The simplest algorithm is the so-called bootstrap particle filter (BPF)  (see also ), which consists of a recursive importance sampling procedure and a resampling step. The term “particle” refers to a Monte Carlo sample in the state space , which is assigned an importance weight. Below, we outline the BPF algorithm with particles.
Bootstrap particle filter.
Initialization. At time , draw i.i.d. samples, , , from the prior .
Recursive step. Let be the particles at time . At time , proceed with the two steps below.
For , draw from the model transition pdf . Then compute the normalized importance weights
Resample times with replacement: for , let with probability , where .
For the sake of simplicity, in step 2.(b) above we assume that multinomial resampling  is carried out for every . The results and methods to be presented in subsequent sections remain valid when resampling is carried out periodically and/or using alternative schemes such as residual , stratified  or minimum-variance  resampling (see also ).
The simple BPF yields several useful approximations. After sampling at step 2.(a), the predictive state probability measure can be approximated as
where denotes the Dirac delta measure located at . The filter measure can be similarly approximated, either using the particles and weights computed at step 2.(a) or the resampled particles after step 2.(b), i.e.,
respectively. In addition, the BPF yields natural approximations of the predictive pdf’s of and given the earlier observations . If we specifically denote these functions as and , then we readily obtain their respective estimates as mixture distributions with mixands, or,
for any and .
Iii A novel asymptotic convergence result
The convergence of the approximate measures, e.g., , towards the true ones is usually assessed in terms of the estimates of 1-dimensional statistics of the corresponding probability distribution. To be specific, let be a real integrable function in the state space and denote222 Let be a measurable space, where for some integer and is the Borel -algebra of subsets of . If is a measure on and the function is integrable with respect to (w.r.t.) , then we use the shorthand notation .
Under mild assumptions on the state space model, it can be proved that
According to (5), the predictive observation pdf is an integral w.r.t. and, as a consequence, Eq. (7) implies that a.s. and point-wise for every under mild assumptions . However, existing theoretical results do not ensure that can converge uniformly on towards and this fact prevents us from claiming that in some proper sense for integrable real functions .
Important contributions of this paper are (a) the proof of a.s. convergence of the random probability measure
towards (as ) under mild regularity assumptions on the state space model, and (b) the provision of explicit error rates. We point out that is not a classical point-mass Monte Carlo approximation of (as, for example, is an approximation of ). Instead, the measure is absolutely continuous with respect to the Lebesgue measure (the same as itself). If a different reference measure were used to define the pdf’s and , say , then both and would be absolutely continuous with respect to . In order to describe how converges to in a rigorous manner , we need to introduce some notation:
For each , let us define the function , i.e., the conditional pdf of given . When this function is used as a likelihood, we write to emphasize that it is a function of .
Let be a real function on some set . We denote the absolute supremum of as . The set of bounded real functions on is .
Let be a multi-index, where each , , is a non-negative integer. Let be a real function on a -dimensional set . We use to denote the partial derivative of w.r.t. the variable z determined by the entries of a, namely,
The order of the derivative operator is .
The minimum out of two scalar quantities, , is denoted .
We make the following assumptions on the likelihood function and the predictive observation measure .
For each , the function is positive and bounded, i.e., for any and .
For each , the function is differentiable with respect to y, with bounded derivatives up to order , hence exists and
For any and any , the sequence of hypercubes
satisfies the inequality for some constants and independent of (yet possibly dependent on and ), where is the complement of .
Assumptions () and () refer to regularity conditions (differentiability and boundedness) that the likelihood function of the state space model should satisfy. Models of observations, for example, of the form , where is a (possibly nonlinear) transformation of the state and is noise with some differentiable, exponential-type pdf (e.g., Gaussian or mixture-Gaussian), readily satisfy these assumptions. Typical two-sided heavy-tailed distributions, such as Student’s distribution, also satisfy () and ().
Assumption () states an explicit bound on the probability under the tails of the pdf . The bound is polynomial, namely
and therefore immediately verified, e.g., by all distributions of the exponential family as well as for many heavy-tailed distributions. For example, when (i.e., the observations are scalars), one can choose the constants and such that is an upper bound for the tails of the (heavy-tailed) Pareto, Weibull, Burr or Levy distributions.
It is actually possible to find simple conditions on the conditional pdf of the observations, , that turn out sufficient for assumption () to hold true. Let us keep , for simplicity, and assume that there exists a sequence of positive constants such that has a polynomial upper bound itself, namely
for some and every such that (note that the smallest set in the sequence is ). For probability distributions with infinite support and continuous with respect to the Lebesgue measure, the inequality (8) implies that the densities are integrable for every possible choice of . Then, the probability below the right tail of is
where the inequality follows from the application of (8). Since is a pdf, we have and some elementary calculations yield
The same result is easily obtained for the left tail of , hence
By comparing (10) and the inequality , we readily see that we can choose and to uphold assumption (). A similar derivation can be carried out when .
Assume that (), () and () hold and the observations are fixed (and otherwise arbitrary). Then, for every and any there exists an a.s. finite r.v. , independent of , such that
See Appendix A for a proof.
Note that the r.v. in the statement of Theorem 1 depends on the time instant . It is possible to remove this dependence if the constants and in assumption () are chosen to be independent of and we impose further constraints on the likelihood function and the Markov kernel of the state space model (similar to the sufficient conditions for uniform convergence in, e.g.,  or ).
Iv Online selection of the number of particles
In the sequel we assume scalar observations, hence and (while is arbitrary). A discussion of how to proceed when is provided in Section IV-E.
Our goal is to evaluate the convergence of the BPF (namely, the accuracy of the approximation ) in real time and, based on the convergence assessment, adapt the computational effort of the algorithm, i.e., the number of used particles .
To that end, we run the BPF in the usual way with a light addition of computations. At each iteration we generate “fictitious observations”, denoted , from the approximate predictive pdf . If the BPF is operating with a small enough level of error, then Theorem 1 states that these fictitious observations come approximately from the same distribution as the acquired observation, i.e., . In that case, as we explain in Subsection IV-B, a statistic can be constructed using , which necessarily has an (approximately) uniform distribution independently of the specific form of the state-space model (1)–(3). By collecting a sequence of such statistics, say for some window size , one can easily test whether their empirical distribution is close to uniform using standard procedures. The better the approximation generated by the BPF, the better fit with the uniform distribution can be expected.
If and is not too large, the cost of the added computations is negligible compared to the cost of running the BPF with particles and, as we numerically show in Section V, the ability to adapt the number of particles online leads to a very significant reduction of the running times without compromising the estimation accuracy.
Below we describe the method, justify its theoretical validity and discuss its computational complexity as well as its extension to the case of multidimensional ’s.
Iv-a Generation of fictitious observations
The proposed method demands at each time the generation of fictitious observations (i.e., Monte Carlo samples), denoted , from the approximate predictive observation pdf . Since the latter density is a finite mixture, drawing from is straightforward as long as the conditional density of the observations, , is itself amenable to sampling. In order to generate , it is enough to draw a sample from the discrete uniform distribution on and then generate .
Iv-B Assessing convergence via invariant statistics
For simplicity, let us assume first that , i.e., there is no approximation error and, therefore, the fictitious observations have the same distribution as the true observation . We define the set and the r.v. . Note that is the set of fictitious observations which are smaller than the actual one, while is the number of such observations. If we let denote the probability mass function (pmf) of , it is not hard to show that is uniform independently of the value and distribution of . This is rigorously given by the Proposition below.
If are i.i.d. samples from a common continuous (but otherwise arbitrary) probability distribution, then the pmf of the r.v. is
Proof: Since are i.i.d., all possible orderings of the samples are a priori equally probable, and the value of the r.v. depends uniquely on the relative position of after the samples are sorted (e.g., if is the smallest sample, then , if there is exactly one then , etc.). There are different ways in which the samples can be ordered, but can only take values from to . In particular, given the relative position of , there are different ways in which the remaining samples can be arranged. Therefore, for every . ∎
For the case of interest in this paper, the r.v.’s (the actual and fictitious observations) have a common probability distribution given by the measure and are generated independently. For the class of state space models described in Section II, and the explicit assumptions in Section III, the measure is absolutely continuous w.r.t. the Lebesgue measure (with associated density ) and, therefore, are indeed continuous r.v.’s and the assumptions of Proposition 1 are met. Moreover, it can also be proved that the variables in the sequence are independent.
If the r.v.’s are i.i.d. with common pdf , then the r.v.’s in the sequence are independent.
See Appendix B for a proof.
In practice, is just an approximation of the predictive observation pdf and, therefore, the actual and fictitious observations are not i.i.d. However, under the assumptions of Theorem 1, the a.s. convergence of the approximate measure enables us to obtain an “approximate version” of the uniform distribution in Proposition 1, with the error vanishing as . To be specific, we introduce the set , which depends on because of the mismatch between and , and the associated r.v. with pmf . We have the following convergence result for .
Let be a sample from and let be i.i.d. samples from . If the observations are fixed and Assumptions (), () and () hold, then there exists a sequence of non-negative r.v.’s such that a.s. and
In particular, a.s.
See Appendix C for a proof. Proposition 1 states that the statistic is distribution-invariant, since independently of and the state space model. Similarly, Theorem 2 implies that the statistic is asymptotically distribution-invariant (independently of and the model) since when , as the BPF converges.333Specifically note that, under assumptions (), () and (), the convergence of the continuous random measure computed via the BPF (which is sufficient to obtain (12); see Appendix C) is guaranteed by Theorem 1.
Iv-C Algorithm with adaptive number of particles
We propose an algorithm that dynamically adjusts the number of particles of the filter based on the transformed r.v. . Table II summarizes the proposed algorithm, that is embedded into a standard BPF (see Section II-B) but can be applied to virtually any other particle filter in a straightforward manner. The parameters of the algorithm are shown in Table I.
The BPF is initialized in Step 1(a) with initial particles. At each recursion, in Step 2(a), the filtered distribution of the current state is approximated. In Step 2(b), fictitious observations are drawn and the statistic is computed. In Step 2(c), once a set of consecutive statistics have been acquired, , a statistical test is performed for checking whether is a sequence of samples from the uniform pmf given by Eq. (11).
There are several approaches that can be used to exploit the information contained in . Here we perform a Pearson’s chi-squared test , where the statistic is computed according to Eq. (13) (see Table II). Then, a p-value for testing the hypothesis that the empirical distribution of is uniform is computed. The value is obtained by comparing the statistic with the distribution with degrees of freedom. Intuitively, a large suggests a good match of the sequence with an i.i.d. sample from the uniform distribution on , while a small indicates a mismatch. Therefore, the p-value is compared with two different significance levels: a low threshold and a high threshold . If , the number of particles is increased according to the rule whereas, if , the number of particles is decreased according to the rule . When , the number of particles remains fixed. These two significance levels allow the practitioner to select the operation range by considering a performance-to-computational-cost tradeoff. Note that we set and , maximum and minimum values for the number of particles, respectively.
A large window yields a more accurate convergence assessment but increases the latency (or decreases the responsiveness) of the algorithm. If the algorithm must be run online, this latency can be critical for detecting a malfunction of the filter and adapting consequently the number of particles. Therefore there is a tradeoff between the accuracy of the convergence assessment procedure and latency of the algorithm.
Iv-D Computational cost
Compared to the BPF, the additional computational cost of the method is mainly driven by the generation of the fictitious observations at each iteration as shown in Subsection IV-A. The generation of these fictitious observations is a two-step procedure, where in the first step, we draw discrete indices, say , from the set with uniform probabilities, and in the second step, we draw samples from , respectively.
In the proposed algorithm, a Pearson’s test is performed with a sequence of samples, that is, it is carried out only once every consecutive time steps. Therefore, the computational cost will depend on the parameters and . We will show in Section V that the algorithm can work very well with a low number of fictitious observations, which imposes a very light extra computational load.
Iv-E Multidimensional observations
Through this section, we have assumed scalar observations. In the multidimensional case, with , the same assessment scheme can be applied over each marginal of the predictive observation pdf. Theoretical guarantees readily follow from the convergence of the marginal measures under the same assumptions as the joint measure (see Appendix A).
The algorithm proposed in Section IV-C can be extended to the case with multidimensional observations. One of way of doing it is by performing an independent assessment for each marginal pdf . As a result, p-values , with