An algorithm for approximating the second moment of the normalizing constant estimate from a particle filter

# An algorithm for approximating the second moment of the normalizing constant estimate from a particle filter

Svetoslav Kostov Svetoslav Kostov School of Mathematics, University Walk, Bristol, BS8 1TW, UK
1Nick Whiteley School of Mathematics, University Walk, Bristol, BS8 1TW, UK
2
Nick Whiteley Svetoslav Kostov School of Mathematics, University Walk, Bristol, BS8 1TW, UK
1Nick Whiteley School of Mathematics, University Walk, Bristol, BS8 1TW, UK
2
2email: svetoslav.kostov@bristol.ac.uk
4email: nick.whiteley@bristol.ac.uk
###### Abstract

We propose a new algorithm for approximating the non-asymptotic second moment of the marginal likelihood estimate, or normalizing constant, provided by a particle filter. The computational cost of the new method is per time step, independently of the number of particles in the particle filter, where is a parameter controlling the quality of the approximation. This is in contrast to for a simple averaging technique using i.i.d. replicates of a particle filter with particles. We establish that the approximation delivered by the new algorithm is unbiased, strongly consistent and, under standard regularity conditions, increasing linearly with time is sufficient to prevent growth of the relative variance of the approximation, whereas for the simple averaging technique it can be necessary to increase exponentially with time in order to achieve the same effect. This makes the new algorithm useful as part of strategies for estimating Monte Carlo variance. Numerical examples illustrate performance in the context of a stochastic Lotka–Volterra system and a simple AR(1) model.

###### Keywords:
marginal likelihood normalizing constant hidden Markov model particle filter

## 1 Introduction

Particle filters, also known as Sequential Monte Carlo (SMC) methods (Doucet et al, 2001), are used across a variety of disciplines including systems biology, econometrics, neuroscience and signal processing, to perform approximate inferential calculations in general state-space Hidden Markov Models (HMM) and in particular, provide an unbiased estimate of the marginal likelihood. Recent application areas of these techniques include for example, systems biology (Golightly and Wilkinson, 2011; Golightly et al, 2015), where the calculation of the marginal likelihood (ML) plays an important role in the estimation of the parameters of stochastic models of biochemical networks. Estimation of the marginal likelihood also features centrally in Particle Markov Chain Monte Carlo methods (Andrieu et al, 2010).

In the present paper we address the problem of approximating the non-asymptotic second moment of the particle filter estimate of the marginal likelihood, henceforth for brevity “the second moment”. As part of strategies to estimate Monte Carlo variance, this allows one to report a numerical measure of the reliability of the particle filter estimate. Our contributions are to introduce a new particle “Pairs algorithm” and prove that it unbiasedly and consistently approximates the second moment. We also establish, under regularity conditions, a linear-in-time bound on the relative variance of the approximation to the second moment, and illustrate through a simple calculation and numerical simulations, that the Pairs algorithm performs more reliably than a default strategy which uses independent copies of the particle filter. In order to discuss the connections between our work and the existing literature, we first need to introduce some notation and definitions.

A HMM is a process , where , called the signal process, is a Markov chain with state space , initial distribution and transition kernel . Each of the observations , is conditionally independent of the rest of the signal process given , with conditional distribution, , where is a probability kernel from to . The HMM can be represented as:

 X0∼π0(⋅), Xn∣Xn−1∼f(Xn−1,⋅),n≥1 Yn∣Xn∼g(Xn,⋅),n≥0.

We consider a fixed observation sequence , assume that admits a density w.r.t. to some dominating measure and write for brevity . For simplicity we also assume throughout that for all , and , . We then define the sequence of distributions , called prediction filters, as

where is the -algebra associated with the space , and the sequence

 (Zn)n≥0,Z0:=∫Xg0(x)π0(dx),Zn:=Zn−1∫Xgn(x)πn(dx),n≥1. (2)

The interpretation of these definitions is the following: is the distribution of , where for any sequence we write , and is the marginal likelihood of the first observations . In many cases of interest, the distributions and constants cannot be computed exactly, and numerical approximations are needed. A particle filter, shown in Algorithm 1, provides such approximations, denoted respectively and . In Algorithm 1 and , are respectively a distribution and Markov kernels on , which may depend on the observations sequence , but this dependence is suppressed from the notation. We assume throughout the rest of the paper that , and and admit a density w.r.t. to some common dominating measure , and with a slight abuse of notation, the corresponding densities are denoted by , , and .

It is well known that Algorithm 1 provides an unbiased estimate of , i.e. . A detailed account of this fact is given in (Del Moral, 2004, Ch. 9). The main contribution of the present paper is to propose and study a new method to approximate . The approximation is delivered by Algorithm 2 – the Pairs algorithm – which we introduce in the next section, and which must be run in addition to the particle filter used to estimate . Our main motivation for approximating is to calculate . In a recent arXiv manuscript (Lee and Whiteley, 2015), A. Lee and the second author of the present paper have introduced a method which allows one to unbiasedly approximate using the same single run of the particle filter which delivers . As , the method of Lee and Whiteley (2015) allows one to consistently approximate asymptotic variance .

We stress that the Pairs algorithm performs the different task of approximating, for any fixed , the non-asymptotic quantity to arbitrary accuracy controlled by an auxiliary parameter (this statement is made precise in Theorem 2.1 below). Thus the Pairs algorithm allows one to reliably approximate without requiring that is large. We shall later illustrate how this property makes the Pairs algorithm useful within strategies for estimating .

Moreover in Theorem 2.1 we prove an important result regarding the time dependence of the error of the approximation of delivered by the Pairs algorithm, showing that under standard regularity conditions, it is sufficient to increase linearly with to control the relative variance of this approximation. This is in contrast to Lee and Whiteley (2015), who do not provide any results concerning the time-dependence of the errors associated with their estimators.

We note that Chan and Lai (2013) investigated numerical techniques for assessing the asymptotic variance associated with particle estimates of expectations with respect to filtering distributions, but they didn’t explore methods for approximating . We also note that Bhadra and Ionides (2014) proposed to approximate using a “meta-model”, for purposes of optimizing parameters of the particle filter. Their method amounts to fitting an AR(1) process to the output of the particle filter; it seems difficult to assess the bias of their approach and no proof of consistency is given.

## 2 Pairs algorithm

### 2.1 Outline of how the algorithm is derived

The full details of the derivation of the Pairs algorithm are given in Appendix A. We now give an account of some of the main ideas behind this derivation. For this some more notation is needed. Let us introduce the nonnegative integral kernels: for ,

 Q1(x,dy)=g0(x)π0(x)q0(x)q1(y1,y2)δx(dy1)dy2, (3)

and for and ,

 Qn(x,dy)=gn−1(x2)f(x1,x2)qn−1(x1,x2)qn(y1,y2)δx2(dy1)dy2. (4)

In terms of compositions of these kernels, the lack-of-bias property of the particle filter reads as:

 E[ZNn]=π0Q1⋯Qn(1). (5)

The kernels also encapsulate the main ingredients of the particle filter itself, indeed one may take the point of view that Algorithm 1 is actually derived from the , in the sense that resampling is performed according to weights given by evaluating the functions

 Q1(x,X2)=g0(x)π0(x)q0(x),Qn(x,X2)=gn−1(x2)f(x1,x2)qn−1(x1,x2),n≥2, (6)

and sampling is performed using the the Markov kernels:

 Qn(x,⋅)Qn(x,X2). (7)

Now introduce the so–called coalescence operator which acts on functions as . Cérou et al (2011) derived the following representation of the second moment of ,

 E[(ZNn)2]=E[π⊗20Cϵ0Q⊗21Cϵ1⋯CϵnQ⊗2n+1(1)], (8)

where , , is a sequence of i.i.d., -valued random variables with distribution

 P(ϵn=1)=1−P(ϵn=0)=1N,

and is the two-fold tensor product of .

The main idea behind the Pairs algorithm is to identify, using (8), certain nonnegative kernels such that the second moment can be written

 E[(ZNn)2]=π⊗20Q(N)1⋯Q(N)n+1(1).

The details of these kernels are given in the Appendix. Observing the similarity with (5), to obtain the Pairs algorithm we shall derive a particle algorithm from the weighting functions and Markov kernels which are associated with in the same way as (6)-(7) are associated with , the result being the Pairs algorithm. Results for standard particle filters then transfer to the Pairs algorithm directly, which leads to our Theorem 2.1 below.

### 2.2 The algorithm and its properties

In Algorithm 2 both and are parameters. The computational cost of Algorithm 2 is per time step, uniformly in , and the quantity which it delivers can be considered an approximation to , in the sense of Theorem 2.1 below.

###### Theorem 2.1.

If

 supxg0(x)π0(x)q0(x)<+∞ and supx1,x2gn(x2)f(x1,x2)qn(x1,x2)<+∞,∀n≥1, (9)

then for any and ,

 E[Ξ(N,M)n] = E[(ZNn)2],∀M≥1, Ξ(N,M)n [M→∞]a.s.⟶ E[(ZNn)2].

If additionally for each there exist constants , and for each , constants and a probability measure such that

 w−0≤g0(x)π0(x)/q0(x)≤w+0,∀x, (10) w−n≤gn(x2)f(x1,x2)/qn(x1,x2)≤w+n,∀x1,x2,n≥1, (11) ϵ−nμn(⋅)≤qn(x,⋅)≤ϵ+nμn(⋅),∀x,n≥1, (12)

then for any and ,

 M>n+1∑s=0Δs⇒E⎡⎢ ⎢ ⎢⎣⎛⎜ ⎜⎝Ξ(N,M)nE[(ZNn)2]−1⎞⎟ ⎟⎠2⎤⎥ ⎥ ⎥⎦≤4Mn+1∑s=0Δs

where is independent of and .

The proof of Theorem 2.1 is given in Appendix A. The conditions in (10)-(12) are fairly standard in the stability theory of particle filters, but are rather strong: they rarely hold when is an unbounded subset of Attempting to establish similar results under more realistic conditions, for example via the techniques of Whiteley (2013), seems to be a much more difficult task, beyond the scope of the present work, and we leave a full investigation of this matter to future research.

### 2.3 Comparison to using i.i.d. replicates of ZNn

A natural alternative to as an approximation to is to use i.i.d. replicates of and simple averaging,

 ˜Ξ(N,M)n:=1MM∑j=1(ZN,jn)2. (13)

The cost of computing is per time step since it involves copies of Algorithm 1, each using particles.

To illustrate why is to be preferred over in terms of relative variance, consider for simplicity of exposition the case: for , ; for , ; and for , . In this case, for all , we have and in Algorithm 1, are i.i.d. draws from . Then with , , and

 E⎡⎢ ⎢ ⎢⎣⎛⎜ ⎜⎝˜Ξ(N,M)nE[(ZNn)2]−1⎞⎟ ⎟⎠2⎤⎥ ⎥ ⎥⎦ = 1M⎛⎜ ⎜⎝E[(ZNn)4]E[(ZNn)2]2−1⎞⎟ ⎟⎠ (14) = 1M⎛⎝n∏p=0E[πNp(g)4]E[πNp(g)2]2−1⎞⎠ = 1M(Cn+1−1),

where by Jensen’s inequality, with equality holding if and only if is a.s. constant. So if exhibits any stochastic variability at all, in the sense that , then must be scaled exponentially fast with in order to control (14), cf. the linear-in- scaling in Theorem 2.1.

## 3 Numerical examples

We will illustrate the properties of the Pairs algorithm using two numerical examples. The first, in Section 3.1 is a simple toy example, based on a auto-regressive process. The second, in Section 3.2, is a more realistic example involving a Lotka - Volterra system of ODEs, observed in noise. In Section 3.3 we investigated the performance of the pairs algorithm within a strategy for estimating Monte Carlo variance.

Throughout section 3 we denote by a number of pairs used in the pairs algorithm to obtain a reliable, benchmark estimate of the true quantity .

### 3.1 Ar(1) example

The signal of this model is an process, defined by , where we set , , . Assume that . We will also assume that , i.e. we will propose using the actual signal density and we will set , given by , i.e. the process is stationary a priori.

In Figure 1 we compare two approaches for estimating : using the Pairs algorithm, and the standard MC approach using i.i.d. replicates as in (13). We consider two sub–examples: the first one is for comparatively small number of particles , and the second sub–example is with higher number of particles . The plots show for the Pairs algorithm and for the standard MC approach (please refer to Algorithm 2 and (13)). Here we take so that is a reliable, benchmark value of .

In the top left plot of Figure 1 we have chosen . For the equal cost plot on the top right we have chosen and . Here, by “equal cost” we mean that and are chosen such that the execution times of the standard MC algorithm and the Pairs algorithm are the same. The time parameter varies from to in both plots and we plot independent runs of both algorithms in order to compare their variability properties.

The second row of plots in Figure 1 consists of plots for the case of larger number of particles . Again, in the bottom left we are comparing the case where , and in bottom right we are comparing the equal cost case where and . The fact that is lower here than in the case reflects the fact that the cost of the standard MC approach is per time step, compared to for the Pairs algorithm. We have plotted 20 independent runs for both algorithms.

Figure 2 shows boxplots based on 100 independent runs for both algorithms for the case of equal and equal cost. We also have . It is apparent that the estimates of that we obtain using the Pairs algorithm have much less variability than the estimates produced using the standard Monte Carlo approach with i.i.d. replicates (especially for big values of the time parameter ).

### 3.2 Lotka - Volterra system example

In this section we illustrate the numerical performance of the pairs algorithm in the context of a partially observed Langevin approximation to Lotka-Volterra ODE system (Golightly and Wilkinson, 2011). The signal process in the HMM is obtained from a discretization of the stochastic differential equation (SDE) , where , . Here is a vector, each of the components of which is independent standard Brownian motion, are parameters and and are the drift and diffusion coefficients given for the Lotka-Volterra system by

with .

We consider Euler discretization of the SDE with time resolution for some , with the resulting process satisfying

 Xn+(j+1)Δt−Xn+jΔt=α(Xn+jΔt,c)Δt+√β(Xn+jΔt,c)Δtχj (15)

for and , where is a sequence of –independent random variables. The signal process in the HMM, denoted by , consists of a –valued random variable and for a –valued random variable . The model for the observations is , where , , where is the identity matrix. We also assume that we have observed the process at integer times . Following Golightly and Wilkinson (2011), we consider two values of the observation noise variance and . We fix the rate constants , and we will use for the discretization parameter.

We adopt the same approach to constructing the proposal kernels suggested in Golightly and Wilkinson (2011, Section 4.3), in which is chosen to be a tractable Gaussian approximation to the conditional density of given ,. The proposal kernel is given by

 qn+1(xn,xn+1)=m−1∏j=0ψn+(j+1)Δt(xn+jΔt,xn+(j+1)Δt)

where , where , , , , . We consider the process as a HMM, to which the particle algorithms are applied to.

We first obtain a reliable benchmark value of , denoted by , using a single run of the Pairs algorithm with . We compare from the Pairs algorithm with the simple Monte Carlo approximation based on i.i.d. replicates, defined in (13) in Figure 3 for two different values of the observation noise - and . In both cases we plot again for the Pairs algorithm and for the standard MC approach.

On the top left of Figure 3 we have the low noise example. In this example, we set , and . On the top right plot we present the large noise case where we set , and in order to equalize the computational cost. Again, as in the previous example, we have plotted 20 independent runs for both algorithms.

In the two plots, and especially for large values of the time parameter , the estimate that we obtain with the help of the Pairs algorithm has much less variability than the estimate calculated using standard Monte Carlo with i.i.d. replicates. We can clearly see that with the increase of the time parameter , the rate of growth of the variability of the estimates of obtained using the Pairs algorithm is far less than the corresponding rate for the standard Monte Carlo approach (using i.i.d. replicates). The observations about the variability of the estimates in Figure 3 are also supported by the corresponding boxplots, based on 100 independent runs of the two algorithms.

Table 1 shows numerical values for and for different values of the time parameter for the Lotka–Volterra example. We see, that although the scale of the values in Table 1 is small, we still have, by Jensen’s inequality, that .

### 3.3 Estimating Monte Carlo variance

The purpose of this example is to show that the benefits of approximating using the Pairs algorithm carry over to its use within a strategy for both estimating and reporting Monte Carlo variance. As a benchmark for comparisons, we consider the following standard approach based on i.i.d. replicates of a particle filter.

##### MC strategy.

Run independent particle filters, each with particles, to give . Then report:

• as an estimate of

• as an estimate of