Twisted particle filters

# Twisted particle filters

## Abstract

We investigate sampling laws for particle algorithms and the influence of these laws on the efficiency of particle approximations of marginal likelihoods in hidden Markov models. Among a broad class of candidates we characterize the essentially unique family of particle system transition kernels which is optimal with respect to an asymptotic-in-time variance growth rate criterion. The sampling structure of the algorithm defined by these optimal transitions turns out to be only subtly different from standard algorithms and yet the fluctuation properties of the estimates it provides can be dramatically different. The structure of the optimal transition suggests a new class of algorithms, which we term “twisted” particle filters and which we validate with asymptotic analysis of a more traditional nature, in the regime where the number of particles tends to infinity.

[
\kwd
\doi

10.1214/13-AOS1167 \volume42 \issue1 2014 \firstpage115 \lastpage141 \newproclaimremRemark \newproclaimdefn*Definition

\runtitle

Twisted particle filters

{aug}

a]\fnmsNick \snmWhiteley\corref\thanksreft1label=e1]nick.whiteley@bristol.ac.uk and b]\fnmsAnthony \snmLee\thanksreft2label=e2]anthony.lee@warwick.ac.uk \thankstextt1Supported by EPSRC Grant EP/K023330/1. \thankstextt2Supported by CRiSM, an EPSRC-HEFCE UK grant.

class=AMS] \kwd[Primary ]60K35 \kwd62M20 \kwd[; secondary ]60G35 Sequential Monte Carlo \kwdfiltering

## 1 Introduction

A hidden Markov model (HMM) with measurable state space and observation space is a process where is a Markov chain on , and each observation , valued in , is conditionally independent of the rest of the process given . Let and be respectively a probability distribution and a Markov kernel on , and let be a Markov kernel acting from to , with admitting a strictly positive density, denoted similarly by , with respect to some dominating -finite measure. The HMM specified by , and is

 X0∼μ0(⋅),Xn|{Xn−1=xn−1} ∼ f(xn−1,⋅),n≥1, (1) Yn|{Xn=xn} ∼ g(xn,⋅),n≥0.

In practice, one often seeks to fit a HMM to data . This motivates computation of the marginal likelihood of under the model (1). We consider methods for approximate performance of this computation.

Let be the set of doubly infinite sequences valued in . For we shall write the coordinate projection and take as a recursive definition of the prediction filters, the sequence of distributions given by

 πω0 := μ0, (2) πωn(A) := ∫Xπωn−1(dx)g(x,Yn−1(ω))f(x,A)∫Xπωn−1(dx)g(x,Yn−1(ω)),A∈X,n≥1.

We are centrally concerned with the sequence defined by

 Zω0:=1,Zωn:=Zωn−1∫Xπωn−1(dx)g(x,Yn−1(ω)),n≥1. (3)

Due to the conditional independence structure of the HMM, is the conditional distribution of given ; and is the marginal likelihood evaluated at the point . The simplest particle filter, known as the “bootstrap” algorithm gordon1993novel (), is given below. It yields an approximation, , of each .

Convergence properties of particle algorithms in the regime are well understood del1999central (), smc:the:C04 (), smc:the:K05 (), smc:the:DM08 () and their stability properties have been expressed through finite- error bounds smc:the:DMG01 (), smc:the:CdMG11 (), whiteley2011 (), time-uniform convergence legland2003robustification (), le2004stability (), smc:the:OR05 (), smc:the:vH09 () and control on asymptotic variance expressions del2001interacting (), favetto2012asymptotic (), whiteley2011 (), douc2012long (). Our aim is to rigorously address comparative questions of how and why one algorithm may outperform another, and how it is possible to modify standard algorithms in order to improve performance. Our study is formulated in a generic framework which accommodates standard particle algorithms and novel extensions. As an introduction we discuss some of our intentions and findings in the context of the bootstrap particle filter as per Algorithm 1; more precise statements are given later.

Writing for expectation with respect to the law of the bootstrap particle filter processing a fixed observation sequence , the well-known lack-of-bias property (smc:theory:Dm04 (), Proposition 7.4.1) reads

 EωN[Zω,Nn]=Zωn, (4)

and holds for any and . This property is desirable because it allows particle filters to be used within “pseudo-marginal”-type algorithms (see Andrieu2009 () and references therein), and plays a role in explaining the validity of some compound Monte Carlo techniques such as particle Markov chain Monte Carlo Andrieu2010 (). The accuracy of influences the performance of such schemes Andrieu2012 (). We shall analyze novel particle algorithms arising through changes of measure on the left-hand side of (4), similarly enjoying lack-of-bias, and which could therefore be used in lieu of more standard particle filters in compound Monte Carlo algorithms. The resulting approximation of will be of the form

 ˜Zω,Nn:=Zω,Nn⋅n∏p=1ϕω,Np, (5)

where is exactly the same functional of the particles as in Algorithm 1 and is a sequence of functionals chosen such that, if we write for expectation under the (as yet unspecified) alternative sampling law, then the lack of bias property is preserved:

 (6)

Our main objective is to identify “good” choices of alternative sampling laws, possibly allowing the transitions of the particles to depend on past and/or future observations. Our criterion for performance arises from a study of the normalized second moment of , in the regime where is fixed and , in an -pathwise fashion.

For now, let us still consider as fixed. Then under the probability law corresponding to Algorithm 1, the generations of the particle system, with , form an -valued time-inhomogeneous Markov chain. Let be the family of Markov kernels such that for each , is given by

 Mω(x,dz)=N∏i=1∑Nj=1g(xj,Y0(ω))f(xj,dzi)∑Nj=1g(xj,Y0(ω)), (7)

with and . Let be the shift operator, , so that, for example, . The -fold iterate of will be written with . It is then clear that the sampling steps of Algorithm 1 implement

 ζ0∼μ⊗N0,ζn|ζn−1∼Mθn−1ω(ζn−1,⋅),n≥1. (8)

Variance growth rates. For a family of Markov kernels belonging to a broad class of candidates and which may depend on in a rather general fashion, but subject to and other regularity conditions, we shall consider sampling the particle system according to

 ζ0∼μ⊗N0,ζn|ζn−1∼˜Mθn−1ω(ζn−1,⋅),n≥1, (9)

and simply setting

 ϕω,Nn:=dMθn−1ω(ζn−1,⋅)d˜Mθn−1ω(ζn−1,⋅)(ζn),n≥1. (10)

Then letting denote expectation under the Markov law (9), and with as in (5), we of course achieve (6).

Let be endowed with the product -algebra and let be a probability measure on . We stress that is not necessarily a measure on observation sequences derived from the particular HMM (1), nor indeed any HMM. Under the assumption that is -preserving and ergodic, and under certain other regularity conditions, application of our first main result, Proposition 4, establishes, for any fixed , existence of a deterministic constant , depending on such that

 1nlog˜EωN[(˜Zω,Nn)2](Zωn)2⟶ΥN(˜M)as n⟶∞, for P-a.a. ω. (11)

It must be the case that , because variance is nonnegative and the lack of bias property (6) holds. We shall see that typically .

Optimal sampling. Our second main result (Theorem 2.4) identifies, for any fixed and among the class of candidates, the essentially unique choice of the family which achieves . It turns out that this optimal choice arises from a particular form of re-weighting applied to each transition kernel and is defined in terms of a family of functions which are, in abstract terms, generalized eigenfunctions associated with algebraic structures underlying the particle algorithm. In the context of the bootstrap particle filter, has the following interpretation. is the prediction filter initialized at time and run forward to time zero, giving a distribution over conditional on . Then, letting be the distribution over obtained by further conditioning on , arises as the pointwise limit:

 hω(x)=limn→∞dΠωndπθ−nωn(x).

Theorem 2.4 establishes that for any , if and only if, for -almost all there exists a set such that is null (with respect to an as yet unnamed measure) and for any ,

 ˜Mω(x,B)=∫BMω(x,dx′)hθω(x′)∫XNMω(x,dx′)hθω(x′)for % all B∈X⊗N, (12)

where

 hω\dvtxx=(x1,…,xN)∈XN⟼N−1N∑i=1hω(xi)∈R+.

In the rare-event and large deviations literatures, the action of re-weighting Markov kernels using nonnegative eigenfunctions is generically referred to as “twisting.” Since in the present context we are applying re-weighting to the transitions of the entire particle system, we shall adopt this terminology and consider a class of algorithms which we refer to as twisted particle filters.

Twisted particle filters. The form of the optimal transition (12), where is re-weighted by an additive, nonnegative functional, leads us to consider a new class of particle algorithms. Consider a family of functions and let be defined by

 ˜Mω(x,dx′) = Mω(x,dx′)\boldsψθω(x′)∫XNMω(x,dz)\boldsψθω(z), (13) \boldsψω\dvtxx = (x1,…,xN)∈XN⟼N−1N∑i=1ψω(xi)∈R+. (14)

This setup clearly admits the optimal transition () and the standard transition (take , for some positive constant ) as special cases. Then introducing

 ˜gω(x):=g(x,Y0(ω))∫Xf(x,dz)ψθω(z),

we observe that , defined in (10), is given by

 ϕω,Nn=[1N−1∑Ni=1g(ζin−1,Yn−1(ω))]∑Ni=1˜gθn−1ω(ζin−1)∑Ni=1ψθnω(ζin). (15)

Since is an additive functional, it is clear that as per (13) is of mixture form, and introducing the -dependent Markov kernel

 ˜fω(x,dx′):=f(x,dx′)ψθω(x′)∫Xf(x,dz)ψθω(z),

the procedure of sampling from (9) and evaluating can be implemented through Algorithm 2, in which and are auxiliary random variables employed for algorithmic purposes, and the recursion for arises from the definition of combined with (5) and (15).

The difference between the sampling steps of Algorithm 2 and Algorithm 1 is fairly subtle: loosely speaking, at each time step, of the particles in Algorithm 2 are propagated by the same mechanism as in Algorithm 1. However, with an appropriate choice of , the fluctuation properties of under (13)–(14) can be dramatically different to those of under (8)–(7). Our third main result (Theorem 3.1) concerns asymptotic fluctuation properties of twisted particle approximations when and are fixed and . Under mild regularity conditions, we prove central limit theorems for generic particle systems under transitions like (13)–(14). For bounded functions centered w.r.t. , we find that the asymptotic variance associated with is the same when sampled under Algorithms 1 and 2, but the asymptotic variances of and are, in general, different.

The finite-, finite- behavior of the relative variance of the standard estimate from Algorithm 1 is well understood. Under certain regularity assumptions, it can be deduced from smc:the:CdMG11 (), Theorem 5.1, that in our setting must satisfy

 ΥN(M)≤log[1+CN−1] (16)

for some finite constant which depends on and . Our fourth main result (Proposition 5) generalizes (16) to the case of twisted particle filters. With as in (11), as in (13), and under some regularity conditions,

 ΥN(˜M)≤log[1+C′N−1supω,x,x′∣∣∣hω(x)ψω(x)−hω(x′)ψω(x′)∣∣∣],

where is a constant. Thus, whenever , by choosing “close” to , we can in principle achieve .

The rest of the paper is structured as follows. Section 2 introduces our general setting, addressing the generalized eigenvalue properties of families of nonnegative kernels and sampling laws of the particle systems we consider. Section 3 narrows attention to twisted particle filters and considers some properties in the regime where is fixed and , and vice-versa. Section 4 discusses the application of our main results to sequential importance sampling, bootstrap and auxiliary particle filters. The proofs of Lemmas 34, Propositions 15 and Theorems 2.43.1 are housed in the supplementary material tpf-supp ().

## 2 Nonnegative kernels, sampling particles and variance growth

### 2.1 Notation and conventions

Let , , , , and and be as in Section 1. Expectation w.r.t. will be denoted by . Let , and be respectively the collections of measures, probability measures and real-valued, bounded, -measurable functions on . We write

 ∥φ∥:=supx∣∣φ(x)∣∣

and

 μ(φ):=∫Xφ(x)μ(dx)for any φ∈L(X),μ∈M(X). (17)

We will be dealing throughout with various real-valued functions on (and more generally , etc.). For any such function , we write the -section of as , . For a function it will sometimes be convenient to write instead of the more standard . We will need to express various integration operations involving functions on and their -sections, so for completeness we quote the following facts of measure theory (see, e.g., doob1994measure (), Chapter VI, which will be used repeatedly without further comment): when is measurable w.r.t. to , then for every , the -section is measurable w.r.t. ; and, furthermore, for any -finite measure on , if is integrable w.r.t. to , then the function acting which maps is measurable w.r.t. and is -integrable.

Let be two functions, each acting and each measurable w.r.t. . We will need to talk about the sets on which such functions take the same values. For any , let and let be a -finite measure on . In order to avoid having to make the sets explicit in various statements, we will write by convention

 \mbox{for }P-a.a. ω,φω(x)=˜φω(x)\mbox{for }μ-a.a. x

to mean .

### 2.2 Generalized eigenvalue theory for nonnegative kernels

Fix arbitrarily and let be such that for each , and is -measurable for each . Then for any , is a Markov kernel on and when it is important to emphasize this perspective, we shall often write instead of . We shall adopt similar notation for other kernels.

For any fixed , let denote expectation with respect to the law of the time-inhomogeneous Markov chain , with each valued in , initialized from and , for . Let be a -measurable, strictly positive and bounded function. {rem} This setup is purposefully generic and accommodates, as one particular instance, the case

 Gω(x)=g(x,Y0(ω)),Mω(x,dx′)=f(x,dx′)∀ω∈Ω, (18)

where and are as in Section 1, and then . Other instances will be discussed in Section 4. We next introduce two hypotheses. Since , (H1) amounts to saying that the observation process is stationary and ergodic. (H2) is a strong mixing condition that rarely holds when and are noncompact, and some results do not rely on both (19) and (20) simultaneously but their combination allows us to avoid a layer of technical presentation which would further lengthen and complicate our proofs.

{longlist}

[(H2)]

The shift operator preserves and is ergodic.

There exist constants , , and such that

 G(ω,x)G(ω′,x′) ≤ β∀(ω,ω′,x,x′)∈Ω2×X2, (19) ε−ν(⋅) ≤ M(ω,x,⋅)≤ε+ν(⋅)∀(ω,x)∈Ω×X. (20)

We now introduce the nonnegative kernel

 Q\dvtxΩ×X×X→R+,Q(ω,x,dx′):=G(ω,x)M(ω,x,dx′). (21)

For any fixed , we define the operators

 Qω(φ)(x) := ∫XQω(x,dx′)φ(x′),φ∈L(X), (22) μQω(⋅) := ∫Xμ(dx)Qω(x,⋅),μ∈M(X), (23)

and let be defined recursively by

 Qω0:=Id,Qωn=Qωn−1Qθn−1ω,n≥1. (24)

This operator notation allows us to express

 μ0Qωn(φ)=Eω[φ(Xn)n−1∏p=0Gθpω(Xp)],n≥1,φ∈L(X). (25)

It is well known that (H1) and (H2) together are sufficient to establish the following result; see leroux1992maximum () for related ideas in the context of HMMs.

###### Proposition 1

Assume (H1) and (H2). Then there exists a constant independent of the initial distribution such that

 1nlogEω[n−1∏p=0Gθpω(Xp)]→Λas n→∞,P-a.s. (26)

It turns out that Proposition 1 is one element of a generalized eigenvalue theory for the nonnegative kernel . Another element is Proposition 2, which involves the following objects. Let be defined by

 Φω(μ)=μQωμQω(1),μ∈P(X),

and let be the family of operators defined recursively by

 Φω0:=Id,Φωn:=Φθn−1ω∘Φωn−1,

so that each acts . Under these definitions, for any ,

 Φωn(μ)=μQωnμQωn(1), (27)

which can be verified by induction, since from the above definitions , and when (27) holds,

 Φωn+1(μ)=(Φθnω∘Φωn)(μ)=Φωn(μ)QθnωΦωn(μ)Qθnω(1)=μQωnQθnωμQωnQθnω(1)=μQωn+1μQωn+1(1).
{rem}

In the setting , , then if and are respectively the initial distribution and prediction-filter as in (2), we have

 πωn+1=Φθnω(πωn),n≥0.
{rem}

Under (H2), it is known that is exponentially stable with respect to initial conditions (e.g., smc:theory:Dm04 (), Chapter 4) that is, there exist constants and such that for any and any ,

 supω∈Ωsupμ,μ′∈P(X)∣∣[Φωn(μ)−Φωn(μ′)](φ)∣∣ ≤ ∥φ∥Cρn. (28)

Equation (28) is used extensively in the proof of the following proposition, which is a variation on the theme of Kifer’s Perron–Frobenius theorem for positive operators in a random environment kifer1996perron (), Theorem 3.1.

###### Proposition 2

Assume (H2). {longlist}[(1)]

Fix . Then the limits

 ηω(A) := limn→∞Φθ−nωn(μ)(A),ω∈Ω,A∈X, (29) h(ω,x) := limn→∞Qωn(1)(x)Φθ−nωn(μ)Qωn(1),ω∈Ω,x∈X, (30)

exist and define a family of probability measures and an -measurable function .

In fact, and are independent of the particular chosen in part (1) and there exist constants and such that for any ,

 supω∈Ωsupμ∈P(X)∣∣[Φθ−nωn(μ)−ηω](φ)∣∣≤∥φ∥Cρn,n≥1 (31)

and

 supω∈Ωsupx∈Xsupμ∈P(X)∣∣∣Qωn(1)(x)Φθ−nωn(μ)Qωn(1)−h(ω,x)∣∣∣≤Cρn,n≥1. (32)

is measurable w.r.t. and we have

 sup(ω,ω′)∈Ω2λωλω′<∞,sup(ω,ω′,x,x′)∈Ω2×X2h(ω,x)h(ω′,x′)<∞. (33)

Among all triples which consist of (i) an -indexed family of probability measures on , (ii) an -valued, not identically zero, measurable function on , and (iii) a measurable function on , the triple , with as in part (1) and as in part (3), uniquely satisfies the system of equations

 ηωQω=λωηθω,Qω(hθω)=λωhω,ηω(hω)=1for all ω∈Ω. (34)

The connection with Proposition 1 is as follows:

###### Proposition 3

Assume (H1), (H2) and let be as in Proposition 1 and be as in Proposition 2. Then

 (35)

In the setting of HMMs as per Remark 2.2, equalities like the first one in (35) appear routinely in the study of likelihood-based estimators leroux1992maximum (), randal2011asymptotic (). However, it is the second equality in (35), and generalizations thereof, which shall be crucial for our purposes in the sequel. {rem} If one weakens the “-step” condition (20) to an -step version for some , then Propositions 13 can easily be generalized, working with the kernel instead of . Part of the utility of the uniform in and bounds in (H2) is that various parts of Proposition 2 hold uniformly over . If one allows -dependent constants and measures in (19) and (20), and imposes certain explicit compactness and continuity assumptions and (H1), then kifer1996perron (), Theorem 3.1, provides a partial alternative to our Proposition 2. We proceed by introducing the laws of the particle systems of interest.

### 2.3 Law of the standard particle system

Unless stated otherwise, in this section we fix arbitrarily and write for the collection of probability measures on .

Let be given, in integral form, by

 M(ω,x,dz)=N∏i=1[∑Nj=1G(ω,xj)M(ω,xj,dzi)∑Nj=1G(ω,xj)], (36)

where . Each member of the family is a Markov transition kernel for the entire -particle system according to a “multinomial” resampling scheme with fitness function , followed by conditionally independent mutation according to .

Now for any given , we shall denote by expectation with respect to the law of the Markov chain , with each valued in and

 ζ0∼μ⊗N0,ζn|ζn−1∼Mθn−1ω(ζn−1,⋅). (37)

We define, with ,

 G\dvtx(ω,x)∈Ω×XN⟼1NN∑i=1G(ω,xi)∈R+. (38)
{rem}

For any , if we define the function

 \boldsφ\dvtxx=(x1,…,xN)∈XN⟼1NN∑i=1φ(xi)∈R,

then the lack-of-bias property of the particle approximation smc:theory:Dm04 (), Proposition 7.4.1, is

 EωN[\boldsφ(ζn)n−1∏p=0Gθpω(ζp)]=Eω[φ(Xn)n−1∏p=0Gθpω(Xp)]. (39)
{rem}

When and , the sampling recipe for simulating the process according to (37) is the bootstrap particle filter: Algorithm 1. Furthermore, the particle approximation of is then . To see it is unbiased, apply (39) with .

Part of our investigation will develop some limit theory for

 EωN[∏n−1p=0Gθpω(ζp)2]Eω[∏n−1p=0Gθpω(Xp)]2, (40)

when is fixed and . Our notation and (39) are intended to hint that the phenomena described in Propositions 13 are relevant to the study of (40). Indeed, this is the direction in which we are heading. However, we will actually study an object more general than (40), arising from a more general form of particle approximation, for the particle system is distributed according to some Markov law, possibly different to (37).

### 2.4 Alternative sampling of the particle system

Let us introduce , possibly different from . For fixed , now denote by expectation with respect to law of the Markov chain

 ζ0∼μ⊗N