Multilevel Monte Carlo for Smoothing via Transport Methods

# Multilevel Monte Carlo for Smoothing via Transport Methods

Jeremie Houssineau DSAP, National University of Singapore. Email: stahje@nus.edu.sg    Ajay Jasra DSAP, National University of Singapore. Email: staja@nus.edu.sg    Sumeetpal S. Singh Department of Engineering, University of Cambridge and The Alan Turing Institute. Email: sss40@cam.ac.uk
###### Abstract

In this article we consider recursive approximations of the smoothing distribution associated to partially observed \glsplsde, which are observed discretely in time. Such models appear in a wide variety of applications including econometrics, finance and engineering. This problem is notoriously challenging, as the smoother is not available analytically and hence require numerical approximation. This usually consists by applying a time-discretization to the \glssde, for instance the Euler method, and then applying a numerical (e.g. Monte Carlo) method to approximate the smoother. This has lead to a vast literature on methodology for solving such problems, perhaps the most popular of which is based upon the \glspf e.g. [9]. It is well-known that in the context of this problem, that when filtering, the cost to achieve a given \glsmse for estimates, the particle filter can be improved upon. This in the sense that the computational effort can be reduced to achieve this target \glsmse, by using \glsml methods [12, 13, 18], via the \glsmlpf [16, 20, 21]. For instance, under assumptions, for the filter, some diffusions and the specific scenario of Euler discretizations with non-constant diffusion coefficients, to obtain a \glsmse of for some the cost of the \glspf is and the \glsmlpf is . In this article we consider a new approach to replace the particle filter, using transport methods in [27]. The proposed method improves upon the \glsmlpf in that one expects that under assumptions and for the filter in the same case mentioned above, to obtain a \glsmse of the cost is . This is established theoretically in an “ideal” example and numerically in numerous examples.

T
\glsdisablehyper\pdfsuppresswarningpagegroup

=1 \newsiamremarkexampleExample \newsiamremarkremarkRemark \newacronymsdeSDEstochastic differential equation \newacronymmseMSEmean squared error \newacronymmlmcMLMCmultilevel Monte Carlo \newacronymmlpfMLPFmultilevel particle filter \newacronympfPFparticle filter \newacronymmlMLmultilevel \newacronymwrtw.r.t.with respect to \headersMLMC for Smoothing via TransportJ. Houssineau, A. Jasra and S.S. Singh

ransport map, Stochastic differential equation, Multilevel Monte Carlo

{AMS}

62M05, 60J60

## 1 Introduction

The smoothing problem often refers to the scenario where one has an unobserved Markov chain (or signal) in discrete or continuous time and one is interested in inferring the hidden process on the basis of observations, which depend upon the hidden chain. The case we consider is where the hidden process follows a \glssde and the observations are regularly recorded at discrete times; given the signal at a time the observation is assumed to be conditionally independent of all other random variables. The process of filtering is to infer some functional of the hidden state at time given all the observations at time and the smoothing problem to infer some functional of potentially all the states at the discrete observation times again given all the observations. It is often of interest to do this recursively in time. This modelling context is relevant for many real applications in econometrics, finance and engineering; see e.g. [4] and the references therein.

The smoothing problem is notoriously challenging. Supposing one has access to the exact transition of the \glssde, then unless the observation density is Gaussian and depends linearly on the hidden state and the transition density is also Gaussian depending linearly on the previous state, the filter and smoother are not analytically tractable (unless the state-space of the position of the diffusion at any given time is finite and of small cardinality); see [5]. However, it is seldom the case that even the transition density (or some unbiased approximation of it, e.g. [11] and the references therein) is available; this is assumed throughout the article. Thus typically, one time-discretizes the diffusion process and then one seeks to perform filtering and smoothing from the time-discretized model. This latter task is still challenging as it is still analytically intractable. There is a vast literature on how to numerically approximate the filter/smoother (e.g. [7]) and perhaps the most popular of which is the particle filter. This is a method whose cost grows linearly with the time parameter and generates samples in parallel. These samples are put through sampling and resampling operations. It is well-known that when estimating the filter, the error is uniform in time. For the smoother, the error often grows due to the so-called path degeneracy problem and indeed, there are many smoothing problems for which it is not appropriate; see [23] for some review and discussion. In the context of the problem in this article, when only considering the filter, ignoring the time parameter and under assumptions, to obtain a \glsmse of for some the cost of the \glspf is . The \glsmse takes into account the exact filter (i.e. the one with no time discretization).

\Gls

mlmc methods [12, 13, 14, 15, 18] are of interest in continuum systems which have to be discretized in one dimension, just as in this article (extensions to discretization in multiple dimensions have been proposed and studied in [6, 17]). We explain the idea informally as follows: let the time parameter be fixed and denote by the filter associated to a (say Euler) discretization level , set , and for bounded denote by the expectation of \glswrt the filter. Then the \glsmlmc method is based upon the following approach. Consider a sequence of discretizations, where is the most accurate (finest) discretization and the least (coarsest), the \glsml identity is

 pLt(φ)=L∑l=0(plt−pl−1t)(φ)

where is an arbitrary measure satisfying for every . The idea is then to sample independent samples from and then, independently for each independently sample coupled pairs from the pair . The \glsmlmc estimator is then

 1N0N0∑i=1φ(X0t,i)+L∑l=11NlNl∑i=1[φ(Xlt,i)−φ(Xl−t,i)]

where are i.i.d.  and are i.i.d. from a coupling of . To obtain a \glsmse of one sets such that the squared bias is (the bias is known in the context of interest). If one has for some then one can try to minimize (\glswrt ) the cost ( for an Euler discretization) subject to the variance being . [12] finds a solution to this problem. The main issue in the context of smoothing, is that one (typically) does not know how to sample from the smoothers nor the couplings.

In [16, 20, 21] it is shown how to utilize the \glspf to leverage on the potential decrease in cost to obtain a given \glsmse. This has been termed the MLPF. The idea is to use couplings in the Euler dynamics and the resampling operation of a \glspf. This has been later refined in [26]. To our knowledge, the only theoretical work for the \glsmlpf in [20], shows that to obtain a \glsmse of the cost in \glsmlpf is , for some specific (non-constant diffusion coefficient) models and under particular assumptions. This is known to be worse than the rates obtained in [12] in the case where there are no observations. Here and throughout, the time parameter is omitted from the discussion on cost and error, despite the fact that these are important considerations in general.

The main idea in this article is to adopt an alternative method to the \glspf. The approach is to use transport methods [27]. Transport maps have been used for Bayesian inference [10, 19] and more specifically for parameter estimation in [24] based on a related multi-scale idea. The basic idea is given a simple probability distribution, from which one can sample, to obtain a mapping such that the distribution of the mapping evaluated at a sample from the simple distribution has exactly the type which one desires. In [27] it is shown how to develop numerical approximations of maps, associated exactly to the distributions of interest in this article. These approximations often induce i.i.d. Monte Carlo approximations of expectations of interest, albeit with a numerical error associated to the approximation of the transport map. As mentioned in [22], it is simple to induce coupled pairs using the method of [27] and this is exactly what is done in this paper. The potential advantages of this method relative to the \glsmlpf are then as follows:

1. The \glsml rate lost by coupled resampling can be regained.

2. The method is useful for approximating the smoother, whereas the approach in [20, 21] is typically not useful for smoothing at large time-lags.

In this article we establish that 1 can hold in an ideal special case, where the model is linear and Gaussian and the transport map is exact. This result is reinforced by numerical examples which show that the result seems to hold more generally. The significance of 1 is that to obtain a \glsmse of the cost is ; this is better than the \glsmlpf. Point 2 relates to the afore-mentioned path degeneracy effect, which can mean \glsplpf (and hence the \glsmlpf) are not so useful in the context of large lag smoothing.

The structure of the article is as follows: Section 2 introduces the model and transport methodology. Section 3 presents the multilevel approach and the MLPF as well as the mechanisms underlying the computation of transport maps for a given level of discretization. The efficiency of the proposed approach is shown numerically on increasingly challenging scenarios in Section 4.

## 2 Methodology for \glssde smoothing

In this section, the considered notations and assumptions for the smoothing of \glsplsde are presented, together with a brief overview of the transport methodology.

### 2.1 The \glssde model

Throughout the article, all random variables will be assumed to be on the same complete probability space and will be denoted by upper-case letters, while their realisations will be in lower case. We consider a diffusion process on the space of the form

 dXt=a(Xt)dt+b(Xt)dWt,t∈[0,T], (1)

where is the final time, is the Brownian motion on , is in the set of twice continuously differentiable mappings from to itself and is in with the space of square matrices of size . The mapping is assumed to be such that is positive definite for all , with denoting the transposition. Moreover, the drift and diffusion coefficients are assumed to be globally Lipschitz, i.e. there exists such that

 |a(x)−a(x′)|+|b(x)−b(x′)|≤c|x−x′|

for all . The initial distribution of the process , i.e. the distribution of , is denoted (and might be equal to for some initial condition ). It is assumed that the th-order moment of defined as is finite for any . Probability density functions will be considered with respect to the Lebesgue measure on and both probability measures and their corresponding density functions will be referred to by the same notation.

The distribution of , , given a realisation of the state is denoted . In addition to the fact that the expression of the Markov transition is unavailable in general, it is not usually possible to devise an unbiased estimator for it or even to sample from it. In the case where , one can obtain “skeletons” of exact paths using the algorithm of [3, 2], however, the extension of this approach to \glsplsde of higher dimensions might not be possible [1].

The diffusion process is assumed to be observed in , , at all the integer-valued times so that the final time is also assumed to be an integer. These assumptions are made for the sake of notational simplicity and can be easily removed. For all , the observation is a random variable that is conditionally independent on the state at times given . The observation process can be expressed in general as

 Yk=gk(Xk,Vk) (2)

where is a deterministic observation function and where is a collection of independent random variables. It is assumed without any real loss of generality that both and the distribution of do not depend on the time index , the corresponding likelihood for a realisation of is denoted .

### 2.2 Smoothing for \glsplsde

Throughout the article, joint states in for some will be denoted either by with or by , with a finite subset of such that for all , defined as . The smoothing distribution associated with the \glssde Eq. 1 is defined formally as the joint law of the diffusion process at all the integer times given realisations of the observation process Eq. 2, and can be expressed for any as

 p(x0:T)=ℓ(x0,y0)p0(x0)∏Tk=1[Q(xk−1,xk)ℓ(xk,yk)]∫ℓ(x′0,y0)p0(x′0)∏Tk=1[Q(x′k−1,x′k)ℓ(x′k,yk)]dx′0:T.

The dependence of the smoothing distribution on the realisations of the observation process is omitted for the sake of notational simplicity. This is justified by the fact that these observations will be fixed in the remainder of the article so that the smoothing distribution and its approximations will always be conditioned on the same given observations. The expression of is a direct consequence of Bayes’ theorem applied to the prior describing the law of the unobserved (hidden) diffusion process together with the joint likelihood whose expression results from the conditional independence of the observations.

Using the same principle of implicit conditioning as with the smoothing distribution, the filtering distribution at time is defined as the law of given the realisations and is expressed recursively as

 pk(xk)=ℓ(xk,yk)∫Q(xk−1,xk)pk−1(xk−1)dxk−1∫ℓ(x′k,yk)Q(x′k−1,x′k)pk−1(x′k−1)dx′kdx′k−1

for any and any . The marginal distribution of induced by the smoothing distribution corresponds to the filtering distribution when only.

The objective in this article can now be formally expressed as follows: to compute the expectation of some bounded measurable function on . Although the above formulation casts the considered problem into the standard Bayesian inference framework, the Markov transition is unavailable in general, so that expressing analytically the distributions and is not usually possible. The first step toward our objective is then to apply a time-discretization to the \glssde Eq. 1, which, for the sake of simplicity, is illustrated with Euler’s method for some discretization level :

 Xt+hl=Xt+hla(Xt)+√hlb(Xt)Ut, (3)

for some time-step and for all where , with a collection of independent Gaussian random variables with density where is the identity matrix of size . The choice of time step is made for the sake of convenience and is not necessary. The only requirement for both the \glsmlpf and the multilevel transport is that the ratio has to be an integer. The number of time steps from a given observation time up to and including the next observation time, that is in the interval for some , is . The numeral scheme Eq. 3 yields a Markov transition between two successive discretization times defined as

 Kl(x,⋅)=ϕ(⋅;x+hla(x),hlb(x)b(x)t)

for any , which enables the approximation of by another Markov kernel defined as

 Ql(x,⋅)=Kl…KlMl times(x,⋅),

where for any transition kernels , . The smoothing distribution induced by Eq. 3, which approximates , is expressed on instead of and is characterised by

 pl(xTl)∝p0(x0)∏t∈Tl∖{T}Kl(xt,xt+hl)T∏k=0ℓ(xk,yk)

for any . Marginalising \glswrt all such that gives a distribution on which depends on the same time steps as . It is understood that the error in the approximation of and by and decreases when increases and tend to as tends to infinity. The measure of the function is understood as the measure of the canonical extension of from to defined as

The extension of the function can indeed be seen as canonical since it holds that \cref@addtoresetequationparentequation

 pl(¯φ) ∝∫¯φ(xTl)p0(x0)∏t∈Tl∖{T}Kl(xt,xt+hl)T∏k=0ℓ(xk,yk)dxTl =∫φ(x0:T)ℓ(x0,y0)p0(x0)T∏k=1[Ql(xk−1,xk)ℓ(xk,yk)]dx0:T,

as expected. Henceforth, will be used has a shorthand notation for when there is no ambiguity.

At this stage, standard Bayesian inference methods can be easily applied. For instance, if and are linear and constant functions respectively and if the observation equation (2) takes the form

 Yk=gk(Xk)+Vk

with a linear map and with normally distributed, then the Kalman methodology can be used to determine the filtering and smoothing distributions. When this is not the case, the \glspf methodology can be used instead, the approach exposed in [9] being one of the most popular versions. The latter applies sampling and resampling mechanisms to determine the filtering distribution with an error that is uniform in time. It is however less efficient for smoothing problems [23], mostly because of the path degeneracy induced by the use of repeated resampling procedures.

The proposed second step toward the efficient computation of is to use a method that enables i.i.d. samples to be drawn directly from the smoothing distribution and hence avoiding path degeneracy. This has been made possible by transport methods [28, 27] which are presented in the next section.

### 2.3 Transport methodology

The general principle of transport methods, when applied to the considered problem, is to compute a deterministic coupling between the base probability distribution of a convenient i.i.d. process and the target distribution , that is to compute a mapping from to itself that pushes forward to , i.e. such that

 pl(xl)=Gl#ηl(xl)≐ηl((Gl)−1(xl))∣∣det∇(Gl)−1(xl)∣∣,

where is the gradient of the inverse transport map evaluated at . The method introduced in [27] makes use of the specific structure of , which is induced by the Markov property of the underlying diffusion process , to divide the problem into a sequence of low-dimensional couplings. Each of these deterministic couplings, say for some , is a mapping from to itself which is assumed to take the form

 Mlt:(xt,xt+hl)↦(Ml,1t(xt,xt+hl),Ml,2t(xt+hl))t,

for some and . Under additional assumptions on and (see Eq. 5 below), the mapping can be characterised by

 (Mlt)#ηlt,t+hl=πt,t+hl,

where the probability distribution on is the marginal of at discretization steps and where is related to the marginal law of and is characterised when by

 πt,t+hl(xt,xt+hl)∝{ηlt(xt)ℓ(xt,yt)Kl(Ml,2t−hl(xt),xt+hl)if$t∈N$ηlt(xt)Kl(Ml,2t−hl(xt),xt+hl)otherwise,

where is the marginal of on at discretization time , and by

 π0,hl(x0,xhl)∝p0(x0)ℓ(x0,y0)Kl(x0,xhl).

The distribution is a design variable which is chosen to be the normal distribution for the sake of convenience (so that and do not depend on ). The two components of the mapping are instrumental for the proposed approach since they allow to transport samples from a convenient distribution to samples from the filtering or smoothing distributions. The filtering case is straightforward since it holds [27, Theorem 7.1] that pushes forward to the filtering distribution . To obtain samples from the smoothing distribution, it is necessary to first embed into the identity function on , which results in a function defined as

 Glt:(x0,xhl,…,xT)↦(x0,…,xt−hl,Ml,1t(xt,xt+hl),Ml,2t(xt+hl),xt+2hl,…,xT)t.

It is also demonstrated in [27, Theorem 7.1] that the desired mapping , that is the one that pushes forward to the smoothing distribution , is defined by the composition

 Gl=Gl0∘Glhl∘⋯∘GlT−hl.

Although the transport maps have been identified, their computation is not straightforward. Assuming that the mappings and are of the form

 Ml,1t(x1:d,x′1:d)=⎡⎢ ⎢ ⎢⎣Ml,1,1t(x1:d,x′1:d)⋮Ml,1,dt(xd,x′1:d)⎤⎥ ⎥ ⎥⎦ and Ml,2t(x1:d)=⎡⎢ ⎢ ⎢⎣Ml,2,1t(x1:d)⋮Ml,2,dt(xd)⎤⎥ ⎥ ⎥⎦, (5)

for any , i.e. loosely speaking, that and are upper triangular, it follows that is a -generalised Knothe-Rosenblatt (KR) rearrangement with , that is, informally, a map whose th component depends only on the variables and which pushes forward the th conditional of the base distribution to the corresponding conditional of the target distribution (see [27, Definition A.3] for more details). In order to find , we first have to solve the following optimisation problem:

 Ml,∗=argminM−E(logπt,t+hl(Sσ(M(Z)))+2d∑i=1log∂iMi(Z)−logηlt,t+hl(Sσ(Z))) (6)

subject to being a monotone increasing lower triangular mapping, where the expectation is \glswrt and where is the linear map corresponding to the transposition matrix induced by . It follows that since it holds that for the considered permutation . The above optimisation problem can be solved in different ways, e.g. by Gauss quadrature or by having recourse to Monte Carlo techniques [25, 8].

The transport map enables an approximation of to be computed by drawing samples from and by computing the empirical average

 ~pl(φ)≐1NN∑i=1φ(Gl(zi))≈pl(φ).

The \glsmse corresponding to the approximation of by can be expressed as the sum of a variance term and a bias term as follows

 E((~pl−p)(φ)2)=E((~pl−pl)(φ)2)+(pl−p)(φ)2.

We propose to further enhance the estimation by having recourse to a multilevel strategy for which transport methods will appear to be particularly well suited.

## 3 Multilevel Monte Carlo

We now consider that the discretization Eq. 3 of the \glssde Eq. 1 is performed at different discretization levels so that for the considered value of . This implies that the solution at the coarsest level is computationally efficient but possibly inaccurate whereas the solution at the finest level is more accurate but slower to compute. The principle of \glsmlmc is that the respective advantages of the coarsest and finest levels can be combined within a single estimation procedure by coupling the estimation of for adjacent levels. More specifically, the first step is to notice that the smoothing distribution corresponding to the discretization at level can be expressed via a telescopic sum involving the smoothing distributions at the other levels , that is

 pL(φ)=L∑l=0(pl−pl−1)(φ) (7)

where is an arbitrary measure satisfying , e.g. the null measure. Equation 7 motivates the introduction of some i.i.d. random variables in with law and some i.i.d. random variables in the space expressed as and such that and have marginal laws and respectively, for all . This enables an approximation of as

 pL(φ)≈~pL(φ)≐1N0N0∑i=1φ(X0i)+L∑l=11NlNl∑i=1(φ(Xli)−φ(Xl−i)). (8)

This approximation of is useful if the random variables are independent of each other for all and if their respective components and are as correlated as possible for all (and hence for all random variables and with since they are i.i.d.).

In order to determine the number of samples required at each level, we first express the \glsmse related to Eq. 8 as the sum of a variance term and a bias term as

 E((~pL−p)(φ)2)=L∑l=0Vl+(pL−p)(φ)2 (9)

with

Assuming that the bias is of order for some integer , it follows that a bias proportional to requires

 L∝−1αlog2(ϵ).

We also assume that the variance at level is of order and that the cost at level is of order for some positive integers and . The number of samples at level can then be determined by optimising the total cost for a given total variance . This leads to

 Nl=N12−(β+ζ)(l−1)/2, (10)

so that, to obtain a \glsmse of order , that is a bias of order and a total variance of order , one must take and

 N1∝ϵ−2L∑l=12(ζ−β)l/2.

Therefore, the number of samples and the cost for a \glsmse of order depends on the respective values of and . For instance, if , then both and are of order .

### 3.1 Multilevel particle filter

It is assumed in this section that the interest lies in estimating the filtering distribution at time through the multilevel identity Eq. 7. Since it is generally difficult to sample directly from a reasonable candidate for a coupling of and , one solution is to adopt a \glspf strategy within the \glsml formulation. In order to obtain samples that are correlated between two adjacent levels, a special joint Markov transition can be devised together with a resampling procedure that retains the correlation of the samples. This is the principle of the \glsmlpf which is briefly discussed here. Assume that we have some collections of samples and at time approximating and respectively. For all and all , samples and at time are produced through the Markov transition as follows:

1. Simulate Eq. 3 starting from the initial condition over time steps, denote by the obtained state of the process and by the collection of realisations of the perturbation drawn during the procedure.

2. Using the initial condition , define as the result of the deterministic recursion

 xl−t+hl−1=xl−t+hl−1a(xl−t)+√hl−1b(xl−t)(ult+ult+hl),

for any . This recursion is meaningful since so that corresponds to the noise in the step from to induced by .

This procedure yields pairs of correlated samples according to the predictive distribution at time given observations up to time . The information provided by the observation is simply taken into account by attributing the respective weights and to the samples and in a similar fashion:

 wli,k=ℓ(xli,k,yk)∑Nlj=1ℓ(xlj,k,yk) and wl−i,k=ℓ(xl−i,k,yk)Nl∑j=1ℓ(xl−j,k,yk).

Following the weighting of the samples, the difference can be estimated via

 (plk−pl−1k)(φ)≈Nl∑i=1(wli,kφ(xli,k)−wl−i,kφ(xl−i,k)).

Although this approximation would behave well in general, most of the sample weights would tend to if we were to apply the same procedure repeatedly in order to reach the next observation times, resulting in a rapid increase of the empirical variance. The usual way to address this problem in the standard \glspf formulation is to perform resampling, that is to draw new samples from the old ones according, for instance, to the multinomial distribution induced by the weights. Applying the same approach to the \glsmlpf would result in the loss of the correlation between the samples at adjacent levels. A coupled resampling is used instead as follows. For all and all :

1. With probability draw the index according to the probability mass function (p.m.f.) on characterised by

 ^mlk(j)=1ρlkmin{wlj,k,wl−j,k}

and define .

2. If 1 is not selected (with probability ), draw the indices and independently according to the p.m.f.s and on characterised by

 mlk(j)∝wlj,k−min{wlj,k,wl−j,k} % and ml−k(j)∝wl−j,k−min{wlj,k,wl−j,k}.
3. Define the new pair of samples as .

Although the coupled resampling addresses the problem of reducing the empirical variance without completely losing the correlation between samples at adjacent levels, it nevertheless has a negative impact of the \glsml rate. Indeed, as demonstrated in [20], one needs to obtain a cost of order for a \glsmse of order . In the case where , e.g. for Euler’s scheme () with , the cost is of order .

Also, even if the \glsmlpf can handle smoothing on a short time window, i.e. it can successfully approximate the distribution of given for small values of , the error in the approximation of the full smoothing distribution would increase in time because of the path degeneracy effect. Indeed, resampling tends to multiply the samples of higher weights so that, after a certain number of time steps, all samples will be descendants of the same earlier sample.

### 3.2 Multilevel transport

In order to avoid the path degeneracy inherent to any \glspf approach and to regain the \glsml rate lost through the coupled resampling of the \glsmlpf, we propose to compute samples from the distributions via the transport maps characterised by with for all . The specific procedure is described as follows. For all :

1. draw a sample from

2. map through to obtain a sample from

3. define a thinned sample

4. map through to obtain a sample from

This simple procedure yields two collections and of samples drawn from a joint distribution that obviously has marginals and and that correlates adjacent levels as desired. The efficiency of the approach comes from the fact that the transport maps have to be computed once only. Given the computation of the maps, it is relatively fast to obtain the samples.

It is assumed that the procedure underlying the computation of the transport maps is deterministic, so that there is no undesired correlations between samples from and when . Further neglecting the numerical error in the computed transport maps, it follows that the expression Eq. 9 of the \glsmse holds for the considered approach.

Before proceeding to a numerical study, the legitimacy of the proposed approach is verified for the linear-Gaussian case. Consider the \glssde Eq. 1 in dimension and with (so that the observation at time has no impact). The corresponding filtering distribution at time and at level simplifies to

 plk(xk)∝∫k∏n=1[Ql(xn−1,xn)ℓ(xn,yn)]dx1:k−1

for any . Denote