Maximum a Posteriori  State Path Estimation: Discretization Limits and their Interpretation\thanksreffootnoteinfo

# Maximum a Posteriori  State Path Estimation: Discretization Limits and their Interpretation\thanksreffootnoteinfo

Dimas Abreu Dutra ddutra@cpdee.ufmg.br    Bruno Otávio Soares Teixeira brunoot@ufmg.br    Luis Antonio Aguirre aguirre@cpdee.ufmg.br Programa de Pós-Graduação em Engenharia Elétrica - Universidade Federal de Minas Gerais - Av. Antônio Carlos 6627, 31270-901, Belo Horizonte, MG, Brasil
###### Abstract

Continuous-discrete models with dynamics described by stochastic differential equations are used in a wide variety of applications. For these systems, the maximum a posteriori (MAP) state path can be defined as the curves around which lie the infinitesimal tubes with greatest posterior probability, which can be found by maximizing a merit function built upon the Onsager–Machlup functional. A common approach used in the engineering literature to obtain the MAP state path is to discretize the dynamics and obtain the MAP state path for the discretized system. In this paper, we prove that if the trapezoidal scheme is used for discretization, then the discretized MAP state path estimation converges hypographically to the continuous-discrete MAP state path estimation as the discretization gets finer. However, if the stochastic Euler scheme is used instead, then the discretized estimation converges to the minimum energy estimation. The minimum energy estimates are, in turn, proved to be the state paths associated with the MAP noise paths, which in some cases differ from the MAP state paths. Therefore, the discretized MAP state paths can have different interpretations depending on the discretization scheme used.

E
thanks: [

footnoteinfo]This work has been financially supported by the Brazilian agencies CAPES, CNPq, FAPEMIG and FINEP.

, , .

stimation theory; optimization problems; variational analysis; stochastic systems.

## 1 Introduction

A wide variety of phenomena of engineering interest are continuous-time in nature and can be modelled by stochastic differential equations (SDEs). In practical situations, the measurements of these systems are usually sampled, leading to continuous-discrete models, i.e., models with continuous-time dynamics and measurements taken in discrete-time. For this class of models, the maximum a posteriori (MAP) state paths can be defined as the curves which have the greatest posterior sojourn probability, i.e., that maximize the probability that the state process lies inside an infinitesimal tube around them. The asymptotic prior probability of -tubes around state paths, as vanishes, is given by the Onsager–Machlup functional (Fujita and Kotani, 1982; Zeitouni and Dembo, 1987; Capitaine, 1995), which can be used to construct a variational problem for MAP state path estimation in continuous-discrete models (Aihara and Bagchi, 1999b). It should be noted, however, that variational MAP state path estimation using the Onsager–Machlup functional is not common in the engineering literature. One of the few exceptions is the work of Aihara and Bagchi (1999a, b), where only experiments with simulated data are shown.

In the context of discrete-time dynamical systems, there exist several methods for maximum a posteriori (MAP) state path estimation (Bell, 1994; Godsill et al., 2001; Aravkin et al., 2011a; Monin, 2013). The MAP state path consists of the joint posterior mode of the states over all time indices, given a series of measurements. These estimators have been the focus of many recent publications due to their robustness properties and larger applicability than nonlinear Kalman smoothers (Farahmand et al., 2011; Aravkin et al., 2011a, b, 2012a, 2012b; Dutra et al., 2012; Monin, 2013).

To use discrete-time MAP state path estimation in the context of continuous-discrete systems, the system dynamics needs to be converted to discrete-time. However, while the discretization of linear systems can be exact, for general nonlinear SDEs it is necessary to employ approximations such as the stochastic Euler, trapezoidal or Itō–Taylor schemes (Kloeden and Platen, 1992). Applications of MAP state path estimation on discretized continuous-discrete systems described by SDEs are presented in the work of Bell et al. (2009) and Aravkin et al. (2011a, 2012a). In these applications, the former used the exact discretization for linear systems, while the latter used the Euler discretization for nonlinear systems.

Discrete-time approximations of SDEs improve as the discretization step decreases. Applications in Kalman filtering that use such discretization schemes, for example, usually require fine discretizations to produce meaningful results (Särkkä and Solin, 2012), especially when the measurements are sparse. An important question then arises: how are the MAP state paths of the discretized systems related to the MAP state path of the original system? Do the discretized estimates converge to continuous-time functions?

In this paper, we use the theory of consistent discretizations (Polak, 1997) to show that, under some regularity conditions, the MAP estimates using the Euler and trapezoidal schemes converge to maxima of variational problems. The limit variational problem for the trapezoidal discretization is the MAP state path estimation problem. For the Euler discretization, however, the limit variational problem is the minimum energy estimation problem (Cox, 1963, Sec. 3.7), whose merit function is similar to that of the MAP problem but lacks a correction term which is related to the amplification of noise by the drift. The minimum energy estimator was, for many years, believed to be a good MAP estimator (Zeitouni and Dembo, 1987, p. 234). Its estimates are MAP state paths, however, only when the correction term is constant. In this paper we prove that, in the general case, the minimum energy estimates correspond to the state paths associated with the MAP noise paths.

The main contributions of this paper are i) providing the statistical interpretation for minimum energy estimates as the state path associated with the MAP noise path, ii) proving a link between variational and discretized MAP state path estimation, and iii) proving that the limits of the discretized MAP state path estimation problems depend on the choice of discretization used. The remainder of this paper is organized as follows. In Sec. 2 a simple motivating example is presented to illustrate the main ideas of the paper. In Sec. 3 we present a formal definition of mode, in the framework of Bayesian decision theory, that generalizes well to infinite-dimensional spaces. We then define the continuous-discrete estimation problems in Secs. 46 and the discretized problems in Sec. 7. Finally, in Sec. 8 the hypo-convergence of the problems is proved. Simulated examples are presented in Sec. 9 and the conclusions and future work are presented in Sec. 10.

## 2 Motivating example

Take the simple scalar SDE below, for which the exact transition probability is known (Daum, 1986):

 dXt=tanh(Xt)dt+dWt. (1)

Then, consider a state path estimation problem where the only information available is the initial state prior distribution and a single measurement at with value . The initial state and measurement are normally distributed with mean and covariance as follows:

 X0 ∼N(0,0.16), Y|X5 ∼N(X5,0.16). (2)

If we discretize this SDE at equally spaced time-points using the exact transition density, the Euler scheme and the trapezoidal scheme, it can be seen that the MAP state path estimates converge to continuous-time functions as . The estimates with the trapezoidal and exact transition densities have the same limit, which is the MAP state path obtained using the Onsager–Machlup functional. The estimates with the Euler discretization, however, converge to a different curve, which is the minimum energy state path. These results are summarized in Fig. 1. The maximization problems corresponding to each solution are presented in Secs. 67.

This example shows that, for some systems, the Euler discretization is not appropriate for approximating the joint density of the state path, even with very fine discretizations. The MAP state paths obtained with the Euler scheme converge to the minimum energy estimates, not to the true MAP state paths obtained with the exact transition density or the Onsager–Machlup functional.

## 3 Bayesian decision and MAP estimation

This section is a short introduction to Bayesian estimation theory to argue that the maximization of the probability of tubes of vanishing radius is a natural extension of the concept of MAP estimation to continuous-time stochastic processes. We also formalize the concept with a clear mathematical definition that takes multimodality into account.

To begin, let be the probability space on which all random variables are defined. Random variables will be denoted by uppercase letters and their values by lowercase, so that if is a -valued random variable, will be used to refer to specific values it might take. The same applies for stochastic processes. The dependency on the random outcome will be omitted when unambiguous, to simplify the notation.

Bayesian estimation can be formulated as taking a decision which minimizes an expected loss, given an observation. Say is a random variable, which is to be estimated, and is an observed event, summarizing all the information available. The Bayesian choice (Robert, 2001) for the estimate corresponds to a value which minimizes of the expected loss, i.e.,

 E[ℓ(X,^x)|A]=infxE[ℓ(X,x)|A], (3)

where is an -valued function which represents the loss associated with each outcome of and choice for .

Several known loss functions correspond to certain statistics of the posterior distribution. When the loss is the absolute distance and squared distance between the estimate and the outcome of the random variable,

 ℓ1(X,x) :=∥X−x∥, ℓ2(X,x) :=∥X−x∥2, (4)

then the Bayesian estimator is the posterior median and mean, respectively. For discrete random variables, i.e., when the image of is a countable set, the posterior modes are the Bayesian estimates associated with the 0–1 loss function, given by

 ℓ01(X,x):=0 if X=x, 1 otherwise. (5)

For -valued random variables which admit continuous posterior densities, the expected value of is always unity, so we must instead consider a family of 0–1 loss functions given by

 ℓϵ(X,x):=0 if ∥X−x∥≤ϵ%,1 otherwise. (6)

One possible definition for the MAP estimate is then the limit of Bayesian estimates associated with the losses , as (Robert, 2001, Sec. 4.1.2). It can be shown that the limit of any convergent subsequence of is always a maximum of the posterior density.

However, the above definition for MAP estimate has several problems. First, convergence of the Bayesian estimates is hard to guarantee. Even the existence of convergent subsequences requires additional assumptions on the distribution of the random variable. In addition, there can exist maxima of posterior density that cannot be limits of the Bayesian estimates. These limitations are often overlooked because the de facto definition of MAP estimates is the location of the maxima of the posterior density, i.e., the modes of the posterior distribution. For random variables in infinite-dimensional spaces, for which there is no probability density in the usual sense, this de facto definition cannot be used. To overcome the problems mentioned above, we present below a formal definition of mode which agrees with the de facto one and generalizes well to infinite-dimensional spaces.

{defn}

[mode] Let be a normed vector space and be a -valued random variable. An element is a mode of if, for all ,

 limsupϵ↓0P(∥X−x∥≤ϵ)P(∥X−^x∥≤ϵ)≤1. (7)
{defn}

The maximum a posteriori (MAP) estimates of a random variable are the modes of its posterior distribution, according to Definition 3.

Definitions 3 and 3 can be interpreted in the framework of Bayesian decision theory because

 E[ℓϵ(X,x)|A]=1−P(∥X−x∥≤ϵ|A). (8)

Therefore, if is a MAP estimate and is not, there exists an such that, for all , the expected loss associated with is no greater than that associated with . If two points are MAP estimates, then the ratio of the expected losses associated with them converges to unity. Furtheremore, if is an -valued random variable that admits a continuous density , then for all with

 limϵ↓0P(∥X−a∥≤ϵ)P(∥X−b∥≤ϵ)=pX(a)pX(b), (9)

so a point is a MAP estimate if and only if it maximizes the density. Consequently, Definition 3 agrees with the de facto definition of mode.

Having formalized the definition of mode for general random variables, we can now show how the Onsager–Machlup functional can be used to obtain the mode of the state paths of systems described by SDEs.

## 4 Probability of tubes around state paths

Let be an -valued stochastic process representing the state of a dynamic system at the instant over the time interval . The evolution of is given by the SDE

 dXt=f(t,Xt)dt+GdWt, (10)

where is the drift function, is the diffusion matrix and is an -dimensional Wiener process which represents the process noise.

We now make some basic assumptions about the state process and the functions which define its dynamics. A function will be said to be of class if it is times differentiable and all its derivatives up to degree , including the zeroth, are bounded and continuous.

\needspace

2 {assum}[state process]

1. The drift is continuous with respect to both arguments and is of class with respect to its second argument , uniformly over its first argument .

2. The diffusion matrix has full rank.

3. The initial state is independent of the Wiener process .

4. The initial state admits a continuous density of class

Together, these assumptions allow the calculation of the asymptotic probability of -tubes of the state process, also called -sausages. The -tube around a given corresponds to all curves which lie within a distance of , with respect to the supremum norm. We denote by the event that the state process is contained within the -tube around , i.e.,

 Bϵϕ:={ω∈Ω:supt∈T∥Xt(ω)−ϕ(t)∥≤ϵ}, (11)

where, for elements of , denotes the Euclidean norm. The probability of is also referred to as the sojourn probability around . The tube events define regions of the space, as illustrated in Fig. 2, and an outcome belongs to the tube event if all of its pairs lie in the shaded region.

Of special interest are the tubes around absolutely continuous functions with square-integrable derivatives, i.e., the Cameron–Martin space of the law of the Wiener process:

 H:={(x:T→\mathdsRn):x abs.\ % continuous, ˙x∈L2(T)}. (12)

For any such path , the Onsager–Machlup functional is defined as

 J(ϕ):=−12∫T0[∥˙ϕ(t)−f(t,ϕ(t))∥2Q+divf(t,ϕ(t))]dt, (13)

where , and is the divergence of the drift vector field. The importance of this functional is outlined in the theorem below.

{thm}

[Fujita and Kotani (1982)] Consider a system (10) with state process satisfying Assumption 4. Then, for all with ,

 limϵ↓0P(Bϵϕ)P(Bϵφ)=exp(J(ϕ))ν(ϕ(0))exp(J(φ))ν(φ(0)). (14)

where is the Onsager–Machlup functional defined in (13) and is the initial state density satisfying Assumption 4d.

Comparing (14) to (9), Thm. 4 implies that, for the purpose of obtaining the mode, the product of the exponential of the Onsager–Machlup functional and the initial state density plays a role analogous to a probability density. For this reason, these terms are sometimes referred to as a fictitious density of state paths (Zeitouni, 1989). Also note that (14) is valid for tubes with respect to other norms of function spaces besides the supremum norm, and that the limit does not depend on the norm used (Capitaine, 1995).

For tubes around , outside the Cameron–Martin space, we assume that their probability decays much faster than that of tubes around that belong to the Cameron–Martin space, i.e., . This implies that the mode of the state path, according to Def. 3, lies in and can be obtained by maximizing the denominator of the right-hand side of (14). This assumption is implicit whenever the Onsager–Machlup functional is used to obtain the mode of state paths (Zeitouni and Dembo, 1987; Aihara and Bagchi, 1999a, b).

Whenever the divergence of the drift vector field does not depend on the state , then the last term of the integral in (13) can be dropped from the Onsager–Machlup functional, leading to the energy functional. Even for systems with state-dependent drift divergence, the energy functional gives the asymptotic probability of -tubes around the path of the Wiener process associated with the state path, as is shown in the next section.

## 5 Probability of tubes around noise paths

The Wiener process is also a diffusion process for which the Onsager–Machlup functional can be used for calculating the asymptotic probability of -tubes. For any with and , let represent the event that the initial state and the Wiener process lie within an radius of and , i.e.,

 Cϵη,x0:={ω∈Ω:supt∈T∥Wt(ω)−η(t)∥≤ϵ,∥X0(ω)−x0∥≤ϵ}. (15)

Applying a variant of Thm. 4 which considers diffusions with a fixed initial condition (cf. Fujita and Kotani, 1982), we have that, for any and such that and the initial state density ,

 limϵ↓0P(Cϵζ,χ0)P(Cϵη,x0)=exp(−12∫T0˙ζ(t)⊺˙ζ(t)dt)ν(χ0)exp(−12∫T0˙η(t)⊺˙η(t)dt)ν(x0). (16)

To relate the sojourn probability of state paths and noise paths, we now show that the event corresponding to each noise path tube is contained in an event corresponding to a state path tube.

{lem}

If the system satisfies Assumption 4, then there exists a constant such that for all and such that and

 ˙ϕ(t)=f(t,ϕ(t))+G˙η(t). (17)
###### Proof.

The SDE (10) can be written in the integral form as

 Xt=X0+∫t0f(t,Xt)dt+GWt. (18)

Writing the associated ordinary differential equation (ODE) (17) in the same form and taking the difference, we have that

 ∥Xt−ϕ(t)∥≤∥X0−ϕ(0)∥+∥GWt−Gη(t)∥+∫t0∥f(t,Xt)−f(t,ϕ(t))∥dt. (19)

Since the Jacobian matrix of is assumed to be bounded, is Lipschitz continuous, i.e., there exist such that

 ∥f(t,a)−f(t,b)∥≤K∥a−b∥∀t∈T and a,b∈\mathdsRn. (20)

Defining , where the norm of is induced by the Euclidean norm, we have that, for all ,

 ∥Xt−ϕ(t)∥≤c2ϵ+∫t0K∥Xt−ϕ(t)∥dt. (21)

Using the Bellman–Grönwall inequality we then have that

 ∥Xt−ϕ(t)∥≤ϵc2exp(TK).\qed (22)

Lem. 5 means that every state path in is associated with a noise path in and a initial state through the associated ODE (17). Rewriting (16) in terms of the state path associated with the noise path , we note that

 −12˙η(t)⊺˙η(t)=−12∥∥˙ϕ(t)−f(t,ϕ(t))∥∥2Q, (23)

which leads to the energy functional:

 Je(ϕ):=−12∫T0∥∥˙ϕ(t)−f(t,ϕ(t))∥∥2Qdt, (24)

so called because it corresponds to the energy of the noise signal associated with each state path. The energy functional is the Onsager–Machlup functional without the drift divergence term, which accounts for the amplification of noise by the system dynamics.

Next, we show how to incorporate the measurement likelihood so that the asymptotic posterior sojourn probability can be obtained and, with it, the MAP estimator.

## 6 Continuous-discrete MAP estimation

Let the finite set be the time instants where measurements are taken. In addition, let the -valued random variables denote the measurements taken at each , with the dimension of the measurement vector possibly varying across different instants. We then make the following assumptions.

{assum}

[measurements]

1. Each measurement is conditionally independent, given the state at the measurement instant, of all the states and measurements at the remaining time instants .

2. Each measurement admits a conditional density , given the state , which is of class with respect to its second argument.

3. All measured values have a nonzero predictive prior density, i.e., .

Given a sequence of measurement values, we wish to obtain the mode of the state path conditioned on . We then have the following result.

{thm}

Let with for all . In addition, let and be two families of events measurable with respect to the -algebra generated by such that , and converges as . Then, for a system with state process satisfying Assumption 4 and measurements satisfying Assumption 6,

 (25)
###### Proof.

Using Bayes’ rule, the posterior probability of any event measurable with respect to the -algebra generated by is

 (26)

Due to the continuity of the measurement likelihood with respect to its second argument, we have that for all ,

 limϵ↓0E[ψt(yt|Xt)|Aϵ1]=ψt(yt|ϕ(t)), (27)

and analogously for and . Substituting (26) and (27) into the left-hand side of (25) we obtain the result on the right-hand side of the equation. ∎

The MAP state path estimation problem then follows from Theorem 6, by setting . We will use the logarithm of the asymptotic sojourn probability because it is more tractable numerically and simplifies some convergence proofs. As some densities can be equal to zero, the resulting merit functions will take values on the extended real number line .

{lem}

The continuous-discrete MAP state paths are curves that maximize the MAP merit given by

 H(ϕ):=J(ϕ)+lnν(ϕ(0))+∑t∈Tmlnψt(yt|ϕ(t)). (28)

If instead , Theorem 6 leads to the minimum energy estimation problem.

{lem}

The minimum energy state paths, which are associated with the MAP noise paths, are curves that maximize the energy merit given by

 He(ϕ):=Je(ϕ)+lnν(ϕ(0))+∑t∈Tmlnψt(yt|ϕ(t)). (29)

We now present the discretization schemes and their corresponding discretized MAP state path estimation problems.

## 7 Discretized MAP state path estimation

To use discrete-time MAP state path estimation methods in systems with dynamics described by SDEs, the state transition density between points in a partition of the experiment’s time interval must be known. As the transition densities of SDEs do not have closed-form solutions for general nonlinear systems, approximations must be used. In this work, we consider approximations used in the numerical solution of SDEs.

We start with an ordered partition , which includes the endpoints , and measurement times . The discretization schemes then represent the evolution of the state at the partition points by difference equations. For brevity when working with partition points, we will use the following notation when unambiguous: will be shortened to and sequences like will be shortened to . The sequence will be denoted the discretized state path.

Change of variables is used to obtain the discretized state path densities. For this we need the well-known theorem below.

{thm}

[Georgii (2008, Prop. 9.1)] Let be an -valued random variable admitting a probability density. If is a diffeomorphism and the -valued random variable is defined as , then admits a probability density, which is given by

 pB(b)=∣∣det∇q−1(b)∣∣pA(q−1(b)), (30)

where is the Jacobian determinant of the inverse of , evaluated at .

In (30), the Jacobian determinant represents how the diffeomorphism changes the volume element at each point. We now present each discretization and the corresponding discretized MAP state path estimation problems.

### 7.1 Euler scheme

The stochastic Euler scheme (Kloeden and Platen, 1992, Sec. 9.1), also known as the Euler–Maruyama scheme, is one of the simplest SDE discretization schemes. It approximates the evolution of the states by

 Xk+1=Xk+f(tk,Xk)δk+GΔWk, (31)

where and . Recall that the increments of the Wiener process are, by definition, independent and normally distributed with zero mean and covariance , i.e., .

From (31) and the assumptions that is of class and is invertible, it can be seen that can be obtained from by a diffeomorphism. Applying Thm. 7, the joint prior density of the state path over the discretization points is given by

 p(x0:N)=ν(x0)exp(R(x0:N))∏N−1k=0|detG|√(2π)nδk, (32)

where is the Euler energy functional, defined as

 R(x0:N):=−12N−1∑k=0δk∥∥∥Δxkδk−f(tk,xk)∥∥∥2Q. (33)

Assumption 6a implies that the joint measurement likelihood, i.e., the joint density of all measurements, conditioned on the state path, is given by

 p(Y|x0:N)=∏t∈Tmψt(yt|xt). (34)

Applying Bayes’ rule we have that, for fixed , the posterior state path density can be obtained from (32) and (34):

 p(x0:N|Y)∝p(Y|x0:N)p(x0:N). (35)

Taking the logarithm of (35) and removing the constant terms, which do not influence the location of maxima, we obtain the merit function for MAP estimation.

{lem}

The discretized MAP state paths under the Euler scheme are the vectors which maximize the Euler merit

 S(x0:N):=R(x0:N)+lnν(x0)+∑t∈Tmlnψt(yt|xt). (36)

### 7.2 Trapezoidal scheme

The trapezoidal scheme (Kloeden and Platen, 1992, p. 500) represents the SDE over the partition points by an implicit difference equation:

 Xk+1=Xk+δk2[f(tk,Xk)+f(tk+1,Xk+1)]+GΔWk. (37)

Since the drift is assumed to be of class , for sufficiently small (37) is a contraction mapping and always has a unique solution. We shall assume that the partition is sufficiently fine such that this holds.

As with the Euler scheme, applying Thm. 7 we obtain the joint discretized state path prior density:

 p(x0:N)=ν(x0)exp(U(x0:N))∏N−1k=0|detG|√(2π)nδk, (38)

where is the trapezoidal Onsager–Machlup functional defined as

 U(x0:N):=∑N−1k=0lndet(I−12∇xf(tk+1,xk+1)δk)−12∑N−1k=0δk∥Δxkδk−12[f(tk,xk)+f(tk+1,xk+1)]∥2Q (39)

with being the Jacobian matrix of with respect to , evaluated at . Using equations (34), (35) and (38) and taking their logarithm, we obtain the discretized MAP state path estimation problem using the trapezoidal scheme. {lem} The discretized MAP state paths under the trapezoidal scheme are the vectors which maximize the trapezoidal merit

 V(x0:N):=U(x0:N)+lnν(x0)+∑t∈Tmlnψt(yt|xt). (40)

Unlike the Euler scheme, in the trapezoidal the state depends nonlinearly on the Wiener process increment , due to the implicit nature of (37). This means that the prior density of a state path does not depend only on the density of the associated noise, but also on the change of the volume element, represented by the Jacobian determinant of the drift function. This extra term represents how much the system dynamics amplifies or attenuates the noise around each path, akin to the drift divergence in the Onsager–Machlup functional. We note that this term has a numerical and conceptual connection to Lyapunov exponents, whose sum quantifies the average change in hypervolume due to pertubations of a path of a noise-free discrete-time map.

We will now present the theory of hypographical convergence to relate the discretized and continuous-discrete MAP estimation problems.

## 8 Limit of the discretized problems

We now consider the discretized problems over a nested sequence of partitions . The last index of each partition will be denoted , their elements by and the time increments by . In addition, will denote the mesh (or norm) of the partition , i.e., the largest distance between consecutive points

 ¯δi:=max0≤k

We consider partition sequences which satisfy

 Pi ⊂Pi+1, limi→∞¯δi =0, ¯δiNi

where is an upper bound.

To relate the various discretizations and the continuous-discrete problems, we must formulate all problems on the same space. We do this by representing the discretized state paths as the piecewise linear interpolants of given . We also denote by the space of all piecewise linear functions from to with breaks along . The discretized merit functions for each partition will be denoted by and , and are accordingly extended for inputs from . The same goes for the Euler energy functional and the trapezoidal Onsager–Machlup functional .

The Cameron–Martin space is a reproducing kernel Hilbert space. It can be endowed with the following inner product and its associated norm:

 ⟨a,b⟩H:=a(0)⊺b(0)+∫T0˙a(t)⊺˙b(t)dt. (43)

This norm for dominates the supremum norm (Daniel, 1969, Prop. 4.1), i.e., there exists a constant such that, for all ,

 (44)

We use hypographical convergence, or hypo-convergence, to relate the discretized problems and their continuous-discrete counterparts. Hypo-convergence is a powerful tool because when a series of optimization problems hypo-converge, then any cluster point of their solutions solves the limit problem (Polak, 1997, Thm. 3.3.3). Hypographical limits are formally defined as the limit of the hypographs of a series of optimization problems but, in metric spaces, the following definition is equivalent and more convenient to use.

{thm}

[adapted from Polak (1997, Thm. 3.3.2)] Let be a normed linear space and be a nested family of finite-dimensional subspaces whose limit is dense in , i.e.,

 Hi⊂Hi+1 ⊂H, H∞ :=∪∞i=1Hi, closH∞=H, (45)

where stands for topological closure. Additionally, let be the upper semi-continuous merit function of a maximization problem and be a family of upper semi-continuous merit functions of a family of maximization problems. A sufficient condition for the family of maximization problems to hypo-converge to the maximization of is that for every convergent infinite sequence where and , then

 limsupi→∞Hi(ϕi)=H(ϕ). (46)

We now present a few lemmas that will help prove hypo-convergence.

{lem}

Let be a continuous function and be a convergent infinite sequence where and . Additionally, let be the piecewise constant interpolations of with breaks along the partitions and define . Then and are bounded and

 limi→∞supt∈T∥g(t)−gi(t)∥=0. (47)
###### Proof.

From (44) and the fact that the sequence is convergent with respect to the -norm, it follows that . We can then restrict our analysis of to a compact subset of its domain, where it is bounded and uniformly continuous. Let denote a modulus of continuity of , i.e., for all and in the compact subset of being analysed,

 ∥q(t,a)−q(τ,b)∥≤ρq(|t−τ|,∥a−b∥). (48)

Similarly, let denote the modulus of continuity of and define . Then and

 supt,τ∈[tki,tk+1,i]∥ϕi(t)−ϕ(τ)∥≤ei. (49)

Combining (48) and (49) we have that

 supt∈T∥gi(t)−g(t)∥ ≤ρq(¯δi,ei), ρq(¯δi,ei)→0.

We now prove that, for any convergent sequence of piecewise linear functions over the partitions, the discretized merits converge to the continuous-discrete merits. The space for -valued functions will be denoted by , and corresponds to the direct sum of copies of the standard space, with its inner product given by

 ⟨a,b⟩L2n:=∫T0a(t)⊺b(t)dt. (50)
{lem}

Let be a convergent infinite sequence where and . Then

 limi→∞Si(ϕi)=He(ϕ), (51)

where is defined in (29) and is given by (36) for partition .

###### Proof.

Since by (44) convergence in implies uniform convergence and the initial state density and measurement likelihoods are continuous, we have that

 ν(ϕi(0)) →ν(ϕ(0)), ψt(ϕi(t)) →ψt(ϕ(t)), (52)

so, as the logarithm is also continuous in , it only remains for us to prove that .

Let be the piecewise constant interpolation of with breaks across and with the left endpoint of each piece as the interpolation point. In addition, define . Then, using the bound and the fact that is piecewise constant, from the definitions (24) and (33) of the functionals we have that

 |Ri(ϕi)−Je(ϕ)|≤∥Q∥(∥˙ϕi−gi∥2L2n−∥˙ϕ−g∥2L2n). (53)

As the norm and exponentiation are continuous functions, it suffices to show that and with respect to the norm. The former follows trivially from the fact that for all . The latter follows from Lem. 8 and the fact that uniform convergence implies convergence in for finite measure spaces. ∎

{lem}

Let be a convergent infinite sequence where and . Then

 limi→∞Vi(ϕi)=H(ϕ), (54)

where is defined in (28) and is given by (40) for partition .

###### Proof.

As in the proof of Lem. 8, the convergence of the terms corresponding to and follows trivially, so we will concentrate on proving that . Define

 (55)

We now show that .

First, recall that for any nonsingular matrix , , where is any matrix logarithm of . In addition, for the principal logarithm we have that, for all matrices with spectral radius smaller than unity, (Higham and Al-Mohy, 2010, Sec. 5.2). Since the drift is assumed to be of class , we have that for small enought (or, equivalently, sufficiently large ) always has spectral radius smaller than unity. This implies that there exists a constant such that, for all above a certain limit, and ,

 ∣∣lndet(I−12∇xf(t,x)δki)−12divf(t,x)δki∣∣≤c10¯δ2i. (56)

Let denote the piecewise constant interpolation of with breaks across and with the right endpoint of each piece as the interpolation point. Likewise, define . We then have that, for all sufficiently large ,

 |ei|≤12∫T0|d(t)−di(t)|dt+c10Ni¯δ2i. (57)

From Lem. 8 and the fact that uniform convergence implies convergence in in finite measure spaces we have that in . Also, from (42) we have that is bounded, so .

We now show that the quadratic terms of the merit functions converge. Let and denote piecewise constant interpolations of with breaks across ; with the left endpoint of each piece as the interpolation point and with the right endpoint. It suffices to prove that

 limi→∞∥˙ϕi−12(gi+hi)∥2L2n−∥˙ϕ−g∥2L2n=0. (58)

From Lem. 8 we have that and in . Furthermore, , so due to continuity of the norm and exponentiation . ∎

{lem}

The space is dense in .

###### Proof.

Denote by the space of piecewise constant functions from to with breaks along . The are spaces of order 1 splines and their union is dense in the space of continuous functions equipped with the supremum norm (de Boor, 2001, p. 147). Since continuous functions are dense in and denseness is a transitive relation, is dense in . Thus, for any and , there exists an and such that . If we then define we have that and . ∎

We are now ready to prove one of the main results of this paper.

{thm}

The discretized MAP state path estimation with the trapezoidal scheme hypo-converges to the continuous-discrete MAP state path estimation; and the discretized MAP state path estimation with the Euler scheme hypo-converges to the minimum energy estimation.

###### Proof.

From Lem. 8 we know that is dense in . From Lemmas 8 and 8 we then have that the merits converge for any convergent sequence of . ∎

In the sequence, we present simulated examples to illustrate an application of robust estimation under the presence of outliers using both the MAP and minimum energy estimators.

## 9 Simulated Examples

The stochastic Van der Pol oscillator is a nonlinear system commonly used to compare state estimators. This system, discretized with the Euler scheme, was used by Aravkin et al. (2011a, 2012a) for a comparison of robust discrete-time estimators. Denoting its states by , its dynamics are defined by

 f(t,x) :=[v−u+2(1−u2)v], G :=[[c]\numprint0.1\numprint0\numprint0\numprint0.1]. (59)

This system has a nonlinear drift divergence, given by , which is higher near the origin. Consequentely, the MAP merit will favor paths with larger while the energy merit will not make such distinctions. To highlight this difference between the estimators, the initial state was chosen around the origin, drawn from the normal distribution .

The system was simulated using the strong explicit order 1.5 scheme (Kloeden and Platen, 1992, Sec. 11.2) with time step 5×10-4 and experiment duration . The measurements were generated according to the Gaussian mixture distribution

 Yt|Xt∼(1−po)N(Ut,σ2y)+poN(Ut,σ2o), (60)

where is the regular measurements’ standard deviation, is the outliers’ standard deviation and is the outlier probability, the latter two of which were varied across experiments. The measurements were taken at time intervals of 0.1, i.e., .

Student’s -distribution was used as the measurement likelihood for the MAP, minimum energy and discretized estimators to confer robustness against outliers. The expression of the log-likelihood used in the estimation is

 lnψt(y|x)=−52ln(1+(u−y)24σ2y), (61)

which corresponds to the -distribution with 4 degrees of freedom and scaling . The estimates were also compared against the unscented Kalman smoother (UKS) of Särkkä (2008), using the continuous-discrete SDE prediction step of Arasaratnam et al. (2010). The measurement covariance of was used for the UKS.

Optimal control techniques were used to solve the MAP and minimum energy estimation problems. The estimation was transcribed to an nonlinear programming (NLP) problem using a third-order Legendre–Gauss–Lobatto direct collocation method equivalent to the Hermite–Simpson method (Betts, 2010, Sec. 4.5). The resulting NLP was then solved using the IPOPT solver of Wächter and Biegler (2006), part of the open-source COIN-OR project. The discretized MAP state path estimation problems with the Euler and trapezoidal schemes were also solved using the IPOPT solver. The estimation time step of 5×10-3 was used for all estimators.

The experiment and estimation were repeated 2,000 times for each combination of , and the integrated squared error (ISE) was calculated for each estimate:

 ISE=∫T0(Xt−x(t))⊺(Xt−x(t))dt, (62)

where is the simulated state and its estimate. The results are summarized in Fig. 3. It can be seen that, in all scenarios, the performance of the minimum energy and MAP estimators was comparable. However, the performance of the UKS becomes significantly worse as the outliers become more frequent and with a higher variance. These results mirror those of Aravkin et al. (2012a).

It should be noted that although the ISE distribution of both the MAP and minimum energy estimates was similar, the estimates were often different. This point can be seen in Fig. 4, which shows a portion of the simulated and estimated timeseries for one experiment. It is not apparent, however, whether any one is intrinsically better statistically than the other, except in very specific scenarios. They represent different statistics with different interpretations; it is up to the user to determine which one is preferrable for each application.

We also note that, for systems with student’s -distribution as the measurement likelihood, the discretized MAP state path estimator with the Euler scheme is the -robust smoother of Aravkin et al. (2012a). Therefore, a practical consequence of the hypo-convergence result of Thm. 8 is that, for sufficiently small discretization steps, the estimates of the -robust smoother should be close to the minimum energy estimates. This was verified in the simulated experiments, and can be observed in Figs. 3 and 4. The MAP state path estimates were also close to the discretized state path estimates with the trapezoidal discretization.

## 10 Conclusions and Future Work

In this paper, we addressed MAP state path estimation in continuous-discrete systems described by SDEs. We began with a formal definition of mode for continuous-time stochastic processes so that the target statistic and its interpretation are well-defined. We then showed how the Onsager–Machlup functional can be used to construct the MAP state path and the MAP noise path estimators, the latter of which is associated with the minimum energy state paths. In this way, a clear statistical interpretation of the minimum energy state paths was provided, together with the conditions under which they coincide with the MAP state paths.

Additionally, we showed that the popular technique to solve these estimation problems by discretizing the dynamics and obtaining the discretized MAP state path estimate leads to different results depending on the discretization scheme used. The Euler scheme, which is the simplest and therefore most widely used method for discretizing nonlinear SDEs, generates discretized estimators which hypo-converge to the MAP noise path estimation. The trapezoidal scheme, on the other hand, generates discretized estimators which hypo-converge to the MAP state path estimation.

As mentioned in the end of Sec. 7, the difference between the energy and Onsager–Machlup functionals can be interpreted as accounting for the amplification of noise by the system. The latter favours paths around more locally stable regions of the state-space, quantified by the sum of the eigenvalues of the drift Jacobian matrix , i.e., the drift divergence . Paths around less stable regions of state space amplify more greatly the perturbations due to noise, which reduces the probability that the state stays inside small tubes around it.

Having proved the link between discretized state path estimation and variational problems also opens the possibility of using optimal control techniques to solve the former, as was done in the simulated examples. These methods are better designed for the solution of infinite-dimensional optimization and are characterized by their order of convergence to the variational solution. SDE discretization methods, on the other hand, are characterized by order of convergence in mean (or of the means) of the simulated and true stochastic processes (Kloeden and Platen, 1992, Secs. 9.6 and 9.7). There also exist optimal control methods of arbitrarily high order; SDE methods for MAP calculation are limited to strong order 1.5 or weak order 4.

The results herein can be generalized in a number of ways. Rank-deficient matrices can be allowed by using the same approach as Aihara and Bagchi (1999b), for example. Hypoconvergence can also be used to analyze the limits of discretized joint MAP state path and parameter estimation (Dutra et al., 2012). Finally, other popular higher order discretization schemes such as the order 1.5 Itō–Taylor should generate problems which hypo-converge to MAP state path estimation as well.

## References

• Aihara and Bagchi [1999a] S.I. Aihara and A. Bagchi. On the Mortensen equation for maximum likelihood state estimation. IEEE Trans. Autom. Control, 44(10):1955–1961, 1999a.
• Aihara and Bagchi [1999b] S.I. Aihara and A. Bagchi. On maximum likelihood nonlinear filter under discrete-time observations. In IEEE ACC, volume 1, pages 450–454, 1999b.
• Arasaratnam et al. [2010] I. Arasaratnam, S. Haykin, and T.R. Hurd. Cubature Kalman filtering for continuous-discrete systems: Theory and simulations. IEEE Trans. Signal Process., 58(10):4977–4993, 2010.
• Aravkin et al. [2011a] A.Y. Aravkin, B.M. Bell, J.V. Burke, and G. Pillonetto. An -Laplace robust Kalman smoother. IEEE Trans. Autom. Control, 56(12):2898–2911, 2011a.
• Aravkin et al. [2011b] A.Y. Aravkin, B.M. Bell, J.V. Burke, and G. Pillonetto. Learning using state space kernel machines. In IFAC, pages 2296–2302, 2011b.
• Aravkin et al. [2012a] A.Y. Aravkin, J.V. Burke, and G. Pillonetto. Robust and trend-following Kalman smoothers using Student’s t. In SYSID, pages 1215–1220, 2012a.
• Aravkin et al. [2012b] A.Y. Aravkin, J.V. Burke, and G. Pillonetto. A statistical and computational theory for robust and sparse Kalman smoothing. In SYSID, pages 894–899, 2012b.
• Bell [1994] B.M. Bell. The iterated Kalman smoother as a Gauss–Newton method. SIAM J. on Optimiz., 4(3):626, 1994.
• Bell et al. [2009] B.M. Bell, J.V. Burke, and G. Pillonetto. An inequality constrained nonlinear Kalman–Bucy smoother by interior point likelihood maximization. Automatica, 45(1):25–33, 2009.
• Betts [2010] J.T. Betts. Practical methods for optimal control and estimation using nonlinear programming. SIAM, 2nd edition, 2010.
• Capitaine [1995] M. Capitaine. Onsager–Machlup functional for some smooth norms on Wiener space. Probab. Theory Rel., 102(2):189–201, 1995.
• Cox [1963] H. Cox. Estimation of state variables for noisy dynamic systems. PhD thesis, Mass. Inst. Technol., 1963.
• Daniel [1969] J.W. Daniel. On the approximate minimization of functionals. Math. Comput., 23:573–581, 1969.
• Daum [1986] F.E. Daum. Exact finite-dimensional nonlinear filters. IEEE Trans. Autom. Control, 31:616–622, 1986.
• de Boor [2001] C. de Boor. A Practical Guide to Splines. Springer, 2001.
• Dutra et al. [2012] D.A. Dutra, B.O.S. Teixeira, and L.A. Aguirre. Joint maximum a posteriori smoother for state and parameter estimation in nonlinear dynamical systems. In SYSID, pages 900–905, 2012.
• Farahmand et al. [2011] S. Farahmand, G.B. Giannakis, and D. Angelosante. Doubly robust smoothing of dynamical processes via outlier sparsity constraints. IEEE Trans. Signal Process., 59(10):4529–4543, 2011.
• Fujita and Kotani [1982] T. Fujita and S.-I. Kotani. The Onsager–Machlup function for diffusion processes. J. Math. Kyoto U., 22(1):115–130, 1982.
• Georgii [2008] H.-O. Georgii. Stochastics: Introduction to Probability and Statistics. Walter de Gruyter, 2008.
• Godsill et al. [2001] S. Godsill, A. Doucet, and M. West. Maximum a posteriori sequence estimation using Monte-Carlo particle filters. Ann. Stat. Math., 53(1):82–96, 2001.
• Higham and Al-Mohy [2010] N.J. Higham and A.H. Al-Mohy. Computing matrix functions. Acta Numer., 19:159–208, 2010.
• Kloeden and Platen [1992] P.E. Kloeden and E. Platen. Numerical Solution of Stochastic Differential Equations. Springer, 1992.
• Monin [2013] A. Monin. Modal trajectory estimation using maximum Gaussian mixture. IEEE Trans. Autom. Control, 58(3):763–768, 2013.
• Polak [1997] E. Polak. Optimization: algorithms and consistent approximations. Springer, 1997.
• Robert [2001] C. Robert. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. Springer, 2 edition, 2001.
• Särkkä [2008] S. Särkkä. Unscented Rauch–Tung–Striebel smoother. IEEE Trans. Autom. Control, 53(3):845–849, 2008.
• Särkkä and Solin [2012] S. Särkkä and A. Solin. On continuous-discrete cubature Kalman filtering. In SYSID, pages 1221–1226, 2012.
• Wächter and Biegler [2006] A. Wächter and L.T. Biegler. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Math. Program., 106(1):25–57, 2006.
• Zeitouni [1989] O. Zeitouni. On the onsager–machlup functional of diffusion processes around non curves. Ann. Probab., 17:1037–1054, 1989.
• Zeitouni and Dembo [1987] O. Zeitouni and A. Dembo. A maximum a posteriori estimator for trajectories of diffusion processes. Stochastics, 20(3):221–246, 1987.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters