Parameter Estimation for Partially Observed Hypoelliptic Diffusions

# Parameter Estimation for Partially Observed Hypoelliptic Diffusions

Yvo Pokern    Andrew M. Stuart    [
###### Abstract

Hypoelliptic diffusion processes can be used to model a variety of phenomena in applications ranging from molecular dynamics to audio signal analysis. We study parameter estimation for such processes in situations where we observe some components of the solution at discrete times. Since exact likelihoods for the transition densities are typically not known, approximations are used that are expected to work well in the limit of small inter-sample times and large total observation times . Hypoellipticity together with partial observation leads to ill-conditioning requiring a judicious combination of approximate likelihoods for the various parameters to be estimated. We combine these in a deterministic scan Gibbs sampler alternating between missing data in the unobserved solution components, and parameters. Numerical experiments illustrate asymptotic consistency of the method when applied to simulated data. The paper concludes with application of the Gibbs sampler to molecular dynamics data.

Pokern, Stuart, Wiberg]Petter Wiberg

## 1 Introduction

In many application areas it is of interest to model some components of a large deterministic system by a low dimensional stochastic model. In some of these applications, insight from the deterministic problem itself forces structure on the form of the stochastic model, and this structure must be reflected in parameter estimation. In this paper, we study the fitting of stochastic differential equations (SDEs) to discrete time series data in situations where the model is a hypoelliptic diffusion process, meaning that the covariance matrix of the noise is degenerate, but the probability densities are smooth, and also where observations are only made of variables that are not directly forced by white noise. Such a structure arises naturally in a number of applications.

One application is the modelling of macro-molecular systems, see Grubmüller and Tavan (1994) and Hummer (2005). In its basic form, molecular dynamics describes the molecule by a large Hamiltonian system of ordinary differential equations (ODEs). \linelabel1_7As is commonplace in chemistry and physics, we will refer to data obtained from numerical simulation of such models as molecular dynamics data. If the molecule spends most of its time in a small number of macroscopic configurations then it may be appropriate to model the dynamics within, and in some cases between, these states by a hypoelliptic diffusion. While this phrasing of the question is relatively recent, under the name of the ”Kramers problem” it dates back to Kramers (1940) with a brief summary in section 5.3.6a of Gardiner (1985). Another application, audio signal analysis, is referred to in Giannopoulos and Godsill (2001) where a continuous time ARMA model is used, see also \linelabelciteGod06Godsill and Yang (2006) for more on the type of methodology used.

We consider SDE models of the form {linenomath}

 {dx=ΘA(x)dt+CdBx(0)=x0 (0)

where is an -dimensional Wiener process and a -dimensional continuous process with . is a set of (possibly non-linear) globally Lipschitz force functions. The parameters which we estimate are the last rows of the drift matrix (the first rows of which are assumed to be known), , and the diffusivity matrix which we assume to be of the form {linenomath}

 C = [0Γ]∈Rk×m

where is a constant nonsingular matrix. Thus, we are estimating drift and diffusion parameters only in the coordinates which are directly driven by white noise.

It is known that under suitable hypotheses on and , a unique -integrable solution exists almost-surely for all times , see e.g. Theorem 5.2.1 in Oksendal (2000). We also assume that the process defined by (1) is hypoelliptic as defined in Nualart (1991). Intuitively, this corresponds to the noise being spread into all components of the system (1) via the drift.

The structure of implies that the noise acts directly only on a subset of the variables which we refer to as rough. It may then be transmitted, through the coupling in the drift, to the remaining parts of the system which we refer to as smooth (we do not mean here, but they are at least ). To distinguish between rough and smooth variables, we introduce the notation where is smooth and is rough. It is helpful to define projections by and by .

We denote the sample path at equally spaced points in time by , and we write to separate the rough and smooth components. Also, for any sequence we write to denote forward differences. We are mainly interested in cases where only the smooth component, , is observed and our focus is on parameter estimation for all of and for entries of those rows of corresponding to the rough path, on the assumption that are samples from a true solution of (1); such a parameter estimation problem arises naturally in many applications and an example is given in section 7. We will describe a deterministic scan Gibbs sampler to approach this problem, sampling alternatingly from the missing path , the drift parameters and the covariance . It is natural to consider and .

Given prior distributions for the parameters, , the posterior distribution can be constructed as follows: {linenomath}

 P(v,Θ,ΓΓT|u)=P(v,Θ,ΓΓT,u)P(u)∝L(u,v|Θ,ΓΓT)p0(Θ,ΓΓT)P(u) (0)

Here, has been introduced as a measure equal to the probability density up to a constant of proportionality. \linelabel3^14When are fixed and is thought of as a function of and it is a a likelihood.

Similarly, the probability densities , and are replaced by corresponding expressions using when omitting constants of proportionality that are irrelevant to estimation of the posterior probability. The probability density gives rise to the transition density which we will write as when omitting constants of proportionality.

In principle, (1) can be used as the basis for Bayesian sampling of , viewing as missing data. However, the exact probability of the path, , is typically unavailable. In this paper we will combine judicious approximations of this density to solve the sampling problem.

The sequence defined above is generated by a Markov chain. The random map is determined by the integral equation {linenomath}

 xn+1 = xn+∫(n+1)ΔtnΔtΘA(x(s))ds+∫(n+1)ΔtnΔtCdB(s).

The Euler-Maruyama approximation of this map gives {linenomath}

 Xn+1 ≈ Xn+ΔtΘA(Xn)+√ΔtR(0,Θ)ξn (0)

where , is an iid sequence of normally distributed random variables, , and {linenomath}

 R(0,Θ)=[000Γ]∈Rk×k

is not invertible. (\linelabel3_1Here, as throughout, we use uppercase letters to denote discrete-time approximations of the continuous time process.) This approximation corresponds to retaining the terms of order in the drift and of in the noise when performing an Ito-Taylor expansion (see chapter 5 of Kloeden and Platen (1992)). Due to the non-invertibility of , this approximation is unsuitable for many purposes and we extend it by adding the first non-zero noise terms arising in the first rows of the Itô-Taylor expansion for . This results in the expression {linenomath}

 Xn+1≈Xn+ΔtΘA(Xn)+√ΔtR(Δt;Θ)ξn (0)

where is distributed as and . Because of the hypoellipticity, is now invertible, but the zeros in mean that it is highly ill-conditioned (or near-degenerate) for . Specific examples for the matrix will be given later.

\linelabel

3_12Ideally we would like to implement the following deterministic scan Gibbs sampler:

1. Sample from .

2. Sample from .

3. Sample from .

4. Restart from step (a) unless sufficiently equilibrated.

In practice, however, approximations to the densities will be needed. We refer to expressions of the form (1) as \linelabel3_13 models and we will use them to approximate the exact density on path-space, , of the path for parameter values and . \linelabel3_10The resulting approximations, and of , are found from (1) and (1) respectively. We again use and in the same way as for the exact distribution above when omitting constants of proportionality.

The questions we address in this paper are:

\linelabel

3_6

1. How does the ill-conditioning of the Markov chain affect parameter estimation for and for the last rows of in the regime ?

2. In many applications, it is natural that only the smooth data is observed, and not the rough data . What effect does the absence of observations of the rough data have on the estimation for and ?

3. The exact likelihood is usually not available; what approximations of the likelihood should be used, in view of the ill-conditioning?

4. How should the answers to these questions be combined to produce an effective Gibbs loop to sample the distribution of parameters and the missing data ?

To tackle these issues, we use a combination of analysis and numerical simulation, based on three model problems which are conceived to highlight issues central to the questions above. We will use analysis to explain why some seemingly reasonable methods fail, and simulation will be used both to extend the validity of the analysis and to illustrate good behaviour of the new method we introduce.

For the numerical simulations, we will use either exact discrete time samples of (1) in simple Gaussian cases, or trajectories obtained by Euler-Maruyama simulation of the SDE on a temporal grid with a spacing considerably finer than the observation time interval .

In section 2 we will introduce our three model problems and in section 3 we study the performance of to estimate the diffusion coefficient. Observing and analysing its failure in the case with partial observation leads to the improved statistical model yielding which eliminates these problems; we introduce this in section 4. In section 5 we show that is inappropriate for drift estimation, but that is effective in this context. In section 6, the individual estimators will be combined into a Gibbs sampler to solve the overall estimation problem with asymptotically consistent performance being demonstrated numerically. Section 7 contains an application to molecular dynamics and section 8 provides concluding discussion.

We introduce one item of notation to simplify the presentation. Given an invertible matrix we introduce a new norm using the Euclidean norm on by setting for vectors .

### 1.1 Two classical estimators

\linelabel

4_14From previous work on hypoelliptic diffusions, we note a classical estimator for the covariance matrix and for the drift matrix in the linear fully observed case which will be useful for reference later in the paper.

Firstly, it is straightforward to estimate the covariance matrix from the quadratic variation: noting that {linenomath}

 1TN−1∑n=0(vn+1−vn)(vn+1−vn)T→ΓΓTasN→∞, (0)

with fixed, see Durrett (1996).

The Girsanov formula gives rise to a maximum likelihood estimator for the lower rows of , and in the linear case, where is just the identity, the maximum likelihood estimate for the whole of is given by {linenomath}

 ^Θ=[∫T0dxxT][∫T0xxTdt]−1. (0)

For the hypoelliptic case, this is proved to be consistent as in Breton and Musiela (1985).

## 2 Model Problems

To study the performance of parameter estimators, we have selected a sequence of three Model Problems ranging from simple linear stochastic growth through a linear oscillator subject to noise and damping to a nonlinear oscillator of similar form. All these problems are second order hypoelliptic and they have a physical background, so we use (position) and (momentum) to denote smooth and rough components in the Model problems instead of and which we used in the general case. Their general form is given as the second order Langevin equation {linenomath}

 {dq=pdt,dp=(−γp+f(q;D))dt+σdB (0)

where is some (possibly nonlinear) force-function parametrised by and the variables and are scalar. The parameters , and are to be estimated.

### 2.1 Model Problem I: Stochastic Growth

Here, satisfies {linenomath}

 {dq=pdtdp=σdB. (0)

The process has one parameter, the diffusion parameter , that describes the size of the fluctuations. In the setting of (1) we have {linenomath}

 A(x)=x,Θ=[0100],C=[0σ]

and , . The process is Gaussian with mean and covariance {linenomath}

 μ(t)=[1t01][q0r0]andΣ(t)=σ2[t3/3t2/2t2/2t].

The exact discrete samples may be written as {linenomath}

 ⎧⎪⎨⎪⎩qn+1=qn+pnΔt+σ(Δt)3/2√12ζ(1)n+σ(Δt)3/22ζ(2)n,pn+1=pn+σ√Δtζ(2)n, (0)

with and being i.i.d.; individual components of are referred to as and respectively. The matrix from (1) is given here as {linenomath}

 R=σ[1√12Δt12Δt01].

In the case of this model problem, the \linelabel3_13_2auxiliary model (1) is actually exact.

### 2.2 Model Problem II: Harmonic Oscillator

As our second model problem we consider a damped harmonic oscillator driven by a white noise forcing where :

{linenomath}
 {dq=pdtdp=−Dqdt−γpdt+σdB. (0)

This model is obtained from the general SDE (1) for the choice {linenomath}

 A(x)=x,Θ=[01−D−γ],C=[0σ]

and , . The process is Gaussian and the mean and covariance of the solution can be explicitly calculated. The matrix is the same as in Model Problem I.

### 2.3 Model Problem III: Oscillator with Trigonometric Potential

In the third model problem, describes the dynamics of a particle moving in a potential which is a superposition of trigonometric functions and in contact with a heat bath obeying the fluctuation-dissipation relation, see Lasota and Mackey (1994). This potential is sometimes used in molecular dynamics in connection with the dynamics of dihedral angles – see section 7. The model is {linenomath}

 {dq=pdt,dp=(−γp−∑cj=1Djsin(q)cosj−1(q))dt+σdB. (0)

This equation has parameters , and . It can be obtained from the general SDE (1) for the choice {linenomath}

 A([qp])=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣sin(q)sin(q)cos(q)⋮sin(q)cosc−1(q)p⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,Θ=[0…01−D1…−Dc−γ],C=[0σ]

and , . No explicit closed-form expression for the solution of the SDE is known in this case; the process is not Gaussian. The matrix in the statistical model (1) is the same as the one obtained in Model Problem I.

## 3 Euler Auxiliary Model

\linelabel

7_14As discussed in the introduction, we need to find appropriate approximations for in steps (a)–(c) of the desired Gibbs loop. The purpose of this section is to show that use of in step (c), to sample the missing component of the path, leads to incorrect estimation of the diffusion coefficient. The root cause is the numerical differentiation for the missing path which is implied by the Euler approximation.

### 3.1 Auxiliary Model

If the force function is nonlinear, closed-form expressions for the transition density are in general unavailable. To overcome this obstacle, one can use a discrete time \linelabel3_13_3auxiliary model. The Euler model (1) is commonly used and we apply it to a simple linear model problem to highlight its deficiencies in the case of partially observed data from hypoelliptic diffusions.

The Euler-Maruyama approximation of the SDE (1) is

 Xn+1 = Xn+ΔtΘA(Xn)+√ΔtCξn (0)

where is an i.i.d. sequence of -dimensional vectors with standard normal distribution. This corresponds to (1) with replaced by from (1). Thus we obtain {linenomath}

 {Un+1=Un+ΔtPΘA(Xn)Vn+1=Vn+ΔtQΘA(Xn)+√ΔtΓξn} (0)

where now each element of the i.i.d. sequence is distributed as in . This model gives rise to the following density: {linenomath}

 LND(U,V|Θ,ΓΓT)=∏N−1n=0exp(−12∥ΔVn−ΔtQΘA(Xn)∥2Γ)√2π|ΓΓT|δ(Un+1−UnΔt−PΘA(Xn)). (0)

The Dirac mass insists that the data is compatible with the \linelabel3_13_4auxiliary model (3.1), i.e. the path must be given by numerical differentiation (ND) of the path in the case of (2), and similar formulae in the general case. To estimate parameters we will use the following expression: {linenomath}

 LE(U,V|Θ,ΓΓT)=∏N−1n=0exp(−12∥ΔVn−ΔtQΘA(Xn)∥2Γ)√2π|ΓΓT|,. (0)
\linelabel

8^10In the case when the Euler model is used to estimate missing components we assume that are related so that the data is compatible with the \linelabel3_13_5auxiliary model – that is, numerical differentiation is used to find from .

### 3.2 Model Problem I

The Euler \linelabel3_13_6auxiliary model for this model problem is {linenomath}

 {Qn+1=Qn+PnΔt,Pn+1=Pn+σ√Δtξn. (0)

Here, is an i.i.d. sequence. The root cause of the phenomena that we discuss in this paper is manifest in comparing (2.1) and (3.2). The difference is that the white noise contributions in the exact time series (2.1) do not appear in the equation for . We will see that this plays havoc with parameter estimation, even though the Euler method is path-wise convergent.

We assume that observations of the smooth component only, , are available. In this case the Euler method for estimation (3.2) gives the formula {linenomath}

 Pn=Qn+1−QnΔt (0)

for the missing data. In the following numerical experiment we generate exact data from (2.1) using the parameter value . We substitute given by (3.2) into (3.1) and find the maximum likelihood estimator for in the case of partial observation. In the case of complete observation we use the exact value for , from (2.1), and again use a maximum likelihood estimator for from (3.1).

Using timesteps for a final time of with the histograms for the estimated diffusion coefficient presented in the middle column of Figure 1 are obtained. The top row contains histograms obtained in the case of complete observation where good agreement between the true and the estimates is observed. The bottom row contains the histograms obtained for partial observation using (3.2). The observed mean value of indicates that the method yields biased estimates. Increasing the final time to (see left column of graphs in Figure 1) or increasing the resolution to (see right column of graphs in Figure 1) do not remove this bias.

Thus we see that, in the case of partial observation, contains errors which do not diminish with decreasing and/or increasing

### 3.3 Analysis of why the missing data method fails

Model Problem I can be used to illustrate why this method fails. We first argue that the method works without hidden data. Interpreting (3.1) as a log-likelihood function wrt. , we obtain following expression in the case of stochastic growth: {linenomath}

 logLE(σ|Q,P)=−2Nlogσ−1σ2ΔtN−1∑n=0(ΔPn)2

where is the forward difference operator. The maximum of the log-likelihood function gives the maximum likelihood estimate, {linenomath}

 ˆσ2=1NΔtN−1∑n=0(ΔPn)2. (0)

In the case of complete data, (2.1) gives {linenomath}

 ˆσ2=σ2NN−1∑n=0(ζ(2)n)2. (0)

By the law of large numbers, almost surely as . This shows that the method works when the complete data is observed.

Let us consider what happens when \linelabel10^13 is hidden. In this case, is estimated by {linenomath}

 ˆPn=Qn+1−QnΔt.

But since is generated by (2.1) we find that {linenomath}

 ˆPn=Pn+1+Pn2+σ√Δt√12ζ(1)n

and {linenomath}

 ΔˆPn =ΔPn+12+ΔPn2+σ√Δt√12(ζ(1)n+1−ζ(1)n) =σ√Δt2(ζ(2)n+1+ζ(2)n+1√3ζ(1)n+1−1√3ζ(1)n)

When is inserted in (3.3) it follows that {linenomath}

 ˆσ2 =σ24NN−1∑n=0⎛⎝ζ(2)n+1+ζ(2)n+ζ(1)n+1−ζ(1)n√3⎞⎠2 =σ24N[N−1∑n=0⎛⎝ζ(2)n+1+ζ(1)n+1√3⎞⎠2+N−1∑n=0(ζ(2)n−ζ(1)n√3)2 +2N−1∑n=0(ζ(2)n−ζ(1)n√3)⎛⎝ζ(2)n+1+ζ(1)n+1√3⎞⎠].

The random variables are i.i.d with . So, by the law of large numbers, almost surely as . Furthermore, the limits hold in either of the cases where either or are fixed as . This means that independently of what limit is considered, a seemingly reasonable estimation scheme based on Euler approximation results in errors in the diffusion coefficient. There is similarity here with work of \linelabel11^6Gaines and Lyons (1997) showing that adaptive methods for SDEs get the quadratic variation wrong if the adaptive strategy is not chosen carefully.

## 4 Improved Auxiliary Model

The Euler \linelabel3_13_7auxiliary model fails to propagate noise to the smooth component of the solution and thus leads to estimating missing paths with incorrect quadratic variation. A new \linelabel3_13_8auxiliary model is thus proposed which propagates the noise using what amounts to an Itô-Taylor expansion, retaining the leading order component of the noise in each row of the equation. The model is used to set up an estimator for the missing path using a Langevin sampler from path-space which is then simplified to a direct sampler in the Gaussian case. Numerical experiments indicate that the method yields the correct quadratic variation for the simulated missing path.

The model is motivated using our common framework the Model Problems I, II and III, namely (2). The improved \linelabel3_13_9auxiliary model is based on the observation that in the second row of an Itô-Taylor expansion of (2) the drift terms are of size whereas the random forcing term is ”typically” (in root mean square) of size . Thus, neglecting the contribution of the drift term in the second row on the first row leads to the following approximation of \linelabel11_16(2): {linenomath}

 [Qn+1Pn+1]=[QnPn]+Δt[Pnf(Qn)−γPn]+σ[∫(n+1)ΔtnΔt(B(s)−B(nΔt))dsB((n+1)Δt)−B(nΔt)]

The random vector on the right hand side is Gaussian, and can be expressed as a linear combination of two independent normally distributed Gaussian random variables. Computation of the variances and the correlation is straightforward leading to the following statistical model: {linenomath}

 [Qn+1Pn+1]=[QnPn]+Δt[Pnf(Qn)−γPn]+σ√ΔtR[ξ1ξ2] (0)

Here, and are independent normally distributed Gaussian random variables and is given as {linenomath}

 R = [Δt√12Δt201]

This is a specific instance of (1). It should be noted that this model is in agreement with the Ito-Taylor approximation up to error terms of order in the first row and in the second row and that higher order hypoelliptic processes can be approximated using a similarly truncated Ito-Taylor expansion. The key important idea is to propagate noise into all components of the system, to leading order.

If complete observations are available, this model performs satisfactorily for the estimation of . This can be verified analytically for Model Problem I in the same fashion as in section 3.3. Numerically, this can be seen from the first row (referring to complete observation) of Figure 2 for Model Problem I and from the first row of Figure 3 for Model Problem II. In both cases the true value is given by . See subsection 4.2 for a full discussion of these numerical experiments.

If only partial observations are available, however, a means of reconstructing the hidden component of the path must be procured. A standard procedure would be the use of the Kálmán filter/smoother (Kalman (1960), Catlin (1989)) which could then be combined with the expectation-maximisation (EM) algorithm (Dempster et al. (1977),Meng and van Dyk (1997)) to estimate parameters. In this paper, however, we employ a Bayesian approach sampling directly from the posterior distribution for the rough component, , without factorising the sampling into forward and backward sweeps.

### 4.1 Path Sampling

The logarithm of the density on path space for the missing data induced by the \linelabel3_13_10auxiliary model (1) can be written as follows: {linenomath}

 logLIT(p|q,Θ,ΓΓT)=−12N∑l=0∥ΔXl−ΘA(Xl)Δt∥2R+const. (0)

We will apply this in the case (4) which is a specific instance of (1).

One way to sample from the density on path space, , for rough paths is via the Langevin equation (see section 6.5.2 in Robert and Casella (1999)) and, in general, we expect this to be effective in view of the high dimensionality of . Other MCMC approaches may also be used.

However, when the joint distribution of \linelabel12_17 is Gaussian it is possible to generate independent samples as follows: note first, that in the Gaussian case, when in (4.1) is quadratic in , the derivative of with respect to the rough path can be computed explicitly, which is carried out in \linelabel12_14Pokern (2007). For our oscillator framework, the derivative can be expressed using a tridiagonal, negative definite matrix with highest order stencil acting on the -vector plus a possibly nonlinear contribution acting on the -vector only: {linenomath}

 ∇plogLIT(Q,P)=PmatP+Q(Q).

Then, the suggested direct sampler for -paths is simply: {linenomath}

 P=−P−1matQ(Q)+U−1ξ (0)

Here is a Cholesky factorisation and is a dimension vector of iid normally distributed random numbers.

### 4.2 Estimating Diffusion Coefficient and Missing Path

The approximation can be used to estimate both the missing path and the diffusion coefficient for our Model Problems I, II and III.

In order to estimate , the derivative of the logarithm of {linenomath}

 logLIT(σ|P,Q,Θ)=logLIT(P,Q|σ,Θ)+logp0(Θ,σ)+const

(where priors are assumed to be given and constants in have been omitted) with respect to is computed: {linenomath}

 ∂∂σlogLIT = −2Nσ+1σ3Z+∂∂σlog(p0(Θ,σ)).

Here, we have used the abbreviation {linenomath}

 Z := N−1∑p=0∥∥∥((Qp+1Pp+1)−(QpPp)−Δt(Pp−f(Qn)−γPp))∥∥∥2R.

In this case no prior distribution was felt necessary as, when , its importance would diminish rapidly. Thus we set .

We use a Langevin type sampler for this distribution. In order to avoid the singularity at we use the transformation \linelabel13^8. Using the Itô formula, this yields the following Langevin equation which we use to sample and hence : {linenomath}

 dζ = ((12−8N)√ζ+4Z)ds+4√2ζ34dW. (0)

A simple explicit Euler-Maruyama discretisation in is used to simulate paths for this SDE. The timestep needs to be tuned with to ensure convergence of the explicit integrator. Since this is a one-dimensional problem, conservatively small timesteps and long integration times can be afforded. With this choice of timestep the theoretically possible transient behaviour (see Roberts and Tweedie (1997)) was not observed and we expect accurate samples from the posterior in .

This Langevin-type sampler (4.2) can then be alternated in a Systematic Scan Gibbs Sampler (as described on p.130 of Liu (2001)) using iterations with the direct sampler for the paths, (4.1). This yields estimates of the missing path and the diffusion coefficient which is estimated by averaging over the latter half of the samples. We illustrate this with an example using Model Problem I with the following parameters:

 σ=1, T∈{10,100}, Δt∈{0.1,0.01}, NGibbs=50.

The sample paths used for the fitting are generated using a sub-sampled Euler-Maruyama method with temporal grid where . \linelabel13_1The resulting histogram of mean posterior estimators is given in Figure 2 where the first row corresponds to the behaviour when complete observations are available and the second row corresponds to only the smooth component being observed and missing data being sampled according to (4.1). For Model Problem II we use the following parameters:

 σ=1, D=4, γ=0.5, T∈{10,100}, Δt∈{0.02,0.002}, NGibbs=50.

The sample paths used for the fitting are generated as for Model Problem I and the experimental results are given in Figure 3.

It appears from these figures that the estimator for this joint problem performs well for Model Problems I and II for sufficiently small and sufficiently large. A more careful investigation of the convergence properties is postponed to section 6 when drift estimation will be incorporated in the procedure.

## 5 Drift Estimation

### 5.1 Overview

\linelabel

15_10With the approximations and in place, the question arises which of these should be used to estimate the drift parameters. Using Model Problem II we numerically observe that an based maximum likelihood estimator performs well. In contrast, ill-conditioning due to hypoellipticity leads to error amplification and affects the performance of the based maximum likelihood estimator.

### 5.2 Drift Parameters from LE

\linelabel

15_6In order to simplify analysis, we illustrate the estimator using Model Problems II, (2.2) and III, (2.3). For the latter, the Euler \linelabel3_13_11auxiliary model is given as follows: {linenomath}

 {Qn+1=Qn+ΔtPnPn+1=Pn−Δt∑ci=1Difi(Qn)−ΔtγPn+√Δtσξn, (0)

where we abbreviated the trigonometric expressions using . The functional in this case is given by: {linenomath}

 LE(γ,D|Q,P,σ)∝exp(−∑N−1n=0(ΔPn+Δt∑ci=1Difi(Qn)+ΔtγPn)22Δtσ2). (0)
\linelabel

16^1Clearly, this posterior is Gaussian with distribution {linenomath}

 ˆΘ∼N(M−1EbE,M−1E), (0)

where the matrix and the vector can be read off from (5.2).

### 5.3 Drift Parameters from LIT

As the approximate model based on is observed to resolve the difficulty with estimating for hidden -paths, it is interesting to see whether it can also be used to estimate the drift parameters.

The logarithm of the density on path space up to an additive constant is given by (4.1). To illustrate the problems arising from the use of we use Model Problem II, so that (4.1) becomes {linenomath}

 logLIT(Θ|Q,P,σ) = 12ΔtN−1∑n=0∥(ΔXn−ΔtΘA(Xn))∥2R+const (0)

where , irrelevant constants have been omitted and we have\linelabel16_7 {linenomath}

 A([QnPn])=[QnPn],Θ=[01−D−γ].

In order to obtain a maximum likelihood estimator from this, we take the derivative with respect to the parameters and and equate to zero. This yields the following linear system: {linenomath}

 [∑nQ2nΔt∑nPnQnΔt∑nPnQnΔt∑nP2nΔt][^D^γ] = [−∑nQnΔPn−∑nPnΔPn]+⎡⎢⎣∑n32Qn(ΔQnΔt−Pn)∑n32Pn(ΔQnΔt−Pn)⎤⎥⎦ (0)

Comparing this linear system to the mean of the successful estimator (5.2) we note the presence of an additional term on the right hand side. This term leads to the failure of the above estimator. \linelabel16_1Thus, is not an appropriate approximation for use in step (a) of the Gibbs sampler.

### 5.4 Numerical Check: Drift

There are two factors influencing convergence: and . To illustrate their influence, consider the following series of numerical tests. All of the tests share these parameters:

 D=4 γ=0.5 σ=0.5 k=30

Data for the tests is again generated using an Euler-Maruyama method on a finer temporal grid with resolution .

In the plot given in Figure 4 the top row contains histograms for the maximum likelihood estimate for the drift parameter whereas the second row contains histograms for the drift parameter in any case using the full sample path for maximum likelihood inference, i.e. formula (5.3). It is clear from these experiments summarised in Figure 4 that both and are grossly underestimated by from (5.3). This problem does not resolve for smaller (see the right column of that figure); it does not disappear for longer intervals of observation, either, as can be inferred from the left column of Figure (4).

### 5.5 Why the LIT Model Fails for the Drift Parameters

\linelabel

17_3The key is to compare (5.3) with the mean in (5.2). This reveals that the last term in (5.3) is an error term which we now study.

Using the 2nd order Itô-Taylor approximation {linenomath}

 Xn+1 = Xn+ΔtAXn+[10−γ1]R[ξ1ξ2]+12Δt2A2Xn+O(Δt52)

we can compute the second term on the right hand side of (5.3): {linenomath}

 (0)

Here, and refer to the exact drift parameters used to generate the sample path, whereas and in (5.3) and (5.5) are the drift parameters estimated using the improved \linelabel3_13_12auxiliary model. The term on the right hand side contains stochastic integrals whose expected value is zero.

As the mean error terms can be written in terms of the matrix elements themselves, (5.5) can be substituted in (5.3) to obtain: {linenomath}

 E^D = 14D+O(Δt) (0) E^γ = 14γ+O(Δt). (0)

This seems to be corroborated by the numerical tests.

### 5.6 Conclusion for Drift Estimation

We observed numerically but do not show here that associated with an Euler model for the SDE (1) yields asymptotically consistent Langevin and maximum likelihood estimators for Model Problem II.

While it is aesthetically desirable to base the estimation of all parameters as well as the missing data on the same approximation of the true density (up to multiplicative constants) , and although this approximation was found to work well for the estimation of missing data and the diffusion coefficient, it does not work for the drift parameters.

It is possible to trace this failure to the fact that only the second row of is estimated where errors in the first row get amplified to errors in the second row. Estimating all entries of , while being outside the specification of the problem under consideration, also yields errors if is used and so does not remedy the problem. This problem is not shared by the discretised version of the diffusion independent estimator (1.1), but this is not a maximum likelihood estimator for .

In summary, for the purposes of fitting our model problems to observed data we employ the Euler \linelabel3_13_13auxiliary model (5.2) for the drift parameters.

## 6 The Gibbs Loop

\linelabel

18_1In this section, we combine the insights obtained in previous sections to formulate an effective algorithm to fit hypoelliptic diffusions to partial observations of data at discrete times. We apply a deterministic scan Gibbs sampler alternating between missing data (the rough component of the path, ), drift parameters and diffusion parameters.

We combine the approximations developed and motivated in previous sections in the following Gibbs sampler: \linelabel19-20

1. Sample from .

2. Sample from .

3. Sample from .

4. Restart from step (a) unless sufficiently equilibrated.

Our numerical results will show that this judicious combination of approximations results in an effective algorithm. Theoretical justification remains an interesting open problem.

When applied to Model Problem III the detailed algorithm reads as follows:

###### Algorithm 1

Given observations , the initial -path is obtained using numerical differentiation: {linenomath}

 P(0)i=ΔQiΔt. (0)

The initial drift parameter estimate is just set to zero: . Then start the Gibbs loop:

For :

1. Estimate the drift parameters and using sampling based on given via (5.2).

2. Estimate the diffusivity using the Langevin sampler (4.2) based on given and , .

3. Get an independent sample of the -path, using (4.1) derived from given parameters , and .

We test this algorithm numerically where sample paths of (2.3) are generated using a sub-sampled Euler-Maruyama approximation of the SDE. The data is generated using a timestep that is smaller than the observation time step by a factor of either or . Comparing the results for these two and other non-reported cases, they are found not to depend on the rate of subsampling, , if this is chosen large enough. The parameters used for these simulations are as follows:

 D0=1, D1=−8, D2=8, γ=0.5, σ=0.7, T=500, Δt∈{12,…,1128}, NGibbs=50.

The trigonometric potential resulting from this choice of drift parameters is depicted on the left of Figure 5 and a typical samplepath is given on the right side of Figure 5. It should be noted that all sample paths are started at . \linelabel20_10A typical sample path for given in Figure 5.

The performance of the Gibbs sampler for the sample -path given in Figure 5 is shown in Figure 7 where Gibbs steps sampling from the posterior distribution of drift and diffusion parameters are shown for the setup shown above except that here and . \linelabel20_9Mean posterior estimators are computed averaging over the latter half of iterations as before. This sampling is repeated up to 64000 times and we label the repeated-sampling average of these mean posterior estimators as and . We then compute their deviation from the true values, and plot and versus in a doubly logarithmic plot given in Figure 6.

We seek to fit a straight line to the in a doubly logarithmic plot to ascertain the order of convergence. \linelabel22^1Since a standard least squares fit proves inadequate, we employ the following procedure:

Given averaged numerically observed parameter estimates and their numerically observed Monte Carlo standard deviations obtained at timesteps we fit and in the following linear regression: {linenomath}

 αiξi=yi−b−cΔti. (0)

Assuming that the errors are normally distributed (which is empirically found to be the case) a maximum likelihood fit for the parameters and can be performed and yields the asymptotic (for ) drift parameter values reported in Figure 6. Note that this fit constrains the slope of the fitted line in the doubly logarithmic plot to one. This is to minimise the number of parameters fitted and to improve the accuracy of the extrapolated value which is the predicted value for at . It can be observed in Figure 6 that this leads to good agreement with the observed average parameter values