\sqrt{n}-consistent parameter estimation for systems of ordinary differential equations: bypassing numerical integration via smoothing

# √n-consistent parameter estimation for systems of ordinary differential equations: bypassing numerical integration via smoothing

[ [    [ [ Department of Mathematics, Vrije Universiteit Amsterdam, De Boelelaan 1081, 1081 HV Amsterdam, The Netherlands. \printeade1 Korteweg-de Vries Institute for Mathematics, Universiteit van Amsterdam, P.O. Box 94248, 1090 GE Amsterdam, The Netherlands. \printeade2
\smonth7 \syear2010\smonth2 \syear2011
\smonth7 \syear2010\smonth2 \syear2011
\smonth7 \syear2010\smonth2 \syear2011
###### Abstract

We consider the problem of parameter estimation for a system of ordinary differential equations from noisy observations on a solution of the system. In case the system is nonlinear, as it typically is in practical applications, an analytic solution to it usually does not exist. Consequently, straightforward estimation methods like the ordinary least squares method depend on repetitive use of numerical integration in order to determine the solution of the system for each of the parameter values considered, and to find subsequently the parameter estimate that minimises the objective function. This induces a huge computational load to such estimation methods. We study the consistency of an alternative estimator that is defined as a minimiser of an appropriate distance between a nonparametrically estimated derivative of the solution and the right-hand side of the system applied to a nonparametrically estimated solution. This smooth and match estimator (SME) bypasses numerical integration altogether and reduces the amount of computational time drastically compared to ordinary least squares. Moreover, we show that under suitable regularity conditions this smooth and match estimation procedure leads to a -consistent estimator of the parameter of interest.

\kwd
\aid

0 \volume18 \issue3 2012 \firstpage1061 \lastpage1098 \doi10.3150/11-BEJ362 \newproclaimcndCondition[section]

\runtitle

Parameter estimation for ODEs

{aug}

1]\fnmsShota \snmGugushvili\corref\thanksref1label=e1]s.gugushvili@vu.nl and 2]\fnmsChris A.J. \snmKlaassen\thanksref2label=e2]c.a.j.klaassen@uva.nl

M-estimator \kwd-consistency \kwdnonparametric regression \kwdODE system \kwdPriestley–Chao estimator

## 1 Brief introduction

Many dynamical systems in science and applications are modeled by a -dimensional system of ordinary differential equations, denoted as

where is the unknown parameter of interest and is the initial condition. With the solution vector corresponding to the parameter value we observe

 Yij=xθj(ti)+ϵij,i=1,…,n,j=1,…,d,

where the observation times are known and the random variables  have mean 0 and model measurement errors combined with latent random deviations from the idealised model (1). Under regularity conditions, the ordinary least squares estimator

 ~θn=argminηn∑i=1d∑j=1(Yij−xηj(ti))2 (2)

of is -consistent, at least theoretically. For systems (1) that do not have explicit solutions, one typically uses iterative procedures to approximate this ordinary least squares estimator. However, since every iteration in such a procedure involves numerical integration of the system (1) and since the number of iterations is typically very large, in practice it is often extremely difficult if not impossible to compute (2), cf. page 172 in voit (). Here, we present a feasible and computationally much faster method to estimate the parameter . To define the estimator of , we first construct kernel estimators

 ^xj(t)=n∑i=1(ti−ti−1)1bK(t−tib)Yij

of with a kernel function and a bandwidth. Now, the estimator of is defined as

 ^θn=argminη∫10∥^x′(t)−F(^x(t),η)∥2w(t)dt, (3)

where denotes the usual Euclidean norm and is a weight function. Related approaches have been suggested in computational biology and numerical analysis literature, see, for example, bellman (), savageau () and varah ().

The main result of this paper is that this smooth and match estimator is -consistent under mild regularity conditions. So, asymptotically the SME is comparable to the ordinary least squares estimator in statistical performance, but it avoids the computationally costly repeated use of numerical integration of (1).

## 2 Introduction

Let us introduce the contents of this paper in more detail. Systems of ordinary differential equations play a fundamental role in many branches of natural sciences, for example, mathematical biology, see keshet (), biochemistry, see voit (), or the theory of chemical reaction networks in general, see, for instance, feinberg () and sontag (). Such systems usually depend on parameters, which in practice are often only approximately known, or are plainly unknown. Knowledge of these parameters is critical for the study of the dynamical system or process that the system of ordinary differential equations describes. Since these parameters usually cannot be measured directly, they have to be inferred from, as a rule, noisy measurements of various quantities associated with the process under study. More formally, in this paper we consider the following setting: let, as in (1),

be a system of autonomous differential equations depending on a vector of real-valued parameters. Here is a -dimensional state variable, denotes a -dimensional parameter, while the column -vector defines the initial condition. Whether the latter is known or unknown, is not relevant in the present context, as long as it stays fixed. Denote a solution to (4) corresponding to parameter value by Suppose that at known time instances noisy observations

 Yij=xθj(ti)+ϵij,i=1,…,n,j=1,…,d, (5)

on the solution are available. The random variables model measurement errors, but they might also contain latent random deviations from the idealized model (1). Such random deviations are often seen in real-world applications. Based on these observations, the goal is to infer the value of the parameter of interest.

The standard approach to estimation of is based on the least squares method (the least squares method is credited to Gauß and Legendre, see stigler ()), see, for example, hemker () and stortelder (). The least squares estimator is defined as a minimiser of the sum of squares, that is,

 ~θn=argminηRn(η)=argminηn∑i=1d∑j=1(Yij−xηj(ti))2.

If the measurement errors are Gaussian, then coincides with the maximum likelihood estimator and is asymptotically efficient. Since the differential equations setting is covered by the general theory of nonlinear least squares, theoretical results available for the latter apply also in the differential equations setting and we refer for example, to jennrich () and wu (), or more generally to geer1 (), geer2 (), and pollard () for a thorough treatment of the asymptotics of the nonlinear least squares estimator. The paper that explicitly deals with the ordinary differential equations setting is xue (). Despite its appealing theoretical properties, in practice the performance of the least squares method can dramatically degrade if (4) is a nonlinear high-dimensional system and if is high-dimensional. In such a case, we have to face a nonlinear optimisation problem (quite often with many local minima) and search for a global minimum of the least squares criterion function  in a high-dimensional parameter space. The search process is most often done via gradient-based methods, for example, the Levenberg–Marquardt method, see levenberg (), or via random search algorithms, see Section 4.5.2 in voit () for a literature overview. Since nonlinear systems in general do not have solutions in closed form, use of numerical integration within a gradient-based search method and serious computational time associated with it seem to be inevitable. For instance, in a relatively simple example of a four-dimensional system considered in Appendix 1 of almeida (), repetitive numerical integration of the system takes up to 95% of the total computational time required to compute the least squares estimator via a gradient based optimisation method. Likewise, random search algorithms are also very costly computationally and in general, computational time will typically be a problem for any optimisation algorithm that relies on numerical integration of any relatively realistic nonlinear system of ordinary differential equations, cf. page 172 in voit (). One example is furnished by kikuchi (), where a system that consists of five differential equations and contains sixty parameters and that describes a simple gene regulatory network from hlavacek () is considered. The optimisation algorithm (a genetic algorithm) was run for seven loops each lasting for about ten hours on the AIST CBRC Magi Cluster with 1040 CPUs (Pentium III 933 MHz).111See http://www.cbrc.jp/magi for the cluster specifications. This amounted to a total of ca. 70 000 CPU hours. The authors also remarked that the gradient-based search algorithm would not be feasible in their setting at all. The problems become aggravated for systems of ordinary differential equations that exhibit stiff behaviour, that is, systems that contain both ‘slow’ and ‘fast’ variables and that are difficult to integrate via explicit numerical integration schemes, see, for example, hairer () for a comprehensive treatment of methods of solving numerically stiff systems. Even if a system is not stiff for the true parameter value during the numerical optimisation procedure one might pass the vicinity of parameters for which the system is stiff, which will necessarily slow down the optimisation process.

The Bayesian approach to estimation of see, for example, gelman () and girolami (), encounters similar huge computational problems. In the Bayesian approach, one puts a prior on the parameter and then obtains the posterior via Bayes’ formula. The posterior contains all the information required in the Bayesian paradigm and can be used to compute for example, point estimates of or Bayesian credible intervals. If is high-dimensional, the posterior will typically not be manageable by numerical integration and one will have to resort to Markov Chain Monte Carlo (MCMC) methods. However, sampling from the posterior distribution for via MCMC necessitates at each step numerical integration of the system (4), in case the latter does not have a closed form solution. Computational time might thus become a problem in this case as well. Also, since in general the likelihood surface will have a complex shape with many local optima, ripples, and ridges, see, for example, girolami () for an example, serious convergence problems might arise for MCMC samplers.

Yet another point is that in practice both the least squares method and the Bayesian approach require good initial guesses of the parameter values. If these are not available, then both approaches might have problems with convergence to the true parameter value within a reasonable amount of time.

Over the years a number of improvements upon the classical methods to compute the least squares estimator have been proposed in the literature. In particular, the multiple shooting method of bock () and the interior-point or barrier method for large-scale nonlinear programming as in biegler () have proved to be quite successful. These two approaches tend to be much more stable than classical gradient-based methods, have a better chance to converge even from poor initial guesses of parameters, and in general require a far less number of iterations until convergence is achieved. However, they still require a nontrivial amount of computational power.

A general overview of the typical difficulties in parameter estimation for systems of ordinary differential equations is given in ramsay (), to which we refer for more details. For a recent overview of typical approaches to parameter estimation for systems of ordinary differential equations in biochemistry and associated challenges see, for example, chou ().

To evade difficulties associated with the least squares method, or more precisely with numerical integration that it usually requires, a two-step method was proposed in bellman () and varah (). In the first step, the solution of (4) is estimated by considering estimation of the individual components as nonparametric regression problems and by using the regression spline method for estimation of these components. The derivatives of are also estimated from the data by differentiating the estimators of with respect to time Thus, no numerical integration of the system (4) is needed. In the second step, the obtained estimate of and its derivative are plugged into (4) and an estimator of is defined as a minimiser in of an appropriate distance between the estimated left- and right-hand sides of (4) as for example, in (3). Since this estimator of results from a minimisation procedure, it is an M-estimator, see, for example, the classical monograph huber (), or Chapter 7 of bkrw (), Chapter 5 of vdvaart (), and Chapter 3.2 of wellner () for a more modern exposition of the theory of M-estimators. For an approach to estimation of related to bellman () and varah () see also savageau (), as well as almeida (), where a practical implementation based on neural networks is studied. The intuitive idea behind the use of this two-step estimator is clear: among all functions defined on any reasonably defined distance between the left- and right-hand side of (4) is minimal (namely, it is zero) for the solution of (4) and the true parameter value For estimates close enough in an appropriate sense to the solution the minimisation procedure will produce a minimiser close to the true parameter value, provided certain identifiability and continuity conditions hold. This intuitive idea was exploited in brunel (), where a more general setting than the one in bellman () and varah () was considered. Another paper in the same spirit as bellman () and varah () is liang ().

This two-step approach will typically lead to considerable savings in computational time, as unlike the straightforward least squares estimator, in its first step it just requires finding nonparametric estimates of and for which fast and numerically reliable recipes are available, whereas the gradient-based least squares method will still rely on successive numerical integrations of (4) for different parameter values in order to find a global minimiser minimising the least squares criterion function. We refer to almeida () for a particular example demonstrating gains in the computational time achieved by the two-step estimator in comparison to the ordinary least squares estimator. When the right-hand side of (4) is linear in and further simplifications will occur in the second step of the two-step estimation procedure, as one will essentially only have to face a weighted linear regression problem then. This is unlike the least squares approach, which cannot exploit linearity of in However, we would also like to stress the fact that the two-step estimator does not necessarily have to be considered a competitor of either the least squares or the Bayesian approach. Indeed, since in practice both of these approaches require good initial guesses for parameter values, these can be supplied by the two-step estimator. In this sense, the proposed two-step estimation approach can be thought of as complementing both the least squares and the Bayesian approaches. Moreover, an additional modified Newton–Raphson step suffices to arrive at an estimator that is asymptotically equivalent to the exact ordinary least squares estimator, as will be shown elsewhere.

A certain limitation of the two-step approach is that it requires that measurements on all state variables , are available. The latter is not always the case in practical applications. In some cases, the unobserved variables can be eliminated by transforming the first order system into a higher order one and next applying a generalisation of the smooth and match method to this higher order system. Under appropriate conditions, this approach should yield a consistent estimator. Although one can always formally perform the least squares procedure, without further assumptions on the system it is far from clear that it leads to a consistent estimator of the parameter of interest.

Our goal in the present work is to undertake a rigorous study of the asymptotics of a two-step estimator of Our exposition is similar to that in brunel () to some degree, but one of the differences is that instead of regression spline estimators we use kernel-type estimators for estimation of and 222The proofs of the main results in brunel () are incomplete and the main theorems require further conditions in order to hold. The conditions are also different. We hope that our contribution will motivate further research into the interesting topic of parameter estimation for systems of ordinary differential equations.

There exists an alternative approach to the ones described here, which also employs nonparametric smoothing, see ramsay (). For information on its asymptotic properties, we refer to qi (). For nonlinear systems, this approach will typically reduce to one of the realisations of the ordinary least squares method, for example, Newton–Raphson algorithm, where however numerical integration of (4) will be replaced by approximation of the solution of the system (4) by an appropriately chosen element of some finite-dimensional function space. This seems to reduce considerably the computational load in comparison to the gradient-based optimisation methods which employ numerical integration of (4). However, it still appears to be computationally more intense than the two-step approach advocated in the present work.

We conclude the discussion in this section by noting that when modeling various processes, some authors prefer not to specify the right-hand side of (4) explicitly (the latter amounts to explicit specification of the in (4)), but simply assume that the right-hand side of (4) is some unknown function of that is, is given by with unknown, and proceed to its estimation via nonparametric methods, see, for example, ellner (). This has an advantage of safeguarding against possible model misspecification. However, the question whether one has or has not to specify explicitly appears to us to be more of a philosophical nature and boils down to a discussion on the use of parametric or nonparametric models, that is, whether one has strong enough reasons to believe that the process under study can be described by a model as in (4) with known or not. We do not address this question here, because an answer to it obviously depends on the process under study and varies from case to case. For a related discussion, see hooker ().

The rest of the paper is organised as follows: in the next section, we will detail the approach that we use and present its theoretical properties. In particular, we will show that under appropriate conditions our two-step approach leads to a consistent estimator with a convergence rate, which is the best possible rate in regular parametric models.333It is claimed in liang () that the two-step estimation procedure suggested there leads to a faster rate than which is impossible. Indeed, Theorem 2 of liang () and its proof are incorrect. Section 4 contains a discussion on the obtained results together with simulation examples. The proofs of the main results are relegated to Section 5, while Appendix Appendix: Auxiliary results contains some auxiliary statements.

## 3 Results

First of all, we point out that in the present study we will be concerned with the asymptotic behaviour of an appropriate two-step estimator of under a suitable sampling scheme. We will primarily be interested in intuitively understanding the behaviour of a relatively simple estimator of as well as in a clear presentation of the obtained results and the proofs. Consequently, the stated conditions will not always be minimal and can typically be relaxed at appropriate places.

We first define the sampling scheme.

{cnd}

The observation times are deterministic and known and there exists a constant such that for all

 max2≤i≤n|ti−ti−1|≤c0n

holds. Furthermore, there exists a constant such that for any interval of length and all the inequality

 1nn∑i=11[ti∈A]≤c1max(|A|,1n)

holds.

Hence, we observe the solution of the system (4) on the interval Instead of we could have taken any other bounded interval. Conditions on as in Condition 3 are typical in nonparametric regression, see, for example, gasser () and Section 1.7 in tsybakov (), and they imply that are distributed over in a sufficiently uniform manner. The most important example in which Condition 3 is satisfied, is when the observations are spaced equidistantly over that is, when for In this case, one may take Notice that we do not necessarily assume that the initial condition is measured or is known. If it is, then it is incorporated into the observations and is used in the first step of the two-step estimation procedure.

{cnd}

The random variables from (5) are independent and are normally distributed with mean zero and finite variance

This assumption of Gaussianity of the ’s may be dropped in various ways, as we will see below; see the note after Proposition 3.1 and Appendix .2.

We next state a condition on the parameter set.

{cnd}

The parameter set is a compact subset of

Compactness of allows one to put relatively weak conditions on the structure of the system (4), that is, the function

Just as the least squares method, see, for example, jennrich (), our smooth and match approach also requires some regularity of the solutions of (4). In what follows, a derivative of any function with respect to the variable will be denoted by For the second derivative of with respect to , we will use the notation with a similar convention for mixed derivatives. An integral of a vector- or matrix-valued function will be understood componentwise.

{cnd}

The following conditions hold: {longlist}[(iii)]

the mapping from (4) is such that its second derivatives  are continuous;

for all parameter values the solution of (4) is defined on the interval

for all parameter values the solution of (4) is unique on

for all parameter values the solution of (4) is a function of on the interval for some positive integer

Observe that Condition 3(i) implies existence and uniqueness of the solution of (4) in some neighbourhood of However, we want the existence and uniqueness to hold on the whole interval and therefore a priori require (ii) and (iii). Furthermore, in (iv) is required when establishing appropriate asymptotic properties of nonparametric estimators of the solution and its derivative, while is needed in Propositions 3.3 and 3.4, and in Theorem 3, respectively. Notice that for every the solution  is of class in in a neighbourhood of provided for a given the function is of class in its first argument. However, we want this to hold on the whole interval and therefore require (iv). Since in the theory of chemical reaction networks, see, for instance, sontag (), the components of are usually polynomial or rational functions of and the solution of (4) will be smooth enough in many examples and is satisfied in a large number of practical examples. For the above-mentioned facts from the theory of ordinary differential equations, see, for example, Chapter 2 in arnold (). Also notice that the condition on in liang (), see Assumption C on page 1573, puts severe restrictions on and excludes for example, quadratic nonlinearities of in This, of course, has to be avoided.

Recall that our observations are for We propose the following nonparametric estimator for

 ^xj(t)=n∑i=1(ti−ti−1)1bK(t−tib)Yij, (6)

where is a kernel function, while the number denotes a bandwidth that we take to depend on the sample size in such a way that as In line with a traditional convention in kernel estimation theory, we will suppress the dependence of  on in our notation, since no confusion will arise. When the ’s are equispaced, the estimator (6) can in essence be obtained by modifying the Nadaraya–Watson regression estimator, cf. page 34 in tsybakov (). It is usually called the Priestley–Chao estimator after the authors who first proposed it in priestley (). As far as an estimator of is concerned, we define it as the derivative of with respect to choosing as a differentiable function. Notice that the bandwidth plays a role of regularisation parameter: too small a bandwidth results in an estimator with small bias, but large variance, while too large a bandwidth results in an estimator with small variance, but large bias, see, for example, pages 7–8 and 32 in tsybakov () for a relevant discussion. In principle one could use different bandwidth sequences for estimation of for different ’s, but as can be seen from the proofs in Section 5, asymptotically this will not make a difference for an estimator of . A similar remark applies to the use of different bandwidths for estimation of and its derivative Arguably, the estimator (6) is simple and there exist other estimators that may outperform it in certain respects in practice. However, as we will show later on, even such a simple estimator leads to a -consistent estimator of

Theoretical properties of the Priestley–Chao estimator were studied in benedetti (), priestley (), schuster (). However, the first two papers do not cover its convergence in the (supremum) norm, while the third one does not do it in the form required in the present work. Since this is needed in the sequel, we will supply the required statement, see Proposition 3.1 below.

To put things in a somewhat more general context than the one in our differential equations setting, consider the following regression model:

 Yi=μ(ti)+ϵi,i=1,…,n, t1,…,tn satisfy Condition ???, (7) ϵ1,…,ϵn are i.i.d. Gaussian with E[ϵi]=0 and Var[ϵi]=σ2>0.

Our goal is to estimate the regression function and its derivative  The estimator of  will be given by an expression similar to (6), namely

 ^μn(t)=n∑i=1(ti−ti−1)1bK(t−tib)Yi, (8)

while an estimator of will be given by We postulate the following condition on the kernel for some strictly positive integer {cnd} The kernel is symmetric and twice continuously differentiable, it has support within and it satisfies the integrability conditions: and for If only the first of the two integrability conditions is required.

The following proposition holds.

###### Proposition 3.1

Suppose the regression model (3) is given and Condition 3 holds. Fix a number such that {longlist}[(ii)]

If is times continuously differentiable and as then

 supt∈[δ,1−δ]|^μn(t)−μ(t)|=OP(bα+1nb2+√lognnb). (9)

If is times continuously differentiable and as then

 supt∈[δ,1−δ]|^μ′n(t)−μ′(t)|=OP(bα−1+1nb3+√lognnb3) (10)

is valid. In particular, and are consistent on if holds additionally.

Gaussianity of the ’s allows one to prove (9) and (10) by relatively elementary means. This assumption can be modified in various ways, for instance by assuming that the ’s are bounded, and we state and prove the corresponding modification of Proposition 3.1 in Appendix .2, see Proposition .1. In general, normality of the measurement errors is a standard assumption in parameter estimation for systems of ordinary differential equations, see, for example, girolami (), hemker (), and ramsay ().

The following corollary is immediate from Proposition 3.1.

###### Corollary 3.1

Let be the same as in Condition 3. Under Conditions 33, we have for the estimator

 supt∈[δ,1−δ]|^xj(t)−xθj(t)|=OP(bα+1nb2+√lognnb) (11)

and

 supt∈[δ,1−δ]|^x′j(t)−x′θj(t)|=OP(bα−1+1nb3+√lognnb3), (12)

provided and as In particular, and are consistent, if holds additionally.

In the proof of Proposition 3.2, we will apply the continuous mapping theorem in order to prove convergence in probability of certain integrals of and its derivatives with ’s plugged in. This is where Corollary 3.1 is used.

Now that we have consistent (in an appropriate sense) estimators of and from the smoothing step we can move to the matching step in the construction of our smooth and match estimator of In particular, we define the estimator of as

 ^θn = argminη∈Θ∫10∥^x′(t)−F(^x(t),η)∥2w(t)dt = argminη∈ΘMn,w(η),

where denotes the usual Euclidean norm and is a weight function. We will refer to as a (random) criterion function. Since is compact and under our conditions is continuous in the minimiser always exists. The fact that is a measurable function of the observations follows from Lemma 2 of jennrich (). Notice that in liang () and varah () the criterion function is given by

 n∑i=1∥˜x′(ti)−F(˜x(ti),η)∥2,

where and are appropriate estimators of and However, in order to obtain a -consistent estimator of it is important to use an integral type criterion: the nonparametric estimators of and have a slower convergence rate than and this is counterbalanced by the integral criterion from (3). Indeed, stationarity at leads to (5). The first factor at the left-hand side of this equality converges to a constant nondegenerate matrix and the right-hand side behaves like a linear combination of the observations with coefficients of order thanks to the integration; cf. Proposition 3.4 and its proof. In light of this the choice of the weight function also appears to be important. Furthermore, the observations from (5) indirectly carry information on the entire curves and not only on the points . An integral type criterion allows one to exploit this fact in the second step of this smooth and match procedure.

Introduce the asymptotic criterion

 Mw(η)=∫10∥F(xθ(t),θ)−F(xθ(t),η)∥2w(t)dt

corresponding to Observe that by Condition 3 it is bounded. Using Corollary 3.1 as a building block, one can show that the SME is consistent. To this end, we will need the following condition on the weight function .

{cnd}

The weight function is a nonnegative function that is continuously differentiable, is supported on the interval for some fixed number such that and is such that the Lebesgue measure of the set is positive.

The fact that vanishes at the endpoints of the interval and beyond, is needed to obtain a -consistent estimator of In particular, together with differentiability of it is used in order to establish (5). The condition that is supported on takes care of the boundary bias effects characteristic of the conventional kernel-type estimators, see, for example, gasser () for more information on this. Boundary effects in kernel estimation are usually remedied by using special boundary kernels, see, for example, vanes (), gasser2 (), goldstein2 (). Using such a kernel, it can be expected that in our case as well the boundary effects will be eliminated and one may relax the requirement from Condition 3 to that is, to allowing to be supported on The condition that the weight function is positive on a set with positive Lebesgue measure, is important for (14) to hold and in fact a.e. would be a strange choice.

The following proposition is valid.

###### Proposition 3.2

Suppose and Under Conditions 33 and the additional identifiability condition

 ∀ε>0inf∥η−θ∥≥εMw(η)>Mw(θ), (14)

we have

The proposition is proved via a reasoning standard in the theory of M-estimation: we show that converges to and that the convergence is strong enough to imply the convergence of a minimiser of to a minimiser of cf. Section 5.2 of vdvaart (). A necessary condition for (14) to hold is that for The latter is a minimal assumption for the statistical identifiability of the parameter The identifiability condition (14) is common in the theory of M-estimation, see Theorem 5.7 of vdvaart (). It means that is a point of minimum of and that it is a well-separated point of minimum. The most trivial example with this condition satisfied is when and hold with initial condition where In fact, in this case

 Mw(η)=(θ−η)2ξ2∫1−δδe2θtw(t)dt,

and this is zero for and is strictly positive for whence (14) follows. More generally, since is compact and is continuous, uniqueness of a minimiser of will imply (14), cf. Exercise 27 on page 84 of vdvaart ().

In practice, (14) might be difficult to check globally and one might prefer to concentrate on a simpler local condition: if the first order condition holds and if the Hessian matrix of is strictly positive definite at then (14) will be satisfied for restricted to some neighbourhood of because will have a local minimum at such and a neighbourhood around it can be taken to be compact with small enough diameter, so that (14) holds for restricted to this neighbourhood. The conclusion of the theorem will then hold for the parameter set restricted to this neighbourhood of

In a statement analogous to Proposition 3.2, brunel () requires that the solutions of (4) belong to a compact set for all and and that from (1) is Lipschitz in its first argument for restricted to this compact uniformly in It is also assumed that the nonparametric estimators belong a.s. to for all and However, the latter typically will not hold for linear smoothers, see Definition 1.7 in tsybakov (), which constitute the most popular choice of nonparametric regression estimators in practice. For instance, local polynomial estimators, see Section 1.6 in tsybakov (), projection estimators, see Section 1.7 in tsybakov (), or the Gasser–Müller estimator, see gasser (), are all examples of linear smoothers. Hence, we prefer to avoid this condition altogether, although this somewhat complicates the proof.

Under the conditions in this section, it turns out that the estimator is not merely a consistent estimator, but a -consistent estimator of in the sense of (18) below. This result follows in essence from the fact that up to a higher order term the difference can be represented as the difference of the images of and under a certain linear mapping, cf. Proposition 3.3. It is known that even though nonparametric curve estimators cannot usually attain the convergence rate, see, for example, Chapters 1 and 2 of tsybakov (), extra smoothness often coming from the structure of linear functionals allows one to construct in many cases -consistent estimators of these functionals via plugging in nonparametric estimators, see, for example, bickel () and goldstein () for more information. The variance of such plug-in estimators can often be proven to be of order while the squared bias can be made of order by undersmoothing, that is, selecting the smoothing parameter smaller than what is an optimal choice in nonparametric curve estimation when the object of interest is a curve itself, cf. goldstein (). Precisely, this happens in our case as well: if the mean integrated squared error is used as a performance criterion of a nonparametric estimator, then under our conditions the optimal bandwidth for estimation of is of order whereas the optimal bandwidth for estimation of is in fact smaller, see Theorem 3 below. Note that undersmoothing is a different approach than the one in bickel (), where it is assumed that nonparametric estimators attain the minimax rate of convergence and the -rate for estimation of a functional in concrete examples, if possible, is achieved by different means exploiting extra smoothness coming from the structure of a functional, see, for example, the first example in Section 2 there. In many cases, it can be proved that such plug-in type estimators are efficient, see bickel (). Notice, however, that in our case this will not imply that is efficient.

First, we will provide an asymptotic representation for the difference

###### Proposition 3.3

Let be an interior point of Suppose that the conditions of Proposition 3.2 hold and let the matrix defined by

 Jθ=∫1−δδ(F′θ(xθ(t),θ))TF′θ(xθ(t),θ)w(t)dt (15)

be nonsingular. Fix If holds for then

 ^θn−θ=OP(J−1θ(Γ(^x)−Γ(xθ)))+oP(n−1/2) (16)

is valid with the mapping given by

 Γ(z)=∫1−δδ{−(F′θ(xθ(t),θ))TF′x(xθ(t),θ)w(t)−ddt[(F′θ(xθ(t),θ))Tw(t)]}z(t)dt. (17)

With the above result in mind, in order to complete the study of the asymptotics of it remains to study the mapping Clearly, it suffices to study the asymptotic behaviour of

 Δ(^μn)−Δ(μ)=∫Rv(t)k(t)^μn(t)dt−∫Rv(t)k(t)μ(t)dt,

where is a known function that satisfies appropriate assumptions, while stands either for or its derivative The next proposition deals with the asymptotics of

###### Proposition 3.4

Under Conditions 3 and 3 and for any continuous function , it holds in the regression model (3) that

 Δ(^μn)−Δ(μ)=OP(n−1/2),

provided is times differentiable and the bandwidth is chosen such that holds for

Our main result is a simple consequence of Propositions 3.3 and 3.4.

{thm}

Let be an interior point of Assume that Conditions 33 together with (14) hold and that (15) is nonsingular. Fix If the bandwidth is such that holds for then

 √n(^θn−θ)=OP(1) (18)

is valid.

Thus any bandwidth sequences satisfying the conditions in Theorem 3 are optimal, in the sense that they lead to estimators of with similar asymptotic behaviour. In particular, each of such bandwidth sequences ensures a convergence rate of Consequently, dependence of the asymptotic properties of the estimator on the bandwidth is less critical than it typically is in nonparametric curve estimation. Notice that the condition in Theorem 3 is needed in order to make the conditions in Propositions 3.3 and 3.4 compatible.

## 4 Discussion

The main result of the paper, Theorem 3, is that under certain conditions for systems of ordinary differential equations parameter estimation at the rate is possible without employing numerical integration. Although we have shown this in the case when in the first step of the two-step procedure a particular kernel-type estimator is used, it may be expected that a similar result holds for other nonparametric estimators. For instance, the arguments for the Nadaraya–Watson estimator seem to be similar, with extra technicalities arising for example, from the fact that it is a ratio of two functions. Furthermore, from formula (5) it can be seen that the proof of Proposition 3.3 requires that the derivative of an estimator of be used as an estimator of Not all popular nonparametric estimators of the derivatives of a regression function are of this type. In practice for small or moderate sample sizes it might be advantageous to use more sophisticated nonparametric estimators than the Priestley–Chao estimator, but asymptotically this does not make a difference.

Once a -consistent estimator of is available, one might ask for more, namely if one can construct an estimator that is asymptotically equivalent to the ordinary least squares estimator (2) or that is semiparametrically efficient. It is expected that this can be achieved without repeated numerical integration of (1) by using as a starting point and performing a one-step Newton–Raphson type procedure; see, for example, Section 7.8 of bkrw () or Chapter 25 of vdvaart (). We intend to address this issue of efficient and ordinary least squares estimation in a separate publication.

Doubtless, the main challenge in implementing the smooth and match estimation procedure lies in selecting the smoothing parameter This is true for any two-step parameter estimation procedure for ordinary differential equations, for example, the one based on the regression splines as in brunel () or the local polynomial estimator as in liang (), and not only for our specific estimator. Observations that we supply below apply in principle to any two-step estimator and not only to the specific kernel-type one considered in the present work. Hence, they are of general interest.

Some attention has been paid in the literature to the selection of the smoothing parameter in the context of parameter estimation for ordinary differential equations. The considered options range from subjective choices and smoothing by hand to more advanced possibilities. Perhaps the simplest solution would be to assume that the targets of the estimation procedure are and to select (a different one for every component ) via a cross-validation procedure, see, for example, Section 5.3 in wasserman () for a description of cross-validation techniques in the context of nonparametric regression. This should produce reasonable results, at least for relatively large sample sizes, cf. simulation examples considered in brunel (). However, it is clear from Theorem 3 and its proof that despite its simplicity, such a choice of will be suboptimal. Another practical approach to bandwidth selection is computation of for a range of values of the bandwidth on some discrete grid and then choosing

 ^b=argminb∈Bn∑i=1d∑j=1(Yij−x^θn(b)j(ti))2.

This seems a reasonable choice, although the asymptotics of are unclear. One other possibility for practical bandwidth selection is nothing else but a variation on the plug-in bandwidth selection method as described for example, in jones (): one can see from the proof in Section 5 that the terms that depend on the bandwidth are lower order terms in the expansion of One can then minimise with respect to a bound on these lower order terms. A minimiser, say will depend on the unknown true parameter  also via and  as well as on the error variances However, and  can be reestimated via and using a different, pilot bandwidth Of course, instead of and the use of any other nonparametric estimators of a regression function and its derivative, for example, local polynomial estimators, see Section 1.6 of tsybakov (), or the Gasser–Müller estimator, see gasser (), is also a valid option. Error term variances can be estimated via one of the methods described in hall () or Section 5.6 of wasserman (). Once the pilot estimators of and together with estimators of are available, these can be plugged back into and in this way one obtains a bandwidth that estimates the optimal bandwidth The final step would be computation of with a new bandwidth Unfortunately, this method leads to extremely cumbersome expressions and furthermore, since we are minimising an upper bound on numerous remainder terms, it will probably tend to oversmooth, that is, produce a bandwidth larger than required. Moreover, the plug-in approach in general is subject to some controversy having both supporters and critics, see, for example, loader () and references therein. An alternative to the plug-in approach might be an approach based on one of the resampling methods: cross-validation, jackknife, or bootstrap. Computationally these resampling methods will be quite intensive. Theoretical analysis of the properties of such bandwidth selectors is a rather nontrivial task. Also a thorough simulation study is needed before the practical value of different bandwidth selection methods can be assessed. We do not address these issues here.

The next observation of this section concerns numerical computation of our SME. The kernel-type nonparametric regression estimates of can be quickly evaluated on any regular grid of points for example, via techniques using the Fast Fourier Transform (FFT) similar to those described in Appendix D of wand2 (). See also marron (). Furthermore, in the match step of the two-step estimation procedure the criterion function can be approximated by a finite sum by discretising the integral in its definition. If is linear in and is univariate, then as already observed in varah (), see pages 29 and 31, cf. page 1262 in brunel () and page 1573 in liang (), this will lead to a weighted linear least squares problem, which can be solved in a routine fashion without using for example, random search methods. This is a great simplification in comparison to the ordinary least squares estimator, which moreover will still tend to get trapped in local minima of the least squares criterion function despite the fact that is linear in its parameters.

We conclude this section with two simple problems illustrating parameter estimation for systems of ordinary differential equations via the smooth and match method studied in the present paper. Our first example deals with the Lotka–Volterra system that is a basic model in population dynamics. It describes evolution over time of the populations of two species, predators and their preys. In mathematical terms, the Lotka–Volterra model is described by a system consisting of two ordinary differential equations and depending on the parameter

 {x′1(t)=θ1x1(t)−θ2x1(t)x2(t),x′2(t)=−θ3x2(t)+θ4x1(t)x2(t). (19)

Here, represents the prey population and the predator population. For additional information on the Lotka–Volterra system see, for example, Section 6.2 in keshet (). We took , and the initial condition The solution to (19) corresponding to these parameter values is plotted in Figure 1 with a thin line. The left panel represents the right panel

The solution components  and  are of oscillatory nature and are out of phase of each other. Next, we simulated a small data set of size of observations on the solution of (19) over the time interval by taking an equidistant grid of time points for and setting

 Yij=xθj(ti)+ϵij,i=1,…,50,j=1,2, (20)

where the i.i.d. measurement errors were generated from the normal distribution with mean zero and variance These observations are represented by crosses in Figure 1.

The three required ingredients for the construction of an estimator are the kernel  the weight function and the bandwidth A general recipe for construction of kernels of an arbitrary order is given in Section 1.2.2 of tsybakov () and is based on the use of polynomials that are orthonormal in with weights. In particular, we used the ultraspherical or Gegenbauer polynomials with weight function and constructed the fourth order kernel with them. Notice that our definition of the kernel of order in Condition 3 is slightly different from the one in Definition 1.3 of tsybakov (), cf. also the remark on page 6 there. For ultraspherical polynomials, see Section 4.7 in szego (). Our fourth order kernel took the form

 K(t)=(10564−31564t2)(1−t2)21[|t|≤1]. (21)

Notice that is a symmetric function. The kernel is plotted in Figure 2 in the left panel.

An alternative here is to use the Gaussian-based kernels as in schucany (), although they do not have a compact support. As far as the weight function is concerned, any nonnegative function that is equal to zero close to the end points of the interval is equal to one on the greater part of the interval and is smooth, could have been used. We opted to simply rescale and shift the function

that arose in a different context in mcmurry (), see formula (3) on page 552 there, so that it could have the required properties in our context. We took the constants and to be equal to and respectively, and then set

 w(t)=λc,β(1.05(t−12.5)12.5). (22)

The function is plotted in the right panel of Figure 2. Finally, since in the present work construction of the bandwidth selector is not our primary goal, we simply selected  by hand and set it to 1.2.

The smooth and match estimation procedure was implemented in Mathematica 6.0, see mathematica (). We first evaluated the kernel estimates of the regression functions and at the equidistant grid of points with With this number of grid points and the sample size there was no need to use binning to compute the estimates and moreover, binning would have probably resulted in a slower procedure, cf. Figure 3(b) in marron (); so we did not employ it. However, the fact that many of the kernel evaluations are actually the same, cf. marron (), was taken into account and led to savings in computation time above the naive implementation of the Priestley–Chao estimator that would directly compute The estimates and are plotted in Figure 1 with a solid line, while the estimates and are plotted in Figure 3.

Notice that the estimates and are severely undersmoothed. We next approximated the criterion function by a Riemann sum

 249∑k=0(^x′1(0.1k)−η1^x1(0.1k)+η2^x1(0.1k)^x2(0.1k))2w(0.1k)0.1 +249∑k=0(^x′2(0.1k)+η3^x2(0.1k)−η4^x1(0.1k)^x2(0.1k))2w(0.1k)0.1.

Note that when performing minimisation, the factor can be omitted from both terms in the above display. The minimisation procedure resulted in the estimate

 ^θn=(0.52,0.50,0.50,0.51)T.

With our implementation, the total time needed for computation of the estimate of (including time needed for kernel and weight function evaluations, but excluding time needed for loading observations) was about 0.5 seconds on a notebook with Intel(R) Pentium(R) Dual CPU T3200 @ 2.00 GHz processor and 4.00 GB RAM. The parameter estimates appear to be sufficiently accurate in this particular case.

Our second example deals with the Van der Pol oscillator that describes an electric circuit containing a nonlinear element, see page 333, Problem 12 on page 365, and the references on page 373 in keshet (). The corresponding system of ordinary differential equations takes the form

 {x′1(t)=θ−1(x1(t)−13(x1(t))3+x2(t)),x′2(t)=−θx1(t). (23)

We took and the initial condition The solution to (23) is of oscillatory nature and the components and are out of phase of each other. The solution is plotted in Figure 4 with a thin line.

We then simulated a data set of size of observations on the solution of (23) over the time interval at an equidistant grid of time points by setting

 Yij=xθj(ti)+ϵij,i=1,…,50,j=1,2, (24)

where the i.i.d. measurement errors were generated from the normal distribution with mean zero and variance These observations are plotted with crosses in Figure 4. When computing the estimate we used the same kernel and the same weight function as in the previous example, while the bandwidth was set to The estimates of the solution components and are depicted by a solid line in Figure 4, while the derivatives and together with their estimates are given in Figure 5.

The estimation procedure resulted in an estimate and the computation time was about seconds.

We intend to perform a more practically oriented study exploring some of the ideas mentioned in this section in a separate publication.

## 5 Proofs

When comparing two sequences and of real numbers, we will use the symbol meaning is less or equal than up to a universal multiplicative constant that is independent of index The symbol will denote the fact that two sequences of real numbers are asymptotically of the same order. {pf*}Proof of Proposition 3.1 We first prove (9). For any positive by Chebyshev’s inequality we have

 P(supt∈[δ,1−δ]|^μn(t)−μ(t)|>ε) ≤ 2ε2{supt∈[δ,1−δ]|E[^μn(t)]−μ(t)|2 = 2ε2(T1+T2).

By formula (1) from Appendix .1, we can write

 E[^μn(t)]−μ(t)=∫10μ(s)1bK(t−sb)ds−μ(t)+O(1nb2).

For all large enough, we have because Then for all such if a standard argument (cf. page 6 in tsybakov ()), namely Taylor’s formula up to order applied to and the moment conditions on the kernel formulated in Condition 3, yields

 supt∈[δ,1−δ]|E[^μn(t)]−μ(t)|≤bα∥μ(α)∥∞α!∫1−1|uαK(u)|du+O(1nb2). (26)

Next we turn to With argumentation similar to that in the proof of Theorem 1.8 of tsybakov () and setting

 Si(t)=ti−ti−1bK(t−tib),N=n2,sj=jN,

for we have

 A = supt∈[δ,1−δ]|^μn(t)−E[^μn(t)]| = supt∈[δ,1−δ]∣∣∣n∑i=1Si(t)ϵi∣∣∣ ≤ max1