State and Parameter Estimation of Partially Observed Linear Ordinary Differential Equations with Deterministic Optimal Control
Résumé
Ordinary Differential Equations are a simple but powerful framework for modeling complex systems. Parameter estimation from times series can be done by Nonlinear Least Squares (or other classical approaches), but this can give unsatisfactory results because the inverse problem can be illposed, even when the differential equation is linear.
Following recent approaches that use approximate solutions of the ODE model, we propose a new method that converts parameter estimation into an optimal control problem: our objective is to determine a control and a parameter that are as close as possible to the data. We derive then a criterion that makes a balance between discrepancy with data and with the model, and we minimize it by using optimization in functions spaces: our approach is related to the socalled Deterministic Kalman Filtering, but different from the usual statistical Kalman filtering.
We show the root consistency and asymptotic normality of the estimators for the parameter and for the states. Experiments in a toy model and in a real case shows that our approach is generally more accurate and more reliable than Nonlinear Least Squares and Generalized Smoothing, even in misspecified cases.
Résumé
Keywords:
Ordinary Differential Equation, Optimal Control, Parameter Estimation, Smoothing, Riccati Equation, Mestimation.
Keywords:
Ordinary Differential Equation, Optimal Control, Parameter Estimation, Smoothing, Riccati Equation, Mestimation.
1 Introduction
Ordinary Differential Equations (ODE) are a widely used class of mathematical models in biology, physics, engineering, …Indeed, it is a relatively simple but powerful framework for expressing the main mechanisms and interactions of potentially complex systems. It is often a reference framework in population dynamics and epidemiology [13], virology [29], or in genetics for describing gene regulation networks [26, 40]. The model takes the form , where is a vector field, is the state, and is a parameter that can be partly known. The parameter is often of high interest, as it represents rates of changes, phenomenological constants needed for interpretability and analysis of the system. Typically, can be related to the sensitivity of a variable with respect to other variables.
Hence, the parameter estimation of ODEs from experimental data is a longstanding statistical subject that have been adressed with many different tools. Estimation can be done with classical estimators such as Nonlinear Least Squares (NLS) and Maximum Likelihood Estimator (MLE) [24, 39, 31] or Bayesian approaches [21, 15, 16, 9]. Nevertheless, the statistical estimation of an ODE model by NLS leads to a difficult nonlinear estimation problem. Some difficulties were pointed out by Ramsay et al. [33] such as computational complexity, due to ODE integration and nonlinear optimization. These difficulties are in fact reminiscent of intrinsic difficulties in the parameter estimation problem, that makes it an illposed inverse problem, that needs some regularization [14, 36] .
Alternative statistical estimators have been developped to deal with this particular framework, such as Generalized Smoothing [33, 32, 12, 10] or TwoStep estimators [38, 5, 25, 17, 6]. Twostep estimators use a nonparametric estimator and aim at minimizing quantities characterizing the differential models, such as the weighted distance . These estimators have a good computational efficiency as they avoid repeated ODE integration. In practice, the used criteria are also smoother and easier to optimize than the NLS criterion. Twostep estimators are consistent in general, but there is a tradeoff with the statistical precision, and some care in the use of nonparametric estimate has to be taken in order to keep a parametric rate [5, 17].
In the case of Generalized Smoothing [33], the solution is approximated by a basis expansion that solves approximately the ODE model; hence, the parameter inference is performed by dealing with an imperfect model. Based on the Generalized Profiling approach, Hooker proposed a criteria that estimates the lackoffit through the estimation of a “forcing function” in the ODE , where is a previous estimate obtained by Generalized Profiling.
In [7], the authors have proposed a twostep estimator for linear models, that avoids the use of and introduces a forcing function without the finite basis decomposition by using control theory. The principle is to transform the estimation problem into a control problem: we have to find the best (or smallest) control such that the ODE is close to the data. The limitations of the results provided in [7] were the restriction to fully observed system with known initial condition. The objective of this paper is to provide a similar twostep estimate that permits the estimation of without knowing , that deals with the partially observed case and provides state estimates.
One interest of the approach used is to deal directly with the optimization in a function space without using of series expansion for function estimation. Moreover, infinite dimensional optimization tools give a powerful characterization of the solutions, useful in practice. This work can be seen as an extension of the previous one [7], aiming to use control theory result for parameter inference. We deal now with the partially observed case with unknown initial condition, that gives rise to a methodology close to the socalled “Deterministic Kalman Filter”. Indeed, in that paper, we assume that the system is linear, with a linear observation function.
Our method provides a consistent parametric estimator when the model is correct. We show that it is rootn consistent and asymptotically normal. At the same time, we get a discrepancy measure between the model and the data under the form of an optimal control analogous to the forcing function in [19], and we show that we can estimate the final and initial conditions and hence all the states if needed, in particular the hidden ones.
In the next section, we introduce the notations and we motivate our approach by discussing the Generalized Smoothing approach, and the link with Optimal Control Theory. In section 3, we investigate the existence and regularity of our new criterion; in particular, we derive necessary and sufficient conditions for defining our approach in partially observed case. We show that the estimator is consistent under some regularity assumption about the model. Then in section 4, we show that we reach the root rate using regression splines for the nonparametric estimator of the observed signal. We derive then the consistency of the state estimator derived. Finally, we show the interest of our method on a toy model and in a real model used in chemical engineering, by a comparison with Nonlinear Least Squares and Generalized Smoothing.
2 Model and methodology
We introduce first the statistical ODE model of interest, and the basic notations for defining our estimator. We relate this work to the Generalized Smoothing estimator and the Tracking estimator.
2.1 Model and Notations
We partially observe a “true” trajectory at random times , such that we have observations defined as
where is a random noise and is the observation matrix of size .
We assume that there is a true parameter belonging to a subset of , such that is the unique solution of the linear ODE
(1) 
with initial condition ; where and . More generally, we denote the solution of (1) for a given , and initial condition . We assume that and are unknown, and that they must be estimated from the data . The parameter is the main parameter of interest, whereas the initial condition is considered as a nuisance parameter, needed essentially for the computation of candidate trajectories .
For linear equations, a central role is played by the solutions of the homogeneous ODE
(2) 
Indeed, for each in , we denote the solution to the matrix ODE (2), with initial condition at time (i.e ). The function is a matrix valued function, called the resolvant of the ODE. It permits to give an explicit dependence of the solutions of (1) in and the initial condition , thanks to Duhamel’s formula:
A consistent and classical method for the estimation of is Nonlinear Least Squares (NLS), that minimizes
A classical alternative is Generalized Smoothing (GS), that uses approximate solutions of the ODE (1). GS replaces the solutions by splines that smooth data and solve approximately the ODE with a penalty based on the ODE model. A basis expansion is computed for each , where is obtained by minimizing in the criterion
(3) 
This first step is considered as profiling along the nuisance parameter , whereas the estimation of the parameter of interest is obtained by minimizing the sum of squared errors of the proxy :
(4) 
In practice, the hyperparameter needs to be selected from the data with adaptive procedures, see [11].
The essential difference with NLS is the replacement of the exact solution by the approximation (that depends also on the data). This change induces a new source of error in the estimation of the true trajectory as the functions are splines that do not solve exactly the ODE model (1). The ODE constraint is relaxed into an inequality constraint defined on the interval . The model constraint is never set to 0 because of the tradeoff with the datafitting term . For this reason, the ODE model (1) is not solved and it is useful to introduce the discrepancy term that corresponds to a model error. In fact, the proxy satisfies the perturbed ODE . This forcing function is an outcome of the optimization process and can be relatively hard to analyze or understand, but its analysis provides a good insight into the relevancy of the model [19, 20].
Based on these remarks, we introduce the perturbed linear ODE
(5) 
where the function can be any function in . The solution of the corresponding Initial Value Problem
is denoted . Instead of using the spline proxy for approximating , we use the trajectories of the ODE (5) controlled by the additional functional parameter .
In [7], the same perturbed model is introduced but the cost function is simpler as the observation matrix is the identity, and the initial condition is fixed. In that framework, an Mestimator for is proposed, based on the optimization of the criterion
(6) 
The proper definition of and the derivation of its properties were obtained by using some classical results of Optimal Control Theory. Essentially, the computation of corresponds to the classical "tracking problem" that can be solved by the LinearQuadratic theory (LQ theory). LQ theory solves the minimization problem in of the cost function
(7) 
The criteria used for parameter estimation is associated to the value function defined in Optimal Control as . The value function plays a critical role in the analysis of optimal control problems, typically for the computation of an optimal policy. Under regularity assumptions, the value function is the solution of the HamiltonJacobiBellman Equation, which is a first order Partial Differential Equation [1]. Quite remarkably, for a linear ODE with a quadratic cost such as (7), the value function is a quadratic form in the state , i.e , where is the solution of a matrix ODE (the Riccati equation), which makes its computation very tractable in practice.
LQ theory can be adapted for tracking of an output signal with a perturbed linear ODE, see chapter 7 in [35]. When we do not know the initial condition, some adaptations are required. Indeed, as the initial condition can have a strong influence on the optimal control and the optimal cost; it seems much harder to solve the control problem when the initial condition is not known: the current state is unknown and all the admissible trajectories must be considered. Nevertheless, this problem is solved by the Deterministic Kalman Filter (DKF) by using the fact that the value function is a quadratic form on the state.
We show in the next section that the Deterministic Kalman Filtering (DKF) is well adapted for developing parameter estimation, as it enables to profile on , considered as a nuisance parameter. In a twostep approach, it is critical as we need to control the influence of the nonparametric estimate of on the convergence rate. As we use as a proxy for , we need to show that the rate of the twostep estimator is not polluted by the use of nonparametric estimates of the boundary conditions, and that we keep a parametric rate for and . This property was carefully checked in [5, 6, 25]; in that paper, as we do not use implicitly or explicitly the derivative of the nonparametric estimate, the mechanics of the proof are different.
In the next section, we give some details on LQ theory and on the criterion . The classical costs in optimal control consist of an integral term plus a penalty term on the final state, such as . A preliminary timereversing transformation is used for introducing properly the initial state in the cost , rather than the final state. In a second step, we derive the criterion , and we give a tractable expression for estimation. Finally, we discuss the importance of identifiability and observability in the definition on our criterion.
2.2 The Deterministic Kalman Filter and the profiled cost
Following the Tracking estimator, we look for a candidate that minimizes at the same time the discrepancy with the data and the size of the perturbations . We consider nearly the same cost as in [7]
(8) 
for given . We can also add a positive quadratic form , where is a positive symmetric matrix . This additional term permits to introduce easily some prior knowledge on such that we have a cost defined as
Moreover, the matrix avoids some technical problems in the definition of our criterion .
For each in , we denote
(9) 
obtained by “profiling” on the function and then in the initial condition . The function is the criterion used in the case of fixed and known initial conditions. Our approach is rather "natural" as we simply profile the regularized criterion .
The definition of mimics the minimization of except that GS uses a discretized solution, defined on a Bsplines basis. Nevertheless, our estimator possesses two other essential differences with Generalized Smoothing. As it was already mentioned in [7], we define our estimator as the global minimum of the profiled cost:
(10) 
whereas the GS estimator minimizes a different criterion .
This means that in our methodology, we try to find a parameter
that maintain a reasonable tradeoff between model and data, whereas
the Generalized Smoothing Estimator is dedicated
to fit the data with the proxy , without considering
the size of model error represented by .
Another important difference is in the way we deal with the unobserved part of the system. For simplicity, let us consider that we observe only the first components of , such that the state vector can be written . For Generalized Smoothing, both functions and are decomposed in a Bspline basis, and the corresponding coefficients and are obtained by minimizing . Because does not have to make a tradeoff between the data and the ODE model, the estimated missing part is the exact solution to the ODE (5).
At the contrary, even in the case of partial observations, the perturbed solution is used for estimating the missing states and a perturbation exists for each component. Consequently, the estimated hidden states are not solution of the initial ODE. We think that this an advantage for state and parameter estimation with respect to Generalized Smoothing (and NLS) because it avoids to rely too strongly on a uncertain model during estimation. This uncertainty can be caused by errors in parameter estimation, or it can be due model misspecification, such as the presence of a forcing function . In our experiments, we show that imposing model uncertainty for the unobserved variables is beneficial for error prediction.
Before going deeper into the interpretation and analysis of our estimator, we need to show that the criterion is properly defined and that we can obtain a tractable expression for computations and for the theoretical analysis of (10). We use the Deterministic Kalman Filter (DKF) to obtain a closedform expression for the minimal cost w.r.t the control and (9).
The initial aim of the DKF is to propose an estimation of the final state by making a balance between the information brought by the noisy signal and the ODE model (see [35] for an introduction). We recall the two steps necessary for the filter construction, more details are given in appendix:

For a given initial condition , we find the minimum cost thanks the fundamental theorem in LQ Theory (presented in A.1),

We minimize the quadratic form w.r.t the final condition.
We give now the main theorem of that section about the existence of the criterion defined in equation (9).
Theorem and Definition of .
Let be a function belonging to
and be the solution to the controlled ODE (5).
For any in , , , there exists a unique optimal control
and initial condition
that minimizes the cost function
(11) 
The optimal control is
(12) 
where and are solutions of the Initial Value Problem
(13) 
with . The functions and are defined by
For all , the matrix is symmetric, and the ODE defining the matrixvalued function is called the Matrix Riccati Differential Equation of the ODE (5).
Finally, the Profiled Cost has the closed form:
(14) 
and the final state is estimated by
(15) 
Remarque 2.1.
The functions are classically called the adjoint model. They depend also on , and because of their definition via equation (13). Nevertheless, we do not write it systematically for notational brevity. As mentioned in the theorem, it is possible to compute in a “closedloop” form as we can solve in a preliminary stage the adjoint model (13) that gives the function and for all . Thanks to equation (12), the closedform expression of the optimal control can be plugged into (5). We can compute by solving the following Final Value Problem:
(16) 
The estimate of of the initial condition is simply the initial value of the Backward ODE (16). Then by using , we can compute effectively the control thanks to (12).
The existence of the criterion and the fundamental expression (14) heavily relies on the nonsingularity of the final value of the Riccati solution . In particular, the final state is estimated by , and it is then critical to identify the assumptions that could prevent to be singular. Our "Theorem and Definition" is legitimate (and proved in the appendix), because the assumption ensures that is nonsingular for all in . Moreover, the matrix can be thought as a kind of prior for helping the state inference. In our basic definition of the cost (2.2), we put a prior on the norm of the initial condition and our regularization penalizes "huge" solutions. Nevertheless, we can have a more refined prior and use a preliminary guess . The modification of the criterion is straightforward by setting
By reparameterizing the initial conditions with and exploiting the relation (consequence of the linearity of the ODE) , we get that
At the opposite, it might be inappropriate in some circumstances to impose such kind of information for the initial condition. This can be the case if the number of observations tends to infinity and becomes quite close to the truth. Another situation is when the initial conditions of the unobserved part are largely unknown. Hence, we extend our estimator to the case , that corresponds also to our framework for studying the asymptotics of . In order to derive relevant and tractable conditions for ensuring the existence of , we need to ensure that only one trajectory, with a unique initial condition (or final condition), is the global minimum of . The nonsingularity of is in fact related to the concept of observability in control theory. In the next proposition, we will pave the way to the assumptions on and the vector field that can guarantee the general existence of our method.
Proposition 2.2.
For a given parameter and observation matrix , the properties 1 and 2 are equivalent:

The system outputs satisfy
(17) 
The (final) observability matrix is nonsingular
(18)
If one of the properties is satisfied, then is nonsingular and is defined for .
An important feature of that proposition is that the criterion does not depend on . Moreover, if is full rank, the matrix is always nonsingular for all in . The criterion 1 means that for a given , any solution and of (1) can be distinguished by their partial observation . The matrix "gives" enough information about the system so that the observed part is sufficient to uniquely characterize the whole system’s state.
The next section is dedicated to the derivation of the regularity properties of . Thanks to the different possible expressions for the criterion , we can show the smoothness in and , and compute directly the needed derivatives.
3 Consistency of the Deterministic Kalman Filter Estimator
3.1 Properties of the criterion
We have a tractable expression of the cost function for a given , but we still need to derive the properties of and on , and shows some convergence properties. First of all, we need to ensure the existence of ; this is the case if the nonparametric estimator belongs to (more explanations are given in appendix A). We show that for all in , the function is well defined and on , under some regularity and identifiability assumptions, detailed below:
 C1:

is a compact subset of and is in the interior ,
 C2a:

and for all in , is nonsingular,
 C2b:

The model is identifiable at i.e
 C3:

and are continuous,
 C4:

and are continuous.
Condition 2 is about identifiability condition: condition 2a is needed for the existence of the criterion , and is related to the identifiability of the initial condition. But C2a is not sufficient, and we need Condition 2b for structural identifiability, based on the joint identifiability at . We require that the observed output can be generated on by the couple . The identifiability problem of systems can be difficult (more than observability). For linear system, several approaches can be used, such as Laplace Transform [2], or Power Expansions [30], see [27] for a review. So far, most of existing methods are poorly used because they rely on (heavy) formal computations, which limit their interest to low dimensional system. Nonetheless, progress in automatic formal computation has promoted new methods based on differential algebra and the Ritt’s algorithm, that improves identifiability checking, [22, 8, 23].
According to the context, the norm will denote the Euclidean norm in , or the Frobenius matrix norm . We use the functional norm in defined by: . Continuity and differentiability have to be understood according to these norms.
Proposition 3.1.
Under conditions 1, 2a and 3 we have:
and
Hence, for all in , the map is well defined on (i.e )
We have shown that for all in , the maps is well defined and so are and as long as the nonparametric estimator are welldefined on .
Proposition 3.2.
Under conditions 1, 2, 3
is continuous on . Under conditions 1, 2a, 3, 4 it is on .
In proposition 3.1 we have shown that our criteria is well defined ( i.e ) and here we have demonstrated (using regularity assumptions on the model) that our finite and asymptotic criteria are continuous or even on . Theses regularity properties justify the use of classical optimization method to retrieve the minimum of .
3.2 Consistency
We show the consistency of the parameter estimator when the model is wellspecified. As already mentioned, we have defined an estimator, and we can proove the consistency (see [37]), by showing

the uniform convergence of to on ,

is the unique global minimum of the asymptotic criterion on .
The second point is assessed in proposition 3.3, and it is related to the structure identifiability of the model provided by condition 2b.
Proposition 3.3.
Under conditions 1, 2a, 2b, we have:
Point 1 is proved by studying the regularity of the map and by obtaining appropriate controls of the variations by , see Supplementary Materials. Theorem (3.4) can be claimed with some generality on the nonparametric proxy .
Thï¿œorï¿œme 3.4.
Under conditions 1, 2a, 2b, 3 and if is consistent in probability in , then .
4 Asymptotics of
The aim of this section is to derive the rate of convergence and asymptotic law of . For this reason, we need more precise assumptions on . The way we proceed is based on the plugin properties of nonparametric estimates, when the functionals of interest are relatively smooth. In the case of series expansion, these properties are well understood [28, 4]. We focus here on regression splines, as they are wellused in practice and relatively simple to study, although more refined nonparametric estimators can be used in the same context, such as Penalized Splines. We assume that has a BSpline expansion
where is computed by linear leastsquares, and the dimension increases with . We introduce then additional regularity conditions on the ODE model, and on the distribution of observations:
 C5:

and are continuous,
 C6:

is nonsingular,
 C7:

The observations are i.i.d with with ,
 C8:

The observation times are uniformly distributed on ,
 C9:

It exists such that , are and
 C10:

The meshsize when
The proofs of the rate and asymptotic normality are somewhat technical, and they are relegated in the Supplementary Materials. We obtain a parametric convergence rate, and the asymptotic normality, by using two facts:

behaves like the difference , where is a linear functional,

if is smooth enough, is asymptotically normal in the case of regression splines.
Conditions C5 and C6 ensures the sufficiency of second order optimality conditions for the criteria . Conditions C7 to C10 are sufficient for the consistency of , as well as for the consistency and the asymptotic normality of the plugin estimators of linear functionals.
Thï¿œorï¿œme 4.1.
If conditions C1C10 are satisfied, then and is asymptotically normal.
5 State Estimation
Once the unknown model has been estimated with ,
we focus on the problem of state estimation. From the definition (2.2),
the criterion is built with an estimation of the
state based on the solution of the pertubed ODE .
The estimate of the initial condition is
derived from a Final Value Problem with the final state .
The state estimate that we have used,
is different from the state estimation classically done when using the Deterministic
Kalman Filter. The classical DKF state estimate is ,
and it corresponds to the best estimate of computed from
the available information at time , .
Whereas the estimate can be very bad at the beginning for small ,
the quality of improves as we get more data. A
remarkable feature of the filter is that
it can be computed recursively with an Ordinary Differential Equation
.
The matrix is the continuous counterpart of
the classical Kalman Gain Matrix, derived from the Filtering Riccati
Differential Equation, see page 313 in [35]. In that
recursive form, the Deterministic Kalman Filter is somehow similar
to the KalmanBucy Filter, which is the continuous version of
the usual Kalman Filter. Nevertheless, there is a
huge difference in the assumptions because
the KalmanBucy Filter assumes that is a Stochastic Differential
Equation, driven by a Brownian Motion . This means that the
deterministic perturbation is replaced by a random
pertubation . The state estimate is then different
from the one we consider as it can be shown that the filter is the
solution of a stochastic differential equation driven by the stochastic
process , see for instance [3].
The state estimate
is solution of the pertubed ODE, with the control computed
from all the data :
hence, our state estimation is based on Kalman Smoothing and not on
Filtering, as we have a backward integration step. In the rest of
that section, we show that the estimator
is also a consistent estimator of the state . In order
to do that, we show first that
is a consistent estimator of the final state.
5.1 Final state estimation
In a way, the consistency of the final state estimator is a rather obvious conclusion. The Deterministic Kalman Filter is initially designed for getting the best possible estimate of the final state, starting from any initial condition . It is then normal that we have a good estimator of when is close to and is close to .
Proposition 5.1.
We assume that conditions C1C4 are satisfied and that is a consistent estimator of . Then, the final state estimator converges in probability to .
Démonstration.
We show first that the true final state value is reached for and i.e . We recall that
and that . The identifiability condition 2b implies that the reconstructed state is the exact one. In our case, the minimum is reached when the optimal control is equal to , i.e.
which implies that ( is nonsingular). We can decompose the difference :
The convergence will come from the consistency of and :