Numerical Aspects for Approximating Governing Equations Using Data

# Numerical Aspects for Approximating Governing Equations Using Data

Kailiang Wu Dongbin Xiu Department of Mathematics, The Ohio State University, Columbus, OH 43210, USA.
###### Abstract

We present effective numerical algorithms for locally recovering unknown governing differential equations from measurement data. We employ a set of standard basis functions, e.g., polynomials, to approximate the governing equation with high accuracy. Upon recasting the problem into a function approximation problem, we discuss several important aspects for accurate approximation. Most notably, we discuss the importance of using a large number of short bursts of trajectory data, rather than using data from a single long trajectory. Several options for the numerical algorithms to perform accurate approximation are then presented, along with an error estimate of the final equation approximation. We then present an extensive set of numerical examples of both linear and nonlinear systems to demonstrate the properties and effectiveness of our equation recovery algorithms.

###### keywords:
Ordinary differential equation, differential-algebraic equation, measurement data, data-driven discovery, regression, sequential approximation.

## 1 Introduction

Recently there has been a growing interest in discovering governing equations of certain physical problems using observational data. It is a result of the wide recognition that all physical laws, e.g., Newton’s law, Kepler’s law, etc., were established based on empirical observational data. Since all models and physical laws are approximations to the true physics, a numerically constructed law or governing equation, which albeit may not possess a concise analytical form, could serve as a good model, so long as it is reasonably accurate.

Early work in this direction includes bongard2007automated (); schmidt2009distilling (), where symbolic regression was used to select physical laws and determine the structure of the underlying dynamical system. More recently, a sparsity-promoting method brunton2016discovering () was proposed to discover governing equations from noisy measurement data. It was based on a sparse regression method (tibshirani1996regression ()) and a Pareto analysis to construct parsimonious governing equations from a combinatorially large set of candidate models. This methodology was further applied to recovering partial differential equations rudy2017data (); schaeffer2017learning (), as well as inferring structures of network models mangan2016inferring (). Conditions were provided in tran2017exact () for recovering the governing equations from possibly highly corrupted measurement data, along with a sparse convex optimization method to recover the coefficients of the underlying equations in the space of multivariate polynomials. A nonconvex sparse regression approach was proposed in schaeffer2017sparse () for selection and identification of a dynamical system directly from noisy data. A model selection approach was developed in Mangan20170009 () for dynamical systems via sparse regression and information criteria. Combining delay embedding and Koopman theory, a data-driven method was designed in brunton2017chaos () to decompose chaotic systems as an intermittently forced linear system. Very recently, three sampling strategies were developed in schaeffer2017extracting () for learning quadratic high-dimensional differential equations from under-sampled data. There are other techniques that address various aspects of dynamical system discovery problem. These include, but are not limited to, methods to reconstruct the equations of motion from a data series crutchfield1987equations (), artificial neural networks gonzalez1998identification (), nonlinear regression voss1999amplitude (), equation-free modeling kevrekidis2003equation (), normal form identification majda2009normal (), empirical dynamic modeling sugihara2012detecting (); ye2015equation (), nonlinear Laplacian spectral analysis giannakis2012nonlinear (), modeling emergent behavior roberts2014model (), and automated inference of dynamics schmidt2011automated (); daniels2015automated (); daniels2015efficient (). The readers are also referred to the introduction in BruntonSIAMNews2017 ().

The focus of this paper is on the study of local recovery/approximation of general first-order nonlinear ordinary differential/algebraic equations from the time-varying measurement data. The major new features of the proposed numerical framework include the following. First, our method seeks “accurate approximation” to the underlying governing equations. This is different from many of the existing studies, where “exact recovery” of the equations is the goal. The justification for “approximation”, rather than “recovery”, is that all governing equations will eventually be discretized and approximated by certain numerical schemes during implementation. As far as computations are concerned, all numerical simulations are conducted using “approximated” equations. Therefore, as long as the errors in the equation approximation/recovery step is small, comparable to the discretization errors, the approximated equations can be used as sufficient model for simulations. By relaxing the requirement from “exact recovery” to “accurate approximation”, we are able to utilize many standard approximation methods using standard basis functions such as polynomials, which are used in this paper. Consequently, one does not need to choose a (combinatorially) large dictionary in order to include all possible terms in the equations. Since most of the governing equations we encounter in practice have relatively simple and smooth forms, polynomial approximation can be highly accurate at moderately low degrees. Secondly, we propose the use of multiple, in fact a large number of, trajectory data of very short span, instead of using a single long trajectory data as in many existing studies. The length of each burst of the trajectory can be exceptionally short – it needs to be merely long enough to enable accurate estimate of the time derivatives. For noiseless trajectories, the length can as little as 2, for first-order estimation of the time derivative, or 3 for second-order estimation, and so on. We do require that the number of such short bursts of trajectories to be large so that they provide “good” samples in the domain of interest. Upon doing so, we convert the equation recovery problem into a function approximation problem with over-sampled data. A large number of well established approximation algorithms can then be employed. (The readers are referred to earlier books such as Cheney66 (); Davis75 (); Powell81 (); Rivlin69 () for the classical results and methods in function approximation theory.) In this paper, we discuss the results obtained by using least squares method, which is one of the most established methods; regression method, which is capable of removing data corruption (candes2005decoding (); ShinXiu ()); and a matrix-free sequential algorithm (ShinXiu_SISC17 (); WuShinXiu_SISC17 (); WuXiu_JCP18 ()), which is suitable for extraordinarily large data sets or stream data. We remark that the use of a very lagre number of trajectory bursts does not necessarily induce large simulation cost, as each burst is of very short length. In fact, in many cases the total amount of data in the large number of bursts is less than that in a single long trajectory. We also remark that the idea of using multiple bursts of trajectories has also be discussed in schaeffer2017extracting (). The major difference between schaeffer2017extracting () and this work is that the goal in schaeffer2017extracting () is for exact recovery of the equation of quadratic form. The bursts are random selected based on the theory of sparse recovery (compressed sensing). In this paper, the bursts are selected using approximation theory, as we only seek to approximate the equations and not recover them. And the equations can be of arbitrary form, beyond polynomials and including algebraic constraints. The choices of the bursts can be random or deterministic, as long as they enable one to obtain accurate approximation.

This paper is organized as follows. After the basis problem setup in Section 2, we present the main method in Section 3, which includes the general formulation, basis selections, trajectory selections, choices of approximation algorithms, and an error bound. We then present, in Section 4, an extensive set of numerical examples, covering a wide variety of linear/nonlinear differential/algebraic equations, to demonstrate the effectiveness of the proposed algorithms.

## 2 Problem Setup

We consider an autonomous system in the form of

 du(t)dt=f(u),u(t0)=u0, (1)

where , , are state variables, the initial time, and defines the evolution of the states. Throughout this paper we call a trajectory, whose dependence on the initial state is explicitly shown. The right-hand-side is the “true” and unknown dynamics. Our goal is to construct an accurate representation of using measurement data of the state variables .

### 2.1 Domain-of-Interest

While the state variables may reside in the entire space , we restrict our equation recovery to a bounded domain . We refer to as the domain-of-interest, which stands for the range of the state variables where one focuses on the underlying problem. Depending on the (unknown) dynamical system and one’s interest in studying it, the definition of the domain-of-interest varies.

Let

 Ui=[umin,i,umax,i],1≤i≤d, (2)

be a closed interval and represent a range of the state variable , , where one wishes to understand the behavior of the system. We then define as the product of these intervals, i.e.,

 D=×di=1Ui. (3)

Effectively this creates a hypercube in dimensions. We then equip it with a measure , which is absolutely continuous and (without loss of generality) satisfies . We then invoke the standard definition of inner product

 (g,h)L2ω:=∫Dg(x)h(x)dω(x),

and its induced norm .

###### Remark 1

Upon defining the domain-of-interest, we have restricted our equation recovery effort to a “local” manner. This is because recovering the equation in the entire space , albeit desirable, may be highly challenging, when the underlying dynamical system exhibits complex behaviors that are different in different parts of the space. The use of the domain-of-interest allows one to focus on a region of the domain where the problem can be well behaved and possesses a regular/smooth form, which would be highly advantageous for the recovery effort. The size of depends on the underlying problem and is not necessarily small.

### 2.2 Data

Let be a sequence of time instances. We assume that the true state variables can be observed on these time instances and denote

 x(tj;u0)=u(tj;u0)+ϵj,j=0,1,…,

the observed data for the state trajectory, where stands for observation errors. The trajectory data can be from different sequences of trajectories, which are originated by different initial states. Let , , be a set of different initial states. We use the following shorthanded notation to denote the trajectory data from the -th initial condition,

 x(m)(tj)=u(tj;u(m)0)+ϵ(m)j∈D,0≤j≤Jm,1≤m≤M, (4)

where is the number of intervals in the -th trajectory data. The collection of all these data shall be used to estimate the true state equation.

### 2.3 Extension to Differential-Algebraic Equations

It is worth noting that the method discussed in this paper is applicable to semi-explicit differential algebraic equations (DAEs), i.e., DAEs with index 1, in the following form,

 du(t)dt=F(u,v), (5a) G(u,v)=0, (5b)

where and are the state variables and represents algebraic constraints. Let us assume that the implicit function theorem holds such that the constraint (5b) induces the following explicit relation

 v=g(u). (6)

Then, by letting , the DAEs can be written in the same form as the autonomous system (1).

## 3 Main Numerical Algorithm

In this section we discuss numerical aspects of our equation approximation/recovery method. We first present the general formulation, which is rather similar to many of the existing work (e.g., bongard2007automated (); brunton2016discovering (); schaeffer2017sparse ()). The details of the numerical algorithm, which is different and new, are then presented, along with an error analysis.

### 3.1 General Formulation

Let be a collection of trajectory data, in the form of (4). We assume that the time derivatives of the trajectory are also available at some time instances , either produced by the measurement procedure or approximated numerically using . Let

 ˙x(t)=˙u(t)+τ(t),

be the derivative data, where stands for the errors or noises. This creates a set of data pairings at the time instances , where both the state variables and their derivatives are available. The collection of such pairings are denoted as

 {x(m)(tj),˙x(m)(tj)},j∈Θm,1≤m≤M, (7)

where the total number of data pairings is

 Mtot=M∑m=1#Θm.

We then choose a (large) set of dictionary to represent in the governing equation (1). For the vector function , its approximation is typically conducted component-wise. That is, for each , we seek

 fℓ(x)≈~fℓ(x):=N∑j=1cj,ℓϕj(x). (8)

Here the functions in the dictionary should contain all the possible functions one wishes to incorporate, for the accurate approximation of . Some examples are

 Φ={1,x,x2,…,xn,sinx,cosx,…,sinmx,cosmx,exp(±x),…,exp(±kx),…}. (9)

Let be a model matrix

 A=(aij),aij=ϕj(xi),1≤i≤Mtot,1≤j≤N,

where represents the -th data pairing in (7). The equation recovery problem is then cast into an approximation problem

 Acℓ≈˙Xℓ,ℓ=1,…,d, (10)

where is the vector of -th component of all available derivatives in the pairings (7), and is the coefficient vector to be solved.

Once an accurate approximation is obtained, we consider

 dx(t)dt=~f(x), (11)

as an approximation of the true system (1).

It should be noted that converting the equation recovery problem into an approximation problem in the form of (10) is a rather standard framework in the existing studies. In most of the existing studies, one typically chooses the dictionary (9) to be very large (often combinatorially large) and then use a certain sparsity-promoting method to construct a parsimonious solution, e.g., minimization for underdetermined system. By doing so, if the terms in the true equations are already included in the dictionary (9), it is then possible to exactly recover the form of the governing equations.

Hereafter, we will present the detail of our numerical approach, which is different from most of the existing approaches.

### 3.2 Equation Approximation using Standard Basis

Contrary to most of the existing studies, our method here does not utilize a combinatorially large dictionary to include all possible terms in the equations. Instead, we propose to use a “standard” set of basis functions, particularly polynomials in term of , to approximate directly. One certainly is not restricted to polynomials and can use other basis such as radial basis functions. Here we choose polynomials merely because it is one of the most widely used basis for function approximation. By doing so, we give up the idea of “exact recovery” of , as exact recovery is not possible unless the true governing equations are strictly of polynomial form (which in fact is not uncommon). Instead, our approach focuses on obtaining an approximation of with sufficiently high accuracy. We remark that this is satisfactory for practical computations, as the exact governing equations (even if known) are often discretized and approximated by a certain numerical scheme after all. So long as the errors in the equation recovery/approximation step is sufficiently small, compared to the discretization errors, the approximated governing equations can be as effective as the true model. Even though our method will not recover the true equations exactly in most cases, we shall continue to loosely use the term “equation recovery” hereafter, interchangeable with “equation approximation”.

To construct the approximation , we confine ourself to a finite dimensional polynomial space with . Without loss of generality, we take , and let the subspace be , the linear subspace of polynomials of degree up to . That is,

 Πdn=span{xi=xi11⋯xidd,|i|≤n}, (12)

where is multi-index with . Immediately, we have

 N=dimΠdn=(n+dd)=(n+d)!n!d!. (13)

Let be a basis of . We then seek an approximation in a component-wise manner. That is, for the -th component, we seek

 ~fℓ(x)=N∑j=1cj,ℓϕj(x),ℓ=1,…,d. (14)

Upon adopting the following vector notations

 (15)

we write

 ~fℓ(x)=⟨cℓ,Φ(x)⟩,ℓ=1,…,d. (16)

Our goal is to compute the expansion coefficients in , . Later in Section 3.4 we will introduce several algorithms to compute the coefficients .

Unless the form of the true system (1) is of polynomial, we shall not obtain exact recovery of the equation. For most physical systems, their governing equations (1) consist of smooth functions, we then expect polynomials to produce highly accurate approximations with low or modest degrees. On the other hand, if one possesses sufficient prior knowledge of the underlying system and is certain that some particular terms should be in the governing equations, then those terms can be incorporated in the dictionary, in addition to the standard polynomial basis.

### 3.3 Selection of Trajectories

Upon setting the basis for the approximation, it is important to “select” the data. Since the problem is now casted into an approximation problem in , many approximation theories can be relied upon. One of the key facts from approximation theory is that, in order to construct accurate function approximation, data should fill up the underlying domain in a systematic manner. This implies that we should enforce the data pairings (7) to “spread” out the domain-of-interest . Since the trajectories of many dynamical systems often evolve on certain low-dimensional manifolds, using a single trajectory, no matter how long it is, will unlikely to produce a set of data pairings for accurate approximation, except for some special underlying systems such as the chaotic systems tran2017exact (). This is the reason why we propose to use multiple number of short trajectories. Specifically, the trajectories are started from different initial conditions. The length of each trajectories may be very short — just long enough to allow one to estimate the time derivatives. More specifically, we conduct the following steps.

• Let be the number of trajectories, and

 x(m)0=u(m)0+ϵ(m)0,m=1,…,M,

are different initial data, where is the observation error.

• For each initial state, collect trajectory data generated by the underlying system (1)

 x(m)(tj)=u(tj;u(m)0)+ϵ(m)j∈D,0≤j≤Jm,1≤m≤M, (17)

where is the number of data points collected for the -th trajectory and is the observation noise.

• Collect or estimate the time derivatives of the trajectories by using the data in (17)

 ˙x(m)(tj)=˙u(tj;u(m)0)+τ(m)(tj),j∈Θm,1≤m≤M, (18)

where stands for the estimation/observation errors, and .

• The pairings

 {x(m)(tj),˙x(m)(tj)},j∈Θm,1≤m≤M, (19)

are the data we use to approximation the equation, where is the initial state.

We propose to use a large number of short bursts of trajectories, i.e.,

 Jm∼O(1),m=1,…,M,M≫1. (20)

The length of each trajectory, , is very small – it should be merely long enough to allow accurate estimation of the time derivatives. In case the time derivatives are available by other means, e.g., direct measurement, then can be set to 0.

There are several approaches to choose different initial states from the region . They include but are not limited to

• Sampling the initial states from some random distribution defined on , e.g., the uniform distribution.

• Sampling the initial states by quasi-Monte Carlo sequence, such as the Sobol or Halton sequence, or other lattice rules.

• If is a hypercube, the initial state set can be taken as the set of quadrature points (e.g., the Gaussian points) or its random subset ZhouNarayanXiu_2015 (). One can also sample the quadrature points according some discrete measure WuShinXiu_SISC17 ().

### 3.4 Numerical Strategies

We now discuss several key aspects of the numerical implementation of the method.

#### 3.4.1 Computation of time derivatives

In most of the realistic situations, only the trajectory data are available, and the time derivatives need to be approximated numerically. Given a set of trajectory data , we seek to numerically estimate the approximate time derivative at certain time instances , . This topic belongs to the problem of numerical differentiation of discrete data.

If the data samples are noiseless, the velocity can be computed by a high-order finite difference method, e.g., a second-order central approximation with and is

 ˙x(m)(t1)=˙x(m)(t2)−˙x(m)(t0)2Δt. (21)

However, when data contain noise, estimation of derivatives can be challenging. In this case, if one uses the standard finite difference methods, then the resulting noise level in the first derivatives scales as . (See wagner2015regularised () for more analysis.) Several approaches were developed for numerical differentiation of noisy data. They include least squares approximation methods knowles2014methods (); wagner2015regularised (), Tikhonov regularization doi:10.1137/0708026 (), total variation regularization chartrand2011numerical (), Knowles and Wallace variational method knowles1995variational (), etc. For an overview, see knowles2014methods ().

In this paper, we use the least squares approach to denoise the derivative . The approach consists of constructing a least squares approximation of the trajectory, followed by an analytical differentiation of the approximation to obtain the derivatives. With the -th burst of trajectory data, , we first construct a polynomial as an approximation to the trajectory , by using the least squares method with . We then estimate the derivative as .

We remark that other methods for derivative computation with denoising effect can be certainly used. We found the least squares method to be fairly robust and easy to implement. When the noise level in the data become too large to obtain accurate derivatives, our method would become unreliable. This is, however, a universal challenge for all methods for equation recovery.

We also remark that even though the number of trajectories can be very large, i.e., . Since each trajectory is very short with , , the total amount of data for the state variables scales as . In many cases this is fairly small.

#### 3.4.2 Recovery algorithms: Regression methods

The regression methods seek the expansion coefficients by solving the following minimization problems

 cℓ=\operatornamewithlimitsargminc∈RN∥Ac−˙Xℓ∥,ℓ=1,…,d, (22)

where the norm can be taken as the vector -norm , which corresponds to the well known least squares method and can handle noisy data. It can also be taken as the vector -norm, which is the least absolute deviation (LAD) method and can eliminate potential corruption errors in data (cf. candes2005decoding (); ShinXiu ()). If one would like to seek the sparse coefficients , the LASSO (least absolute shrinkage and selection operator) method tibshirani1996regression () may be used,

 cℓ=\operatornamewithlimitsargminc∈RN12∥Ac−˙Xℓ∥2+λ∥c∥1,ℓ=1,…,d, (23)

where is a parameter to be chosen. Since our method here seeks accurate approximation to the governing equations, rather than exact recovery, we focus on the use of least squares and LAD.

###### Remark 2

The methods discussed here are based on component-wise approximation of the state variable . That is, each component of the states is approximated independently. It is possible to seek joint approximation of all components of simultaneously. A joint approximation may offer computational advantages and yet introduce new challenges. We shall leave this to a separate study.

#### 3.4.3 Recovery algorithms: Matrix-free method

When the data set is exceptionally large (), or unlimited as in stream data, matrix-free method such as the sequential approximation (SA) method (cf. ShinXiu_SISC17 (); WuShinXiu_SISC17 (); WuXiu_JCP18 (); ShinWuXiu_JCP18 ()), can be more effective.

The SA method is of iterative nature. It aims at conducting approximation using only one data point at each step. By starting from arbitrary choices of the initial approximation, which corresponds to an arbitrary choice of the coefficient vectors at step, the iteration of the coefficient vector at step is

 c(k)ℓ=c(k−1)ℓ+˙x(k)ℓ(tjk)−⟨c(k−1)ℓ,Φ(x(k)(tjk))⟩∥∥Φ(x(k)(tjk))∥∥22+γkΦ(x(k)(tjk)),k=1,…, (24)

where are parameters chosen according to the noise level in the -th piece of data , and denotes the -th component of . For noiseless data, one can set .

The scheme is derived by minimizing a cost function that consists of data mismatch at the current arriving data point and the difference in the polynomial approximation between two consecutive steps. The method is particularly suitable for stream data arriving sequentially in time, as it uses only one data point at each step and does not require storage or knowledge of the entire data set. Moroever, the scheme uses only vector operations and does not involve any matrices. For more details of the method, including convergence and error analysis, see ShinXiu_SISC17 (); WuShinXiu_SISC17 (); WuXiu_JCP18 (); ShinWuXiu_JCP18 ().

### 3.5 Error analysis

Assuming the aforementioned numerical algorithms can construct a sufficiently accurate approximation of with a error bound , we can estimate the error in solution of the recovered system (11).

Let be the exact solution of the recovered system (11) and the exact solutions of true system (1). Assume that , we then have the following error bound for the difference between and .

###### Theorem 1

Assume is Lipschitz continuous on ,

 ∥f(u1)−f(u2)∥2≤Lf∥u1−u2∥2,∀u1,u2∈D.

If , for , then

 ∥∥x(t)−u(t)∥∥2≤(t−t0)∥~f−f∥2,L∞+Lf∫tt0∥∥x(s)−u(s)∥∥2ds, (25)

Furthermore

 ∥∥x(t)−u(t)∥∥2≤L−1f(eLf(t−t0)−1)(∥f−PVf∥2,L∞+∥~f−PVf∥2,L2ω∥√K∥L∞), (26)

where the operator denotes the component-wise projection on to , and denotes the “diagonal” of the reproducing kernel of . With any orthogonal basis of , the kernel can be expressed as

The proof can be found in A. We remark that the error bound here is rather crude, as it addresses a general system of equation. More refined error bounds can be derived when one has more knowledge of the underlying system of equations.

## 4 Numerical Examples

In this section we present numerical examples to verify the performance of the proposed methods for recovering the ordinary differential equations (ODEs) and semi-explicit differential algebraic equations (DAEs).

In all the test cases, we assume that there is no prior knowledge about the governing equations. Without loss of generality, we assume that . Unless otherwise stated, is taken as , the initial states are sampled by the uniform distribution over the region , and the basis is taken as . We employ a variety of test cases of both linear and nonlinear systems of ODEs. The examples include relatively simple and well understood ODE systems from the textbook boyce2009elementary (), as well as more complex nonlinear systems of coupled ODE and DAEs. Moreover, we also introduce corruption errors in the trajectory data, in addition to the standard random noises. We examine not only the errors in approximating the exact equations, but also the errors in the predictions made by the recovered equations.

Our primary focus is to demonstrate the flexibility of using polynomial approximation for accurate equation approximation, rather than exact recovery, as well as the effect of using a large number of short bursts of trajectories. For the actual computation of the approximation, we primarily use the standard least squares method. When data corruption exists, we also use LAD, i.e., regression, and employed the software package -Magic candes2005l1 (). We also include an example with a very large data set and use the sequential method to demonstrate the flexibility of the matrix-free method.

### 4.1 Linear ODEs

We first present a set of numerical results for simple linear ODEs with , to examine the performance of the methods in recovering the structure of ODEs around different types of critical points. All examples are textbook problems from boyce2009elementary ().

###### Example 1

Six linear ODE systems are considered here, and their structures cover all the elementary critical points. The description of the systems and the corresponding numerical results are displayed in Table 1. In all the tests, is used. The first and second tests use sets of noiseless trajectory data. The third and forth tests use sets of noisy trajectory data, where the noise follows uniform distribution in . The last two tests considers sequences of trajectory data, with two uncertain (randomly chosen) sequences ( of the total sequences) have large corruption errors following the normal distribution . It is seen from Table 1 that the proposed methods are able to recover the correct structure of the ODE systems from few measurement data. The values of approximate coefficients in the learned governing equations are very close to the true values.

We now discuss the need for using a large number of short bursts of trajectory data. Let us consider the fourth linear ODE system in Table 1, which is the nodal sink system

 {˙u1=−2u1+u2−2,˙u2=u1−2u2+1. (27)

Assume that the observed data are from a single trajectory

 x(tj;u0)=u(tj;u0),

originated from the initial state . If belongs to the line , then all the data on the trajectory belong to . It is straightforward to see that the solution of the following linear system

 {˙u1=−u2+η1(u2−u1−1),˙u2=−u2+η2(u2−u1−1), (28)

can exactly fit the data on , for arbitrary . This implies that recovering (27) becomes an ill-posed problem, as there are infinite number of linear systems (28) that correspond to such a single trajectory data. From the point of view of function approximation, this is not surprising, as a single trajectory of a given system often falls onto a certain low dimensional manifold (unless the system is chaotic). Approximating an unknown function using data from a low-dimensional manifold is highly undesirable and can be ill-posed. The use of a large number of short bursts of trajectory effectively eliminates this potential difficulty.

### 4.2 Nonlinear ODEs

We then test the proposed methods for recovering nonlinear ODE systems exhibiting more complex dynamics. In general, the nonlinear ODEs do not have explicit analytical solutions, we therefore generate the measurement data by numerically solving the ODEs using the forth-order explicit Runge-Kutta method with a small time step-size.

###### Example 2

We first consider the undamped Duffing equation

 ¨u+u+ϵu3=0,

where is a small positive number and specified as . Taking and gives

 {˙u1=u2,˙u2=−u1−ϵu31. (29)

The region is specified as . From bursts of noiseless trajectory data with , we try to learn the governing system using different regression methods with . The errors of approximate expansion coefficients in the learned systems are displayed in Fig. 1. As we can see, the coefficients are recovered with high accuracy by and regressions.

###### Example 3

We consider

 {˙u1=u1(1−u1−u2),˙u2=u2(0.5−0.25u2−0.75u1), (30)

whose qualitative behavior was studied in boyce2009elementary (). We take , which contains the four critical points, an unstable node , two asymptotically stable nodes and , and a saddle point . Suppose we have sets of trajectory data with , and all the data contain i.i.d. random noises following uniform distribution in . Learning from these data, the least squares regression method with produced the following approximate system

 {˙x1=x1(1.0217−1.0166x1−1.0164x2)−0.01695+0.0350x2−0.0128x22,˙x2=x2(0.50914−0.2500x2−0.7544x1)−0.0160+0.0086x1−0.0016x21. (31)

Although in this case the data contain noises at a relatively large level, the proposed method can still recover the coefficients with errors less than . In Figure 2, the phase portraits of the original and recovered systems are plotted to provide a qualitative comparison between the true and the learned dynamics. Good agreements are observed in the structures, and the four critical points are correctly detected. Figure 3 shows the solutions of the original and recovered systems at the initial state , as well as the errors in the learned solution.

###### Example 4

We now consider

 {˙u1=u2−u1(u21+u22−1),˙u2=−u1−u2(u21+u22−1). (32)

An important feature of this example is that we assume the state variables in the unit disk are not accessible. Therefore we have data only from outside this region. This rather arbitrary choice of accessibility of data is to mimic the practical situation where data collection may be subject to certain restrictions. Also, we introduce relatively large corruption errors to the data. Therefore, we have a nonstandard region , where data are available and corrupted.

First, we use bursts of trajectory data with (very short burst consisting only 3 data points), where of the data contain relatively large corruption errors following i.i.d. normal distribution . (Note the relatively large bias in the mean value.) We then use -regression to learn the governing equations from these data, and obtain

 ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩˙x1=0.9986x2−x1(1.0020x21+1.0018x22−1.0047)−0.0006+0.0005x21−0.0001x1x2+0.0002x22+0.0006x21x2+0.0005x32,˙x2=−1.0025x1−x2(1.0015x21+1.0011x22−1.0025)−0.0005−0.00008x21+0.0004x1x2+0.0004x22+0.0006x31+0.0005x1x22. (33)

We see that the -regression can effectively eliminate the sparse corruption errors, and the coefficients are recovered with high accuracy. The phase portraits in Figure 4 further exhibit the accurate dynamics of the learned system (33). We also validate the solution of the recovered system (33) with an arbitrarily chosen initial state , see Figure 5. The solution agrees well with the reference solution obtained from the exact equation. The increase of the error over time is expected, as in any numerical solvers for time dependent problems.

We now consider data only with standard random noises and study the effect of noise level on the accuracy of recovered equations. Suppose we have sets of trajectory data with , and all the data contain i.i.d. random noises following uniform distribution in . Here the parameter represents the noise level, and corresponds to noiseless data. The errors of the expansion coefficients in the learned systems are displayed in Fig. 6, for different noise levels at . The vector 2-norms of the coefficient errors are shown in Fig. 7. It is clearly seen that the errors become larger at larger noise level in the data, which is not surprising.

###### Example 5

We now consider a damped pendulum problem. Let be the length of the pendulum, be the damping constant, and be the gravitational constant. Then, for the unit mass, the governing equation is well known as

 d2θdt2+αldθdt+glsinθ=0,

where is the angle from its equilibrium vertical position. By defining and , we rewrite the equation as

 {˙u1=u2,˙u2=−glsinu1−αlu2. (34)

We set and , and use a large number of short trajectory data () to recover the equation. The initial state of each trajectory is randomly and uniformly chosen from the region . The least squares -regression is used, with polynomial approximation at different degrees . Figure 8 shows the phase portraits for the system (34) and the learned systems. As increases, the recovered dynamics becomes more accurate and can capture the true structures of the solution. More detailed examination of the learned system is shown in Figure 9, with polynomial degree . We observe good agreements between the solution of the learned system and true solution , at an arbitrarily chosen initial state . The error evolution over time remains bounded at level.

### 4.3 Nonlinear DAEs

We now study semi-explicit DAEs (5b) from measurement data. The learned approximating DAEs are denoted by

 dx(t)dt=~f(x(t)), (35a) y(t)=~g(x(t)), (35b)

where and are approximations to and , respectively. In the tests, the measurement data are first generated by numerically solving the true DAEs using the forth-order explicit Runge-Kutta method with a small time step-size. Possible noises or corruption errors are then added to the trajectories to generate our synthetic data.

###### Example 6

The first DAEs model we consider is a model for nonlinear electric network pulch2013polynomial (), defined as

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩˙u1=v2/C,˙u2=u1/L,0=v1−(G0−G∞)U0tanh(u1/U0)−G∞u1,0=v2+u2+v1, (36)

where denotes the node voltage, and , and are branch currents. Following pulch2013polynomial (), the physical parameters are specified as , , , and . In our test, , sequences of trajectory data with and are generated by numerically solving (36) and then adding random noises. The added noises in and follow uniform distribution on , while the noises in and follow normal distribution and respectively. We then use the -regression with to learn the governing equations from these data. The phase portraits for the system and the learned system are compared in Figure 10. It is observed that the proposed method accurately reproduces the dynamics and detects the correct features. Comparison the solutions between the recovered DAEs and the true system is shown Figure 11, using an arbitrary choice of the initial state . The evolution of the numerical errors in the solution of the learned system are shown in Figure 12.

###### Example 7

We now consider a model for a genetic toggle switch in Escherichia coli, which was proposed in gardner2000construction () and numerically studied in xiu2007efficient (). It is composed of two repressors and two constitutive promoters, where each promoter is inhibited by the repressor that is transcribed by the opposing promoter. Details of experimental measurement can be found in [10], where the following model was also constructed,

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩˙u1=α11+uβ2−u1,˙u2=α21+vγ1−u2,0=v1+εsinv1−u1(1+[IPTG]/K)η. (37)

In the last equation, we add a small perturbation term to increase the nonlinearity. Here, and are the concentration of the two repressors, respectively; and are the effective rates of synthesis of the repressors; and represent cooperativity of repression of the two promoters, respectively; and is the concentration of IPTG, the chemical compound that induces the switch. The values of these parameters are taken as , , , , . In the test, and we generate the measurement data by numerically solving (37) with and . Suppose we have bursts of trajectory data with , and five () arbitrarily chosen bursts of data contain corruption errors following normal distribution . We employ the -regression approach with order of nominalized Legendre polynomial basis. The phase portraits for the system and the learned system are plotted in Figure 13. As we can see, the learned structures agree well with the ideal structures everywhere in . This demonstrates the ability of our method in finding the underlying governing equations. The accuracy of the learned model is further examined in Figures 14 and 15.

###### Example 8

Our last example is a kinetic model of an isothermal batch reactor system, which was proposed in biegler1986nonlinear () and studied in becerra2001applying (); kadu2010modified (), among other literature. It was mentioned in becerra2001applying () that the example in its original form was given by the Dow Chemical Company as a challenge test for parameter estimation methods. More details about this model can be found in becerra2001applying (). The model consists of six differential mass balance equations, an electroneutrality condition and three equilibrium conditions,

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩˙u1=−k2u2v2,˙u2=−k1u2u6+k−1u4−k2u2v2,˙u3=k2u2v2+k3u4u6−k−3v3,˙u4=−k3u4u6+k−3v3,˙u5=1u2u6−k−1v4,˙u6=−k1u2u6+k−1v4−k3v4v6+k−3v3,[Q+]−u6+v1−v2−v3−v4=0,v2−K2u1/(K2+v1)=0,v3−K3u3/(K3+v1)=0,v4−K1u5/(K1+v1)=0. (38)

In our test, the values of the parameters are specified as , , , , , , and , and . Assume that one can collect many sets of the time-series measurement data of and starting from any initial states . We collect bursts of trajectory data with , and sampled from the tensorized Chebyshev distribution on . The data set is big as to enable possible accurate discovery of the complicated reaction process. Handling such a (relatively) large-scale problem with the standard regression algorithms can be challenging, primarily due to the large size of matrices. Without using more sophisticated regression methods to deal with large matrices (which is a topic outside our scope), we employ the sequential approximation (SA) algorithm to demonstrate its capability of handling large data sets. We employ the nominalized Legendre polynomial basis of up to order . This induces basis functions. The convergence result is examined by the errors in the function approximation in the -th step iteration compared with the exact functions . To evaluate the errors we independently draw a set of 20,000 samples uniformly and compute the difference between and at these sampling points. The vector 2-norm is then used to report the difference. The relative errors in each iteration of SA are displayed in Figure 16 with respect to the number of iterations . The SA method uses one piece of data at a time, and exhibits good effectiveness in this case. As the iteration continues, more data are used and more accurate approximations are obtained, yielding more accurate models. The exponential type convergence of the errors is observed. This is consistent with the theoretical analysis in ShinXiu_SISC17 (); WuXiu_JCP18 (); ShinWuXiu_JCP18 (). We also validate the finial system with the initial state , and observe good agreements between the solutions of the recovered system and the exact system in Figure 17.

## 5 Conclusion

In this paper we studied several effective numerical algorithms for data-driven discovery of nonlinear differential/algebraic equations. We proposed to use the standard basis functions such as polynomials to construct accurate approximation of the true governing equations, rather than the exact recovery. We also discussed the importance of using a (large) number of short bursts of trajectory data, rather than data from a single (long) trajectory. In conjunction with some standard approximation algorithms, e.g., least squares method, the overall method can produce highly accurate approximation of the true governing equations. Using an extensive set of test examples, ranging from textbook examples to more complicated practical examples, we demonstrated that the proposed method can be highly effective in many situations.

## Appendix A Proof of Theorem 1

We first introduce the following lemma.

###### Lemma 1

If , we have

 ∥~f−f∥2,L∞≤∥f−PVf∥2,L∞+∥~f−PVf∥2,L2ω∥√K∥L∞. (39)
##### Proof

The triangular inequality gives

 ∥~f−f∥2,L∞≤∥f−PVf∥2,L∞+∥~f−PVf∥2,L∞. (40)

Let be any orthogonal basis of , and

 PVfℓ=:⟨^cℓ,Ψ(x)⟩,~fℓ=:⟨~cℓ,Ψ(x)⟩.

Then . Note for any that

 ∥~f(x)−PVf(x)∥22 = d∑ℓ=1∣∣~fℓ(x)−PVfℓ(x)∣∣2=d∑ℓ=1∣∣⟨~cℓ−^cℓ,Ψ(x)⟩∣∣2 ≤ d∑ℓ=1∥~cℓ−^cℓ∥22∥Ψ(x)∥22=d∑ℓ=1∥~fℓ−PVfℓ∥2L2ωK(x) = ∥~f−PVf∥22,L2ωK(x),

where the Cauchy-Schwarz inequality and the orthogonality of basis have been used. Taking -norm in the above inequality and using (40) give (39).

Based on the above lemma, the proof of Theorem 1 is given as follows.

##### Proof

Integrating the systems (1) and (11) respectively gives

 u(t)=u(t0)+∫tt0f(u(s))ds, (41) x(t)=x(t0)+∫tt0~f(x(s))ds. (42)

Subtracting (41) from (42) with , we obtain

 ∥∥x(t)−u(t)∥∥2 = ∥∥∥∫tt0~f(x(s))−f(u(s))ds∥∥∥2 = (t−t0)η(1t−t0∫tt0~f(x(s))−f(u(s