Neural Ordinary Differential Equations
Abstract
We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a blackbox differential equation solver. These continuousdepth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuousdepth residual networks and continuoustime latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows endtoend training of ODEs within larger models.
Residual Network  ODE Network 

1 Introduction
Models such as residual networks, recurrent neural network decoders, and normalizing flows build complicated transformations by composing a sequence of transformations to a hidden state:
(1) 
where and . These iterative updates can be seen as an Euler discretization of a continuous transformation (Lu et al., 2017; Haber and Ruthotto, 2017; Ruthotto and Haber, 2018).
What happens as we add more layers and take smaller steps? In the limit, we parameterize the continuous dynamics of hidden units using an ordinary differential equation (ODE) specified by a neural network:
(2) 
Starting from the input layer , we can define the output layer to be the solution to this ODE initial value problem at some time . This value can be computed by a blackbox differential equation solver, which evaluates the hidden unit dynamics wherever necessary to determine the solution with the desired accuracy. Figure 1 contrasts these two approaches.
Defining and evaluating models using ODE solvers has several benefits:
Memory efficiency
In Section 2, we show how to compute gradients of a scalarvalued loss with respect to all inputs of any ODE solver, without backpropagating through the operations of the solver. Not storing any intermediate quantities of the forward pass allows us to train our models with nearly constant memory cost as a function of depth, a major bottleneck of training deep models.
Adaptive computation
Euler’s method is perhaps the simplest method for solving ODEs. There have since been more than 120 years of development of efficient and accurate ODE solvers (Runge, 1895; Kutta, 1901; Hairer et al., 1987). Modern ODE solvers provide guarantees about the growth of approximation error, monitor the level of error, and adapt their evaluation strategy on the fly to achieve the requested level of accuracy. This allows the cost of evaluating a model to scale with problem complexity. After training, accuracy can be reduced for realtime or lowpower applications.
Parameter efficiency
When the hidden unit dynamics are parameterized as a continuous function of time, the parameters of nearby “layers” are automatically tied together. In Section 3, we show that this reduces the number of parameters required on a supervised learning task.
Scalable and invertible normalizing flows
An unexpected sidebenefit of continuous transformations is that the change of variables formula becomes easier to compute. In Section 4, we derive this result and use it to construct a new class of invertible density models that avoids the singleunit bottleneck of normalizing flows, and can be trained directly by maximum likelihood.
Continuous timeseries models
Unlike recurrent neural networks, which require discretizing observation and emission intervals, continuouslydefined dynamics can naturally incorporate data which arrives at arbitrary times. In Section 5, we construct and demonstrate such a model.
2 Reversemode automatic differentiation of ODE solutions
The main technical difficulty in training continuousdepth networks is performing reversemode differentiation (also known as backpropagation) through the ODE solver. Differentiating through the operations of the forward pass is straightforward, but incurs a high memory cost and introduces additional numerical error.
We treat the ODE solver as a black box, and compute gradients using the adjoint sensitivity method. This approach computes gradients by solving a second, augmented ODE backwards in time, and is applicable to all ODE solvers. This approach scales linearly with problem size, has low memory cost, and explicitly controls numerical error.
Consider optimizing a scalarvalued loss function , whose input is the result of an ODE solver:
(3) 
To optimize , we require gradients with respect to its parameters: , , , and . The first step is to determining how the gradient of the loss depends on the hidden state at each instant. This quantity is called the adjoint . Its dynamics are given by another ODE, which can be thought of as the instantaneous analog of the chain rule:
(4) 
We can compute by another call to an ODE solver. This solver must run backwards, starting from the “initial value” of . One complication is that solving this ODE requires the knowing value of along its entire trajectory. However, we can simply recompute backwards in time together with the adjoint, starting from its final value .
Computing the gradients with respect to the parameters requires evaluating a third integral, which depends on both and :
(5) 
All three of these integrals can be computed in a single call to an ODE solver, which concatenates the original state, the adjoint, and the other partial derivatives into a single vector. Algorithm 1 shows how to construct the necessary dynamics, and call an ODE solver to compute all gradients at once.
Most ODE solvers have the option to output the state at multiple times. When the loss depends on these intermediate states, the reversemode derivative must be broken into a sequence of separate solves, one between each consecutive pair of outputs (Figure 2). At each observation, the adjoint must be adjusted by the corresponding direct gradient .
The results above extend those of Stapor et al. (2018, section 2.4.2). Detailed derivations are provided in appendix E. Appendix B provides Python code which computes all derivatives for scipy.integrate.odeint by extending the autograd automatic differentiation package. This code also supports all higherorder derivatives.
3 Replacing residual networks with ODEs for supervised learning
In this section, we experimentally investigate the training of neural ODEs for supervised learning.
Software
To solve ODE initial value problems numerically, we use the implicit Adams method implemented in LSODE and VODE and interfaced through the scipy.integrate package. Being an implicit method, it has better guarantees than explicit methods such as RungeKutta but requires solving a nonlinear optimization problem at every step. This setup makes direct backpropagation through the integrator difficult. We implement the adjoint sensitivity method in Python’s autograd framework (Maclaurin et al., 2015). For the experiments in this section, we evaluated the hidden state dynamics and their derivatives on the GPU using Tensorflow, which were then called from the Fortran ODE solvers, which were called from Python autograd code.
Model Architectures
We experiment with a small residual network which downsamples the input twice then applies 6 residual blocks, which are replaced by an ODESolve module in the ODENet variant. We also test a network which directly backpropagates through a RungeKutta integrator, referred to as RKNet.
Test Error  # Params  Memory  Time  

1Layer MLP  1.60%  0.24 M     
ResNet  0.41%  0.60 M  
RKNet  0.47%  0.22 M  
ODENet  0.42%  0.22 M 
Table 1 shows test error, number of parameters, and memory cost. denotes the number of layers in the ResNet, and is the number of function evaluations that the ODE solver requests in a single forward pass, which can be interpreted as an implicit number of layers.
We find that ODENets and RKNets can achieve around the same performance as the ResNet, while using fewer parameters. For reference, a neural net with a single hidden layer of 300 units has around the same number of parameters as the ODENet and RKNet architecture that we tested.
Error Control in ODENets
ODE solvers can approximately ensure that the output is within a given tolerance of the true solution. Changing this tolerance changes the behavior of the network. We first verify that error can indeed be controlled in Figure 3a. The time spent by the forward call is proportional to the number of function evaluations (Figure 3b), so tuning the tolerance gives us a tradeoff between accuracy and computational cost. One could train with high accuracy, but switch to a lower accuracy at test time.
Figure 3c) shows a surprising result: the number of evaluations in the backward pass is roughly half of the forward pass. This suggests that the adjoint sensitivity method is not only more memory efficient, but also more computationally efficient than directly backpropagating through the integrator, because the latter approach will need to backprop through each function evaluation in the forward pass.
Network Depth
It’s not clear how to define the ‘depth‘ of an ODE solution. A related quantity is the number of evaluations of the hidden state dynamics required, a detail delegated to the ODE solver and dependent on the initial state or input. Figure 3d shows that he number of function evaluations increases throughout training, presumably adapting to increasing complexity of the model.
4 Continuous Normalizing Flows
The discretized equation (1) also appears in normalizing flows (Rezende and Mohamed, 2015) and the NICE framework (Dinh et al., 2014). These methods use the change of variables theorem to compute exact changes in probability if samples are transformed through a bijective function :
(6) 
An example is the planar normalizing flow (Rezende and Mohamed, 2015):
(7) 
Generally, the main bottleneck to using the change of variables formula is computing of the determinant of the Jacobian , which has a cubic cost in either the dimension of , or the number of hidden units. Recent work explores the tradeoff between the expressiveness of normalizing flow layers and computational cost (Kingma et al., 2016; Tomczak and Welling, 2016; Berg et al., 2018).
Surprisingly, moving from a discrete set of layers to a continuous transformation simplifies the computation of the change in normalizing constant:
Theorem 1 (Instantaneous Change of Variables).
Let be a finite continuous random variable with probability dependent on time. Let be a differential equation describing a continuousintime transformation of . Assuming that is uniformly Lipschitz continuous in and continuous in , then the change in log probability also follows a differential equation,
(8) 
Proof in Appendix A. Instead of the log determinant in (6), we now only require a trace operation. Also unlike standard finite flows, the differential equation does not need to be bijective, since if uniqueness is satisfied, then the entire transformation is automatically bijective.
As an example application of the instantaneous change of variables, we can examine the continuous analog of the planar flow, and its change in normalization constant:
(9) 
Given an initial distribution , we can sample from and evaluate its density by solving this combined ODE.
Using multiple hidden units with linear cost
While is not a linear function, the trace function is, which implies . Thus if our dynamics is given by a sum of functions then the differential equation for the log density is also a sum:
(10) 
This means we can cheaply evaluate flow models having many hidden units, with a cost only linear in the number of hidden units . Evaluating such ‘wide’ flow layers using standard normalizing flows costs , meaning that standard NF architectures use many layers of only a single hidden unit.
Timedependent dynamics
We can specify the parameters of a flow as a function of , making the differential equation change with . This is parameterization is a kind of hypernetwork (Ha et al., 2016). We also introduce a gating mechanism for each hidden unit, where is a neural network that learns when the dynamic should be applied. We call these models continuous normalizing flows (CNF).
4.1 Experiments with Continuous Normalizing Flows


K=2 K=8 K=32 
M=2 M=8 M=32 
















We first compare continuous and discrete planar flows at learning to sample from a known distribution. We show that a planar CNF with hidden units can be at least as expressive as a planar NF with layers, and sometimes much more expressive.
Density matching
We configure the CNF as described above, and train for 10,000 iterations using Adam (Kingma and Ba, 2014). In contrast, the NF is trained for 500,000 iterations using RMSprop (Hinton et al., 2012), as suggested by Rezende and Mohamed (2015). For this task, we minimize as the loss function where is the flow model and the target density can be evaluated. Figure 4 shows that CNF generally achieves lower loss.
Maximum Likelihood Training


A useful property of continuoustime normalizing flows is that we can compute the reverse transformation for about the same cost as the forward pass, which cannot be said for normalizing flows. This lets us train the flow on a density estimation task by performing maximum likelihood estimation, which maximizes where is computed using the appropriate change of variables theorem, then afterwards reverse the CNF to generate random samples from .
For this task, we use 64 hidden units for CNF, and 64 stacked onehiddenunit layers for NF. Figure 5 shows the learned dynamics. Instead of showing the initial Gaussian distribution, we display the transformed distribution after a small amount of time which shows the locations of the initial planar flows. Interestingly, to fit the Two Circles distribution, the CNF rotates the planar flows so that the particles can be evenly spread into circles. While the CNF transformations are smooth and interpretable, we find that NF transformations are very unintuitive and this model has difficulty fitting the two moons dataset in Figure 4(b).
5 A generative latent function timeseries model
Applying neural networks to irregularlysampled data such as medical records, network traffic, or neural spiking data is difficult. Typically, observations are put into bins of fixed duration, and the latent dynamics are discretized in the same way. This leads to difficulties with missing data and illdefined latent variables. Missing data can be addressed using generative timeseries models (Álvarez and Lawrence, 2011; Futoma et al., 2017; Mei and Eisner, 2017; Soleimani et al., 2017a) or data imputation (Che et al., 2018). Another approach concatenates timestamp information to the input of an RNN (Choi et al., 2016; Lipton et al., 2016; Du et al., 2016; Li, 2017).
We present a continuoustime, generative approach to modeling time series. Our model represents each time series by a latent trajectory. Each trajectory is determined from a local initial state, , and a global set of latent dynamics shared across all time series. Given observation times and an initial state , an ODE solver produces , which describe the latent state at each observation.We define this generative model formally through a sampling procedure:
(11)  
(12)  
(13) 
Function is a timeinvariant function that takes the value at the current time step and outputs the gradient: . We parametrize this function using a neural net. Because is timeinvariant, given any latent state , the entire latent trajectory is uniquely defined. Extrapolating this latent trajectory lets us make predictions arbitrarily far forwards or backwards in time.
Training and Prediction
We can train this latentvariable model as a variational autoencoder (Kingma and Welling, 2014; Rezende et al., 2014), with sequencevalued observations. Our recognition net is an RNN, which consumes the data sequentially backwards in time, and outputs . A detailed algorithm can be found in Appendix D. Using ODEs as a generative model allows us to make predictions for arbitrary time points … on a continuous timeline.
Poisson Process likelihoods
The fact that an observation occurred often tells us something about the latent state. For example, a patient may be more likely to take a medical test if they are sick. The rate of events can be parameterized by a function of the latent state: . Given this rate function, the likelihood of a set of independent observation times in the interval is given by an inhomogeneous Poisson process (Palm, 1943):
We can parameterize using another neural network. Conveniently, we can evaluate both the latent trajectory and the Poisson process likelihood together in a single call to an ODE solver. Figure 7 shows the event rate learned by such a model on a toy dataset.
A Poisson process likelihood on observations can be combined with an event rate can easily be combined with a data likelihood to give a joint distribution over observations and their times.
5.1 Timeseries Latent ODE Experiments
We investigate the ability of the latent ODE model to fit and extrapolate time series. The recognition network is an RNN with 25 hidden units. We use a 4dimensional latent space. We parameterize the dynamics function with a onehiddenlayer network with 20 hidden units. The decoder computing is another neural network with one hidden layer with 20 hidden units. Our baseline was a recurrent neural net with 25 hidden units trained to minimize negative Gaussian loglikelihood. We trained a second version of this RNN whose inputs were concatenated with the time difference to the next observation to aid RNN with irregular observations.
Bidirectional spiral dataset
We generated a dataset of 1000 2dimensional spirals, each starting at a different point, sampled at 100 equallyspaced timesteps. The dataset contains two types of spirals: half are clockwise while the other half counterclockwise. To make the task more realistic, we add gaussian noise to the observations.




Time series with irregular time points
To generate irregular timestamps, we randomly sample points from each trajectory without replacement (). We report predictive rootmeansquared error (RMSE) on 100 time points extending beyond those that were used for training. Table 2 shows that the latent ODE has substantially lower predictive RMSE.
Figure 8 shows examples of spiral reconstructions with 30 subsampled points. Reconstructions from the latent ODE were obtained by sampling from the posterior over latent trajectories and decoding it to dataspace. Examples with varying number of time points are shown in Appendix C. We observed that reconstructions and extrapolations are consistent with the ground truth regardless of number of observed points and despite the noise.
# Observations  30/100  50/100  100/100 

RNN  0.3937  0.3202  0.1813 
Latent ODE  0.1642  0.1502  0.1346 
Latent space interpolation
Figure 7(c) shows latent trajectories projected onto the first two dimensions of the latent space. The trajectories form two separate clusters of trajectories, one decoding to clockwise spirals, the other to counterclockwise. Figure 9 shows that the latent trajectories change smoothly as a function of the initial point , switching from a clockwise to a counterclockwise spiral.
6 Scope and Limitations
Minibatching
Second, the use of minibatches is less straightforward than for standard neural networks. One can still batch together evaluations through the ODE solver by concatenating the states of each batch element together, creating a combined ODE with dimension . In some cases, controlling error on all batch elements together might require evaluating the combined system times more often than if each system was solved individually, leading to a more than cost to evaluate batch elements. However, in practice the number of evaluations did not increase substantially when using minibatches.
Uniqueness
When do continuous dynamics have a unique solution? Picard’s existence theorem (Coddington and Levinson, 1955) states that the solution to an initial value problem exists and is unique if the differential equation is uniformly Lipschitz continuous in and continuous in . This theorem holds for our model if the neural network has finite weights and uses Lipshitz nonlinearities, such as tanh or relu.
Reversibility
Even if the forward trajectory is invertible in principle, in practice there will be three compounding sources of error in the gradient:

Numerical error introduced in the forward ODE solver;

Information lost due to multiple initial values mapping to the same final state; and

Numerical error introduced in the reverse ODE solver.
Error from 1 and 3 can be made as small as desired, at the cost of more computation. Error from 2 could be expected to be a problem if the system dynamics encoded optimizationlike, convergent dynamics. Note that exactly convergent transformations are disallowed by Lipschitz dynamics. In our experiments we did not find this to be a practical problem, and we informally checked that reversing many layers of continuous normalizing flows recovered the initial Gaussian density.
If necessary, exact reversibility could be recovered by storing the initial state and solving the reverse pass as a boundaryvalue problem instead of as an initialvalue problem. If truly convergent dynamics are required, replacing the ODE solver with a differentiable blackbox optimizer, as in OptNet (Amos and Kolter, 2017), would be more appropriate.
7 Related Work
The interpretation of residual networks He et al. (2016) as approximate ODE solvers spurred research into exploiting reversibility and approximate computation in ResNets (Chang et al., 2017; Lu et al., 2017). We demonstrate these same properties in more generality by directly using an ODE solver.
Adaptive computation
One can adapt computation time by training secondary neural networks to choose the number of evaluations of recurrent or residual networks (Graves, 2016; Jernite et al., 2016; Figurnov et al., 2017). However, this introduces overhead both at training and test time, and extra parameters that need to be fit. In contrast, ODE solvers offer wellstudied, computationally cheap, and generalizable rules for adapting the amount of computation.
Reversibility
Recent work developed reversible versions of residual networks (Gomez et al., 2017; Haber and Ruthotto, 2017; Chang et al., 2017), which gives the same memory savings as our approach. However, this method requires the use of restricted architectures which partition the hidden units. Our approach does not have these restrictions.
Learning differential equations
Much recent work has proposed learning differential equations from data. One can train feedforward or recurrent neural networks to approximate a differential equation (Raissi and Karniadakis, 2018; Raissi et al., 2018a; Long et al., 2017), with applications such as fluid simulation (Wiewel et al., 2018). There is also significant work on connecting Gaussian Processes (GPs) and ODE solvers (Schober et al., 2014). GPs have been adapted to fit differential equations (Raissi et al., 2018b) and can naturally model continuoustime effects and interventions (Soleimani et al., 2017b; Schulam and Saria, 2017). Ryder et al. (2018) use stochastic variational inference to recover the solution of a given stochastic differential equation.
Differentiating through ODE solvers
The dolfin library (Farrell et al., 2013) implements adjoint computation for general ODE and PDE solutions, but only by backpropagating through the individual operations of the forward solver. The Stan library (Carpenter et al., 2015) implements gradient estimation through ODE solutions using forward sensitivity analysis. However, forward sensitivity analysis is quadratictime in the number of variables, whereas the adjoint sensitivity analysis is linear (Carpenter et al., 2015; Zhang and Sandu, 2014). Melicher et al. (2017) used the adjoint method to train bespoke latent dynamic models.
In contrast, by providing a generic vectorJacobian product, we allow an ODE solver to be trained endtoend with any other differentiable model components. We believe that the code provided in Appendix B is the first fully general implementation of memoryandtime efficient reverse mode automatic differentiation through ODE solutions.
8 Conclusion
We investigated the use of blackbox ODE solvers as a model component, developing new models for timeseries modeling, supervised learning, and density estimation. These models are evaluated adaptively, and allow explicit control of the tradeoff between computation speed and accuracy. Finally, we derived an instantaneous version of the change of variables formula, and developed continuoustime normalizing flows, which can scale to large layer sizes.
9 Acknowledgements
We thank Wenyi Wang and Geoff Roeder for help with proofs, and Daniel Duckworth, Ethan Fetaya, Hossein Soleimani, Eldad Haber, Ken Caluwaerts, and Daniel FlamShepherd for feedback on early drafts. We thank Chris Rackauckas, Dougal Maclaurin, and Matthew James Johnson for helpful discussions.
Appendix A Proof of the Instantaneous Change of Variables Theorem
Theorem (Instantaneous Change of Variables).
Let be a finite continuous random variable with probability dependent on time. Let be a differential equation describing a continuousintime transformation of . Assuming that is uniformly Lipschitz continuous in and continuous in , then the change in log probability also follows a differential equation:
Proof.
To prove this theorem, we take the infinitesimal limit of finite changes of through time. First we denote the transformation of over an change in time as
(14) 
We assume that is Lipschitz continuous in and continuous in , so every initial value problem has a unique solution by Picard’s existence theorem. We also assume is bounded. These conditions imply that , , and are all bounded. In the following, we use these conditions to exchange limits and products.
We can write the differential equation using the discrete change of variables formula, and the definition of the derivative:
(15)  
(16)  
(17)  
(18)  
(19)  
(20) 
The derivative of the determinant can be expressed using Jacobi’s formula, which gives
(21)  
(22)  
(23) 
Substituting with its Taylor series expansion and taking the limit, we complete the proof.
(24)  
(25)  
(26)  
(27) 
∎
Continuous Planar Flows
Let , then . Since the trace of an outer product is the inner product, we have
(28) 
Continuous Hamiltonian Flows
The continuous analog of NICE (Dinh et al., 2014) is a Hamiltonian flow, which splits the data into two equal partitions and is a volumepreserving transformation, implying that . We can verify this. Let
(29) 
Then because the Jacobian is all zeros on its diagonal, the trace is zero.
Appendix B Autograd Implementation
Appendix C Extra Figures




Appendix D Algorithm for training the latent ODE model
To obtain the latent representation , we traverse the sequence using RNN and obtain parameters of distribution . The algorithm is the following:

Run an RNN encoder through the time series and infer the parameters for the a posterior over :
(30) where comes from hidden state of RNN()

Sample

Obtain by solving ODE , where is the function defining the gradient as a function of

Maximize ,
where
Appendix E Derivations of The Adjoint Sensitivity Method
Here we describe the adjoint sensitivity method, which allows one to efficiently compute the gradients though ODE solution. Consider a problem of optimizing the loss that depends on the solution of ODE . We want to find the gradients wrt parameters . For simplicity of notation we formulate the problem on the time interval , can be easily generalized to interval .
Consider function s.t. our loss function can be represented as follows: . Function is implicitly defined.
We can formally define the problem as follows:
(31) 
given an ODE system
(32) 
where is a function of . For the sake of shorter notation, we use the notation to denote the derivative wrt time . Note that implicitly depends on .
We wish to compute the derivative of L with respect to :
(33) 
The system of equations (31) (32) can be rewritten as an unconstrained optimization problem using Lagrangian multipliers and , where is a function over time. We introduce a new loss :
(34) 
Now we want to take derivative over . Note that .
First, let’s differentiate by . Note that , and function itself depend on :
(35) 
Now let’s differentiate by . Note that implicitly depends on .
(36) 
Finally, compute :
(37) 
To compute , differentiate by parts. Recall that is also a function of .
(38) 
(39)  
(40) 
Grouping the terms by and :
(41)  
(42) 
It is difficult to calculate . As we are free to choose and , set and . This allows us to remove term.
We can also choose s.t. the term at equals zero:
(43) 
(recall that )
(44) 
Here is the partial derivative of wrt , which is a derivative of direct dependency of and . Through of adjoint method we have computed , a full derivative which includes a dependency of and through other variables, such as .
The adjoint ODE:
(45) 
Algorithm for computing the gradient :
1) Solve ODE for from t = 0 to T:
(46) 
2) Solve an adjoint ODE for from t = T to 0:
(47) 
3) Compute
(48) 
Note that in equation 47 contains . Although we can compute (which is a loss function), the function is unknown.
In this case, we can use the following algorithm to solve the adjoint system:
Algorithm if is unknown
1) Set end value for adjoint state
2) for i = T to 1 do:
a) Compute the end value of :
(49) 
b) Solve adjoint system backwards on time interval :
(50) 
3) Compute gradient as in equation 48
e.1 Interpretation of
Here we provide an an interpretation of the adjoing state . Consider the gradient of the loss w.r.t. the solution of an ODE .
(51) 
We can make the same calculations as with and obtain the equation similar to 44.
(52) 
e.2 Gradients wrt
Consider a loss that depends on . The derivative is the partial derivative of wrt , which can be easily computed. We aim to compute a full derivative through the adjoint method.
Algorithm:
1)
3) Return