Dissecting Neural ODEs

Dissecting Neural ODEs

Abstract

Continuous deep learning architectures have recently re–emerged as variants of Neural Ordinary Differential Equations (Neural ODEs). The infinite–depth approach offered by these models theoretically bridges the gap between deep learning and dynamical systems; however, deciphering their inner working is still an open challenge and most of their applications are currently limited to the inclusion as generic black–box modules. In this work, we “open the box” and offer a system–theoretic perspective, including state augmentation strategies and robustness, with the aim of clarifying the influence of several design choices on the underlying dynamics. We also introduce novel architectures: among them, a Galrkin–inspired depth–varying parameter model and neural ODEs with data–controlled vector fields.

1 Introduction

Continuous–depth neural network architectures are built upon the observation that, for particular classes of discrete models with coherent input and output dimensions, e.g. residual networks (ResNets) he2016deep, their inter–layer dynamics can be expressed as:

(1)

where is the input data. System (1) resembles the Euler discretization of the ordinary differential equation (ODE):

(2)

with , and .

Figure 1: Evolution of the input data through the depth of the neural ODEs over the learned vector fields.

Neural ODEs chen2018neural represent the latest instance of continuous deep learning models, first developed in the context of continuous recurrent networks cohen1983absolute. Since their introduction, research on neural ODEs variants tzen2019neural; jia2019neural; zhang2019anodev2; yildiz2019ode; poli2019graph has progressed at a rapid pace. However, the search for concise explanations and experimental evaluations of novel architectures has left many fundamental questions unanswered.

In this work, we adopt a system–theoretic approach to develop a deeper understanding of neural ODE dynamics. We leverage diagnostic tools for dynamical systems such as visualization of vector fields and Lyapunov exponents oseledec1968multiplicative to provide practical insights that can guide the design of neural ODE architectures. Furthermore, in contrast with what emerges in dupont2019augmented, we provide theoretical and empirical results proving that neural ODEs without state augmentation are capable of learning non–homeomorphic maps when equipped with additional structure.

Our work is structured as follows. In Section 3 we highlight important differences between the vanilla neural ODEs formulation and system (2) and introduce a Galrkin–inspired depth–varying parameter model to approximate the true deep limit of residual networks. We discuss how different activation functions shape the vector field and provide an insight on the role of universal function approximation theory in neural ODEs. In Section 4 we treat state augmentation and introduce physics–inspired and parameter efficient alternatives. Moreover, we review image classification results of dupont2019augmented and verify that the proposed augmentation approaches are more effective in terms of performance and computation. In Section 5, we highlight the need to exercise caution when extrapolating empirical insights from low dimensional to high dimensional spaces when dealing with dynamical systems, as some of their defining properties do not naturally translate between settings. We prove that state augmentation is not always necessary to learn non–homemorphic maps. First, we show that depth–varying vector fields alone are sufficient in dimensions greater than one. Then, we propose two different strategies of incorporating additional structure in the neural ODE to solve the above mentioned problem without state–augmentation. This results in two new classes of models, namely, the data–controlled and adaptive depth neural ODEs. Finally, in Section 6 we link chaos lorenz1963deterministic with robustness of neural ODEs. We estimate Lyapunov exponents oseledec1968multiplicative and show that enforcing stability can improve robustness by regularizing chaos.

2 Background and Related Work

We begin with a brief history of classical approaches to continuous and dynamical system–inspired deep learning.

A brief historical note on continuous deep learning

Continuous neural networks have a long history that goes back to continuous time variants of recurrent networks cohen1983absolute. Since then, several works explored the connection between dynamical systems, control theory and continuous recurrent networks zhang2014comprehensive, providing stability analyses and introducing delays marcus1989stability. Many of these concepts have yet to resurface in the context of neural ODEs. haber2017stable provides an analysis of ResNet dynamics and linked stability with robustness. Injecting stability into discrete neural networks has inspired the design of a series of architectures chang2019antisymmetricrnn; haber2019imexnet; NIPS2019_8358. hauser2019state explored the algebraic structure of neural networks governed by finite difference equations. Approximating ODEs with neural networks has been discussed in wang1998runge; filici2008neural. On the optimization front, several works leverage dynamical system formalism in continuous time wibisono2016variational; maddison2018hamiltonian; massaroli2019port.

Related work

This work concerns neural ordinary differential equations chen2018neural and a system–theoretic discussion of their dynamical behavior. The main focus is on neural ODEs and not the extensions to other classes of differential equations tzen2019neural; jia2019neural, though the insights developed here can be broadly applied to continuous–depth models. We further develop augmentation dupont2019augmented and robustness through stability hanshu2019robustness of neural ODEs. We note that a discussion of Lyapunov exponents for discrete neural networks is included in li2018analysis.

Notation

We refer to the set of real numbers as . is the norm induced by the inner product of . The origin of is . Scalars are denoted as lower–case letters, vectors as bold and lower–case and matrices as bold capital letters. Indices of arrays and matrices are reported as superscripts in round brackets. We use instead of as the depth variable to generalize the concept of time to depth. Throughout this paper the nested n–spheres benchmark task introduced in dupont2019augmented is used extensively. Namely, given define

(3)

We consider learning the non–homeomorphic map with neural ODEs prepending a linear layer . Notice that has been slightly modified with respect to dupont2019augmented, to be well–defined in its domain. For the one–dimensional case, we will often instead refer to the non–homeomorphic map as the crossing trajectories problem. Unless otherwise stated, all ODEs are solved with the adaptive Dormand–Prince prince1981high method available in the torchdiffeq chen2018neural PyTorch package. In nested –spheres and crossing trajectories problems, we minimize mean squared error (MSE) losses of model outputs and mapping . Additionally, for all experiments, complete information relative to the implementation is reported in the supplementary materials. The code will be open–sourced after the review phase and is included in the submission.

3 Anatomy of the Neural ODE Formulation

Well–posedness

Without any loss of generality let . Under mild conditions on , namely Lipsichitz continuity with respect to and uniform continuity with respect to , for each initial condition , the ODE in (2) admits a unique solution defined in the whole . Thus there is a mapping from to the space of absolutely continuous functions such that satisfies the ODE in (2). The output of the neural ODE is then defined as

Symbolically, the output of the neural ode is obtained by the following

For the sake of compactness of notation, the flow of a neural ODE at arbitrary is referred as .

Are neural ODEs networks with infinite layers?

Vanilla neural ODEs, as they appear in the original paper chen2018neural, cannot be considered the deep limit of residual networks (1). In fact, the author consider model

(4)

where the depth variable enters in the dynamics per se1 rather than in the map . Model (4) is rather the deep limit of residual networks with identical layers. Nevertheless, this simplified formulation allows the optimization of the network parameters with standard gradient–based techniques, differentiating directly through the ODE solver or by adjoint sensitivity analysis pontryagin1962mathematical. In “truly continuous” neural ODEs (2) the network training must be carried out optimizing in a functional space. While practical implementations of these truly continuous networks do not exists, two alternative finite–dimensional solutions can be used to parametrize :

Figure 2: Depth trajectories of the hidden state and relative vector fields for different activation functions in a nonlinear classification task. It can be noticed how the models with tanh and ELU outperforms the others, as is able to steer along negative directions.
  • Define a hypernetwork describing the variation of within the depth domain zhang2019anodev2, where is another tensor of traininable parameters and . This approach redefines neural ODEs as the system

    (5)
  • Use a Galrkin–style approach, performing an an eigenfunction expansion over some functional space of mappings :

    where are the eigenfunction of the basis and the corresponding eigenvalues. In the Galrkin neural ODEs we approximate with the first modes and learn the coefficient of the basis function with standard methods. This yields to

    (6)
    (7)

    and becomes the set of trainable parameters of the network. We refer to this class of neural ODEs as GalNODEs.

In the case of Galrkin neural ODEs, however, we must fix a prior on the functional class to which is restricted. E.g., if is assumed to be a continuous periodic function, we can use the Fourier series. In particular, different assumptions on the problem geometry2 determine the optimal class of eigefunction expansions, such as Chebichev polynomials or Bessel functions, etc. There exists a significant difference in parameter efficiency between the hypernetwork and Galrkin methods. While hypernetworks generally scale polynomially in , depending on the multi–layer structure of , the number of parameters in GalNODEs always scales linearly with and the number of modes considered in the series expansion.

Neural ODEs as universal approximators

Vanilla neural ODEs are not, in general, universal function approximators (UFAs) zhang2019approximation. Besides some recent works on the topic zhang2019approximation; li2019deep this apparent limitation is still not well–understood in the context of continuous–depth models. When neural ODEs are employed as general–purpose black–box modules, some assurances on the approximation capabilities of the model are necessary. zhang2019approximation noticed that a depth–invariant augmented neural ODE

(8)

where the output is picked as , can approximate any function provided that the neural network is an approximator of , since , mimicking the mapping . Although this simple result is not sufficient to provide a constructive blueprint to the design of neural ODE models, it suggests the following (open) questions:

  • Why should we use a neural ODE if its vector field can solve the approximation problem as a standalone neural network?

  • Can neural ODEs be UFAs with non-UFA vector fields?

On the other hand, if neural ODEs are used in a scientific machine learning context rackauckas2020universal requiring an UFA neural network to parametrize the model provides it with the ability to approximate arbitrary dynamical systems.

Mind your activation

When using a neural ODE within structured architectures we have to be especially careful with the choice of activation function in the last layer of . In fact, the chosen nonlinearity will strongly affect the “shape” of the vector field and, as a consequence, the flows learnable by the model. Therefore, while designing as a multi–layer neural network, it is generally advisable to append a linear layer to maximize the expressiveness of the underlying vector field. In some applications, conditioning the vector field (and thus the flows) with a specific nonlinearities can be desirable, e.g., when there exist priors on the desired transformation.

In Figure 2 we see how different activation functions in the last layer of condition the vector fields and the depth evolution of the hidden state in the classification of nonlinearly separable data. It is worth to be noticed that the models with better performance are the ones with hyperbolic tangent (tanh) and ELU clevert2015fast as the vector field can assume both positive and negative values and, thus, can “force” the hidden state in different directions. On the other hand, with sigmoid, ReLU or softplus zheng2015improving, the vector field is nonnegative in all directions and thus has limited freedom.

4 Augmenting Neural ODEs

Augmented neural ODEs (ANODEs) dupont2019augmented propose solving the initial value problem (IVP) in a higher dimensional space to limit the complexity of learned flows and therefore reduce the computational burden of the numerical integration scheme. The proposed approach relies on augmenting the baseline state vector with another vector , initialized at :

(9)

where, in this case, . We will henceforth refer to this augmentation strategy as –augmentation.

In this section we discuss alternative augmentation strategies for neural ODEs that match or improve on –augmentation in terms of performance or parameter efficiency.

Wide neural ODEs

Determining the increase in deep neural network capacity as function of layer width has been the focal topic of several analyses lu2017expressive. In particular, increasing width is known to improve performance in fully–connected architectures as well as convolutional networks zagoruyko2016wide, provided the application at hand benefits from the added model capacity. Following this standard deep learning approach, –augmentation can be generalized as follows:

(10)

where is a generic neural network. System (10) gives the model more freedom in determining the initial condition for the IVP instead of constraining it to a concatenation of and , at a small parameter cost. We refer to this type of augmentation as input layer (IL) augmentation and to the model as IL–neural ODE (IL–NODE). In applications where maintaining the structure of the first dimensions is important, e.g. approximation of dynamical systems, a parameter efficient alternative of (10) can be obtained by modifying the input network to only affect the additional dimensions.

Higher–order neural ODEs

Further parameter efficiency can be achieved by lifting the neural ODEs to higher orders. For example, a second–order neural ODE of the form:

(11)

By setting , (11) is equivalent to the first–order system

(12)

The above can be extended to –th order neural ODEs as

(13)

A limitation of system (12) is that a naive extension to second order requires a number of augmented dimensions . To allow for flexible augmentations of few dimensions , the formulation of second–order neural ODEs can be modified as follows. Let , . We can decide to give second order dynamics only to the first states while the dynamics of other states is kept invariant. This approach yields

(14)

with and such that . Both higher–order and selective higher–order neural ODEs are compatible with input layer augmentation.

Augmenting convolution and graph based architectures

In the case of convolutional neural network (CNN) or graph neural network (GNN) architectures, augmentation can be performed along different dimensions i.e. channel, heigth, width or similarly node features or number of nodes. The most physically consistent approach, employed in dupont2019augmented for CNNs, is augmenting along the channel dimension, equivalent to providing each pixel in the image additional states. By viewing an image as a lattice graph, the generalization to GNN–based neural ODEs poli2019graph operating on arbitrary graphs can be achieved by augmenting each node feature with additional states.

NODE ANODE IL–NODE 2nd
Test acc.
NFE
Param.
Table 1: Test results across 5 seeded runs on MNIST (mean and standard deviation). We report the mean NFE at convergence.

Revisiting MNIST results for augmented neural ODEs

In higher dimensional state spaces, such as those of image classification settings, the benefits of augmentation become subtle and manifest as performance improvements and a lower number of function evaluations (NFE) chen2018neural. We revisit the MNIST experiments of dupont2019augmented and evaluate classes of depth–invariant neural ODEs: namely, vanilla (no augmentation), ANODE (0–augmentation), IL-NODE (input–layer augmentation), and 2nd order (input–layer augmentation). For a fair comparison, the CNN parametrizing the vector field for all augmented neural ODEs take as input 6–channel images: in the case of ANODEs this is achieved by an augmentation with , whereas for IL–NODEs and 2nd order NODEs a linear convolutional layer is used to lift the channel dimension from 1 to 6. All CNN architectures, aside from vanilla neural ODEs, as well as hyperparameter setups, match that of dupont2019augmented. The input augmentation network is composed of a single, linear layer. The results for 5 experiments are reported in Table 1. IL–NODEs consistently achieve lower NFEs and higher performance than other variants, whereas second–order neural ODEs offer best relative parameter efficiency.

It should be noted that prepending an input multi–layer neural network to the neural ODE was the approach chosen in the experimental evaluations of the original neural ODE paper chen2018neural and that dupont2019augmented decided to forego this approach in favor of a comparison between no input layer and –augmentation. However, a significant difference exists between architectures depending on the depth and expressivity of . Indeed, utilizing non–linear and multi–layer input networks can be detrimental, as discussed in Sec. 5. Empirical evaluations on MNIST show that an augmentation through a single linear layer is sufficient to relieve vanilla neural ODEs of their limitations and achieves the best performance.

5 Augmentation is not always necessary

Blessing of dimensionality

Let us consider a worst case scenario for vanilla neural ODEs, e.g., the pathological case of function (3). Is it always necessary to augment dimensions to solve “hard ” problems? Although the answer is no, we must be cautious while extrapolating properties from examples in different (low) dimensions. In fact, dynamics in the first two (spatial) dimension are substantially different from the higher ones, e.g. no chaotic behaviors are possible khalil2002nonlinear. For example, in one dimension, it is correct to state that “two distinct trajectories can never intersect in the state–space” due the “narrowness” of . However, in the “infinitely wider” (and so in ), distinct trajectories of a time–varying process can well intersect in the state–space, provided that they do not pass through the same point at the same time. This implies that, in turn, depth–varying models can solve problems such as the nested 2–sphere problem in all dimensions but .

The above described phenomenon is deeply linked to the curse of dimensionality, which refers to the exponential increase in complexity known to stem from a transition to higher dimensional problems friedman1997bias. When dealing with finite datasets, however, the curse turns into a blessing: the number of data required to approximate a full cover of the surface of the – sphere, for example, grows exponentially in . Moreover, the specific data configuration required for the mapping to approximate a true non–homeomorphism becomes less and less likely as the dimension increases. As a result, the theoretical limitation of depth–invariant neural ODEs turns out, in practice, to have only a mild effect on higher dimensional problems.

Hereafter, starting from the one–dimensional case, we propose new classes of models allowing neural ODEs to perform “difficult tasks” without the need of any augmentation.

Figure 3: Depth trajectories over vector field of the data–controlled neural ODEs (16) for . As the vector field depends on the input data, the model is able to learn the non–homeomorphic mapping .

Data–controlled neural ODEs

We hereby derive a new class of neural ODEs, namely the data–controlled neural ODEs. To introduce the proposed approach, we start with an analytical result to solve the approximation of . We show how a simple handcrafted ode can approximate with arbitrary accuracy, passing to the vector field the input data . The aim is to highlight that, with proper additional structure, neural ODEs can perform non–homeomorphic mappings without augmentation, contrary to what emerges from dupont2019augmented. The result is the following:

Theorem 1

For all , there exists a parameter such that

(15)

where is the solution of the neural ODE

(16)

The proof is reported in the appendix. Figure (3) shows how model (16) is able to approximate when trained with standard backpropagation.

Figure 4: Depth trajectories over vector field of the adaptive–depth neural ODEs. The mapping can be learned by the proposed model. The key is to assign different integration times to the inputs, thus not requiring the intersection of trajectories.

From this hand–crafted model, we can simply define the general data–controlled neural ODE by incorporating the input data into the vector field, i.e.

(17)

where can be any. In particular, in this work we consider and input to be concatenated and subsequently fed to . Further experimental results with the latter general model on the representation of are reported in the appendix. The main advantage of the proposed approach is that we can condition the vector field on , effectively learning ad–hoc dynamics for each input data. Although this may seem to lead to overfitting the data, in a latter experiment we show how this is not the case as the show the best generalization performances.

Figure 5: [Above] Depth-flows of the data in the state–space. [Below] Resulting decision boundaries.
Figure 6: Solving nested 2–spheres without augmentation by prepending a non–linear transformation performed by a 2–layer fully–connected network.

Adaptive depth neural ODEs

Let us come back to the approximation of . Indeed, without incorporating the input data into , it is not possible to realize a mapping mimicking , due to the topology preserving property of the flows. Nevertheless, a neural ODE can be employed to approximate without the need of any crossing trajectory. In fact, if each input is integrated for in a different depth domain, , it is possible to learn without crossing flows as shown in Figure (4). In general, we can use a hypernetwork trained to learn the integration depth of each sample. In this setting, we define the general adaptive depth class as neural ODEs performing the mapping , i.e.

(18)

where is a neural network, are its trainable parameters are is any neural ODE architecture.

Qualitative study of non–augmented neural ODE variants

We qualitatively inspect the performance of different neural ODE variants: depth–invariant, depth–variant, Galrkin (GalNODE) and data–controlled. We utilize a nested –spheres problem in which all data points are available during training. Models are then qualitatively evaluated based on the complexity of the learned flows and on how accurately they extrapolate to unseen points in the continuous space of the –spheres, i.e. the learned decision boundaries. The training is carried out utilizing Adam kingma2014adam for epochs and learning rate . The models share an architecture comprising 2 hidden, fully–connected layers of size . We choose a Fourier series with harmonics as the eigenfunctions , to compute the parameters of the GalNODE as described in (6).

Figure 5 shows how data–controlled neural ODEs accurately extrapolate outside of the training data with best performance in terms of NFEs, thus with the simplest learned flows. Conversely, depth–variant models tend to overfit, yielding lower training losses and less precise boundaries. Finally, depth–invariant neural ODEs are able to squeeze through the gaps and solve the problem, at a significant NFE cost.

Mind your input networks

An alternative approach to learning non–homeomorphic maps involves solving the ODE in a latent state space. Figure 6 shows that with no augmentation, a network composed by a 2 fully–connected layers with non–linear activation followed by a neural ODE can solve the nested 2–spheres problem. However, the flows learned by the neural ODEs are superfluous: indeed, the clusters were already linearly separable after the first non–linear transformation. This example warns against superficial evaluations of neural ODE architectures preceded or followed by several layers of non–linear input and output transformations. In these scenarios, the learned flows risk performing unnecessary transformations and in pathological cases can collapse into a simple identity map. To sidestep these issues, we propose visually inspecting trajectories or performing an ablation experiment on the neural ODE block.

6 Chaos in Neural ODEs

Blending deep learning with system–theory provides a variety of well–studied diagnostic tools for dynamical systems capable of shining light on the behavior of neural differential equations. Here, we discuss how establishing a connection between Lyapunov exponents (LEs) oseledec1968multiplicative and neural ODEs provides insight into their robustness and can guide the discovery of alternative regularization techniques.

Lyapunov exponents for Neural ODEs

Chaos in dynamical system is often described as sensitivity to perturbations of the initial condition. In system theory, the study of chaos has a long history lorenz1963deterministic; shaw1981strange particularly concerning its detection.

Intuitively, LEs describe how different phase space directions contract or expand throughout the trajectory: positive exponents are a strong indicator of chaos benettin1980lyapunov. Computing the full spectrum of LEs is challenging and can be accomplished analytically only if certain conditions are met wolf1985determining. However, the largest LE alone provides an upper bound of the exponential rate of divergence of trajectories with –close initial conditions:

(19)

We employ a simplified3 version of non–negative LEs estimation algorithm proposed in wolf1985determining to investigate the exponential divergence of nearby trajectories. The algorithm is described in (1). We denote a uniform –spherical distribution centered at with radius as . The distribution has probability density constant on its surface and null everywhere else. Algorithm (1) is specified for a single data point for clarity; the extension to arbitrary batch size is straightforward.

1:Input: initial condition, integration time, radius
2:
3: Sample from –spherical uniform
4:
5:
6:
7: LE estimation
8:Return: ,
Algorithm 1 LE estimation for neural ODEs

Stabilizing via regularization terms

The concepts of stability and chaos can be used to regularize neural ODEs through a variety of additional terms. hanshu2019robustness proposes minimizing a loss term:

(20)

to achieve stability, where is the batch size. However, (20) requires integration up to . Inspired by (19), a simple alternative stabilizing regularization term can be considered at no significant additional computational cost:

(21)

which penalizes non–fixed points of . We measure the exponential divergence in –close trajectories i.e. of depth–invariant, depth–variant and stable neural ODEs regularized with (21) trained to solve the nested 3–spheres problem. To do so, algorithm (1) with is employed. The results are shown in Figure 7: stability leads to smaller LEs and non–diverging error dynamics.

Are Neural ODEs chaotic systems?

Estimated non–negative LEs suggest that neural ODEs may exhibit chaotic dynamics. It should be noted that algorithm (1) involves the choice of hyperparameters , and and can underestimate LEs when the chosen evolution time is too large wolf1985determining. However, the obtained values are significantly larger than those observed in other well–known chaotic dynamical systems i.e. Lorenz attractor lorenz1963deterministic. The above discussion confirms recent results on adversarial robustness of neural ODEs hanshu2019robustness as stability is observed to act as a chaos regularizer and therefore to reduce sensitivity to perturbations. Future extensions include alternative chaos metrics and additional regularization terms. Moreover, it may be possible to constructively design a non–chaotic, and therefore robust, neural ODE.

Figure 7: Evaluating sensitivity to perturbations of neural ODEs. Stability is observed to regularize chaos. LEs are estimated for each data point individually.

7 Conclusion

In this work, we develop a system–theoretic perspective on the dynamical behavior of neural ODEs. With the aim of shining light on fundamental questions regarding depth–variance, state augmentation and robustness, we provide theoretical and empirical insights with practical implications for architecture design. The discussion covers the choice of input–layers preceding neural ODE blocks, the vector field shaping effect of different activation functions and the important differences between low dimensional and high dimensional flows. Moreover, we introduce novel variants of neural ODEs: namely, a Galrkin inspired depth–variant model, higher–order and data–controlled neural ODEs. Finally, we prove that neural ODEs can learn non–homeomorphic maps without augmentation and we link chaos to robustness via Lyapunov exponents and a novel regularization term.

References

Appendix A Proof of Theorem 1

Analytic solution for crossing trajectories

Proof 1

The general solution of (16) is

Indeed it holds: . Thus,

It follows that

Appendix B Further Experimental Results

Computational resources

The experiments were carried out on a cluster of 2x12GB NVIDIA® Titan Xp GPUs and CUDA 10.1. All neural ODEs were trained on GPU using torchdiffeq [chen2018neural] PyTorch package.

General experimental setup

We report here information and hyperparameters that are shared across several experiments. For the nested –spheres problem we use data points evenly split between the two classes. In general, the models are trained for epochs using Adam [kingma2014adam] with learning rate and no scheduling. The adjoint sensitivity method is used only in MNIST experiments, whereas we choose backpropagation through the numerical solver for all the other empirical evaluations. This choice is motivated by the need for smaller numerical errors on the gradient, necessary to learn more complex and dynamic flows often required by depth–invariant neural ODE variants. All neural ODEs are solved numerically via the Dormand–Prince method [prince1981high] with both relative and absolute tolerances set to for MNIST classification and in all other cases.

Figure 8: Decision boundaries learned by the vector field of a neural ODE are directly conditioned by the choice of activation function.

Concat refers to depth–variant neural ODE variants where the depth–variable is concatenated to the input as done in [chen2018neural].

Figure 9: Training curves on MNIST for the different augmentation variants of NODEs. Shaded area indicates the 1 standard deviation interval.
Vanilla ANODE
Layer In dim. Out dim. In dim. Out dim. Activation
in layer None
–1 ReLU
–2 64 64 ReLU
–3 None
IL–NODE 2nd
Layer In dim. Out dim. In dim. Out dim. Activation
in layer None
–1 ReLU
–2 64 64 ReLU
–3 None
Table 2: Channel dimensions across the architectures used for MNIST experiments. The output is passed to a single linear layer to obtain classification probabilities for the digit classes. For 2nd, is only tasked with computing the velocity vector field and has thus smaller output dimensions.

b.1 Section 3

Effects of activations

In order to compare the effect of different activation functions in the last layer of , we set up a nonlinear classification task with the half–moons dataset. For the sake of completeness, we selected activations of different “types”, i.e.,

  • tanh: bounded;

  • sigmoid: bounded, non–negative output;

  • ReLU/softplus: unbounded, non–negative output;

  • ELU: positively unbounded and negatively bounded.

The dataset is comprised of data points. We utilize the entire dataset for training and evaluation since the experiment has the aim of delivering a qualitative description of the learned vector fields. has been selected as a multilayer perceptron with two hidden layers of 16 neurons each. The training has been carried out using Adam [kingma2014adam] optimizer with learning rate and weight decay set to .

Figure 8 shows how different activation functions shape the vector field and as a result the decision boundary.

b.2 Section 4

MNIST experiments

We closely follow the experimental setup of [dupont2019augmented]. We use Adam with learning rate and batch size . We use 3–layer depth–invariant CNNs for parametrizing the vector fields . The choice of depth–invariance is motivated by the discussion carried out in Section 5: both augmentation and depth–variance can relieve approximation limitations of vanilla, depth–invariant neural ODEs. As a result, including both renders the ablation study for augmentation strategies less accurate.

For input layer augmented neural ODE models, namely IL–NODE and 2nd order, we prepend to the neural ODE a single, linear CNN layer. The hidden channel dimension of the CNN parametrizing is set to . Second order neural ODEs, 2nd, use to compute the vector field of velocities: therefore, the output of is –dimensional, and the remaining outputs to concatenate (vector field of positions) are obtained as the last elements of . Architectures are shown in detail in Table 2.

Figure 9 highlights how vanilla neural ODEs are capable of convergence without any spikes in loss or NFEs. We speculate the numerical issues encountered in [dupont2019augmented] to be a consequence of the specific neural network architecture used to parametrize the vector field , which employed an excessive number of channels inside , i.e .

Augmentation experiments on nested 2-spheres

We qualitatively compare ANODE with second order neural ODEs (2nd) on the nested 2-spheres problem to verify whether the proposed second order model is capable of learning simple, non–stiff flows in a more parameter efficient manner. Both models have parametrized by a neural network with hidden layers of 16 neurons. Moreover, for both models, only the first two “non–augmented” hidden states are retained and fed to the linear classification layer following the neural ODE. The training is carried out by backpropagating directly through the solver. Figures 10 and 11 show the flows of states and augmented states for both models.

(a) Flows of the hidden state
(b) Flows of the augmented dimensions
Figure 10: Second Order versus ANODE. Trajectories in the respective hidden spaces of the data.
Figure 11: Depth evolution of the hidden and augmented dimensions.

b.3 Section 5

Experiments on crossing trajectories

We trained both state–of–the–art and proposed models to learn the map . We created a training dataset sampling equally spaced in . The models have been trained to minimize L1 losses using Adam [kingma2014adam] with learning rate and weight decay for 1000 epochs using the whole batch.

  • Vanilla neural ODEs We trained vanilla neural ODEs, i.e. both depth–invariant and depth variant models (“concat” and GalNODE). As expected, these models cannot approximate due to its non–homeomorphic nature. Both depth–invariant and concat have been selected with two hidden layers of 16 and 32 neurons each, respectively and tanh activation. The GalNODE have been designed with one hidden layer of 32 neurons whose depth–varying weights were parametrized by a Fourier series of five modes. The resulting trajectories over the learned vector fields are shown in Fig. 12.

    Figure 12: Depth evolution over the learned vector fields of the standard models: depth–invariant and depth–variant (“concat” and GalNODE ). As expected the Neural ODE cannot approximate the non–homeomorphic mapping .
  • Data–controlled neural ODEs We evaluate both the handcrafted linear depth–invariant model (16) and the general formulation of data–controlled models (17), realized with two hidden layers of 32 neurons each and tanh activation in all layers but the output. Note that the loss of the handcrafted model results to be convex and continuously differentiable. Moreover, proof 1 provides analytically a lower bound on the model parameter to ensure the loss to be upper–bounded by a desired , making its training superfluous. Nevertheless, we provide results with a trained version to show that the benefits of data–controlled neural ODEs are compatible with gradient–based learning.

    Figure 13: Depth evolution over the learned vector fields of the standard models: depth–invariant and depth–variant (“concat” and GalNODE ). As expected the Neural ODE cannot approximate the non–homeomorphic mapping .

    The results are shown in Fig.s 12 and 13. The input data information embedded into the vector field allows the neural ODE to steer the hidden state towards the desired label through its continuous depth. Data–controlled neural ODEs can be used to learn non–homeomorphic mappings without augmentation.

  • Adaptive depth neural ODEs The experiments have been carried out with a depth–variant neural ODE in “concat” style where was parametrized by a neural network with two hidden layers of 8 units and activation. Moreover, the function computing the data–adaptive depth of the neural ODE was composed by a neural network with one hidden layer (8 neurons and ReLU activation) whose output is summed to one and then taken in absolute value,

    where is the ReLU activation, and . In particular, the summation to one has been employed to help the network “sparsify” the learned integration depths and avoid highly stiff vector fields, while the absolute value is needed to avoid infeasible integration intervals. The training results can be visualized in Fig. 14. It is worth to be noticed how this early results should be intended as a new research direction for neural ODEs rather than a definitive evaluation of the proposed method, out of the scope of this paper and postponed to future work. Furthermore, for the sake of simplicity, the result of Fig. 4 showed in the main text has been obtained by training the model only on and setting by hand .

Figure 14: Evolution of the input data through the depth of the neural ODEs

Qualitative study of non–augmented neural ODE variants

Further experimental results on the performances of different neural ODEs dealing with the nested 2–sphere task are here reported. In particular, Fig. 15 shows the depth evolution of the data for the different architectures, Fig., 16 provides an examples of the weight functions learned by the GalNODE. Finally, Fig. 17 is an extended version of Fig. 1 displaying how the input data are transported by the vector fields through the neural ODEs’ depth towards linearly separable manifolds.

Figure 15: Evolution of the input data through the depth of the neural ODEs
Figure 16: Weights functions learned by the GalNODE for the nested 2–spheres task.

Experiments on design of input networks for neural ODEs

We tackle the nested –spheres task with a neural ODE preceded by a simple –layer neural network with units and ReLU activation. The second layer is linear.

b.4 Section 6

Definition of Lyapunov exponents

Given a dynamical system with its –th Lyapunov exponent LE measures the local deformation of an infinitesimal n–sphere of initial conditions: [wolf1985determining]:

(22)

where denotes the –th ellipsoidal principal axis.

Estimation of Lyapunov exponents

To estimate the largest Lyapunov exponents of neural ODEs, We employ algorithm 1 with on neural ODE variants: depth–invariant, depth–variant (”concat”) and depth–variant regularized via (21). The neural ODEs are trained on a nested –spheres problem following the general experimental setup. The neural network parametrizing the vector field is comprised of two layers of 16 units each.

Figure 17: Depth evolution of the learned vector fields

Footnotes

  1. In practice, is concatenated to and fed to the network.
  2. Since the basis function is defined on , if is scaled or parametrized differently, the result changes.
  3. The complete version for longer time series involves replacements in case the error grows above a specified threshold.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
408777
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description