Approximation Capabilities of Neural ODEs and Invertible Residual Networks
Abstract
Neural ODEs and iResNet are recently proposed methods for enforcing invertibility of residual neural models. Having a generic technique for constructing invertible models can open new avenues for advances in learning systems, but so far the question of whether Neural ODEs and iResNets can model any continuous invertible function remained unresolved. Here, we show that both of these models are limited in their approximation capabilities. We then prove that any homeomorphism on a dimensional Euclidean space can be approximated by a Neural ODE operating on a dimensional Euclidean space, and a similar result for iResNets. We conclude by showing that capping a Neural ODE or an iResNet with a single linear layer is sufficient to turn the model into a universal approximator for noninvertible continuous functions.
8.5in \setlength11in
1 Introduction
A neural network block is a function that maps an input vector to output vector , and is parameterized by a weight vector . We require that is almost everywhere differentiable with respect to both of its arguments, allowing the use of gradient methods for tuning based on training data and an optimization criterion, and for passing the gradient to preceding network blocks.
One type of neural building blocks that has received attention in recent years is a residual block [HZRS16], where , with being some differentiable, nonlinear, possibly multilayer transformation. Input and output dimensionality of a residual block are the same, , and such blocks are usually stacked in a sequence, a ResNet, . Often, the functional form of is the same for all blocks in the sequence. Then, we can represent the sequence through , where consists of trainable parameters for all blocks in the sequence; the second argument, , allows us to pick the proper subset of parameters, . If we allow arbitrary , for example a neural network with input and output dimensionality but with many hidden layers of dimensionality higher than , a sequence of residual blocks can, in principle, model arbitrary mappings , where we define to be the result of applying the sequence of residual blocks to the initial input . For example, a linear layer preceded by a deep sequence of residual blocks is a universal approximator for Lebesgue integrable functions [LJ18].
Recently, models arising from residual blocks have gained attention as a means to construct invertible networks; that is, training a network results in mappings for which an inverse mapping exist. Ability to train a mapping that is guaranteed to be invertible has practical applications; for example, they give rise to normalizing flows [DB95, RM15], which allow for sampling from a complicated, multimodal probability distribution by generating samples from a simple one, and transforming them through an invertible mapping. For any given architecture for invertible neural networks, it is important to know whether it can be trained to approximate arbitrary invertible mappings, its approximation capabilities are limited.
1.1 Invertible Models
We focus our attention on two invertible neural network architectures: iResNet, a constrained ResNet, and Neural ODE, a continuous generalization of a ResNet.
Invertible Residual Networks
While ResNets refer to arbitrary networks with any residual blocks , that is, can have any residual mapping , iResNets [BGC19], and their improved variant, Residual Flows [CBDJ19], are built from blocks in which is Lipschitzcontinuous with constant lower than 1 as a function of for fixed , which we denote by . This simple constraint is sufficient [BGC19] to guarantee invertibility of the residual network, that is, to make a onetoone mapping.
Given the constraint on the Lipschitz constant, an invertible mapping cannot be performed by a single iResNet layer. But a stack of two layers, each of the form and thus Lipschitzcontinuous with constant lower than 1, yields the desired mapping. A single iResNet layer , where is the identity mapping, is , and a composition of such layers has Lipschitz constant of at most . Thus, for any finite , it might be possible to approximate any invertible mapping with by a series of iResNet layers, with the number of layers depending on . However, the question whether the possibility outlined above true, and iResNet have universal approximation capability within the class of invertible continuous mappings, has not been considered thus far.
Neural Ordinary Differential Equations
Neural ODEs (ODENets) [CRBD18] are a recently proposed class of differentiable neural network building blocks. ODENets were formulated by observing that processing an initial input vector through a sequence of residual blocks can be seen as evolution of in time . Then, a residual block (eq. 1) is a discretization of a continuoustime system of ordinary differential equations (eq. 2)
(1)  
(2) 
The transformation taking into realized by an ODENet for some chosen, fixed time is not specified directly through a functional relationship for some neural network , but indirectly, through the solutions to the initial value problem (IVP) of the ODE
(3) 
involving some underlying neural network with trainable parameters . By a ODENet we denote an ODENet that takes a dimensional sample vector on input, and produces a dimensional vector on output. The underlying network must match those dimensions on its input and output, but in principle can have arbitrary internal architecture, including multiple layers of much higher dimensionality.
By the properties of ODEs, ODENets are always invertible, we can just reverse the limits of integration, or alternatively integrate . The adjoint sensitivity method [PMBG62] based on reversetime integration of an expanded ODE allows for finding gradients of the IVP solutions with respect to parameters and the initial values . This allows training ODENets using gradient descent, as well as combining them with other neural network blocks. Since their introduction, ODENets have seen improved implementations [RIM19] and enhancements in training and stability [GKB19, ZYG19].
Unlike an unconstrained residual block, a Neural ODE on its own does not have universal approximation capability. Consider a continuous, differentiable, invertible function on . There is no ODE defined on that would result in . Informally, in ODEs, paths between the initial value and final value have to be continuous and cannot intersect in for two different initial values, and paths corresponding to and would need to intersect. By contrast, in an unconstrained residual block sequence, a discrete dynamical system on , we do not have continuous paths, only points at unittime intervals, with an arbitrary transformation between points; finding an unconstrained ResNet for is easy. While ODENets used outofthebox have limited modeling capability, some evidence exists that this limitation can be overcome by changing the way way ODENets are applied. Yet, the question whether they can be turned into universal approximators remains open.
1.2 Our Contribution
We analyze the approximation capabilities of ODENets and iResNets. The results most closely related to ours have been recently provided by the authors of ANODE [DDT19], who focus on a ODENet followed by a linear layer. They provide counterexamples showing that such an architecture is not a universal approximator of functions. However, they show empirical evidence indicating that expanding the dimensionality and using ODENet for instead of a ODENet has positive impact on training of the model and on its generalization capabilities. The authors of iResNet [BGC19] also use expanded dimensionality in their experiments, observing that it leads to a modest increase in model’s accuracy.
Here, we prove that setting is enough to turn Neural ODE followed by a linear layer into a universal approximator for . We show similar result for iResNet. Our main focus is on modeling invertible functions – homeomorphisms – by exploring pure ODENets and iResNets, not capped by a linear layer. We show a class of invertible mappings that cannot be expressed by these modeling approaches when they operate within . We then prove that any homeomorphism , for , can be modeled by a Neural ODE / iResNet operating on an Euclidean space of dimensionality that embeds as a linear subspace.
2 Background on ODEs, Flows, and Embeddings
This section provides background on invertible mappings and ODEs; we recapitulate standard material, for details see [Utz81, Lee01, BS02, You10].
2.1 Flows
A mapping is a homeomorphism if is a onetoone mapping of onto itself, and both and its inverse are continuous. Here, we will assume that for some , and we will use the term homeomorphism where dimensionality matters.
A topological transformation group or a flow [Utz81] is an ordered triple involving an additive group with neutral element 0, and a mapping such that and for all , all . Further, mapping is assumed to be continuous with respect to the first argument. The mapping gives rise to a parametric family of homeomorphisms defined as , with the inverse being .
Given a flow, an orbit or a trajectory associated with is a subspace . Given , either or ; two orbits are either identical or disjoint, they never intersect. A point is a fixed point if . A path is a part of the trajectory defined by a specific starting and end points. A path is a subset of ; we will also consider a spacetime path composed of points if we need to make the time evolution explicit.
A discrete flow is defined by setting . For arbitrary homeomorphism of onto itself, we easily get a corresponding discrete flow, an iterated discrete dynamical system, , , . Setting gives us a ResNet corresponding to , though not necessarily an iResNet, since there is no constraint. For Neural ODEs, the type of flow that is relevant is a continuous flow, defined by setting , and adding an assumption that the family of homeomorphisms, the function , is differentiable with respect to its second argument, , with continuous . The key difference compared to a discrete flow is that the flow at time , , is now defined for arbitrary , not just for integers. We will use the term flow to indicate that .
Informally, in a continuous flow the orbits are continuous, and the property that orbits never intersect has consequences for what homeomorphisms can result from a flow. Unlike in the discrete case, for a given homeomorphism there may not be a continuous flow such that for some . We cannot just set , what is required is a continuous family of homeomorphisms such that and is identity – such family may not exist for some . In such case, a Neural ODE would not be able to model . While iResNets are discrete, the way they are constructed may also limit the space of mappings they can model to a subset of all homeomorphisms, even if each residual mapping is made arbitrarily complex within the Lipschitz constraint.
2.2 Continuous Flows and ODEs
Given a continuous flow one can define a corresponding ODE on by defining a vector for every such that . Then, the ODE corresponds to continuous flow . Indeed, , is identity, and for timeindependent . Thus, for any homeomorphism family defining a continuous flow, there is a corresponding ODE that, integrated for time , models the flow at time , .
The vectors of derivatives for all are continuous over and are constant in time, and define a continuous vector field over . The ODEs evolving according to such a timeinvariant vector field, where the righthand side of eq. 2 depends on but not directly on time , are called autonomous ODEs, and take the form of .
Any timedependent ODE (eq. 2) can be transformed into an autonomous ODE by removing time from being a separate argument of , and adding it as part of the vector .
Specifically, we add an additional dimension
Given time and an ODE defined by , , the flow at time , may not be well defined, for example if diverges to infinity along the way. However, if is well behaved, the flow will exist at least locally around the initial value. Specifically, PicardLindelöf theorem states that if an ODE is defined by a Lipschitzcontinuous function , then there exists such that the flow at time , , is welldefined and unique for . If exists, is a homeomorphism, since the inverse exists and is continuous; simply, is the inverse of .
2.3 Flow Embedding Problem for Homeomorphisms
Given a flow, we can always find a corresponding ODE. Given an ODE, under mild conditions, we can find a corresponding flow at time , , and it necessarily is a homeomorphism. Is the class of flows equivalent to the class of homeomorphisms, or only to its subset? That is, given a homeomorphism , does a flow such that exist? This question is referred to as the problem of embedding the homeomorphism into a flow.
For a homeomorphism , its restricted embedding into a flow is a flow such that for some ; the flow is restricted to be on the same domain as the homeomorphism. Studies of homeomorphisms on simple domains such as a 1D segment [For55] or a 2D plane [And65] showed that a restricted embedding does not always exist.
An unrestricted embedding into a flow [Utz81] is a flow on some space of dimensionality higher than . It involves a homeomorphism that maps into some subset , such that the flow on results in mappings on that are equivalent to on for some , that is, . While a solution to the unrestricted embedding problem always exists, it involves a smooth, nonEuclidean manifold . For a homeomorphism , the manifold , variously referred to as the twisted cylinder [Utz81], or a suspension under a ceiling function [BS02], or a mapping torus [Bro66], is a quotient space defined through the equivalence relation . The flow that maps at to at and at involves trajectories in in the following way: for going from 0 to 1, the trajectory tracks in a straight line from to ; in the quotient space is equivalent to . Then, for going from 1 to 2, the trajectory proceeds from to .
The fact that the solution to the unrestricted embedding problem involves a flow on a nonEuclidean manifold makes applying it in the context of gradienttrained ODENets difficult.
3 Approximation of Homeomorphisms by Neural ODEs
In exploring the approximation capabilities of Neural ODEs for homeomorphisms, we will assume that the neural network on the right hand side of the ODE is a universal approximator and, if needed, can be made large enough to closely approximate any desired function. Thus, our concern is with what flows can be modeled by a ODENet assuming that can have arbitrary internal architecture, including depth and dimensionality, as long as its inputoutput dimensionality remains fixed at . We consider two scenarios, , and .
3.1 Restricting the Dimensionality Limits Capabilities of Neural ODEs
We show a class of functions that a Neural ODE cannot model, a class that generalizes the onedimensional example.
Theorem 1.
Let , and let be a set that partitions into two or more disjoint, connected subsets , for . Consider a mapping that

is an identity transformation on , that is, ,

maps some into , for .
Then, no ODENet can model .
Proof.
A ODENet can model if a restricted flow embedding of exists. Suppose that it does, a continuous flow can be found for such that the trajectory of is continuous on with and for some , for all .
If maps some into , for , the trajectory from to crosses – there is such that for some . From uniqueness and reversibility of ODE trajectories, we then have . From additive property of flows, we have .
Since is identity over and , thus . That is, the trajectory over time is a closed curve starting and ending at , and for any . Specifically, . Thus, . We arrive at a contradiction with the assumption that and are in two disjoint subsets of separated by . Thus, no ODENet can model .
∎
The result above shows that Neural ODEs applied in the most natural way, with , are severely restricted in the way distinct regions of the input space can be rearranged in order to learn and generalize from the training set, and the restrictions go well beyond requiring invertibility and continuity.
3.2 Neural ODEs with Extra Dimensions are Universal Approximators for Homeomorphisms
If we allow the Neural ODE to operate on Euclidean space of dimensionality , we can approximate arbitrary homeomorphism , as long as is high enough. Here, we show that is suffices to take .
We construct a mapping from the original problem space, into that

preserves as a dimensional linear subspace consisting of vectors ,

leads to an ODE that maps .
Thus, we provide a solution with a structure that is convenient for outofthebox training and inference using Neural ODEs – it is sufficient to add zeros to input vectors.
Theorem 2.
For any homeomorphism , , there exists a ODENet for such that for any .
Proof.
We prove the existence in a constructive way, by showing a vector field in , and thus an ODE, with the desired properties.
We start with the extended space with a variable corresponding to time added as the last dimension, as in the construction of an autonomous ODE from timedependent ODE. We then define a mapping that will represent paths starting from at time . For , the mapping (see Fig. 1) is defined through
(4) 
For each , let be defined as . The functions are required to have continuous first derivative, and have , , iff , and the derivatives and are null at and only there. The mapping indeed just adds dimensions of 0 to at time , and at time it gives the result of the homeomorphism applied to , again with dimensions of 0
For the purpose of constructing an ODENet with universal approximation capabilities, suffices. However, more generally we can define the mapping for , by setting ; for example, . Intuitively, the mapping will provide the position in of the time evolution for duration of an ODE on starting from a position corresponding to .
For two distinct , the paths in given by eq. 4 do not intersect at the same position at the same point in time. First, consider the case where is not parallel to . Then, the second set of variables is equal only if , that is, only at integer . At those time points, the first set of variables takes iterates of , that is, , which are different for different . Second, consider such that for some . Then, either and thus for all noninteger , that is, the second set of variables are always different except at corresponding to iterates of , which are distinct; or , and the second set of variables are always the same. In the latter case, also , hence the first set of variables is is only equal to if . Thus, in , paths starting from two distinct points do not intersect at the same point in time. Intuitively, we have added enough dimensions to the original space so that we can reroute all trajectories without intersections.
We have correspond directly to time, that is, The mapping has continuous derivative with respect to , defining a vector field over the image of , a subset of
From the conditions on , we can verify that this timedependent vector field defined through derivatives of with respect to time has the same values for and for any
Thus, the vector field is wellbehaved at , it is continuous over the whole image of . The vector field above is defined over a closed subset of , and can be (see [Lee01], Lemma 8.6) extended to the whole . A ODENet with a universal approximator network on the right hand side can be designed to approximate the vector field arbitrarily well. The resulting ODENet approximates to . ∎
Based on the above result, we now have a simple method for training a Neural ODE to approximate a given continuous, invertible mapping and, for free, obtain also its continuous inverse . On input, each sample is augmented with zeros. For a given , the output of the ODENet is split into two parts. The first dimensions are connected to a loss function that penalizes deviation from . The remaining dimensions are connected to a loss function that penalizes for any deviation from 0. Once the network is trained, we can get by using an ODENet with instead of used in the trained ODENet.
4 Approximation of Homeomorphisms by iResNets
4.1 Restricting the Dimensionality Limits Capabilities of iResNets
We show that similarly to Neural ODEs, iResNets cannot model a simple homeomorphism , indicating that their approximation capabilities are limited.
Theorem 3.
Let be an layer iResNet, and let and . If for all , then there are is no number and no functions for all such that .
Proof.
Consider and . Then, and . From we have that . Let . Then, we have
That is, has the same sign as . Thus, applying the reasoning to arbitrary , instead of , if , then , and if , then , for all . Assume we can construct an iResNet such that ; then for any , and cannot map into . ∎
The result above leads to a more general observation about paths in spaces of dimensionality higher than one. As with ODENets, we will use iResNet to denote an iResNet operating on .
Corollary 4.
Let the straight line connecting to be called an extended path of a timediscrete topological transformation group on . In iResNet, for , extended paths and do not intersect.
Proof.
For two extended paths to intersect, vectors have to be coplanar. If we restrict attention to dynamics starting form , we can view it as a onedimensional system, with the space axis parallel to , and time axis orthogonal to it.
First, consider the situation where the line connecting is parallel, in the original space, to the line connecting . Applying Theorem 3 shows that if is above , then is above ; extended paths do not intersect in this case.
In the more general case, without loss of generality assume that is the one extending farther from the line than . We can construct another iResNet, keeping but with , for such that we arrive at the parallel case above; . The intersection of extended paths for both iResNets, if exists, is at the same position. But the second iResNet is the parallel case analyzed above, with no intersection.
∎
The result allows us to show that iResNets faces a similar constraint in its capabilities as Neural ODEs
Theorem 5.
Let , and let and be the same as in Theorem 1. Then, no iResNet can model .
Proof.
Consider a layered iResNet on , giving riving rise to extended spacetime paths in , with integer corresponding to activations in subsequent layers. For any , the extended path in starts at and ends at . Since iResNet layers are continuous transformations, the union of all extended paths arising from is a simply connected subset of ; it has no holes and partitions into separate regions. Since extended paths cannot intersect, remains in the same region as , which is in contradiction with mapping . ∎
The proof shows that the limitation in capabilities of the two architectures for invertible mappings analyzed here arises from the fact that paths in invertible mappings constructed through NeuralODEs and iResNets are not allowed to intersect and from continuity in .
4.2 iResNets with Extra Dimensions are Universal Approximators for Homeomorphisms
Similarly to Neural ODEs, expanding the dimensionality of the iResNet from to by adding zeros on input guarantees that any homeomorphism can be approximated, as long as its Lipschitz constant is finite and an upper bound on it is known during iResNet architecture construction.
Theorem 6.
For any homeomorphism , with , there exists a iResNet with residual layers such that for any .
Proof.
For a given invertible iResNet approximating , define a possibly noninvertible mapping , where ; we have . An iResNet that models using layers for can be constructed in the following way:
The first layer maps into and stores it in the second set of activations. The subsequent layers progress in a straight line from to in constantlength steps, and the last layer restores null values in the second set of activations.
All layers are continuous mappings. The residual part of the first layer has Lipschitz constant below one, since . The middle layers have residual part constant in and contractive in , with Lipschitz constant below one. The residual part of the last layer is a mapping of the form . For a pair , let , . We have , with equality only if . From invertibility of we have that implies and thus ; hence, the residual part of the last layer also has Lipschitz constant below one. ∎
The theoretical construction above suggests that while on the order of layers may be needed to approximate arbitrary homeomorphism with , only the first and last layers depend on and need to be trained; the middle layers are simple, fixed linear layers. The last layer for is the same as the first layer of would be, but the inverse of the first layer, but since iResNet construct invertible mappings using possibly noninvertible , it has to be trained along with the first layer.
The construction for iResNets is similar to that for NeuralODEs, except one does not need to enforce differentiability in the time domain, hence we do not need smooth accumulation and removal of in the second set of activations, and the movement from to in the original dimensions does not need to be smooth. In both cases, the transition from to progresses along a straight line in the first dimensions, with the direction of movement stored in the second set of variables.
5 Invertible Networks capped by a Linear Layer are Universal Approximators
We show that a Neural ODE or an iResNet followed by a single linear layer can approximate functions, including noninvertible functions, equally well as any traditional feedforward neural network. Since networks with shallowbutwide fullyconnected architecture [Cyb89, Hor91], or narrowbutdeep unconstrained ResNetbased architecture [LJ18] are universal approximators, so are ODENets and iResNets. Consider a function . For any such that , the mapping is a homeomorphism, and as we have shown, can be approximated by a ODENet or iResNet; can be extracted from the result by a simple linear layer. Through a simple construction, we show that actually using just dimensions is sufficient.
Theorem 7.
Consider a neural network that approximates function that is Lebesgue integrable for each of the output dimensions, with being a compact subset. For , there exists a linear layercapped ODENet that can perform the mapping . If is Lipschitz, there also is a linear layercapped iResNet for .
Proof.
Let be a neural network that takes input vectors and produces dimensional output vectors , where is the desired transformation. is constructed as follows: use to produce , ignore , and always output . Consider a ODENet defined through . Let the initial value be . The ODE will not alter the first dimensions throughout time, hence for any , . After time , we will have
Thus, for any , the output can be recovered from the output of the ODENet by a simple, sparse linear layer that ignores all dimensions except the last , which it returns. A similar construction can be used for defining layers of iResNet. We define residual layers, each with residual mapping . If , then the residual mapping has Lipschitz constant below 1. ∎
6 Experimental Results
6.1 iResNets
We tested whether iResNet operating in one dimension can learn to perform the mapping, and whether adding one more dimension has impact on the ability learn the mapping. To this end, we constructed a network with five residual blocks. In each block, the residual mapping is a single linear transformation, that is, the residual block is . We used the official iResNet PyTorch package [BGC19] that relies on spectral normalization [MKKY18] to limit the Lipschitz constant to less than unity. We trained the network on a set of 10,000 randomly generated values of uniformly distributed in for 100 epochs, and used an independent test set of 2,000 samples generated similarly.
For the onedimensional and the twodimensional target mapping, we used MSE as the loss. Adding one extra dimension results in successful learning of the mapping, confirming Theorem 6. The test MSE on each output is below ; the network learned to negate , and to bring the additional dimension back to null, allowing for invertibility of the model. For the iResNet operating in the original, onedimensional space, learning is not successful (MSE of 33.39), the network learned a mapping for a small positive , that is, the mapping closest to negation of that can be achieved while keeping nonintersecting paths, confirming experimentally Corollary 4.
6.2 Neural ODEs
We performed experiments to validate if the dimensions threshold beyond which any homeomorphism can be approximated by a ODENet can be observed empirically in a practical classification problem. We used the CIFAR10 dataset [Kri09] that consists of 32 x 32 RGB images, that is, each input image has dimensionality of . We constructed a series of ODENets with dimensionality , and for each measured the crossentropy loss for the problem of classifying CIFAR10 images into one of ten classes. We used the default split of the dataset into 50,000 training and 10,000 test images. In designing the architecture of the neural network underlying the ODE we followed ANODE [DDT19]. Briefly, the network is composed of three 2D convolutional layers. The first two convolutional layers use filters, and the last one uses the number of input channels as the number of filters, to ensure that the dimensionalities of the input and output of the network match. The convolution stack is followed by a ReLU activation function. A linear layer, with softmax activation and crossentropy loss, operates on top the ODE block. We used torchdiffeq package [CRBD18] and trained on a single NVIDIA Tesla V100 GPU card.
To extended the dimensionality of the space in which the ODE operates, we introduce additional null channels on input, that is, we use input images of the form . Then, to achieve , we need . We tested . To analyze how the capacity of the network interplays with the increases in input dimensionality, we also experimented with varying the number of convolutional filters, , in the layers inside the ODE block.
The results in Fig. 2 show that once the networks with small network capacity, below 64 filters, behaves differently than networks with 64 or more filters. Once the network capacity is high enough, at 64 filters or more, adding dimensions past beyond 3, that is, beyond , results in slower decrease in test set loss. To quantify if this slowdown is likely to arise by chance, we calculated the change in test set loss for dimensionality as the increases by one, , for =1,…,7. We pooled the results from experiments with 64 convolution filter or more. Twotailed nonparametric MannWhitney U test between and shows the change of trend is significant (=.002).
Acknowledgments
T.A. is supported by NSF grant IIS1453658.
Footnotes
 To avoid confusion with indicating time, we use to denote th component of vector .
 We use upper subscript to denote dimensionality of vectors; that is, .
References
 Stephen A Andrea. On homeomorphisms of the plane, and their embedding in flows. Bulletin of the American Mathematical Society, 71(2):381–383, 1965.
 Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and JörnHenrik Jacobsen. Invertible residual networks. In International Conference on Machine Learning, pages 573–582, 2019.
 William Browder. Manifolds with . Bulletin of the American Mathematical Society, 72(2):238–244, 1966.
 Michael Brin and Garrett Stuck. Introduction to dynamical systems. Cambridge university press, 2002.
 Tian Qi Chen, Jens Behrmann, David K Duvenaud, and JörnHenrik Jacobsen. Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems, pages 9913–9923, 2019.
 Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pages 6571–6583, 2018.
 George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
 Gustavo Deco and Wilfried Brauer. Nonlinear higherorder statistical decorrelation by volumeconserving neural architectures. Neural Networks, 8(4):525–535, 1995.
 Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural ODEs. arXiv preprint arXiv:1904.01681, 2019.
 Marion Kirkland Fort. The embedding of homeomorphisms in flows. Proceedings of the American Mathematical Society, 6(6):960–967, 1955.
 Amir Gholami, Kurt Keutzer, and George Biros. ANODE: unconditionally accurate memoryefficient gradients for neural ODEs. arXiv preprint arXiv:1902.10298, 2019.
 Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
 Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
 John M Lee. Introduction to smooth manifolds. Springer, 2001.
 Hongzhou Lin and Stefanie Jegelka. ResNet with oneneuron hidden layers is a universal approximator. In Advances in Neural Information Processing Systems, pages 6169–6178, 2018.
 Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
 Lev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. The mathematical theory of optimal processes. 1962.
 Chris Rackauckas, Mike Innes, Yingbo Ma, Jesse Bettencourt, Lyndon White, and Vaibhav Dixit. DiffEqFlux.jla Julia library for neural differential equations. arXiv preprint arXiv:1902.02376, 2019.
 Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530–1538, 2015.
 WR Utz. The embedding of homeomorphisms in continuous flows. In Topology Proc, volume 6, pages 159–177, 1981.
 Hassler Whitney. The singularities of a smooth manifold in (2 1)space. Ann. of Math, 45(2):247–293, 1944.
 Laurent Younes. Shapes and diffeomorphisms, volume 171. Springer, 2010.
 Tianjun Zhang, Zhewei Yao, Amir Gholami, Kurt Keutzer, Joseph Gonzalez, George Biros, and Michael Mahoney. ANODEV2: A coupled neural ODE evolution framework. arXiv preprint arXiv:1906.04596, 2019.