Approximation Capabilities of Neural ODEsand Invertible Residual Networks

# Approximation Capabilities of Neural ODEs and Invertible Residual Networks

## Abstract

Neural ODEs and i-ResNet are recently proposed methods for enforcing invertibility of residual neural models. Having a generic technique for constructing invertible models can open new avenues for advances in learning systems, but so far the question of whether Neural ODEs and i-ResNets can model any continuous invertible function remained unresolved. Here, we show that both of these models are limited in their approximation capabilities. We then prove that any homeomorphism on a -dimensional Euclidean space can be approximated by a Neural ODE operating on a -dimensional Euclidean space, and a similar result for i-ResNets. We conclude by showing that capping a Neural ODE or an i-ResNet with a single linear layer is sufficient to turn the model into a universal approximator for non-invertible continuous functions.

\setlength

8.5in \setlength11in

## 1 Introduction

A neural network block is a function that maps an input vector to output vector , and is parameterized by a weight vector . We require that is almost everywhere differentiable with respect to both of its arguments, allowing the use of gradient methods for tuning based on training data and an optimization criterion, and for passing the gradient to preceding network blocks.

One type of neural building blocks that has received attention in recent years is a residual block [HZRS16], where , with being some differentiable, nonlinear, possibly multi-layer transformation. Input and output dimensionality of a residual block are the same, , and such blocks are usually stacked in a sequence, a ResNet, . Often, the functional form of is the same for all blocks in the sequence. Then, we can represent the sequence through , where consists of trainable parameters for all blocks in the sequence; the second argument, , allows us to pick the proper subset of parameters, . If we allow arbitrary , for example a neural network with input and output dimensionality but with many hidden layers of dimensionality higher than , a sequence of residual blocks can, in principle, model arbitrary mappings , where we define to be the result of applying the sequence of residual blocks to the initial input . For example, a linear layer preceded by a deep sequence of residual blocks is a universal approximator for Lebesgue integrable functions [LJ18].

Recently, models arising from residual blocks have gained attention as a means to construct invertible networks; that is, training a network results in mappings for which an inverse mapping exist. Ability to train a mapping that is guaranteed to be invertible has practical applications; for example, they give rise to normalizing flows [DB95, RM15], which allow for sampling from a complicated, multi-modal probability distribution by generating samples from a simple one, and transforming them through an invertible mapping. For any given architecture for invertible neural networks, it is important to know whether it can be trained to approximate arbitrary invertible mappings, its approximation capabilities are limited.

### 1.1 Invertible Models

We focus our attention on two invertible neural network architectures: i-ResNet, a constrained ResNet, and Neural ODE, a continuous generalization of a ResNet.

#### Invertible Residual Networks

While ResNets refer to arbitrary networks with any residual blocks , that is, can have any residual mapping , i-ResNets [BGC19], and their improved variant, Residual Flows [CBDJ19], are built from blocks in which is Lipschitz-continuous with constant lower than 1 as a function of for fixed , which we denote by . This simple constraint is sufficient [BGC19] to guarantee invertibility of the residual network, that is, to make a one-to-one mapping.

Given the constraint on the Lipschitz constant, an invertible mapping cannot be performed by a single i-ResNet layer. But a stack of two layers, each of the form and thus Lipschitz-continuous with constant lower than 1, yields the desired mapping. A single i-ResNet layer , where is the identity mapping, is , and a composition of such layers has Lipschitz constant of at most . Thus, for any finite , it might be possible to approximate any invertible mapping with by a series of i-ResNet layers, with the number of layers depending on . However, the question whether the possibility outlined above true, and i-ResNet have universal approximation capability within the class of invertible continuous mappings, has not been considered thus far.

#### Neural Ordinary Differential Equations

Neural ODEs (ODE-Nets) [CRBD18] are a recently proposed class of differentiable neural network building blocks. ODE-Nets were formulated by observing that processing an initial input vector through a sequence of residual blocks can be seen as evolution of in time . Then, a residual block (eq. 1) is a discretization of a continuous-time system of ordinary differential equations (eq. 2)

 xt+1−xt =fΘ(xt,t), (1) dxtdt=limδt→0xt+δt−xtδt =fΘ(xt,t). (2)

The transformation taking into realized by an ODE-Net for some chosen, fixed time is not specified directly through a functional relationship for some neural network , but indirectly, through the solutions to the initial value problem (IVP) of the ODE

 xT=ϕT(x0)=x0+∫T0fΘ(xt,t)dt (3)

involving some underlying neural network with trainable parameters . By a -ODE-Net we denote an ODE-Net that takes a -dimensional sample vector on input, and produces a -dimensional vector on output. The underlying network must match those dimensions on its input and output, but in principle can have arbitrary internal architecture, including multiple layers of much higher dimensionality.

By the properties of ODEs, ODE-Nets are always invertible, we can just reverse the limits of integration, or alternatively integrate . The adjoint sensitivity method [PMBG62] based on reverse-time integration of an expanded ODE allows for finding gradients of the IVP solutions with respect to parameters and the initial values . This allows training ODE-Nets using gradient descent, as well as combining them with other neural network blocks. Since their introduction, ODE-Nets have seen improved implementations [RIM19] and enhancements in training and stability [GKB19, ZYG19].

Unlike an unconstrained residual block, a Neural ODE on its own does not have universal approximation capability. Consider a continuous, differentiable, invertible function on . There is no ODE defined on that would result in . Informally, in ODEs, paths between the initial value and final value have to be continuous and cannot intersect in for two different initial values, and paths corresponding to and would need to intersect. By contrast, in an unconstrained residual block sequence, a discrete dynamical system on , we do not have continuous paths, only points at unit-time intervals, with an arbitrary transformation between points; finding an unconstrained ResNet for is easy. While ODE-Nets used out-of-the-box have limited modeling capability, some evidence exists that this limitation can be overcome by changing the way way ODE-Nets are applied. Yet, the question whether they can be turned into universal approximators remains open.

### 1.2 Our Contribution

We analyze the approximation capabilities of ODE-Nets and i-ResNets. The results most closely related to ours have been recently provided by the authors of ANODE [DDT19], who focus on a -ODE-Net followed by a linear layer. They provide counterexamples showing that such an architecture is not a universal approximator of functions. However, they show empirical evidence indicating that expanding the dimensionality and using -ODE-Net for instead of a -ODE-Net has positive impact on training of the model and on its generalization capabilities. The authors of i-ResNet [BGC19] also use expanded dimensionality in their experiments, observing that it leads to a modest increase in model’s accuracy.

Here, we prove that setting is enough to turn Neural ODE followed by a linear layer into a universal approximator for . We show similar result for i-ResNet. Our main focus is on modeling invertible functions – homeomorphisms – by exploring pure ODE-Nets and i-ResNets, not capped by a linear layer. We show a class of invertible mappings that cannot be expressed by these modeling approaches when they operate within . We then prove that any homeomorphism , for , can be modeled by a Neural ODE / i-ResNet operating on an Euclidean space of dimensionality that embeds as a linear subspace.

## 2 Background on ODEs, Flows, and Embeddings

This section provides background on invertible mappings and ODEs; we recapitulate standard material, for details see [Utz81, Lee01, BS02, You10].

### 2.1 Flows

A mapping is a homeomorphism if is a one-to-one mapping of onto itself, and both and its inverse are continuous. Here, we will assume that for some , and we will use the term -homeomorphism where dimensionality matters.

A topological transformation group or a flow [Utz81] is an ordered triple involving an additive group with neutral element 0, and a mapping such that and for all , all . Further, mapping is assumed to be continuous with respect to the first argument. The mapping gives rise to a parametric family of homeomorphisms defined as , with the inverse being .

Given a flow, an orbit or a trajectory associated with is a subspace . Given , either or ; two orbits are either identical or disjoint, they never intersect. A point is a fixed point if . A path is a part of the trajectory defined by a specific starting and end points. A path is a subset of ; we will also consider a space-time path composed of points if we need to make the time evolution explicit.

A discrete flow is defined by setting . For arbitrary homeomorphism of onto itself, we easily get a corresponding discrete flow, an iterated discrete dynamical system, , , . Setting gives us a ResNet corresponding to , though not necessarily an i-ResNet, since there is no constraint. For Neural ODEs, the type of flow that is relevant is a continuous flow, defined by setting , and adding an assumption that the family of homeomorphisms, the function , is differentiable with respect to its second argument, , with continuous . The key difference compared to a discrete flow is that the flow at time , , is now defined for arbitrary , not just for integers. We will use the term -flow to indicate that .

Informally, in a continuous flow the orbits are continuous, and the property that orbits never intersect has consequences for what homeomorphisms can result from a flow. Unlike in the discrete case, for a given homeomorphism there may not be a continuous flow such that for some . We cannot just set , what is required is a continuous family of homeomorphisms such that and is identity – such family may not exist for some . In such case, a Neural ODE would not be able to model . While i-ResNets are discrete, the way they are constructed may also limit the space of mappings they can model to a subset of all homeomorphisms, even if each residual mapping is made arbitrarily complex within the Lipschitz constraint.

### 2.2 Continuous Flows and ODEs

Given a continuous flow one can define a corresponding ODE on by defining a vector for every such that . Then, the ODE corresponds to continuous flow . Indeed, , is identity, and for time-independent . Thus, for any homeomorphism family defining a continuous flow, there is a corresponding ODE that, integrated for time , models the flow at time , .

The vectors of derivatives for all are continuous over and are constant in time, and define a continuous vector field over . The ODEs evolving according to such a time-invariant vector field, where the right-hand side of eq. 2 depends on but not directly on time , are called autonomous ODEs, and take the form of .

Any time-dependent ODE (eq. 2) can be transformed into an autonomous ODE by removing time from being a separate argument of , and adding it as part of the vector . Specifically, we add an additional dimension1 to vector , with . We equate it with time, , by including in the definition of how acts on , and including in the initial value . In defining , explicit use of as a variable is being replaced by using the component of vector . The result is an autonomous ODE.

Given time and an ODE defined by , , the flow at time , may not be well defined, for example if diverges to infinity along the way. However, if is well behaved, the flow will exist at least locally around the initial value. Specifically, Picard-Lindelöf theorem states that if an ODE is defined by a Lipschitz-continuous function , then there exists such that the flow at time , , is well-defined and unique for . If exists, is a homeomorphism, since the inverse exists and is continuous; simply, is the inverse of .

### 2.3 Flow Embedding Problem for Homeomorphisms

Given a -flow, we can always find a corresponding ODE. Given an ODE, under mild conditions, we can find a corresponding flow at time , , and it necessarily is a homeomorphism. Is the class of -flows equivalent to the class of -homeomorphisms, or only to its subset? That is, given a homeomorphism , does a -flow such that exist? This question is referred to as the problem of embedding the homeomorphism into a flow.

For a homeomorphism , its restricted embedding into a flow is a flow such that for some ; the flow is restricted to be on the same domain as the homeomorphism. Studies of homeomorphisms on simple domains such as a 1D segment [For55] or a 2D plane [And65] showed that a restricted embedding does not always exist.

An unrestricted embedding into a flow [Utz81] is a flow on some space of dimensionality higher than . It involves a homeomorphism that maps into some subset , such that the flow on results in mappings on that are equivalent to on for some , that is, . While a solution to the unrestricted embedding problem always exists, it involves a smooth, non-Euclidean manifold . For a homeomorphism , the manifold , variously referred to as the twisted cylinder [Utz81], or a suspension under a ceiling function [BS02], or a mapping torus [Bro66], is a quotient space defined through the equivalence relation . The flow that maps at to at and at involves trajectories in in the following way: for going from 0 to 1, the trajectory tracks in a straight line from to ; in the quotient space is equivalent to . Then, for going from 1 to 2, the trajectory proceeds from to .

The fact that the solution to the unrestricted embedding problem involves a flow on a non-Euclidean manifold makes applying it in the context of gradient-trained ODE-Nets difficult.

## 3 Approximation of Homeomorphisms by Neural ODEs

In exploring the approximation capabilities of Neural ODEs for -homeomorphisms, we will assume that the neural network on the right hand side of the ODE is a universal approximator and, if needed, can be made large enough to closely approximate any desired function. Thus, our concern is with what flows can be modeled by a -ODE-Net assuming that can have arbitrary internal architecture, including depth and dimensionality, as long as its input-output dimensionality remains fixed at . We consider two scenarios, , and .

### 3.1 Restricting the Dimensionality Limits Capabilities of Neural ODEs

We show a class of functions that a Neural ODE cannot model, a class that generalizes the one-dimensional example.

###### Theorem 1.

Let , and let be a set that partitions into two or more disjoint, connected subsets , for . Consider a mapping that

• is an identity transformation on , that is, ,

• maps some into , for .

Then, no -ODE-Net can model .

###### Proof.

A -ODE-Net can model if a restricted flow embedding of exists. Suppose that it does, a continuous flow can be found for such that the trajectory of is continuous on with and for some , for all .

If maps some into , for , the trajectory from to crosses – there is such that for some . From uniqueness and reversibility of ODE trajectories, we then have . From additive property of flows, we have .

Since is identity over and , thus . That is, the trajectory over time is a closed curve starting and ending at , and for any . Specifically, . Thus, . We arrive at a contradiction with the assumption that and are in two disjoint subsets of separated by . Thus, no -ODE-Net can model .

The result above shows that Neural ODEs applied in the most natural way, with , are severely restricted in the way distinct regions of the input space can be rearranged in order to learn and generalize from the training set, and the restrictions go well beyond requiring invertibility and continuity.

### 3.2 Neural ODEs with Extra Dimensions are Universal Approximators for Homeomorphisms

If we allow the Neural ODE to operate on Euclidean space of dimensionality , we can approximate arbitrary -homeomorphism , as long as is high enough. Here, we show that is suffices to take . We construct a mapping from the original problem space, into that 2

• preserves as a -dimensional linear subspace consisting of vectors ,

• leads to an ODE that maps .

Thus, we provide a solution with a structure that is convenient for out-of-the-box training and inference using Neural ODEs – it is sufficient to add zeros to input vectors.

###### Theorem 2.

For any homeomorphism , , there exists a -ODE-Net for such that for any .

###### Proof.

We prove the existence in a constructive way, by showing a vector field in , and thus an ODE, with the desired properties.

We start with the extended space with a variable corresponding to time added as the last dimension, as in the construction of an autonomous ODE from time-dependent ODE. We then define a mapping that will represent paths starting from at time . For , the mapping (see Fig. 1) is defined through

 y(x,τ) =[x+f(τ)δx,δxg(τ)]. (4)

For each , let be defined as . The functions are required to have continuous first derivative, and have , , iff , and the derivatives and are null at and only there. The mapping indeed just adds dimensions of 0 to at time , and at time it gives the result of the homeomorphism applied to , again with dimensions of 0

 y(x,0)) =[x,0(p)], y(x,1) =[x+δx,0(p)]=[h(x),0(p)]=y(h(x),0).

For the purpose of constructing an ODE-Net with universal approximation capabilities, suffices. However, more generally we can define the mapping for , by setting ; for example, . Intuitively, the mapping will provide the position in of the time evolution for duration of an ODE on starting from a position corresponding to .

For two distinct , the paths in given by eq. 4 do not intersect at the same position at the same point in time. First, consider the case where is not parallel to . Then, the second set of variables is equal only if , that is, only at integer . At those time points, the first set of variables takes iterates of , that is, , which are different for different . Second, consider such that for some . Then, either and thus for all non-integer , that is, the second set of variables are always different except at corresponding to iterates of , which are distinct; or , and the second set of variables are always the same. In the latter case, also , hence the first set of variables is is only equal to if . Thus, in , paths starting from two distinct points do not intersect at the same point in time. Intuitively, we have added enough dimensions to the original space so that we can reroute all trajectories without intersections.

We have correspond directly to time, that is, The mapping has continuous derivative with respect to , defining a vector field over the image of , a subset of

 dydt =[dfdtδx,dgdtδx].

From the conditions on , we can verify that this time-dependent vector field defined through derivatives of with respect to time has the same values for and for any

 dydt(x,0)=[0(p),0(p)]=dydt(x,1)=dydt(h(x),0)

Thus, the vector field is well-behaved at , it is continuous over the whole image of . The vector field above is defined over a closed subset of , and can be (see [Lee01], Lemma 8.6) extended to the whole . A -ODE-Net with a universal approximator network on the right hand side can be designed to approximate the vector field arbitrarily well. The resulting ODE-Net approximates to . ∎

Based on the above result, we now have a simple method for training a Neural ODE to approximate a given continuous, invertible mapping and, for free, obtain also its continuous inverse . On input, each sample is augmented with zeros. For a given , the output of the ODE-Net is split into two parts. The first dimensions are connected to a loss function that penalizes deviation from . The remaining dimensions are connected to a loss function that penalizes for any deviation from 0. Once the network is trained, we can get by using an ODE-Net with instead of used in the trained ODE-Net.

## 4 Approximation of Homeomorphisms by i-ResNets

### 4.1 Restricting the Dimensionality Limits Capabilities of i-ResNets

We show that similarly to Neural ODEs, i-ResNets cannot model a simple homeomorphism , indicating that their approximation capabilities are limited.

###### Theorem 3.

Let be an -layer i-ResNet, and let and . If for all , then there are is no number and no functions for all such that .

###### Proof.

Consider and . Then, and . From we have that . Let . Then, we have

 δ1 =a0+δ0+f1(a0+δ0)−a0−f1(a0) =δ0+f1(a0+δ0)−f1(a0), δ1 >δ0−|δ0|, δ1 <δ0+|δ0|.

That is, has the same sign as . Thus, applying the reasoning to arbitrary , instead of , if , then , and if , then , for all . Assume we can construct an i-ResNet such that ; then for any , and cannot map into . ∎

The result above leads to a more general observation about paths in spaces of dimensionality higher than one. As with ODE-Nets, we will use -i-ResNet to denote an i-ResNet operating on .

###### Corollary 4.

Let the straight line connecting to be called an extended path of a time-discrete topological transformation group on . In -i-ResNet, for , extended paths and do not intersect.

###### Proof.

For two extended paths to intersect, vectors have to be co-planar. If we restrict attention to dynamics starting form , we can view it as a one-dimensional system, with the space axis parallel to , and time axis orthogonal to it.

First, consider the situation where the line connecting is parallel, in the original space, to the line connecting . Applying Theorem 3 shows that if is above , then is above ; extended paths do not intersect in this case.

In the more general case, without loss of generality assume that is the one extending farther from the line than . We can construct another i-ResNet, keeping but with , for such that we arrive at the parallel case above; . The intersection of extended paths for both i-ResNets, if exists, is at the same position. But the second i-ResNet is the parallel case analyzed above, with no intersection.

The result allows us to show that i-ResNets faces a similar constraint in its capabilities as Neural ODEs

###### Theorem 5.

Let , and let and be the same as in Theorem 1. Then, no -i-ResNet can model .

###### Proof.

Consider a -layered i-ResNet on , giving riving rise to extended space-time paths in , with integer corresponding to activations in subsequent layers. For any , the extended path in starts at and ends at . Since i-ResNet layers are continuous transformations, the union of all extended paths arising from is a simply connected subset of ; it has no holes and partitions into separate regions. Since extended paths cannot intersect, remains in the same region as , which is in contradiction with mapping . ∎

The proof shows that the limitation in capabilities of the two architectures for invertible mappings analyzed here arises from the fact that paths in invertible mappings constructed through NeuralODEs and i-ResNets are not allowed to intersect and from continuity in .

### 4.2 i-ResNets with Extra Dimensions are Universal Approximators for Homeomorphisms

Similarly to Neural ODEs, expanding the dimensionality of the i-ResNet from to by adding zeros on input guarantees that any -homeomorphism can be approximated, as long as its Lipschitz constant is finite and an upper bound on it is known during i-ResNet architecture construction.

###### Theorem 6.

For any homeomorphism , with , there exists a -i-ResNet with residual layers such that for any .

###### Proof.

For a given invertible i-ResNet approximating , define a possibly non-invertible mapping , where ; we have . An i-ResNet that models using layers for can be constructed in the following way:

 ϕ0([x,0]) →[x,0]+[0,δ(x)], ϕi([z,y]) →[z,y]+[yT/(T+1),0]i=1,...,T+1, ϕT+2([h(x),δ(x)]) →[h(x),δ(x)]+[0,−δ(x)]

The first layer maps into and stores it in the second set of activations. The subsequent layers progress in a straight line from to in constant-length steps, and the last layer restores null values in the second set of activations.

All layers are continuous mappings. The residual part of the first layer has Lipschitz constant below one, since . The middle layers have residual part constant in and contractive in , with Lipschitz constant below one. The residual part of the last layer is a mapping of the form . For a pair , let , . We have , with equality only if . From invertibility of we have that implies and thus ; hence, the residual part of the last layer also has Lipschitz constant below one. ∎

The theoretical construction above suggests that while on the order of layers may be needed to approximate arbitrary homeomorphism with , only the first and last layers depend on and need to be trained; the middle layers are simple, fixed linear layers. The last layer for is the same as the first layer of would be, but the inverse of the first layer, but since i-ResNet construct invertible mappings using possibly non-invertible , it has to be trained along with the first layer.

The construction for i-ResNets is similar to that for NeuralODEs, except one does not need to enforce differentiability in the time domain, hence we do not need smooth accumulation and removal of in the second set of activations, and the movement from to in the original dimensions does not need to be smooth. In both cases, the transition from to progresses along a straight line in the first dimensions, with the direction of movement stored in the second set of variables.

## 5 Invertible Networks capped by a Linear Layer are Universal Approximators

We show that a Neural ODE or an i-ResNet followed by a single linear layer can approximate functions, including non-invertible functions, equally well as any traditional feed-forward neural network. Since networks with shallow-but-wide fully-connected architecture [Cyb89, Hor91], or narrow-but-deep unconstrained ResNet-based architecture [LJ18] are universal approximators, so are ODE-Nets and i-ResNets. Consider a function . For any such that , the mapping is a -homeomorphism, and as we have shown, can be approximated by a -ODE-Net or -i-ResNet; can be extracted from the result by a simple linear layer. Through a simple construction, we show that actually using just dimensions is sufficient.

###### Theorem 7.

Consider a neural network that approximates function that is Lebesgue integrable for each of the output dimensions, with being a compact subset. For , there exists a linear layer-capped -ODE-Net that can perform the mapping . If is Lipschitz, there also is a linear layer-capped -i-ResNet for .

###### Proof.

Let be a neural network that takes input vectors and produces -dimensional output vectors , where is the desired transformation. is constructed as follows: use to produce , ignore , and always output . Consider a -ODE-Net defined through . Let the initial value be . The ODE will not alter the first dimensions throughout time, hence for any , . After time , we will have

 xT =x0+∫10G(xt)dt=[x(p),0(r)]+∫10[0(p),y(r)]dt =[x(p),F(x(p))].

Thus, for any , the output can be recovered from the output of the ODE-Net by a simple, sparse linear layer that ignores all dimensions except the last , which it returns. A similar construction can be used for defining layers of i-ResNet. We define residual layers, each with residual mapping . If , then the residual mapping has Lipschitz constant below 1. ∎

## 6 Experimental Results

### 6.1 i-ResNets

We tested whether i-ResNet operating in one dimension can learn to perform the mapping, and whether adding one more dimension has impact on the ability learn the mapping. To this end, we constructed a network with five residual blocks. In each block, the residual mapping is a single linear transformation, that is, the residual block is . We used the official i-ResNet PyTorch package [BGC19] that relies on spectral normalization [MKKY18] to limit the Lipschitz constant to less than unity. We trained the network on a set of 10,000 randomly generated values of uniformly distributed in for 100 epochs, and used an independent test set of 2,000 samples generated similarly.

For the one-dimensional and the two-dimensional target mapping, we used MSE as the loss. Adding one extra dimension results in successful learning of the mapping, confirming Theorem 6. The test MSE on each output is below ; the network learned to negate , and to bring the additional dimension back to null, allowing for invertibility of the model. For the i-ResNet operating in the original, one-dimensional space, learning is not successful (MSE of 33.39), the network learned a mapping for a small positive , that is, the mapping closest to negation of that can be achieved while keeping non-intersecting paths, confirming experimentally Corollary 4.

### 6.2 Neural ODEs

We performed experiments to validate if the -dimensions threshold beyond which any -homeomorphism can be approximated by a -ODE-Net can be observed empirically in a practical classification problem. We used the CIFAR10 dataset [Kri09] that consists of 32 x 32 RGB images, that is, each input image has dimensionality of . We constructed a series of -ODE-Nets with dimensionality , and for each measured the cross-entropy loss for the problem of classifying CIFAR10 images into one of ten classes. We used the default split of the dataset into 50,000 training and 10,000 test images. In designing the architecture of the neural network underlying the ODE we followed ANODE [DDT19]. Briefly, the network is composed of three 2D convolutional layers. The first two convolutional layers use filters, and the last one uses the number of input channels as the number of filters, to ensure that the dimensionalities of the input and output of the network match. The convolution stack is followed by a ReLU activation function. A linear layer, with softmax activation and cross-entropy loss, operates on top the ODE block. We used torchdiffeq package [CRBD18] and trained on a single NVIDIA Tesla V100 GPU card.

To extended the dimensionality of the space in which the ODE operates, we introduce additional null channels on input, that is, we use input images of the form . Then, to achieve , we need . We tested . To analyze how the capacity of the network interplays with the increases in input dimensionality, we also experimented with varying the number of convolutional filters, , in the layers inside the ODE block.

The results in Fig. 2 show that once the networks with small network capacity, below 64 filters, behaves differently than networks with 64 or more filters. Once the network capacity is high enough, at 64 filters or more, adding dimensions past beyond 3, that is, beyond , results in slower decrease in test set loss. To quantify if this slowdown is likely to arise by chance, we calculated the change in test set loss for dimensionality as the increases by one, , for =1,…,7. We pooled the results from experiments with 64 convolution filter or more. Two-tailed nonparametric Mann-Whitney U test between and shows the change of trend is significant (=.002).

## Acknowledgments

T.A. is supported by NSF grant IIS-1453658.

### Footnotes

1. To avoid confusion with indicating time, we use to denote -th component of vector .
2. We use upper subscript to denote dimensionality of vectors; that is, .

### References

1. Stephen A Andrea. On homeomorphisms of the plane, and their embedding in flows. Bulletin of the American Mathematical Society, 71(2):381–383, 1965.
2. Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and Jörn-Henrik Jacobsen. Invertible residual networks. In International Conference on Machine Learning, pages 573–582, 2019.
3. William Browder. Manifolds with . Bulletin of the American Mathematical Society, 72(2):238–244, 1966.
4. Michael Brin and Garrett Stuck. Introduction to dynamical systems. Cambridge university press, 2002.
5. Tian Qi Chen, Jens Behrmann, David K Duvenaud, and Jörn-Henrik Jacobsen. Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems, pages 9913–9923, 2019.
6. Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pages 6571–6583, 2018.
7. George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
8. Gustavo Deco and Wilfried Brauer. Nonlinear higher-order statistical decorrelation by volume-conserving neural architectures. Neural Networks, 8(4):525–535, 1995.
9. Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural ODEs. arXiv preprint arXiv:1904.01681, 2019.
10. Marion Kirkland Fort. The embedding of homeomorphisms in flows. Proceedings of the American Mathematical Society, 6(6):960–967, 1955.
11. Amir Gholami, Kurt Keutzer, and George Biros. ANODE: unconditionally accurate memory-efficient gradients for neural ODEs. arXiv preprint arXiv:1902.10298, 2019.
12. Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
13. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
14. Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
15. John M Lee. Introduction to smooth manifolds. Springer, 2001.
16. Hongzhou Lin and Stefanie Jegelka. ResNet with one-neuron hidden layers is a universal approximator. In Advances in Neural Information Processing Systems, pages 6169–6178, 2018.
17. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
18. Lev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. The mathematical theory of optimal processes. 1962.
19. Chris Rackauckas, Mike Innes, Yingbo Ma, Jesse Bettencourt, Lyndon White, and Vaibhav Dixit. DiffEqFlux.jl-a Julia library for neural differential equations. arXiv preprint arXiv:1902.02376, 2019.
20. Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530–1538, 2015.
21. WR Utz. The embedding of homeomorphisms in continuous flows. In Topology Proc, volume 6, pages 159–177, 1981.
22. Hassler Whitney. The singularities of a smooth -manifold in (2- 1)-space. Ann. of Math, 45(2):247–293, 1944.
23. Laurent Younes. Shapes and diffeomorphisms, volume 171. Springer, 2010.
24. Tianjun Zhang, Zhewei Yao, Amir Gholami, Kurt Keutzer, Joseph Gonzalez, George Biros, and Michael Mahoney. ANODEV2: A coupled neural ODE evolution framework. arXiv preprint arXiv:1906.04596, 2019.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters