A Flow Model of Neural Networks
^{1}
Abstract
Based on a natural connection between ResNet and transport equation or its characteristic equation, we propose a continuous flow model for both ResNet and plain net. Through this continuous model, a ResNet can be explicitly constructed as a refinement of a plain net. The flow model provides an alternative perspective to understand phenomena in deep neural networks, such as why it is necessary and sufficient to use 2layer blocks in ResNets, why deeper is better, and why ResNets are even deeper, and so on. It also opens a gate to bring in more tools from the huge area of differential equations.
1 Introduction
Deep neural networks have been proven impressively successful on certain supervised learning tasks (LeCun et al., 2015). It successively maps datasets to a feature space on which simple output functions (e.g. softmax classifier) are sufficient to achieve high performance. Although each single layer is only a simple transformation, the composition of many layers can represent very complicated functions. Guided by this philosophy and supported by powerful computers and massive amount of data, deeper and deeper neural networks are invented (Krizhevsky et al., 2012, Zeiler and Fergus, 2014, Simonyan and Zisserman, 2014, Szegedy et al., 2015). A remarkable event is that He et al. (2016) set a new record on the ImageNet competition (Deng et al., 2009) using their ResNets with and layers. Going deeper is believed to be helpful. However, the mechanism for that and many other mysteries about the ‘black box’ is still under exploration.
Our contributions. In this short note, we construct flow models of neural networks. Our aim is not restricted to answering any specific questions about neural networks, but to build a framework which connects neural networks with differential equations. As a bridge, it could bring in new perspective and new methods, which could be applied to understand or solve learning problems.
We observed that a ResNet is the same as a discretization of the characteristic equation of a transport equation. Conversely, the transport equation can be regarded as a continuous model of the ResNet. In physics, transport equations are models for describing dynamics of quantities which are transported by continuous flows. Hence we call the continuous model as a flow model.
As a natural extension, we also construct a flow model for plain net (neural network without residual shortcuts). It is built in a different way. This is because nonresidual maps between layers can not be considered as discretization of transport velocity field.
The flow models are immediately available to explain some phenomena in neural networks. For example, it naturally supports the belief in the power of depth of neural networks. It also relates plain nets to ResNets explicitly. The connection is used to explain the super depth of ResNets. Besides, it explains why it’s necessary to use 2layer blocks in ResNets with ReLU activations, and so on.
Related works. Li and Shi (2017a) consider to solve supervised and semisupervised learning problems through PDEs on the point cloud of data. They propose alternative methods for initializing and training ResNets. Recently, we noted that we are not the only ones that observed the connection between neural networks and differential equations. E (2017) proposes to study ResNet as a dynamical system. Based on that, Li et al. (2017) consider training algorithm from optimal control point of view. Chang et al. (2017) presents an empirical study on the training of ResNet as a dynamical system. However, all these papers focus on ResNets. We haven’t seen any paper considering plain nets from similar point of view.
The structure of this note is as follows. In Section 2, we start with a transport equation and its characteristic equation and end up with a ResNet. In Section 3, we build a continuous flow model for a plain net, which is done for linear map and activation respectively and then glued up. In Section 4, the flow model of plain net is discretized to get a ResNet. Considering the relationship between neural networks and their flow models, we have some comments, which are summarized in Section 5.
2 Residual Networks
2.1 Transport Equation
Consider the following terminal value problem (TVP) for linear transport equation:
(1) 
Here is an valued function, called transport velocity field. It can be chosen in different ways. We will consider the general form first, then a special type:
(2) 
where , . The activation is an valued nonlinear function, which is Lipschitz continuous.
It is well known that the solution of equation (1) is transported along characteristics, which are defined as solutions of the initial value problems (IVP) of the ODE:
(3) 
where . Along the solution curve , it is easy to verify that
(4)  
(5) 
In the last step we used the transport equation (1). So remains unchanged along the curve. See Figure 1 for a conceptual illustration. Therefore
(6) 
We have solved the transport equation (1) by integrating the ODE (3). This is socalled the method of characteristics.
2.2 Connection with ResNets
Discretizing the ODE (3) by Euler’s method naturally leads to a ResNet. In order to make the following approximations reasonable, we assume that the change of with and is regular enough. Especially, we assume that the solution of (1) and (3) exist and are regular enough.
Let with and be a partition of such that for any , is small enough. Let be a characteristic of the transport equation (1), i.e. the solution of (3), and denote . Denote and for any . See Figure 2 for a illustration of the discretization.
Near time , the ODE (3) is approximately
(7) 
Use Euler’s method to integrate this ODE from to , we get
(8)  
(9)  
(10) 
where is the identity map. Therefore
(11)  
(12) 
If the terminal value function of is given as , we might be able to use (12) to get the initial value at any . According to (6),
(13) 
The discrete solution (13) of the terminal value problem of transport equation (1) is valid for any . Its basic structure is shown in Figure 3. This structure reminds us of the ResNet (He et al., 2016), but it is merely a formal one. In order to see the actual structure, we need to specify the definition of ’s.
A Special Type. In order to get a ResNet with explicit 2layer block, consider the special type of transport velocity field given by (2). Denote
(14)  
(15)  
(16) 
By using the method of characteristics as before, we can get
(17) 
It generates a layer ResNet block, which is much more like the original ResNet. Figure 4 illustrates its basic structure.
At a first glance, it appears that simply defining the transport velocity as (2) is not natural. But it is actually reasonable. The inner parameters and are used to specify the location in the space of data. It controls where to assigned a velocity vector. If the activation is nonnegative, or even restricted to , which is often the case, then the outer parameters and are necessary to adjust the direction and magnitude of the transport velocity. Both inner parameters and outer parameters are necessary ingredients of the transport velocity field. Of course, if is symmetric (such as ), the outer parameters are not necessary for this purpose.
The ResNet obtained here is special. Firstly, as we can see in (10) and (17), due to the time step , the residual term can be made sufficiently small comparing with the leading term . This is a necessary condition for the ResNet to be modeled by transport equation.
Secondly, the parameters of the ResNet changes slowly from block to block. More specifically, the parameters on the same positions of adjacent ResNet blocks should be close to each other, because they are assumed to be discretizations of continuous functions of time. For example, is close to , is close to , and so on.
3 Continuous Model of Plain Networks
We have seen that the method of characteristics for transport equations corresponds to ResNets. The key of this connection is the transport velocity field that generates the residual terms between layers. It’s natural to consider similar relationship for a plain net, whose typical layer is
(18) 
where is the activation, multiplication weight matrix and bias vector. In (18), however, the nonresidual term defines a finite (rather than infinitesimal) transformation of . It can not be naturally interpreted as velocity, which makes it difficult to be modeled by transport equation directly. In this section we will construct a continuous flow for the map (18). It is done for the linear map and the nonlinear activation respectively. Later, this flow will be used to construct the ResNetapproximation of the plain net (18).
As a preparation, we define the time scaling function . If the flow is only required to be continuous in time, then with is sufficient. Here we require the flow to be smooth, then needs to be nonlinear. Let be a smooth increasing function that satisfies:

for ,

for ,

for .
With the above properties of , the transport velocity fields of adjacent layers can be glued up smoothly. Since we only consider the th layer in this section, let’s turn off the suffix of parameters for simplicity.
3.1 Linear Map
Approximation by matrix exponentials
The mainly considered object here is the weight matrix . Without loss of generality, assume that and has been embedded into a space of sufficiently high dimension , such that is a square matrix with . If it can be written into an exponential form, we are done. Unfortunately, this is generally not possible. So we consider its full size singular value decomposition
(19) 
Notice that we use instead of its adjoint in the decomposition. The requirement that is to ensure that and can be taken as proper rotations even if include mirror reflection on its invariant subspace. Since and are rotations of finite angles, they can be expressed as exponential of angular velocity matrices:
(20) 
The matrix is a combination of finite stretch (nonzero diagonals) and projections (zero diagonals). But projections can be considered as limits of stretch, so it can be approximated by a matrix exponential:
(21) 
where and where the last entries are . Thus
(22)  
(23) 
for large . So the map (18) can be approximated by
(24) 
Flows of linear maps
The linear map (24) can be approached by a composition of continuous flows. Denote
(25) 
then and . For any , define the translation flow
(26) 
Then and . So the linear maps and can be modeled by continuous flows, each takes one unit of time. In the following we consider their transport velocity fields.
The rotation flow can also be described by the initial value problem of ODE
(27) 
because its solution is just
(28) 
It means that the transport velocity field from to is defined by .
In a similar way, the stretch flow can also be described by
(29) 
because its solution is just
(30) 
It means that the transport velocity field from to is defined by .
In a similar way, the rotation map can also be described by
(31) 
because its solution is just
(32) 
It means that the transport velocity field from to is defined by .
Finally, the translation flow can also be described by
(33) 
because its solution is simply
(34) 
It means that the transport velocity field from to is just .
By Euler method, it can be shown that the linear exponential layers can all be approximated by several linear ResNet blocks.
3.2 Activation
Now let’s consider the nonlinear activation . Assume that is nondecreasing, differentiable almost everywhere and Lipschitz. From now on, denote
(35) 
then we have . For any and , define
(36) 
Clearly,
(37) 
So it takes one unit of time to move from to . Fix any , the value of is strictly increasing in , hence invertible. Denote , hence . As goes from to , is a flow that continuously moves to . The transport velocity field is given by
(38)  
(39)  
(40) 
or simply
(41) 
Thus is the solution to the initial value problem
(42) 
and .
Example. Before moving on, let’s look at an example of activation flow . Let be ReLU. For any ,
(43) 
By definition (36), the activation flow is
(44) 
Notice that for any , it is a leaky ReLU. If , then
(45) 
Hence the transport velocity field is
(46)  
(47) 
3.3 Gluing Up
In summary, the map of nonlinear plain layer (18) can be modeled successively by the flows , , , , . So it takes units of time to move from to , then takes one unit of time to move from to . For technical completeness, let’s glue this flows together. For any , define
(48) 
For convenience, the above sequentially glued flow (48) is called the layer flow of the th layer.
The layer flow (48) can also be described by the ODE
(49) 
with initial condition . Then . Notice that at the , the velocity vanishes.
Notice that the above sequentially gluing procedure is only one of possible ways to construct a continuous flow for (18). There are infinitely many flows that produce the same nonlinear map (18), although most of them do not have such explicit formulation.
Previously, we construct a transport velocity field for a typical single layer of plain net. Let’s construct the velocity field for the whole network. Consider the terminal value problem of the linear transport equation (1). Now the transport velocity field is defined by gluing up (49) of different layers. The detail is as follows. Let with and be a uniform partition of such that be small enough. Then for ,
(50) 
Notice that the time is scaled such that units of time here is equivalent to units of time in (49). Notice that for any , the transport velocity field . It means that is smooth in . Thus we have seen that the transport equation is a continuous model for the plain net. Given any plain net, we can construct a transport equation using its parameters and activations.
4 ReDiscretization as ResNet
In Section 2, we have shown that ResNets can be modeled by continuous flows. In Section 3, we have shown that plain nets can also be modeled by continuous flows. It’s natural to consider the connection of the two types of neural networks through their continuous models. In this section, we show that by rediscretizing the flow model obtained from the plain net, we can get a ResNet, which is an approximation of the plain net. More specifically, each layer of the plain net is approximated by several ResNet blocks.
4.1 Linear map
We have two options for the linear map
(51) 
One option is to leave it as a whole map. The other option is to discretize its continuous model in the same way as we did in Section 2. For the second option, one only needs to applying Euler’s method to the ODEs in (49) which corresponds to the linear map. Let’s discretize the first equation in (49) as an example. Let with and be a uniform partition of , such that is small enough. Denote , then . By Euler’s method, we have
(52) 
which is a linear 1layer ResNet block. Repeat this iteration for times, we have
(53) 
We can apply the same procedure to the second, third and fourth equation in (49). The discretization of these equations are very similar, hence are omitted here.
4.2 Activation
In the following, let’s focus on the nonlinear part. The activation flow is solved from (42) in the following way. Recall that it takes one unit of time to move from to . For clarity in notations, we still use as the range of time . Let with and be a uniform partition of , such that is small enough. Denote and . Then and . Solve (42) by Euler method iteratively, we have
(54)  
(55) 
To see the basic structure of ResNet, let’s make (55) explicit.
Example. For ReLU activation , it is straightforward. According to (47),
(56) 
Therefore,
(57)  
(58) 
which is a 1layer ResNet block with scalar weight
(59) 
and . Thus for ReLU activation, the approximation of plain nets by ResNets is quite trivial.
If has no explicit expression or is nonlinear, we may consider its linearization at and near . According to the definition of (36), the Jacobian of at any is
(60)  
(61)  
(62) 
whose inverse is the inverse Jacobian in terms of :
(63) 
Notice that is a vector and the fraction is entrywise. Since the linearization of inverse is the inverse of linearization, we first linearize at , then compute its inverse.
(64) 
therefore
(65) 
For simplicity, denote
(66)  
(67) 
and take , we have
(68) 
Then the iteration (55) becomes
(69)  
(70) 
Let
(71) 
Then we get
(72) 
which is the approximation of (55). It contains a layer ResNet block followed by a nonresidual linear map, as shown in Figure 5. The whole activation flow is composed of several iterations in (55) or its approximation (72).
Together with the linear map (51), the single th layer of the plain net (18) is approximated by the composition of linear maps and 2layer ResNet blocks. See Figure 6. Alternatively, we can also use (53) and its successors instead of the whole linear map (51). See Figure 7.
Now it may be a little confusing: The multilayer ResNet still contains several activations (the small orange circles with solid border line, within each dashed green box in Figures 6 and 7). Why bother to replace one activation (the orange dotted ellipse) by such a multilayer structure containing more activations? The answer is as follows. The roles of activations in original plain net and in the new ResNet are different. In the plain net, the activation causes nonlinear distortion to the map between two layers, or poses geometric constraint on the layer flow. The effect is significant and immediate. In the ResNet got above, however, the activation causes nonlinear distortion to the transport velocity field, or poses differential constraint on the layer flow. The effect can become significant only after accumulation.
Another confusing thing is about the continuously change of parameters from layer to layer. Since the neural networks here are got from discretizing a continuous flow, it’s natural to guess that the parameters of the networks varies slowly from layer to layer. However we should be careful about this idea. It is generally not true for nonlinear networks. For the nonlinear plain net (18), we have seen in (49) that the continuous transport velocity field is NOT simply like
(73) 
So the parameters and in (18) themselves should not be regarded as discretization of some continuous parameters of a velocity field.
For the nonlinear ResNet shown in Figure 6, the situation is more subtle. To approximate the activation in one layer of the original plain net (18) (the dotted orange ellipse), several basic structures (the dashed green boxes in Figure 6) of the ResNet are used. The structure of these basic structures are the same, the parameters on their corresponding positions varies slowly. It means that as changes, changes slowly, changes slowly, and so on. In this sense, the parameters of ResNet changes continuously. But if we only naively go through the parameters layer by layer, we will not find this continuity.
5 Discussions
In Section 2 and 3 respectively, we use a transport equation and its characteristic equation as a continuous flow model for ResNets and plain nets. This correspondence between neural network and its flow model is very natural, or even obvious for ResNets. It is summarized in Table 1 and illustrated in Figure 8.
Inspired by the connection between neural networks and transport equations, the well studied methods in the area of differential equations might help to understand neural networks or to solve related problems. Here are just a few examples:

We have seen the reason for using 2layer blocks in ResNet. In the language of transport equation, the inner parameters are used to specify location in the space of data. It tells the network where to assign a velocity vector. The outer parameters are used to adjust the magnitude and direction of the velocity vector at the specified location. The outer parameters are necessary because ReLU is asymmetric.

The correspondence provides one way to see why deep is good for neural networks. In the perspective of TVP of transport equation, in order to transform the terminal value function to the initial value function, the transport velocity field needs to be complicated. To make the discretization converge and to control error, it is necessary to use small time step size and many iterations. It allows the discretization to be more regular, such that each step makes only a small progress. For neural networks, the transformation provided by each layer is also very limited. So it needs more layers to accomplish the required deformation of datasets.

In practice, ResNets can usually be significantly deeper than plain nets. Considering their connections with flow model, the reason for this is quite transparent. On the one hand, the plain net is equivalent to its flow model, which is constructed in Section 3. On the other hand, the flow model can be discretized in an iterative way to get a ResNet, as described in Section 4. Combining the two facts, we can say that the ResNet is a refinement of the original plain net. Naturally, it is deeper than the plain net.

Although ResNets can be very deep, many authors have shown that the training of ResNets is easier than plain nets of comparable depth. From the differential equation point of view, this is because ResNets deform dataset in an incremental way, which is much regular than plain nets.

When solving PDEs, people often use dissipative terms to increase regularity of solutions. In terms of neural networks, it means to add randomness to the feedforward process. This idea is very close to the dropout technique (Srivastava et al., 2014).

We have already known that ResNet corresponds to method of characteristics for transport equations. But there are other methods to solve PDEs (Li and Shi, 2017a), which might lead to alternative equivalent architectures to neural networks.

The training of neural networks could be considered as solving inverse problem of transport equation. It means that both initial value and terminal value are given. The task is to find a timedependent velocity field that transports the initial value to the terminal value. Of course, the solution to the inverse problem is highly nonunique. There are uncountably many velocity fields that can do the job. Thus the inverse problem is usually formulated as an optimization problem constrained by the transport equation as well as the initial and terminal conditions. There are many methods to solve these problems. Some of them could be modified to train neural networks (Li et al., 2017).
One possible question about continuous model is the dimension matching problem. In practice, one has more flexibility to choose different dimensions for different layers. But in the continuous model, it seems difficult to do so. Since the main concern of this paper is theoretical, it’s not a serious problem for us. Actually, the dimension matching problem already exists in ResNets. There is a restriction that the shortcuts are only used when dimensions are matched. Otherwise, extra projection matrices are needed. In this note, we have adopted a simple assumption that the dataset is embedded into a space with sufficiently high dimension at the beginning. This ambient dimension doesn’t change with time. In order to approximate necessary reduction of intrinsic dimension of dataset during time, we used compressing flows. Of course, this theoretical approach is inefficient in practice. An alternative approach is to glue up different flow models with different dimensions.
Acknowledgement
Zhen Li would like to show his gratitude to the support of professors Yuan Yao and Yang Wang from the Department of Mathematics, HKUST.
Footnotes
 Most part of this work was submitted to arXiv as two separated notes on 22 August (Li and Shi, 2017b) and 6 September respectively. But the latter was not announced due to technical reasons. This is a combination of the two previous notes.
References
 Chang, B., Meng, L., Haber, E., Tung, F., and Begert, D. (2017). Multilevel Residual Networks from Dynamical Systems View. ArXiv eprints.
 Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and FeiFei, L. (2009). ImageNet: A LargeScale Hierarchical Image Database. In CVPR09.
 E, W. (2017). A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1–11. Dedicated to Professor ChiWang Shu on the occasion of his 60th birthday.
 He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on.
 Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc.
 LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521(7553):436–444.
 Li, Q., Chen, L., Tai, C., and E, W. (2017). Maximum principle based algorithms for deep learning. CoRR, abs/1710.09513.
 Li, Z. and Shi, Z. (2017a). Deep residual learning and pdes on manifold. CoRR, abs/1708.05115.
 Li, Z. and Shi, Z. (2017b). Notes: A continuous model of neural networks. part I: residual networks. CoRR, abs/1708.06257.
 Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556.
 Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958.
 Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9. IEEE.
 Zeiler, M. D. and Fergus, R. (2014). Visualizing and Understanding Convolutional Networks, pages 818–833. Springer International Publishing, Cham.