A Flow Model of Neural Networks
Based on a natural connection between ResNet and transport equation or its characteristic equation, we propose a continuous flow model for both ResNet and plain net. Through this continuous model, a ResNet can be explicitly constructed as a refinement of a plain net. The flow model provides an alternative perspective to understand phenomena in deep neural networks, such as why it is necessary and sufficient to use 2-layer blocks in ResNets, why deeper is better, and why ResNets are even deeper, and so on. It also opens a gate to bring in more tools from the huge area of differential equations.
Deep neural networks have been proven impressively successful on certain supervised learning tasks (LeCun et al., 2015). It successively maps datasets to a feature space on which simple output functions (e.g. softmax classifier) are sufficient to achieve high performance. Although each single layer is only a simple transformation, the composition of many layers can represent very complicated functions. Guided by this philosophy and supported by powerful computers and massive amount of data, deeper and deeper neural networks are invented (Krizhevsky et al., 2012, Zeiler and Fergus, 2014, Simonyan and Zisserman, 2014, Szegedy et al., 2015). A remarkable event is that He et al. (2016) set a new record on the ImageNet competition (Deng et al., 2009) using their ResNets with and layers. Going deeper is believed to be helpful. However, the mechanism for that and many other mysteries about the ‘black box’ is still under exploration.
Our contributions. In this short note, we construct flow models of neural networks. Our aim is not restricted to answering any specific questions about neural networks, but to build a framework which connects neural networks with differential equations. As a bridge, it could bring in new perspective and new methods, which could be applied to understand or solve learning problems.
We observed that a ResNet is the same as a discretization of the characteristic equation of a transport equation. Conversely, the transport equation can be regarded as a continuous model of the ResNet. In physics, transport equations are models for describing dynamics of quantities which are transported by continuous flows. Hence we call the continuous model as a flow model.
As a natural extension, we also construct a flow model for plain net (neural network without residual shortcuts). It is built in a different way. This is because non-residual maps between layers can not be considered as discretization of transport velocity field.
The flow models are immediately available to explain some phenomena in neural networks. For example, it naturally supports the belief in the power of depth of neural networks. It also relates plain nets to ResNets explicitly. The connection is used to explain the super depth of ResNets. Besides, it explains why it’s necessary to use 2-layer blocks in ResNets with ReLU activations, and so on.
Related works. Li and Shi (2017a) consider to solve supervised and semi-supervised learning problems through PDEs on the point cloud of data. They propose alternative methods for initializing and training ResNets. Recently, we noted that we are not the only ones that observed the connection between neural networks and differential equations. E (2017) proposes to study ResNet as a dynamical system. Based on that, Li et al. (2017) consider training algorithm from optimal control point of view. Chang et al. (2017) presents an empirical study on the training of ResNet as a dynamical system. However, all these papers focus on ResNets. We haven’t seen any paper considering plain nets from similar point of view.
The structure of this note is as follows. In Section 2, we start with a transport equation and its characteristic equation and end up with a ResNet. In Section 3, we build a continuous flow model for a plain net, which is done for linear map and activation respectively and then glued up. In Section 4, the flow model of plain net is discretized to get a ResNet. Considering the relationship between neural networks and their flow models, we have some comments, which are summarized in Section 5.
2 Residual Networks
2.1 Transport Equation
Consider the following terminal value problem (TVP) for linear transport equation:
Here is an -valued function, called transport velocity field. It can be chosen in different ways. We will consider the general form first, then a special type:
where , . The activation is an -valued nonlinear function, which is Lipschitz continuous.
It is well known that the solution of equation (1) is transported along characteristics, which are defined as solutions of the initial value problems (IVP) of the ODE:
where . Along the solution curve , it is easy to verify that
2.2 Connection with ResNets
Discretizing the ODE (3) by Euler’s method naturally leads to a ResNet. In order to make the following approximations reasonable, we assume that the change of with and is regular enough. Especially, we assume that the solution of (1) and (3) exist and are regular enough.
Let with and be a partition of such that for any , is small enough. Let be a characteristic of the transport equation (1), i.e. the solution of (3), and denote . Denote and for any . See Figure 2 for a illustration of the discretization.
Near time , the ODE (3) is approximately
Use Euler’s method to integrate this ODE from to , we get
where is the identity map. Therefore
The discrete solution (13) of the terminal value problem of transport equation (1) is valid for any . Its basic structure is shown in Figure 3. This structure reminds us of the ResNet (He et al., 2016), but it is merely a formal one. In order to see the actual structure, we need to specify the definition of ’s.
A Special Type. In order to get a ResNet with explicit 2-layer block, consider the special type of transport velocity field given by (2). Denote
By using the method of characteristics as before, we can get
It generates a -layer ResNet block, which is much more like the original ResNet. Figure 4 illustrates its basic structure.
At a first glance, it appears that simply defining the transport velocity as (2) is not natural. But it is actually reasonable. The inner parameters and are used to specify the location in the space of data. It controls where to assigned a velocity vector. If the activation is non-negative, or even restricted to , which is often the case, then the outer parameters and are necessary to adjust the direction and magnitude of the transport velocity. Both inner parameters and outer parameters are necessary ingredients of the transport velocity field. Of course, if is symmetric (such as ), the outer parameters are not necessary for this purpose.
The ResNet obtained here is special. Firstly, as we can see in (10) and (17), due to the time step , the residual term can be made sufficiently small comparing with the leading term . This is a necessary condition for the ResNet to be modeled by transport equation.
Secondly, the parameters of the ResNet changes slowly from block to block. More specifically, the parameters on the same positions of adjacent ResNet blocks should be close to each other, because they are assumed to be discretizations of continuous functions of time. For example, is close to , is close to , and so on.
3 Continuous Model of Plain Networks
We have seen that the method of characteristics for transport equations corresponds to ResNets. The key of this connection is the transport velocity field that generates the residual terms between layers. It’s natural to consider similar relationship for a plain net, whose typical layer is
where is the activation, multiplication weight matrix and bias vector. In (18), however, the non-residual term defines a finite (rather than infinitesimal) transformation of . It can not be naturally interpreted as velocity, which makes it difficult to be modeled by transport equation directly. In this section we will construct a continuous flow for the map (18). It is done for the linear map and the nonlinear activation respectively. Later, this flow will be used to construct the ResNet-approximation of the plain net (18).
As a preparation, we define the time scaling function . If the flow is only required to be continuous in time, then with is sufficient. Here we require the flow to be smooth, then needs to be nonlinear. Let be a smooth increasing function that satisfies:
With the above properties of , the transport velocity fields of adjacent layers can be glued up smoothly. Since we only consider the -th layer in this section, let’s turn off the suffix of parameters for simplicity.
3.1 Linear Map
Approximation by matrix exponentials
The mainly considered object here is the weight matrix . Without loss of generality, assume that and has been embedded into a space of sufficiently high dimension , such that is a square matrix with . If it can be written into an exponential form, we are done. Unfortunately, this is generally not possible. So we consider its full size singular value decomposition
Notice that we use instead of its adjoint in the decomposition. The requirement that is to ensure that and can be taken as proper rotations even if include mirror reflection on its invariant subspace. Since and are rotations of finite angles, they can be expressed as exponential of angular velocity matrices:
The matrix is a combination of finite stretch (nonzero diagonals) and projections (zero diagonals). But projections can be considered as limits of stretch, so it can be approximated by a matrix exponential:
where and where the last entries are . Thus
for large . So the map (18) can be approximated by
Flows of linear maps
The linear map (24) can be approached by a composition of continuous flows. Denote
then and . For any , define the translation flow
Then and . So the linear maps and can be modeled by continuous flows, each takes one unit of time. In the following we consider their transport velocity fields.
The rotation flow can also be described by the initial value problem of ODE
because its solution is just
It means that the transport velocity field from to is defined by .
In a similar way, the stretch flow can also be described by
because its solution is just
It means that the transport velocity field from to is defined by .
In a similar way, the rotation map can also be described by
because its solution is just
It means that the transport velocity field from to is defined by .
Finally, the translation flow can also be described by
because its solution is simply
It means that the transport velocity field from to is just .
By Euler method, it can be shown that the linear exponential layers can all be approximated by several linear ResNet blocks.
Now let’s consider the nonlinear activation . Assume that is non-decreasing, differentiable almost everywhere and Lipschitz. From now on, denote
then we have . For any and , define
So it takes one unit of time to move from to . Fix any , the value of is strictly increasing in , hence invertible. Denote , hence . As goes from to , is a flow that continuously moves to . The transport velocity field is given by
Thus is the solution to the initial value problem
Example. Before moving on, let’s look at an example of activation flow . Let be ReLU. For any ,
By definition (36), the activation flow is
Notice that for any , it is a leaky ReLU. If , then
Hence the transport velocity field is
3.3 Gluing Up
In summary, the map of nonlinear plain layer (18) can be modeled successively by the flows , , , , . So it takes units of time to move from to , then takes one unit of time to move from to . For technical completeness, let’s glue this flows together. For any , define
For convenience, the above sequentially glued flow (48) is called the layer flow of the -th layer.
The layer flow (48) can also be described by the ODE
with initial condition . Then . Notice that at the , the velocity vanishes.
Notice that the above sequentially gluing procedure is only one of possible ways to construct a continuous flow for (18). There are infinitely many flows that produce the same nonlinear map (18), although most of them do not have such explicit formulation.
Previously, we construct a transport velocity field for a typical single layer of plain net. Let’s construct the velocity field for the whole network. Consider the terminal value problem of the linear transport equation (1). Now the transport velocity field is defined by gluing up (49) of different layers. The detail is as follows. Let with and be a uniform partition of such that be small enough. Then for ,
Notice that the time is scaled such that units of time here is equivalent to units of time in (49). Notice that for any , the transport velocity field . It means that is smooth in . Thus we have seen that the transport equation is a continuous model for the plain net. Given any plain net, we can construct a transport equation using its parameters and activations.
4 Re-Discretization as ResNet
In Section 2, we have shown that ResNets can be modeled by continuous flows. In Section 3, we have shown that plain nets can also be modeled by continuous flows. It’s natural to consider the connection of the two types of neural networks through their continuous models. In this section, we show that by re-discretizing the flow model obtained from the plain net, we can get a ResNet, which is an approximation of the plain net. More specifically, each layer of the plain net is approximated by several ResNet blocks.
4.1 Linear map
We have two options for the linear map
One option is to leave it as a whole map. The other option is to discretize its continuous model in the same way as we did in Section 2. For the second option, one only needs to applying Euler’s method to the ODEs in (49) which corresponds to the linear map. Let’s discretize the first equation in (49) as an example. Let with and be a uniform partition of , such that is small enough. Denote , then . By Euler’s method, we have
which is a linear 1-layer ResNet block. Repeat this iteration for times, we have
We can apply the same procedure to the second, third and fourth equation in (49). The discretization of these equations are very similar, hence are omitted here.
In the following, let’s focus on the nonlinear part. The activation flow is solved from (42) in the following way. Recall that it takes one unit of time to move from to . For clarity in notations, we still use as the range of time . Let with and be a uniform partition of , such that is small enough. Denote and . Then and . Solve (42) by Euler method iteratively, we have
To see the basic structure of ResNet, let’s make (55) explicit.
Example. For ReLU activation , it is straightforward. According to (47),
which is a 1-layer ResNet block with scalar weight
and . Thus for ReLU activation, the approximation of plain nets by ResNets is quite trivial.
If has no explicit expression or is nonlinear, we may consider its linearization at and near . According to the definition of (36), the Jacobian of at any is
whose inverse is the inverse Jacobian in terms of :
Notice that is a vector and the fraction is entry-wise. Since the linearization of inverse is the inverse of linearization, we first linearize at , then compute its inverse.
For simplicity, denote
and take , we have
Then the iteration (55) becomes
Then we get
which is the approximation of (55). It contains a -layer ResNet block followed by a non-residual linear map, as shown in Figure 5. The whole activation flow is composed of several iterations in (55) or its approximation (72).
Together with the linear map (51), the single -th layer of the plain net (18) is approximated by the composition of linear maps and 2-layer ResNet blocks. See Figure 6. Alternatively, we can also use (53) and its successors instead of the whole linear map (51). See Figure 7.
Now it may be a little confusing: The multi-layer ResNet still contains several activations (the small orange circles with solid border line, within each dashed green box in Figures 6 and 7). Why bother to replace one activation (the orange dotted ellipse) by such a multi-layer structure containing more activations? The answer is as follows. The roles of activations in original plain net and in the new ResNet are different. In the plain net, the activation causes nonlinear distortion to the map between two layers, or poses geometric constraint on the layer flow. The effect is significant and immediate. In the ResNet got above, however, the activation causes nonlinear distortion to the transport velocity field, or poses differential constraint on the layer flow. The effect can become significant only after accumulation.
Another confusing thing is about the continuously change of parameters from layer to layer. Since the neural networks here are got from discretizing a continuous flow, it’s natural to guess that the parameters of the networks varies slowly from layer to layer. However we should be careful about this idea. It is generally not true for nonlinear networks. For the nonlinear plain net (18), we have seen in (49) that the continuous transport velocity field is NOT simply like
So the parameters and in (18) themselves should not be regarded as discretization of some continuous parameters of a velocity field.
For the nonlinear ResNet shown in Figure 6, the situation is more subtle. To approximate the activation in one layer of the original plain net (18) (the dotted orange ellipse), several basic structures (the dashed green boxes in Figure 6) of the ResNet are used. The structure of these basic structures are the same, the parameters on their corresponding positions varies slowly. It means that as changes, changes slowly, changes slowly, and so on. In this sense, the parameters of ResNet changes continuously. But if we only naively go through the parameters layer by layer, we will not find this continuity.
In Section 2 and 3 respectively, we use a transport equation and its characteristic equation as a continuous flow model for ResNets and plain nets. This correspondence between neural network and its flow model is very natural, or even obvious for ResNets. It is summarized in Table 1 and illustrated in Figure 8.
Inspired by the connection between neural networks and transport equations, the well studied methods in the area of differential equations might help to understand neural networks or to solve related problems. Here are just a few examples:
We have seen the reason for using 2-layer blocks in ResNet. In the language of transport equation, the inner parameters are used to specify location in the space of data. It tells the network where to assign a velocity vector. The outer parameters are used to adjust the magnitude and direction of the velocity vector at the specified location. The outer parameters are necessary because ReLU is asymmetric.
The correspondence provides one way to see why deep is good for neural networks. In the perspective of TVP of transport equation, in order to transform the terminal value function to the initial value function, the transport velocity field needs to be complicated. To make the discretization converge and to control error, it is necessary to use small time step size and many iterations. It allows the discretization to be more regular, such that each step makes only a small progress. For neural networks, the transformation provided by each layer is also very limited. So it needs more layers to accomplish the required deformation of datasets.
In practice, ResNets can usually be significantly deeper than plain nets. Considering their connections with flow model, the reason for this is quite transparent. On the one hand, the plain net is equivalent to its flow model, which is constructed in Section 3. On the other hand, the flow model can be discretized in an iterative way to get a ResNet, as described in Section 4. Combining the two facts, we can say that the ResNet is a refinement of the original plain net. Naturally, it is deeper than the plain net.
Although ResNets can be very deep, many authors have shown that the training of ResNets is easier than plain nets of comparable depth. From the differential equation point of view, this is because ResNets deform dataset in an incremental way, which is much regular than plain nets.
When solving PDEs, people often use dissipative terms to increase regularity of solutions. In terms of neural networks, it means to add randomness to the feedforward process. This idea is very close to the dropout technique (Srivastava et al., 2014).
We have already known that ResNet corresponds to method of characteristics for transport equations. But there are other methods to solve PDEs (Li and Shi, 2017a), which might lead to alternative equivalent architectures to neural networks.
The training of neural networks could be considered as solving inverse problem of transport equation. It means that both initial value and terminal value are given. The task is to find a time-dependent velocity field that transports the initial value to the terminal value. Of course, the solution to the inverse problem is highly non-unique. There are uncountably many velocity fields that can do the job. Thus the inverse problem is usually formulated as an optimization problem constrained by the transport equation as well as the initial and terminal conditions. There are many methods to solve these problems. Some of them could be modified to train neural networks (Li et al., 2017).
One possible question about continuous model is the dimension matching problem. In practice, one has more flexibility to choose different dimensions for different layers. But in the continuous model, it seems difficult to do so. Since the main concern of this paper is theoretical, it’s not a serious problem for us. Actually, the dimension matching problem already exists in ResNets. There is a restriction that the shortcuts are only used when dimensions are matched. Otherwise, extra projection matrices are needed. In this note, we have adopted a simple assumption that the dataset is embedded into a space with sufficiently high dimension at the beginning. This ambient dimension doesn’t change with time. In order to approximate necessary reduction of intrinsic dimension of dataset during time, we used compressing flows. Of course, this theoretical approach is inefficient in practice. An alternative approach is to glue up different flow models with different dimensions.
Zhen Li would like to show his gratitude to the support of professors Yuan Yao and Yang Wang from the Department of Mathematics, HKUST.
- Most part of this work was submitted to arXiv as two separated notes on 22 August (Li and Shi, 2017b) and 6 September respectively. But the latter was not announced due to technical reasons. This is a combination of the two previous notes.
- Chang, B., Meng, L., Haber, E., Tung, F., and Begert, D. (2017). Multi-level Residual Networks from Dynamical Systems View. ArXiv e-prints.
- Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
- E, W. (2017). A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1–11. Dedicated to Professor Chi-Wang Shu on the occasion of his 60th birthday.
- He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on.
- Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc.
- LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521(7553):436–444.
- Li, Q., Chen, L., Tai, C., and E, W. (2017). Maximum principle based algorithms for deep learning. CoRR, abs/1710.09513.
- Li, Z. and Shi, Z. (2017a). Deep residual learning and pdes on manifold. CoRR, abs/1708.05115.
- Li, Z. and Shi, Z. (2017b). Notes: A continuous model of neural networks. part I: residual networks. CoRR, abs/1708.06257.
- Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958.
- Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9. IEEE.
- Zeiler, M. D. and Fergus, R. (2014). Visualizing and Understanding Convolutional Networks, pages 818–833. Springer International Publishing, Cham.