Robust learning with implicit residual networks
In this effort we propose a new deep architecture utilizing residual blocks inspired by implicit discretization schemes. As opposed to the standard feed-forward networks, the outputs of the proposed implicit residual blocks are defined as the fixed points of the appropriately chosen nonlinear transformations. We show that this choice leads to improved stability of both forward and backward propagations, has a favorable impact on the generalization power of the network and allows for higher learning rates. In addition, we consider a reformulation of ResNet which does not introduce new parameters and can potentially lead to a reduction in the number of required layers due to improved forward stability and robustness. Finally, we derive the memory efficient reversible training algorithm and provide numerical results in support of our findings.
Robust learning with implicit residual networks
Viktor Reshniak Computational and Applied Mathematics Oak Ridge National Lab Oak Ridge, TN 37831 email@example.com Clayton G. Webster Department of Mathematics University of Tennessee at Knoxville Knoxville, TN 37996 firstname.lastname@example.org, and Computational and Applied Mathematics Oak Ridge National Lab Oak Ridge, TN 37831 email@example.com
noticebox[b]Preprint. Under review.\end@float
1 Introduction and related works
A large volume of empirical results has been collected in recent years illustrating the striking success of deep neural networks (DNNs) in approximating complicated maps by a mere composition of relatively simple functions LeCun2015 (). Universal approximation property of DNNs with a relatively small number of parameters has also been shown for a large class of functions hanin2017universal (); NIPS2017_7203 (). The training of deep networks nevertheless remains a notoriously difficult task due to the issues of exploding and vanishing gradients, which become more apparent and noticeable with increasing depth Bengio1994 (). These issues accelerated efforts of the research community in an attempt to explain this behavior and gain new insights into the design of better architectures and faster algorithms. A promising approach in this direction was obtained by casting evolution of the hidden states of a DNN as a dynamical system E2017 (), i.e.,
where for each layer , is a nonlinear transformation parameterized by the weights and , are the appropriately chosen spaces. In the case of a very deep network, when , it is convenient to consider the continuous time limit of the above expression such that
where the parametric evolution function defines a continuous flow through the input data . Parameter estimation for such continuous evolution can be viewed as an optimal control problem E2018 (), given by
where is a terminal loss function, is a regularizer, and is a probability distribution of the input-target data pairs . More general models additionally consider spatially continuous networks by using differential Ruthotto2018 () or integral formulations Sonoda2017 (). A continuous time formulation based on ordinary differential equations (ODEs) was proposed in NIPS2018_7892 () with the state equation (2) of the form
In the work NIPS2018_7892 (), the authors relied on the black-box ODE solvers and used adjoint sensitivity analysis (see, e.g., Strang2007 () for the introduction to adjoint methods) to derive equations for the backpropagation of errors through the continuous system.
The authors of Haber_2017 () concentrated on the well-posedness of the learning problem for ODE-constrained control and emphasized the importance of stability in the design of deep architectures. For instance, the solution of a homogeneous linear ODE with constant coefficients
is given by
where is the eigen-decomposition of a matrix , and is the diagonal matrix with corresponding eigenvalues. Similar equation holds for the backpropagation of gradients. To guarantee the efficient propagation of information through the network, one must ensure that the elements of have magnitudes close to one. This condition, of course, is satisfied when all eigenvalues of the matrix are imaginary with real parts close to zero. In order to preserve this property, the authors of Haber_2017 () proposed several time continuous architectures of the form
When , , the equations above provide an example of a conservative Hamiltonian system with the total energy .
In the discrete setting of the ordinary feed forward networks, the necessary conditions for the optimal solution of (1)-(2) recover well-known equations for the forward propagation (state equation (2)), backward gradient propagation (co-state equation), and the optimality condition, to compute the weights (gradient descent algorithm), see, e.g, LeCun1988 (). The continuous setting offers additional flexibility in the construction of discrete networks with the desired properties and efficient learning algorithms. Classical feed forward networks (Figure 1, left) is just the particular and the simplest example of such discretization which is prone to all the issues of deep learning. In order to facilitate the training process, a skip-connection is often added to the network (Figure 1, middle) yielding
where is a positive hyperparameter. Equation (5) can be viewed as a forward Euler scheme to solve the ODE in (3) numerically on the time grid with step size . While it was shown that such residual layers help to mitigate the problem of vanishing gradients and speed-up the training process ResNet2016 (), the scheme has very restrictive stability properties Hairer1993 (). This can result in the uncontrolled accumulation of errors at the inference stage reducing the generalization ability of the trained network. Moreover, Euler scheme is not capable of preserving geometric structure of conservative flows and is thus a bad choice for the long time integration of such ODEs Hairer2006 (). In other words, residual blocks in (5) are not well suited for the very deep networks.
Memory efficient explicit reversible architectures can be obtained by considering time discretization of the partitioned system of ODEs in (4). The reversibility property allows to recover the internal states of the system by propagating through the network in both directions and thus does not require one to cache these values for the evaluation of the gradients. First, such architecture (RevNet) was proposed in NIPS2017_6816 (), and without using a connection to discrete solutions of ODEs, it has the form
It was later recognized as the Verlet method applied to the particular form of the system in (4), see Haber_2017 (); chang2018reversible (). The leapfrog and midpoint networks are two other examples of reversible architectures proposed in chang2018reversible ().
Other residual architectures can be also found in the literature including Resnet in Resnet (RiR) targ2016resnet (), Dense Convolutional Network (DenseNet) Huang_2017_CVPR () and linearly implicit network (IMEXNet) haber2019imexnet (). For some problems, all of these networks show a substantial improvement over the classical ResNet but still have an explicit structure, which has limited robustness to the perturbations of the input data and parameters of the network. Instead, in this effort we propose new fully implicit residual architecture which, unlike the above mentioned examples, is unconditionally stable and robust. As opposed to the standard feed-forward networks, the outputs of the proposed implicit residual blocks are defined as the fixed points of the appropriately chosen nonlinear transformations as follows:
The right part of Figure 1 provides a graphical illustration of the proposed layer. The choice of the nonlinear transformation and the design of the learning algorithm are discussed in the next section.
2 Description of the method
We first motivate the necessity for our new method by letting the continuous model of a network be given by the ordinary differential equations in (4), that is:
An s-stage Runge-Kutta method for the approximate solution of the above equations is given by
The order conditions for the coefficients , , , , , and , which guarantee convergence of the numerical solution are well known and can be found in any topical text, see, e.g., Hairer1993 (). Note that when or for at least some , the scheme is implicit and a system of nonlinear equations has to be solved at each iteration which obviously increases the complexity of the solver. Nevertheless, the following example illustrates the benefits of using implicit approximations.
Linear stability analysis.
Consider the following linear differential system
and four simple discretization schemes:
Due to linearity of the system in (6), we can write the generated numerical solutions as
The long time behavior of the discrete dynamics is hence determined by the spectral radius of the matrix which need to be less or equal to one for the sake of stability. For example, we have for the forward Euler scheme and the method is unconditionally unstable. Backward Euler scheme gives and the method is unconditionally stable. The corresponding eigenvalues of the trapezoidal scheme have magnitude equal to one for all and . Finally, the characteristic polynomial for the matrix of the Verlet scheme is given by , i.e., the method is only conditionally stable when .
Figure 2 illustrates this behavior for the particular case of . Notice that the flows of the forward and backward Euler schemes are strictly expanding and contracting which makes the training process inherently ill-posed as the dynamics are not easily invertible. Contrary, the implicit trapezoidal and explicit Verlet schemes seem to reproduce the original flow very well but the latter is conditional on the size of the step . Another nice property of the trapezoidal and Verlet schemes is their symmetry with respect to the exchanging and . Such methods play a central role in the goemetric integration of reversible differential flows and are handy in the construction of the memory efficient reversible network architectures. Conditions for the reversibilty of general Runge-Kutta schemes can be found in Hairer2006 ().
2.1 Implicit ResNet.
Motivated by the discussion above, we propose an implicit variant of the residual networks given by
where , , are the input, output and parameters of the layer and is a nonlinear function.
To solve the nonlinear equation in (7), consider the equivalent minimization problem
One way to construct the required solution is by applying the gradient descent algorithm
Alternatively, the fixed point iteration
can be also used when the initial guess is sufficiently close to the minimizer.
Finally, by linearizing around , we obtain the closed form estimate of the solution
which can be used as an initial guess for the mentioned iterative algorithms.
It is worth noting that, even though the nonlinearity in (7) adds to the complexity of the forward propagation, the backpropagation through the nonlinear solver is not required as is shown below.
Using the chain rule we can easily find the Jacobian matrices of the imlpicit residual layer as follows
The backpropagation formulas then follow immediately
One can see that the backpropagation is an essentially linear process and only one linear solve is required at the beginning of each layer. It is also clear that the case corresponds to the standard ResNet architecture with essentially no control over the propagation of perturbations through the network. At the other extreme, when , the network has excellent forward stability but cannot be trained. Trapezoidal scheme with instead has a proper balance between the stability and controllability while also being a reversible second-order integrator.
The proposed residual architecture can be easily implemented using any existing deep learning framework such as PyTorch or Tensorflow. The code snippet in Listing 1 gives an example of such implementation in Tensorflow. Firstly, we use tf.stop_gradient to avoid backpropagating through the nonlinear solver and then we compose the output of the layer with the custom_backprop function which is a tf.custom_gradient decorator of the identity map. This decorator is responsible for the linear solve in the backpropagation formulas above while the remaining operations are handled by the automatic differentiation algorithm supplied with the framework.
Let be the depth of the network and denote by the memory complexity of the standard ResNet layer. Then the memory effort of the proposed architecture is . In fact, it can be often reduced due to the improved stability and hence potentially smaller required depth. Moreover, the memory complexity can be made when using reversible methods such as the trapezoidal scheme in (7) with . On the other hand, the computational cost of the implicit network is necessarily larger when compared to ResNet of the same depth since additional nonlinear and linear solves are required at each layer. The cost of the linear solver strongly depends on the structure of the linear operator. For general dense matrices it is on the order of for some . In practice, the dimension of hidden states is often not very large or the linear operator is of special structure. For instance, sparse convolutional operators should not be cast into the matrix form and the corresponding linear systems can be solved by iterative methods which only require one to know how to apply a particular operator to the given tensor. The cost of the nonlinear solver is more difficult to estimate since the convergence is highly dependent on the initialization. For example, for Lipschitz continuous maps one can show that the fixed point iteration converges as where is the Lipschitz constant.
For the first example, we consider the problem of approximating a simple one-dimensional function given by
For this purpose, we use the folowing ordinary differential equation
with a skew symmetric coefficient matrix
Note that has a purely imaginary spectrum which guarantees stability of the continuous dynamics.
We compare the behaviour of two networks derived from (7), namely the standard ResNet () and the new implicit trapezoidal network (). We used the training dataset of randomly sampled points and the standard loss function, and trained the networks using batch gradient descent optimizer on the batches of size . The validation dataset was evaluated on points. Both networks were initialized with the Glorot uniform initializer and we used the weight regularization of the form
where is the number of layers. We set for the ResNet and for the trapezoidal network with the correspondingly adjusted values of the hyperparameter so that both networks approximate the same ODE. We chose these values using the stability argument, this is the reason why ResNet need more layers.
Figures 3-4 show the convergence of the loss on the training and validation datasets for several independent simulations. With nodes at each hidden layer, ResNet has parameters while the implicit scheme needs only parameters to guarantee the same accuracy of approximation. These results illustrate the superior forward and backward stability of the implicit architecture in comparison to ResNet.
For the second example, we consider another small test problem from Haber_2017 (). The dataset is illustrated in Figure 6. It consists of points organized in two differenetly labeled spirals. Every other point was removed to be used as the validation dataset. We used the same network architecture as in the previous example but with hidden nodes at each of the hidden layers and activation insted of . The final classification layer has sigmoid activation. Figures 5 and 6 illustrate convergence of the networks and the classification results. One can see that the proposed implicit scheme is more accurate and robust than the classical ResNet.
This material is based upon work supported in part by: the U.S. Department of Energy, Office of Science, Early Career Research Program under award number ERKJ314; U.S. Department of Energy, Office of Advanced Scientific Computing Research under award numbers ERKJ331 and ERKJ345; the National Science Foundation, Division of Mathematical Sciences, Computational Mathematics program under contract number DMS1620280; and by the Laboratory Directed Research and Development program at the Oak Ridge National Laboratory, which is operated by UT-Battelle, LLC., for the U.S. Department of Energy under contract DE-AC05-00OR22725.
-  Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
-  Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible architectures for arbitrarily deep residual neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 6571–6583. Curran Associates, Inc., 2018.
-  Weinan E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1–11, 2017.
-  Weinan E, Jiequn Han, and Qianxiao Li. A mean-field optimal control formulation of deep learning. Research in the Mathematical Sciences, 6(1):10, 2018.
-  Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2214–2224. Curran Associates, Inc., 2017.
-  Eldad Haber, Keegan Lensink, Eran Triester, and Lars Ruthotto. IMEXnet: A forward stable deep neural network. arXiv e-prints, page arXiv:1903.02639, 2019.
-  Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, 2017.
-  Ernst Hairer, Christian Lubich, and Gerhard Wanner. Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Equations, volume 31 of Springer Series in Computational Mathematics. Springer-Verlag Berlin Heidelberg, 2006.
-  Ernst Hairer, Syvert P. Nørsett, and Gerhard Wanner. Solving Ordinary Differential Equations I, Nonstiff Problems, volume 8 of Springer Series in Computational Mathematics. Springer-Verlag Berlin Heidelberg, 1993.
-  Boris Hanin. Universal function approximation by deep neural nets with bounded width and ReLU activations. arXiv preprint arXiv:1708.02691, 2017.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 630–645, Cham, 2016. Springer International Publishing.
-  Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  Yann LeCun. A theoretical framework for back-propagation. In D. Touretzky, G. Hinton, and T. Sejnowsky, editors, Proceedings of the 1988 Connectionist Models Summer School, pages 21–28, CMU, Pittsburgh, Pa, 1988. Morgan Kaufmann.
-  Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436–444, 2015.
-  Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6231–6239. Curran Associates, Inc., 2017.
-  Lars Ruthotto and Eldad Haber. Deep Neural Networks Motivated by Partial Differential Equations. arXiv e-prints, page arXiv:1804.04272, 2018.
-  Sho Sonoda and Noboru Murata. Double continuum limit of deep neural networks. In ICML Workshop on Principled Approaches to Deep Learning, 2017.
-  Gilbert Strang. Computational science and engineering, volume 791. Wellesley-Cambridge Press Wellesley, 2007.
-  Sasha Targ, Diogo Almeida, and Kevin Lyman. Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029, 2016.