Step Size Matters in Deep Learning
Training a neural network with the gradient descent algorithm gives rise to a discrete-time nonlinear dynamical system. Consequently, behaviors that are typically observed in these systems emerge during training, such as convergence to an orbit but not to a fixed point or dependence of convergence on the initialization. Step size of the algorithm plays a critical role in these behaviors: it determines the subset of the local optima that the algorithm can converge to, and it specifies the magnitude of the oscillations if the algorithm converges to an orbit. To elucidate the effects of the step size on training of neural networks, we study the gradient descent algorithm as a discrete-time dynamical system, and by analyzing the Lyapunov stability of different solutions, we show the relationship between the step size of the algorithm and the solutions that can be obtained with this algorithm. The results provide an explanation for several phenomena observed in practice, including the deterioration in the training error with increased depth, the hardness of estimating linear mappings with large singular values, and the distinct performance of deep residual networks.
Step Size Matters in Deep Learning
Kamil Nar S. Shankar Sastry Electrical Engineering and Computer Sciences University of California, Berkeley
noticebox[b]Preprint. Work in progress.\end@float
The depth of a neural network determines the size of the class of functions that it can represent. As the depth is increased, this class of functions expands provided that the new layers are able to express the identity mapping. Therefore, the minimum training error that can be achieved by a network diminishes as its depth is increased. However, the training error of most neural networks degrades in practice once the number of layers exceeds a certain value; and deeper networks start to perform worse than their shallower counterparts, as shown in Figure 1 (He et al., 2016). This deterioration in the training error with increased depth indicates a problem with the method used for training the neural network; namely, a problem with the convergence of the gradient descent algorithm.
When gradient descent is used to minimize a function, say , it leads to a discrete-time dynamical system:
where is the state of the system, which consists of the parameters updated by the algorithm, and is the step size of the algorithm. Every fixed point of the system (1) is called an equilibrium of the system, and they correspond to the critical points of the function .
Unless is a quadratic function of the parameters, the system described by (1) is either a nonlinear system or a hybrid system that switches from one dynamics to another over time. Consequently, the system (1) can exhibit behaviors that are typically observed in nonlinear and hybrid systems, such as convergence to an orbit but not to a fixed point, or dependence of the convergence on the initialization. The step size of the gradient descent algorithm has a critical effect on these behaviors, as shown in the following two examples.
Example 1. Convergence to a periodic orbit: Consider the continuously differentiable and convex function , which has a unique local minimum at the origin. The gradient descent algorithm on this function yields
As expected, the origin is the only equilibrium of this system. Interestingly, however, converges to the origin only when the initial state belongs to a countable set :
For all other initializations, converges to an oscillation between and . This implies that, if the initial state is randomly drawn from a continuous distribution, then almost surely, does not converge to the origin, yet converges to . In other words, with probability 1, the state does not converge to a fixed point, such as a local optimum or a saddle point, even though the estimation error converges to a finite non-optimal value.
Example 2. Dependence of convergence on the initialization: Consider the function where is an even number larger than 2. The gradient descent results in the system
The state converges to the origin if the initial state satisfies and diverges if .
These two examples demonstrate:
the convergence of training error does not imply the convergence of the gradient descent algorithm to a local optimum or a saddle point,
the step size determines the magnitude of the oscillations if the algorithm converges to an orbit but not to a fixed point,
the step size influences the convergence of the algorithm differently for each initialization.
Note that these are consequences of the nonlinear dynamics of the algorithm and not of the (non)convexity of the function to be minimized. While both of the functions used in the examples are convex, the identical behaviors are observed during the minimization of nonconvex cost functions of neural networks as well. In fact, only these behaviors can provide a satisfactory explanation for a phenomenon observed in Figure 1: right after the step size of the algorithm is decreased, the training error plummets. It cannot be the case that the parameters are slowly converging to an equilibrium right before the step size is changed, nor can they be stuck near a local optimum or a saddle point, because if either were the case, decreasing the step size would have further slowed down the convergence. These sharp falls can only be explained by the fact that the initial step size is too large for some regions in the parameter space, and the parameters are oscillating around a local optimum right before the step size is changed. Once the step size is decreased, the radius of the oscillations around the equilibrium point diminishes, the distance to the equilibrium point in the parameter space falls sharply, and consequently, so does the training error.
While training a deep neural network, the dynamical system created by the gradient descent will usually have multiple equilibria, which coincide with the critical points of the training cost function. Convergence to these equilibria is in general affected unequally by the step size. For example, for a given step size, the algorithm might be able to converge to a subset of the local optima but not to the others independent of the initializations. Therefore, the step size also plays a critical role in understanding why some solutions are more likely to be obtained instead of the others when the gradient descent algorithm is used.
1.1 Our contributions
In this paper, we study the gradient descent algorithm as a discrete-time dynamical system during training deep neural networks, and we show the relationship between the step size of the algorithm and the solutions that can be obtained with this algorithm. In particular, we achieve the following:
We highlight that one of the reasons the training error stops decreasing during training is the fact that the step size of the algorithm becomes larger than it should be for certain regions in the parameter space, and the parameters keep oscillating around a local optimum rather than slowly converge to it or be stuck near a saddle point. In particular, for every fixed step size, there is a significant possibility that the algorithm converges to an orbit instead of a critical point of the training cost function.
We analyze the Lyapunov stability of the gradient descent algorithm on deep linear networks and find different upper bounds on the step size to enable convergence to each solution. We show that for every step size, the algorithm can converge to only a subset of the local optima, and there are always some local optima that the algorithm cannot converge to independent of the initialization.
We show that symmetric positive definite matrices can be estimated with a deep linear network by initializing the weight matrices as the identity, and this initialization allows the use of the largest step size. Conversely, the gradient descent is most likely to converge for an arbitrarily chosen step size if the weight matrices are initialized as the identity.
We show that symmetric matrices with negative eigenvalues cannot be estimated with the identity initialization, and the gradient descent converges to the closest positive semidefinite matrix in the Frobenius norm.
For 2-layer neural networks with ReLU activations, we obtain an explicit relationship between the step size of the gradient descent and the output of the solution that the algorithm can converge to.
1.2 Related work
It is a well-known problem that the gradient of the training cost function can become disproportionate for different parameters when training a neural network. Several works in the literature tried to address this problem. For example, changing the geometry of optimization was proposed in (Neyshabur et al., 2017) and a regularized descent algorithm was proposed to prevent the gradients from exploding and vanishing during training.
Deep residual networks, which is a specific class of neural networks, yielded exceptional results in practice with their peculiar structure (He et al., 2016). By keeping each layer of the network close to the identity function, these networks were able to attain lower training and test errors as the depth of the network was increased. To explain their distinct behavior, the training cost function of their linear versions was shown to possess some crucial properties (Hardt & Ma, 2016). Later, equivalent results were also derived for nonlinear residual networks under some conditions (Bartlett et al., 2018a).
The effect of the step size on training neural networks was empirically investigated in (Daniel et al., 2016). A step size adaptation scheme was proposed in (Rolinek & Martius, 2018) for the stochastic gradient method and shown to outperform the training with a constant step size. Similarly, some heuristic methods with variable step size were introduced and tested empirically in (Magoulas et al., 1997) and (Jacobs, 1988).
Two-layer linear networks were first studied in (Baldi & Kornik, 1989). The analysis was extended to deep linear networks in (Kawaguchi, 2014), and it was shown that all local optima of these networks were also the global optima. It was discovered in (Hardt & Ma, 2016) that the only critical points of these networks were actually the global optima as long as all layers remained close to the identity function during training. The dynamics of training these networks were also analyzed in (Saxe et al., 2013) and (Gunasekar et al., 2017) by assuming an infinitesimal step size and using a continuous-time approximation to the dynamics.
Lyapunov analysis from the dynamical system theory (Khalil, 2002; Sastry, 1999), which is the main tool for our results in this work, was used in the past to understand and improve the training of neural networks – especially that of the recurrent neural networks (Michel et al., 1988; Matsuoka, 1992; Barabanov & Prokhorov, 2002). State-of-the-art feedforward networks, however, have not been analyzed from this perspective.
We summarize the major differences between our contributions and the previous works as follows:
We relate the vanishing and exploding gradients that arise during training feedforward networks to the Lyapunov stability of the gradient descent algorithm.
Unlike the continuous-time analyses given in (Saxe et al., 2013) and (Gunesekar et al., 2017), we study the discrete-time dynamics of the gradient descent with an emphasis on the step size. By doing so, we obtain upper bounds on the step size to be used, and we show that the step size restricts the set of local optima that the algorithm can converge to. Note that these results cannot be obtained with a continuous-time approximation.
For deep linear networks with residual structure, (Hardt & Ma, 2016) shows that the gradient of the cost function cannot vanish away from a global optimum. This is not enough, however, to suggest the fast convergence of the algorithm. Given a fixed step size, the algorithm may also converge to an oscillation around a local optimum, as in the case of Example 1. We rule out this possibility and provide a step size so that the algorithm converges to a global optimum with a linear rate.
We recently found out that the convergence of the gradient descent algorithm was also studied in (Bartlett et al., 2018b) for symmetric positive definite matrices independently of and concurrently with our preliminary work (Nar & Sastry, 2018). However, unlike (Bartlett et al., 2018b), we explicitly give a step size value for the algorithm to converge with a linear rate, and we emphasize the fact that the identity initialization allows convergence with the largest step size.
2 Upper bounds on the step size for training deep linear networks
Deep linear networks are a special class of neural networks that do not contain nonlinear activations. They represent a linear mapping and can be described by a multiplication of a set of matrices:
where for each . Even though the functions they represent are linear, their training cost is not a quadratic function of the parameters, and therefore, the dynamics of the gradient descent is always nonlinear during training of these networks. For this reason, they provide a simple model to study some of the nonlinear behaviors observed during training.
The training cost functions of the deep linear networks always have multiple optima. In fact, given a cost function , if the point is a local minimum, then is also a local minimum for every set of scalars satisfying . Consequently, independent of the function , none of the local optima is isolated in the parameter space, and the cost function is not strongly convex at any point.
Although multiple local optima attain the same training cost for deep linear networks, the dynamics of the gradient descent algorithm exhibits distinct behaviors around these points. In particular, the step size required to render each of these local optima stable in the sense of Lyapunov is very different. Since the Lyapunov stability of a point is a necessary condition for the convergence of the algorithm to that point, the step size that allows convergence to each solution is also different, which is formalized in Theorem 1.
Given a nonzero matrix and a set of points in that satisfy , assume that is estimated as a multiplication of the matrices by minimizing the squared error loss
where for all . Then the gradient descent algorithm can converge to a solution only if the step size satisfies
and and are the left and right singular vectors of corresponding to its largest singular value.
Considering all the solutions that satisfy , the bound in (3) can be arbitrarily small for some of the local optima. Therefore, given a fixed step size , the gradient descent can converge to only a subset of the local optima, and there are always some solutions that the gradient descent cannot converge to independent of the initialization.
Remark 1. Theorem 1 provides a necessary condition for convergence to a specific solution. It does not state that given a step size , the algorithm converges to a solution which satisfies (3). It only rules out the possibility of converging to a large subset of the local optima. It might be the case, for example, that the algorithm converges to an oscillation around a local optimum which violates (3) even though there are some other local optima which satisfy (3).
As a necessary condition for the convergence to a global optimum, we can also find an upper bound on the step size independent of the weight matrices of the solution, which is given next.
For the minimization problem in Theorem 1, the gradient descent with step size cannot converge to a global optimum unless the step size satisfies
where is the largest singular value of .
Remark 2. Corollary 1 shows that, unlike the optimization of the ordinary least squares problem, the step size required for the convergence of the algorithm depends on the parameter to be estimated, . Consequently, estimating linear mappings with larger singular values requires the use of a smaller step size. Moreover, the step size used during training gives information about the solution obtained if the algorithm converges, which is stated next.
Assume that the gradient descent with step size has converged to a local optimum for the minimization problem in Theorem 1. Then the largest singular value of satisfies
3 Identity initialization allows the largest step size for estimating symmetric positive definite matrices
Corollary 1 provides only a necessary condition for the convergence of the algorithm, and the bound (4) is not tight for every estimation problem. However, if the matrix to be estimated is symmetric and positive definite, the algorithm can converge to a solution with step sizes close to (4), which requires a specific initialization of the weight parameters.
Assume that is a symmetric positive semidefinite matrix, and given a set of points which satisfy , the matrix is estimated as a multiplication of the square matrices by minimizing
If the weight parameters are initialized as for all and the step size satisfies
then each converges to with a linear rate.
Remark 3. Theorem 2 shows that the algorithm converges to a global optimum despite the nonconvexity of the optimization, and it provides a case where the bound (4) is tight. The tightness of the bound implies that for the same step size, most of the other global optima are unstable in the sense of Lyapunov, and therefore, the algorithm cannot converge to them independent of the initialization. Consequently, using identity initialization allows convergence to a solution which is most likely to be stable in the sense of Lyapunov for an arbitrarily chosen step size.
Remark 4. The fact that the identity initialization allows convergence to a global optimum even for large step sizes is remarkable. Given that the identity initialization on deep linear networks is equivalent to the zero initialization of linear residual networks (Hardt & Ma, 2016), Theorem 2 provides an explanation for the exceptional performance of residual networks as well (He et al., 2016).
When the matrix to be estimated is symmetric but not positive semidefinite, the bound (4) is still tight for some of the global optima. In this case, however, the eigenvalues of the estimate cannot attain negative values if the weight matrices are initialized with the identity.
Let in Theorem 2 be a symmetric matrix such that the minimum eigenvalue of , , is negative. If the weight parameters are initialized as for all and the step size satisfies
then the estimate converges to the closest positive semidefinite matrix to in Frobenius norm.
4 Effect of step size on training two-layer networks with ReLU activations
In Section 2, we analyzed the relationship between the step size of the gradient descent algorithm and the solutions that can be obtained by training deep linear networks. A similar relationship exists for nonlinear networks as well. The following theorem, for example, provides an upper bound on the step size for the convergence of the algorithm when the network has two layers and ReLU activations.
Given a set of points in , let a function be estimated by a two-layer neural network with ReLU activations by minimizing the squared error loss:
where is the ReLU function, is the fixed bias vector, and the optimization is only over the weight parameters and . If the gradient descent algorithm with step size converges to a solution , then the estimate satisfies
Theorem 4 shows that if the algorithm is able to converge with a large step size, then the estimate must have a small magnitude for large values of .
Similar to Corollary 1, the bound given by Theorem 4 is not necessarily tight. Nevertheless, it highlights the effect of the step size on the convergence of the algorithm. To demonstrate that small changes in the step size could lead to significantly different solutions, we generated a piecewise continuous function and estimated it with a two-layer network by minimizing
with two different step sizes , where , , and for all . The initial values of and the constant vector were all drawn from independent standard normal distributions; and the vector was kept the same for both of the step sizes used. As shown in Figure 2, training with converged to a fixed solution, which provided an estimate close the original function . In contrast, training with converged to an oscillation and not to a fixed point. That is, after sufficient training, the estimate kept switching between and at each iteration of the gradient descent.
When gradient descent is used to minimize a function, typically only three possibilities are considered: convergence to a local optimum, to a global optimum, or to a saddle point. In this work, we considered the fourth possibility: the gradient descent may not converge at all – even in the deterministic setting. The training error may not reflect the oscillations in the dynamics, or when a stochastic optimization method is used, the oscillations in the training error might be wrongly attributed to the stochasticity of the algorithm. We underlined that, if the training error of an algorithm converges to a non-optimal value, that does not imply the algorithm is stuck near a bad local optimum or a saddle point; it might simply be the case that the algorithm has not converged at all.
We showed that step size of the gradient descent influences the dynamics of the algorithm substantially. It renders some of the local optima unstable in the sense of Lyapunov, and the algorithm cannot converge to these points independent of the initialization. It also determines the magnitude of the oscillations if the algorithm converges to an orbit around an equilibrium point in the parameter space.
In Corollary 2 and Theorem 4, we showed that the step size required for convergence to a specific solution depends on the solution itself. This reveals that some solutions, such as linear functions with large singular values, are harder to converge to. Given that there exists a relationship between the Lipschitz constants of the estimated functions and their generalization error (Bartlett et al., 2017), this result could provide a better understanding of the generalization of deep neural networks.
-  P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, Vol. 2, pp. 53–58, 1989.
-  N. E. Barabanov and D. V. Prokhorov. Stability analysis of discrete-time recurrent neural networks. IEEE Transactions on Neural Networks, Vol. 13, No. 2, pp. 292–303, 2002.
-  P. L. Bartlett, S. Evans, and P. Long. Representing smooth functions as compositions of near-identity functions with implications for deep network optimization. arXiv:1804.05012 [cs.LG], 2018a.
-  P. L. Bartlett, D. J. Foster, and M. Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, 2017.
-  P. L. Bartlett, D. P. Helmbold, and P. Long. Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks. arXiv:1802.06093 [cs.LG], 2018b.
-  C. Daniel, J. Taylor, and S. Nowozin. Learning step size controllers for robust neural network training. In AAAI Conference on Artificial Intelligence, 2016.
-  S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pp. 6152–6160, 2017.
-  M. Hardt and T. Ma. Identity matters in deep learning. arXiv:1611.04231 [cs.LG], 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
-  R. A. Jacobs. Increased rates of convergence through learning rate adaptation. Neural Networks, Vol. 1, pp. 295–307, 1988.
-  K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pp. 586–594, 2016.
-  H. K. Khalil. Nonlinear Systems, 3rd Edition. Prentice Hall, 2002.
-  G. D. Magoulas, M. N. Vrahatis and G. S. Androulakis. Effective backpropagation training with variable stepsize. Neural Networks, Vol. 10, No. 1, pp. 69–82, 1997.
-  K. Matsuoka. Stability conditions for nonlinear continuous neural networks with asymmetric connection weights. Neural Networks, Vol. 5, No. 3, pp. 495–500, 1992.
-  A. N. Michel, J. A. Farrell, and W. Porod. Stability results for neural networks. In Neural Information Processing Systems, pp. 554–563, 1988.
-  K. Nar and S. Shankar Sastry. Residual networks: Lyapunov stability and convex decomposition. a͡rXiv:1803.08203 [cs.LG], 2018.
-  B. Neyshabur, R. Tomioka, R. Salakhutdinov, and N. Srebro. Geometry of optimization and implicit regularization in deep learning. arXiv:1705.03071 [cs.LG], 2017.
-  S. Sastry. Nonlinear Systems: Analysis, Stability, and Control. Springer: New York, NY, 1999.
-  A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120 [cs.NE], 2013.
-  M. Rolinek and G. Martius. L4: Practical loss-based stepsize adaptation for deep learning. arXiv:1802.05074 [cs.LG], 2018.
Appendix A Proof of Theorem 1 and Corollary 1
Let be symmetric and positive semidefinite. Then, .
We can write as , where for all and if . Then,
Let be a linear map defined as where and are symmetric positive semidefinite matrices for all . Then, for every nonzero and , the largest eigenvalue of satisfies
First, we show that is symmetric and positive semidefinite. Given two matrices , we can write
where the last inequality follows from Lemma 1. This shows that is symmetric and positive semidefinite. Then, for every nonzero , we have
In particular, given two nonzero vectors and ,
Proof of Theorem 1. The cost function (2) in Theorem 1 can be written as
Let denote the error in the estimate, i.e. . The gradient descent yields
By multiplying the update equations of and subtracting , we can obtain the dynamics of as
where denotes the higher order terms, and
Lyapunov’s indirect method of stability (Khalil, 2002; Sastry, 1999) states that given a dynamical system its equilibrium is stable in the sense of Lyapunov only if the linearization of the system around
does not have any eigenvalue larger than 1 in magnitude. By using this fact for the system defined by (5)-(6), we can observe that an equilibrium with is stable in the sense of Lyapunov only if the system
does not have any eigenvalue larger than 1 in magnitude, which requires that the mapping
does not have any real eigenvalue larger than . Let and be the left and right singular vectors of corresponding to its largest singular value, and let and be defined as in the statement of Theorem 1. Then, by Lemma 2, the mapping in (7) does not have an eigenvalue larger than only if
which completes the proof.
Appendix B Proof of Theorem 2
Let be estimated as a multiplication of the scalar parameters by minimizing via gradient descent. Assume that for all . If the step size is chosen to be less than or equal to
then for all , where
Due to symmetry, for all for all . Denoting any of them by , we have
To show that converges to , we can write
If there exists some such that
then is always larger or always smaller than , and its distance to decreases by a factor of at each step. Since is a monotonic function in , the condition (9) holds for all if it holds only for and , which gives us and ∎
Proof of Theorem 2. There exists a common invertible matrix that can diagonalize all the matrices in the system created by the gradient descent: , for all . Then the dynamical system turns into independent update rules for the diagonal elements of and . Lemma 3 can be applied to each of the systems involving the diagonal elements. Since in Lemma 3 is monotonically decreasing in , the bound for the maximum eigenvalue of guarantees linear convergence.
Appendix C Proof of Theorem 3
Assume that and is used for all to initialize the gradient descent algorithm to solve
Then, each converges to 0 unless .
We can write the update rule for any weight as
which has one equilibrium at and another at . If and , it can be shown by induction that
for all . As a result, converges to 0. ∎
Proof of Theorem 3. Similar to the proof of Theorem 2, the system created by the gradient descent can be decomposed into independent systems of the diagonal elements of the matrices and . Then, Lemma 3 and Lemma 4 can be applied to the systems with positive and negative eigenvalues of , respectively.
Appendix D Proof of Theorem 4
To find a necessary condition for the convergence of the gradient descent algorithm to , we analyze the local stability of that solution in the sense of Lyapunov. Since the analysis is local and the function is fixed, for each point we can use a matrix that satisfies . Note that is a diagonal matrix and all of its diagonal elements are either 0 or 1. Then, we can write the cost function around an equilibrium as
Denoting the error by , the gradient descent gives
Let denote the vector . Then we can write the update equation of as
Similar to the proof of Theorem 1, the equilibrium can be stable in the sense on Lyapunov only if the system
does not have any eigenvalue larger than 1 in magnitude. Note that the linear system in (10) can be described by a symmetric matrix, whose eigenvalues cannot be larger in magnitude than the eigenvalues of its sub-blocks on the diagonal, in particular those of the system
The eigenvalues of the system are less than 1 in magnitude only if the eigenvalues of the system
are less than . This requires that for all for which ,
As a result, Lyapunov stability of the solution requires