The asymptotic spectrum of the Hessian of DNN throughout training
The dynamics of DNNs during gradient descent is described by the so-called Neural Tangent Kernel (NTK). In this article, we show that the NTK allows one to gain precise insight into the Hessian of the cost of DNNs: we obtain a full characterization of the asymptotics of the spectrum of the Hessian, at initialization and during training.
The advent of deep learning has sparked a lot of interest in the loss surface of deep neural networks (DNN), and in particular its Hessian. However to our knowledge, there is still no theoretical description of the spectrum of the Hessian. Nevertheless a number of phenomena have been observed numerically.
The loss surface of neural networks has been compared to the energy landscape of different physical models [1, 2, 3]. It appears that the loss surface of DNNs may change significantly depending on the width of the network (the number of neurons in the hidden layer), motivating the distinction between the under- and over-parametrized regimes [4, 2, 5].
The non-convexity of the loss function implies the existence of a very large number of saddle points, which could slow down training. In particular, in [6, 7], a relation between the rank of saddle points (the number of negative eigenvalues of the Hessian) and their loss has been observed.
For overparametrized DNNs, a possibly more important phenomenon is the large number of flat directions . The existence of these flat minima is conjectured to be related to the generalization of DNNs and may depend on the training procedure [8, 9, 10].
In  it has been shown, using a functional approach, that in the infinite width-limit, DNNs behave like kernel methods with respect to the so-called Neural Tangent Kernel, which is determined by the architecture of the network. This strengthens the connections between neural networks and kernel methods [16, 17, 18].
This raises the question: can we use these new results to gain insight into the behavior of the Hessian of the loss of DNNs, at least in the small region explored by the parameters during training?
Following ideas introduced in , we consider the training of -layered DNNs in a functional setting. For a functional cost , the Hessian of the loss is the sum of two matrices and . We show the following results for large and for a fixed number of datapoints :
The first matrix is positive semi-definite and its eigenvalues are given by the (weighted) kernel PCA of the dataset with respect to the NTK. The dominating eigenvalues are the principal components of the data followed by a high number of small eigenvalues. The “flat directions” are spanned by the small eigenvalues and the null-space (of dimension at least when there is a single output). Because the NTK is asymptotically constant , these results apply at initialization, during training and at convergence.
The second matrix can be viewed as residual contribution to , since it vanishes as the network converges to a global minimum. We compute the limit of the first moment and characterize its evolution during training, of the second moment which stays constant during training, and show that the higher moments vanish.
Regarding the sum , we show that the matrices and are asymptotically orthogonal to each other at initialization and during training. In particular, the moments of the matrices and add up: .
These results give, for any depth and a fairly general non-linearity, a complete description of the spectrum of the Hessian in terms of the NTK at initialization and throughout training. This gives theoretical confirmation of a number of observations about the Hessian [8, 6, 7, 9, 10, 19, 2], and sheds a new light on them.
1.2 Related works
For least-squares and cross-entropy costs, the first matrix is equal to the Fisher matrix [21, 22], whose moments have been described for shallow networks in . For deep networks, the first two moments and the operator norm of the Fisher matrix for a least squares loss were computed at initialization in  conditionally on a certain independence assumption; our method does not require such assumptions. Note that their approach implicitly uses the NTK.
The second matrix has been studied in [19, 2] for shallow networks, conditionally on a number of assumptions. Note that in the setting of , the matrices and are assumed to be freely independent, which allows them to study the spectrum of the Hessian; in our setting, we show that the two matrices and are asymptotically orthogonal to each other.
We consider deep fully connected artificial neural networks (DNNs) using the setup and NTK parametrization of , taking an arbitrary nonlinearity (i.e. that is 4 times continuously differentiable function with all four derivatives bounded). The layers are numbered from (input) to (output), each containing neurons for . The parameters consist of the weight matrices and bias vectors for . We aggregate the parameters into the vector .
The activations and pre-activations of the layers are defined recursively for an input , setting :
The parameter is added to tune the influence of the bias on training111In our experiments, we take .. All parameters are initialized as iid Gaussians.
We will in particular study the network function, which maps inputs to the activation of the output layer (before the last non-linearity):
In this paper, we will study the limit of various objects as sequentially, i.e. we first take then , etc. This greatly simplifies the proofs, but they could in principle be extended to the simultaneous limit, i.e. when . All our numerical experiments are done with ‘rectangular’ networks (with ) and match closely the predictions for the sequential limit.
In the limit we study in this paper, the NTK is asymptotically fixed, as in [13, 15, 14, 25]. By rescaling the outputs of DNNs as the width increases, one can reach another limit where the NTK is not fixed [12, 26, 11, 27]. The behavior of the Hessian in this other limit may be significantly different.
2.1 Functional viewpoint
The network function lives in a function space and we call the function that maps the parameters to the network function the realization function. We study the differential behavior of :
The derivative is a function-valued vector of dimension . The -th entry represents how modifying the parameter modifies the function in the space .
The Hessian is a function-valued matrix.
The network is trained with respect to the cost functional:
for strictly convex , summing over a finite dataset of size . The parameters are then trained with gradient descent on the composition , which defines the usual loss surface of neural networks.
In this setting, we define the finite realization function mapping parameters to be the restriction of the network function to the training set . The Jacobian is hence an matrix and its Hessian is a tensor. Defining the restricted cost , we have .
For our analysis, we require that the gradient norm does not explode during training. The following condition is sufficient:
A loss has bounded gradients over sublevel sets (BGOSS) if the norm of the gradient is bounded over all sets .
2.2 Neural Tangent Kernel
The behavior during training of the network function in the function space is described by a (multi-dimensional) kernel, the Neural Tangent Kernel (NTK)
During training, the function follows the so-called kernel gradient descent with respect to the NTK, which is defined as
In the infinite-width limit (letting sequentially) and for losses with BGOSS, the NTK converges to a deterministic limit , which is constant during training, uniformly on finite time intervals . In the case of the MSE loss, the uniform convergence of the NTK was proven for in .
The limiting NTK is constructed as follows:
For and a kernel , define the kernel by
for a centered Gaussian vector with covariance matrix . For , we denote by the kernel .
We define the kernels for each layer of the network, starting with and then recursively by , for , where is the network non-linearity.
The limiting NTK is defined in terms of the kernels and the kernels :
2.3 Gram Matrices
For a finite dataset and a fixed depth , we denote by the Gram matrix of with respect to the limiting NTK, defined by
It is block diagonal because different outputs are asymptotically uncorrelated.
Similarly, for any (scalar) kernel (such as the limiting kernels introduced later), we will use the same notation, denoting the Gram matrix of the datapoints by .
3 Main Theorems
3.1 Hessian as
Using the above setup, the Hessian of the loss is the sum of two terms, with the entry given by
For a finite dataset, the Hessian matrix is equal to the sum of two matrices
where is a matrix, is a matrix and is a tensor to which we apply a scalar product (denoted by ) in its last dimension with the vector to obtain a matrix.
Our main contribution is the following theorem, which describes the limiting moments in terms of the moments of and :
For any loss with BGOSS and , in the sequential limit , we have for all
The limits of and can be expressed in terms of the NTK , the kernel and the non-symmetric kernels , defined in Appendix C:
The moments converge to the following limits (with the convention that ):
The first moment converges to the limit:
At initialization form a Gaussian pair of -vectors, independent for differing output indices and with covariance for a (non-symmetric) limiting kernel . During training, both vectors follow the differential equations
The second moment converges to the following limit defined in terms of the Gram matrix :
The higher moments for vanish.
In the case of a MSE loss , the first and second derivatives take simple forms and and the differential equations can be solved to obtain more explicit formulae:
For the MSE loss and , in the limit , we have uniformly over
In expectation we have:
The moments of are constant because is constant. For the moments of , we first solve the differential equation for :
Noting , we have
The expectation of the first moment of then follows. ∎
3.2 Mutual Orthogonality of and
A first key ingredient to prove Theorem 1 is the asymptotic mutual orthogonality of the matrices and
For any loss with BGOSS and , we have uniformly over
As a consequence .
If two matrices and are mutualy orthogonal (i.e. ) the range of is contained in the nullspace of and vice versa. The non-zero eigenvalues of the sum are therefore given by the union of the non-zero eigenvalues of and . Furthermore the moments of and add up: . Proposition 5 shows that this is what happens asymptotically for and .
Note that both matrices and have large nullspaces: indeed assuming a constant width , we have and (see Appendix C), while the number of parameters scales as (when ).
Figure 3 illustrates the mutual orthogonality of and .
3.3 The matrix
The matrix is best understood as a perturbation to , which vanishes as the network converges because . To calculate its moments, we note that
where the vector is the evaluation of the function on the training set.
For the second moment we have
for the Gram matrix of the kernel .
The following proposition desribes the limit of the function and the kernel and the vanishing of the higher moments:
For any loss with BGOSS and , the first two moments of take the form
- At initialization, and converge to a (centered) Gaussian pair with covariances
and during training evolves according to
- Uniformly over any interval , the kernel has a deterministic and fixed limit with limiting kernel:
- The higher moment vanish: .
This result has a number of consequences for infinitely wide networks:
At initialization, the matrix has a finite Frobenius norm , because converges to a fixed limit. As the network converges, the derivative of the cost goes to zero and so does the Frobenius norm of .
In contrast the operator norm of vanishes already at initialization (because for all even , we have ). At initialization, the vanishing of in operator norm but not in Frobenius norm can be explained by the matrix having a growing number of eigenvalues of shrinking intensity as the width grows.
When it comes to the first moment of , Proposition 4 shows that the spectrum of is in general not symmetric. For the MSE loss the expectation of the first moment at initialization is
which may be positive or negative depending on the choice of nonlinearity: with a smooth ReLU, it is positive, while for the arc-tangent or the normalized smooth ReLU, it can be negative (see Figure 1).
This is in contrast to the result obtained in [19, 2] for the shallow ReLU networks, taking the second derivative of the ReLU to be zero. Under this assumption the spectrum of is symmetric: if the eigenvalues are ordered from lowest to highest, and .
These observations suggest that has little influence on the shape of the surface, especially towards the end of training, the matrix however has an interesting structure.
3.4 The matrix
At a global minimizer , the spectrum of describes how the loss behaves around . Along the eigenvectors of the biggest eigenvalues of , the loss increases rapidely, while small eigenvalues correspond to flat directions. Numerically, it has been observed that the matrix features a few dominating eigenvalues and a bulk of small eigenvalues [28, 20, 29, 30]. This leads to a narrow valley structure of the loss around a minimum: the biggest eigenvalues are the ‘cliffs’ of the valley, i.e. the directions along which the loss grows fastest, while the small eigenvalues form the ‘flat directions’or the bottom of the valley.
Note that the rank of is bounded by and in the overparametrized regime, when , the matrix will have a large nullspace, these are directions along which the value of the function on the training set does not change. Note that in the overparametrized regime, global minima are not isolated: they lie in a manifold of dimension at least and the nullspace of is tangent to this solution manifold.
The matrix is closely related to the NTK Gram matrix:
As a result, the limiting spectrum of the matrix can be directly obtained from the NTK222This result was already obtained in , but without identifying the NTK explicitely and only at initialization.
For any loss with BGOSS and , uniformly over any interval , the moments converge to the following limit (with the convention that ):
It follows from and the asymptotic of the NTK . ∎
3.4.1 Mean-Square Error
When the loss is the MSE, is equal to . As a result, and have the same non-zero eigenvalues up to a scaling of . Because the NTK is assymptotically fixed, the spectrum of is also fixed in the limit.
The eigenvectors of the NTK Gram matrix are the kernel principal components of the data. The biggest principal components are the directions in function space which are most favorised by the NTK. This gives a functional interpretation of the narrow valley structure in DNNs: the cliffs of the valley are the biggest principal components, while the flat directions are the smallest components.
As the depth of the network increases, one can observe two regimes [31, 32]: Order/Freeze where the NTK converges to a constant and Chaos where the NTK converges to a Kronecker delta. In the Order/Freeze the Gram matrix approaches a block diagonal matrix with constant blocks, and as a result eigenvalues of dominate the other ones, corresponding to constant directions along each outputs (this is line with the observations of ). This leads to a narrow valley for the loss and slows down training. In contrast, in the Chaos regime, the NTK Gram matrix approaches a scaled identity matrix, and the spectrum of should hence concentrate around a positive value, hence speeding up training. Figure 3 illustrates this phenomenon: with the smooth ReLU we observe a narrow valley, while with the normalized smooth ReLU (which lies in the Chaos according to ) the narrowness of the loss is reduced. A similar phenomenon may explain why normalization helps smoothing the loss surface and speed up training [33, 34].
3.4.2 Cross-Entropy Loss
For a binary cross-entropy loss with labels
is a diagonal matrix whose entries depend on (but not on ):
The eigenvectors of then correspond to the weighted kernel principal component of the data. The positive weights approach as goes to , i.e. when it is close to the decision boundary from one class to the other, and as the weight go to zero. The weights evolve in time through , the spectrum of is therefore not asymptotically fixed as in the MSE case, but the functional interpretation of the spectrum in terms of the kernel principal components remains.
We have given an explicit formula for the limiting moments of the Hessian of DNNs throughout training. We have used the common decomposition of the Hessian in two terms and and have shown that the two terms are asymptotically mutually orthogonal, such that they can be studied separately.
The matrix vanishes in Frobenius norm as the network converges and has vanishing operator norm throughout training. The matrix is arguably the most important as it describes the narrow valley structure of the loss around a global minimum. The eigendecomposition of is related to the (weighted) kernel principal components of the data w.r.t. the NTK.
-  Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The Loss Surfaces of Multilayer Networks. Journal of Machine Learning Research, 38:192–204, nov 2015.
-  Mario Geiger, Stefano Spigler, Stéphane d’Ascoli, Levent Sagun, Marco Baity-Jesi, Giulio Biroli, and Matthieu Wyart. The jamming transition as a paradigm to understand the loss landscape of deep neural networks. arXiv preprint arXiv:1809.09349, 2018.
-  Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
-  Marco Baity-Jesi, Levent Sagun, Mario Geiger, Stefano Spigler, Gerard Ben Arous, Chiara Cammarota, Yann LeCun, Matthieu Wyart, and Giulio Biroli. Comparing Dynamics: Deep Neural Networks versus Glassy Systems. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 314–323. PMLR, 10–15 Jul 2018.
-  Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’Ascoli, Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning . abs/1901.01608, 2019.
-  Razvan Pascanu, Yann N Dauphin, Surya Ganguli, and Yoshua Bengio. On the saddle point problem for non-convex optimization. arXiv preprint, 2014.
-  Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and Attacking the Saddle Point Problem in High-dimensional Non-convex Optimization. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pages 2933–2941, Cambridge, MA, USA, 2014. MIT Press.
-  Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
-  Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.
-  Lei Wu, Zhanxing Zhu, and Weinan E. Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes. CoRR, abs/1706.10239, 2017.
-  Grant Rotskoff and Eric Vanden-Eijnden. Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. In Advances in Neural Information Processing Systems 31, pages 7146–7155. Curran Associates, Inc., 2018.
-  Lénaïc Chizat and Francis Bach. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport. In Advances in Neural Information Processing Systems 31, pages 3040–3050. Curran Associates, Inc., 2018.
-  Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. In Advances in Neural Information Processing Systems 31, pages 8580–8589. Curran Associates, Inc., 2018.
-  Simon S. Du, Xiyu Zhai, Barnabás Póczos, and Aarti Singh. Gradient Descent Provably Optimizes Over-parameterized Neural Networks. 2019.
-  Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A Convergence Theory for Deep Learning via Over-Parameterization. CoRR, abs/1811.03962, 2018.
-  Radford M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1996.
-  Youngmin Cho and Lawrence K. Saul. Kernel Methods for Deep Learning. In Advances in Neural Information Processing Systems 22, pages 342–350. Curran Associates, Inc., 2009.
-  Jae Hoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep Neural Networks as Gaussian Processes. ICLR, 2018.
-  Jeffrey Pennington and Yasaman Bahri. Geometry of Neural Network Loss Surfaces via Random Matrix Theory. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 2798–2806. PMLR, 06–11 Aug 2017.
-  Levent Sagun, Utku Evci, V. Ugur Güney, Yann Dauphin, and Léon Bottou. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks. CoRR, abs/1706.04454, 2017.
-  Daniel Wagenaar. Information geometry of neural networks. 1998.
-  Razvan Pascanu and Yoshua Bengio. Revisiting Natural Gradient for Deep Networks. jan 2013.
-  Jeffrey Pennington and Pratik Worah. The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network. In Advances in Neural Information Processing Systems 31, pages 5415–5424. Curran Associates, Inc., 2018.
-  Ryo Karakida, Shotaro Akaho, and Shun-Ichi Amari. Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach. jun 2018.
-  Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955, 2019.
-  Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956, 2018.
-  Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. arXiv preprint arXiv:1902.06015, 2019.
-  Levent Sagun, Léon Bottou, and Yann LeCun. Singularity of the hessian in deep learning. CoRR, abs/1611.07476, 2016.
-  Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. CoRR, abs/1812.04754, 2018.
-  Vardan Papyan. Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians. CoRR, abs/1901.08244, 2019.
-  Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3360–3368. Curran Associates, Inc., 2016.
-  Arthur Jacot, Franck Gabriel, and Clément Hongler. Freeze and chaos for dnns: an NTK view of batch normalization, checkerboard and boundary effects. CoRR, abs/1907.05715, 2019.
-  Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2483–2493. Curran Associates, Inc., 2018.
-  Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2232–2241, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
Appendix A Proofs
For the proofs of the theorems and propositions presented in the main text, we reformulate the setup of . For a fixed training set , we consider a (possibly random) time-varying training direction which describes how each of the outputs must be modified. In the case of gradient descent on a cost , the training direction is . The parameters are updated according to the differential equation
Under the condition that is stochastically bounded as the width of the network goes to infinity, the NTK converges to its fixed limit uniformly over .
The reason we consider a general training direction (and not only a gradient of a loss) is that we can split a network in two at a layer and the training of the smaller network will be according to the training direction given by
because the derivatives are bounded and by Lemma 1 of the Appendix of , this training direction satisfies the constraints even though it is not the gradient of a loss. As a consequence, as the NTK of the smaller network also converges to its limit uniformly over . As we let the pre-activations and weights move at a rate of . We will use this rate of change to prove that other types of kernels are constant during training.
When a network is trained with gradient descent on a loss with BGOSS, the integral is stochastically bounded. Because the loss is decreasing during training, the outputs lie in the sublevel set for all times . The norm of the gradient is hence bounded for all times . Because the distribution of converges to a multivariate Gaussian, is stochastically bounded as the width grows, where is a bound on the norm of the gradient on . We then have the bound which is itself stochastically bounded.
For the binary and softmax cross-entropy losses the gradient is uniformly bounded:
For the binary cross-entropy loss and any , .
For the softmax cross-entropy loss C on classes and any , .
The binary cross-entropy loss with labels is
and the gradient at an input is
which is bounded in absolute value by for both such that .
The softmax cross-entropy loss over classes with labels is defined by
The gradient is at an input and output class is
which is bounded in absolute value by such that . ∎
Appendix B Preliminaries
To study the moments of the matrix , we first have to show that two tensors vanish as :
We study these tensors recursively, for this, we need a recursive definition for the first derivatives and second derivatives . The value of these derivatives depend on the layer the parameters and belong to, and on whether they are connection weights or biases . The derivatives with respect to the parameters of the last layer are