The asymptotic spectrum of the Hessian of DNN throughout training

The asymptotic spectrum of the Hessian of DNN throughout training

Arthur Jacot
EPFL
arthur.jacot@epfl.ch
   Franck Gabriel
EPFL
franck.gabriel@epfl.ch
   Clément Hongler
EPFL
clement.hongler@epfl.ch
Abstract

The dynamics of DNNs during gradient descent is described by the so-called Neural Tangent Kernel (NTK). In this article, we show that the NTK allows one to gain precise insight into the Hessian of the cost of DNNs: we obtain a full characterization of the asymptotics of the spectrum of the Hessian, at initialization and during training.

1 Introduction

The advent of deep learning has sparked a lot of interest in the loss surface of deep neural networks (DNN), and in particular its Hessian. However to our knowledge, there is still no theoretical description of the spectrum of the Hessian. Nevertheless a number of phenomena have been observed numerically.

The loss surface of neural networks has been compared to the energy landscape of different physical models [1, 2, 3]. It appears that the loss surface of DNNs may change significantly depending on the width of the network (the number of neurons in the hidden layer), motivating the distinction between the under- and over-parametrized regimes [4, 2, 5].

The non-convexity of the loss function implies the existence of a very large number of saddle points, which could slow down training. In particular, in [6, 7], a relation between the rank of saddle points (the number of negative eigenvalues of the Hessian) and their loss has been observed.

For overparametrized DNNs, a possibly more important phenomenon is the large number of flat directions [4]. The existence of these flat minima is conjectured to be related to the generalization of DNNs and may depend on the training procedure [8, 9, 10].

Recent results have obtained strong convergence guarantees for shallow networks [11, 12, 3] and also for deep networks in the over-parametrized regime [13, 14, 15].

In [13] it has been shown, using a functional approach, that in the infinite width-limit, DNNs behave like kernel methods with respect to the so-called Neural Tangent Kernel, which is determined by the architecture of the network. This strengthens the connections between neural networks and kernel methods [16, 17, 18].

This raises the question: can we use these new results to gain insight into the behavior of the Hessian of the loss of DNNs, at least in the small region explored by the parameters during training?

1.1 Contributions

Following ideas introduced in [13], we consider the training of -layered DNNs in a functional setting. For a functional cost , the Hessian of the loss is the sum of two matrices and . We show the following results for large and for a fixed number of datapoints :

  • The first matrix is positive semi-definite and its eigenvalues are given by the (weighted) kernel PCA of the dataset with respect to the NTK. The dominating eigenvalues are the principal components of the data followed by a high number of small eigenvalues. The “flat directions” are spanned by the small eigenvalues and the null-space (of dimension at least when there is a single output). Because the NTK is asymptotically constant [13], these results apply at initialization, during training and at convergence.

  • The second matrix can be viewed as residual contribution to , since it vanishes as the network converges to a global minimum. We compute the limit of the first moment and characterize its evolution during training, of the second moment which stays constant during training, and show that the higher moments vanish.

  • Regarding the sum , we show that the matrices and are asymptotically orthogonal to each other at initialization and during training. In particular, the moments of the matrices and add up: .

These results give, for any depth and a fairly general non-linearity, a complete description of the spectrum of the Hessian in terms of the NTK at initialization and throughout training. This gives theoretical confirmation of a number of observations about the Hessian [8, 6, 7, 9, 10, 19, 2], and sheds a new light on them.

1.2 Related works

The Hessian of the loss has been studied through the decomposition in a number of previous works [20, 19, 2].

For least-squares and cross-entropy costs, the first matrix is equal to the Fisher matrix [21, 22], whose moments have been described for shallow networks in [23]. For deep networks, the first two moments and the operator norm of the Fisher matrix for a least squares loss were computed at initialization in [24] conditionally on a certain independence assumption; our method does not require such assumptions. Note that their approach implicitly uses the NTK.

The second matrix has been studied in [19, 2] for shallow networks, conditionally on a number of assumptions. Note that in the setting of [19], the matrices and are assumed to be freely independent, which allows them to study the spectrum of the Hessian; in our setting, we show that the two matrices and are asymptotically orthogonal to each other.

2 Setup

We consider deep fully connected artificial neural networks (DNNs) using the setup and NTK parametrization of [13], taking an arbitrary nonlinearity (i.e. that is 4 times continuously differentiable function with all four derivatives bounded). The layers are numbered from (input) to (output), each containing neurons for . The parameters consist of the weight matrices and bias vectors for . We aggregate the parameters into the vector .

The activations and pre-activations of the layers are defined recursively for an input , setting :

The parameter is added to tune the influence of the bias on training111In our experiments, we take .. All parameters are initialized as iid Gaussians.

We will in particular study the network function, which maps inputs to the activation of the output layer (before the last non-linearity):

In this paper, we will study the limit of various objects as sequentially, i.e. we first take then , etc. This greatly simplifies the proofs, but they could in principle be extended to the simultaneous limit, i.e. when . All our numerical experiments are done with ‘rectangular’ networks (with ) and match closely the predictions for the sequential limit.

In the limit we study in this paper, the NTK is asymptotically fixed, as in [13, 15, 14, 25]. By rescaling the outputs of DNNs as the width increases, one can reach another limit where the NTK is not fixed [12, 26, 11, 27]. The behavior of the Hessian in this other limit may be significantly different.

2.1 Functional viewpoint

The network function lives in a function space and we call the function that maps the parameters to the network function the realization function. We study the differential behavior of :

  • The derivative is a function-valued vector of dimension . The -th entry represents how modifying the parameter modifies the function in the space .

  • The Hessian is a function-valued matrix.

The network is trained with respect to the cost functional:

for strictly convex , summing over a finite dataset of size . The parameters are then trained with gradient descent on the composition , which defines the usual loss surface of neural networks.

In this setting, we define the finite realization function mapping parameters to be the restriction of the network function to the training set . The Jacobian is hence an matrix and its Hessian is a tensor. Defining the restricted cost , we have .

For our analysis, we require that the gradient norm does not explode during training. The following condition is sufficient:

Definition 1.

A loss has bounded gradients over sublevel sets (BGOSS) if the norm of the gradient is bounded over all sets .

For example, the Mean Square Error (MSE) for the labels has BGOSS because . For the binary and softmax cross-entropy the gradient is uniformly bounded, see Proposition 2 in Appendix A.

2.2 Neural Tangent Kernel

The behavior during training of the network function in the function space is described by a (multi-dimensional) kernel, the Neural Tangent Kernel (NTK)

During training, the function follows the so-called kernel gradient descent with respect to the NTK, which is defined as

In the infinite-width limit (letting sequentially) and for losses with BGOSS, the NTK converges to a deterministic limit , which is constant during training, uniformly on finite time intervals [13]. In the case of the MSE loss, the uniform convergence of the NTK was proven for in [25].

The limiting NTK is constructed as follows:

  1. For and a kernel , define the kernel by

    for a centered Gaussian vector with covariance matrix . For , we denote by the kernel .

  2. We define the kernels for each layer of the network, starting with and then recursively by , for , where is the network non-linearity.

  3. The limiting NTK is defined in terms of the kernels and the kernels :

The NTK leads to convergence guarantees for DNNs in the infinite-width limit, and connect their generalization properties to those of kernel methods [13, 25].

2.3 Gram Matrices

For a finite dataset and a fixed depth , we denote by the Gram matrix of with respect to the limiting NTK, defined by

It is block diagonal because different outputs are asymptotically uncorrelated.

Similarly, for any (scalar) kernel (such as the limiting kernels introduced later), we will use the same notation, denoting the Gram matrix of the datapoints by .

3 Main Theorems

3.1 Hessian as

Using the above setup, the Hessian of the loss is the sum of two terms, with the entry given by

For a finite dataset, the Hessian matrix is equal to the sum of two matrices

where is a matrix, is a matrix and is a tensor to which we apply a scalar product (denoted by ) in its last dimension with the vector to obtain a matrix.

Our main contribution is the following theorem, which describes the limiting moments in terms of the moments of and :

Theorem 1.

For any loss with BGOSS and , in the sequential limit , we have for all

The limits of and can be expressed in terms of the NTK , the kernel and the non-symmetric kernels , defined in Appendix C:

  • The moments converge to the following limits (with the convention that ):

  • The first moment converges to the limit:

    At initialization form a Gaussian pair of -vectors, independent for differing output indices and with covariance for a (non-symmetric) limiting kernel . During training, both vectors follow the differential equations

  • The second moment converges to the following limit defined in terms of the Gram matrix :

  • The higher moments for vanish.

Proof.

The moments of and can be studied separately because the moments of their sum is asymptotically equal to the sum of their moments by Proposition 5 below. The limiting moments of and are respectively described by Propositions 1 and 4 below. ∎

In the case of a MSE loss , the first and second derivatives take simple forms and and the differential equations can be solved to obtain more explicit formulae:

Corollary 1.

For the MSE loss and , in the limit , we have uniformly over

where

In expectation we have:

Proof.

The moments of are constant because is constant. For the moments of , we first solve the differential equation for :

Noting , we have

The expectation of the first moment of then follows. ∎

Figure 1: Comparison of the theoretical prediction of Corollary 1 for the expectation of the first 4 moments (colored lines) to the empirical average over 250 trials (black crosses) for a rectangular network with two hidden layers of finite widths () with the smooth ReLU (left) and the normalized smooth ReLU (right), for the MSE loss on scaled down 14x14 MNIST with . Only the first two moments are affected by at the beginning of training.

3.2 Mutual Orthogonality of and

A first key ingredient to prove Theorem 1 is the asymptotic mutual orthogonality of the matrices and

Proposition (Proposition 5 in Appendix D).

For any loss with BGOSS and , we have uniformly over

As a consequence .

Remark 1.

If two matrices and are mutualy orthogonal (i.e. ) the range of is contained in the nullspace of and vice versa. The non-zero eigenvalues of the sum are therefore given by the union of the non-zero eigenvalues of and . Furthermore the moments of and add up: . Proposition 5 shows that this is what happens asymptotically for and .

Note that both matrices and have large nullspaces: indeed assuming a constant width , we have and (see Appendix C), while the number of parameters scales as (when ).

Figure 3 illustrates the mutual orthogonality of and .

3.3 The matrix

The matrix is best understood as a perturbation to , which vanishes as the network converges because . To calculate its moments, we note that

where the vector is the evaluation of the function on the training set.

For the second moment we have

for the Gram matrix of the kernel .

The following proposition desribes the limit of the function and the kernel and the vanishing of the higher moments:

Proposition (Proposition 4 in Appendix C).

For any loss with BGOSS and , the first two moments of take the form

- At initialization, and converge to a (centered) Gaussian pair with covariances

and during training evolves according to

- Uniformly over any interval , the kernel has a deterministic and fixed limit with limiting kernel:

- The higher moment vanish: .

This result has a number of consequences for infinitely wide networks:

  1. At initialization, the matrix has a finite Frobenius norm , because converges to a fixed limit. As the network converges, the derivative of the cost goes to zero and so does the Frobenius norm of .

  2. In contrast the operator norm of vanishes already at initialization (because for all even , we have ). At initialization, the vanishing of in operator norm but not in Frobenius norm can be explained by the matrix having a growing number of eigenvalues of shrinking intensity as the width grows.

  3. When it comes to the first moment of , Proposition 4 shows that the spectrum of is in general not symmetric. For the MSE loss the expectation of the first moment at initialization is

    which may be positive or negative depending on the choice of nonlinearity: with a smooth ReLU, it is positive, while for the arc-tangent or the normalized smooth ReLU, it can be negative (see Figure 1).
    This is in contrast to the result obtained in [19, 2] for the shallow ReLU networks, taking the second derivative of the ReLU to be zero. Under this assumption the spectrum of is symmetric: if the eigenvalues are ordered from lowest to highest, and .

These observations suggest that has little influence on the shape of the surface, especially towards the end of training, the matrix however has an interesting structure.

Figure 2: Illustration of the mutual orthogonality of and . For the 20 first eigenvectors of (blue) and (orange), we plot the Rayleigh quotients and (with , and the normalized ReLU on 14x14 MNIST with ). We see that the directions where is large are directions where is small and vice versa.
Figure 3: Plot of the loss surface around a global minimum along the first (along the y coordinate) and fourth (x coordinate) eigenvectors of . The network has , width for the smooth ReLU (left) and the normalized smooth ReLU (right). The data is uniform on the unit disk. Normalizing the non-linearity greatly reduces the narrow valley structure of the loss thus speeding up training.

3.4 The matrix

At a global minimizer , the spectrum of describes how the loss behaves around . Along the eigenvectors of the biggest eigenvalues of , the loss increases rapidely, while small eigenvalues correspond to flat directions. Numerically, it has been observed that the matrix features a few dominating eigenvalues and a bulk of small eigenvalues [28, 20, 29, 30]. This leads to a narrow valley structure of the loss around a minimum: the biggest eigenvalues are the ‘cliffs’ of the valley, i.e. the directions along which the loss grows fastest, while the small eigenvalues form the ‘flat directions’or the bottom of the valley.

Note that the rank of is bounded by and in the overparametrized regime, when , the matrix will have a large nullspace, these are directions along which the value of the function on the training set does not change. Note that in the overparametrized regime, global minima are not isolated: they lie in a manifold of dimension at least and the nullspace of is tangent to this solution manifold.

The matrix is closely related to the NTK Gram matrix:

As a result, the limiting spectrum of the matrix can be directly obtained from the NTK222This result was already obtained in [24], but without identifying the NTK explicitely and only at initialization.

Proposition 1.

For any loss with BGOSS and , uniformly over any interval , the moments converge to the following limit (with the convention that ):

Proof.

It follows from and the asymptotic of the NTK [13]. ∎

3.4.1 Mean-Square Error

When the loss is the MSE, is equal to . As a result, and have the same non-zero eigenvalues up to a scaling of . Because the NTK is assymptotically fixed, the spectrum of is also fixed in the limit.

The eigenvectors of the NTK Gram matrix are the kernel principal components of the data. The biggest principal components are the directions in function space which are most favorised by the NTK. This gives a functional interpretation of the narrow valley structure in DNNs: the cliffs of the valley are the biggest principal components, while the flat directions are the smallest components.

Remark 2.

As the depth of the network increases, one can observe two regimes [31, 32]: Order/Freeze where the NTK converges to a constant and Chaos where the NTK converges to a Kronecker delta. In the Order/Freeze the Gram matrix approaches a block diagonal matrix with constant blocks, and as a result eigenvalues of dominate the other ones, corresponding to constant directions along each outputs (this is line with the observations of [30]). This leads to a narrow valley for the loss and slows down training. In contrast, in the Chaos regime, the NTK Gram matrix approaches a scaled identity matrix, and the spectrum of should hence concentrate around a positive value, hence speeding up training. Figure 3 illustrates this phenomenon: with the smooth ReLU we observe a narrow valley, while with the normalized smooth ReLU (which lies in the Chaos according to [32]) the narrowness of the loss is reduced. A similar phenomenon may explain why normalization helps smoothing the loss surface and speed up training [33, 34].

3.4.2 Cross-Entropy Loss

For a binary cross-entropy loss with labels

is a diagonal matrix whose entries depend on (but not on ):

The eigenvectors of then correspond to the weighted kernel principal component of the data. The positive weights approach as goes to , i.e. when it is close to the decision boundary from one class to the other, and as the weight go to zero. The weights evolve in time through , the spectrum of is therefore not asymptotically fixed as in the MSE case, but the functional interpretation of the spectrum in terms of the kernel principal components remains.

4 Conclusion

We have given an explicit formula for the limiting moments of the Hessian of DNNs throughout training. We have used the common decomposition of the Hessian in two terms and and have shown that the two terms are asymptotically mutually orthogonal, such that they can be studied separately.

The matrix vanishes in Frobenius norm as the network converges and has vanishing operator norm throughout training. The matrix is arguably the most important as it describes the narrow valley structure of the loss around a global minimum. The eigendecomposition of is related to the (weighted) kernel principal components of the data w.r.t. the NTK.

References

Appendix A Proofs

For the proofs of the theorems and propositions presented in the main text, we reformulate the setup of [13]. For a fixed training set , we consider a (possibly random) time-varying training direction which describes how each of the outputs must be modified. In the case of gradient descent on a cost , the training direction is . The parameters are updated according to the differential equation

Under the condition that is stochastically bounded as the width of the network goes to infinity, the NTK converges to its fixed limit uniformly over .

The reason we consider a general training direction (and not only a gradient of a loss) is that we can split a network in two at a layer and the training of the smaller network will be according to the training direction given by

because the derivatives are bounded and by Lemma 1 of the Appendix of [13], this training direction satisfies the constraints even though it is not the gradient of a loss. As a consequence, as the NTK of the smaller network also converges to its limit uniformly over . As we let the pre-activations and weights move at a rate of . We will use this rate of change to prove that other types of kernels are constant during training.

When a network is trained with gradient descent on a loss with BGOSS, the integral is stochastically bounded. Because the loss is decreasing during training, the outputs lie in the sublevel set for all times . The norm of the gradient is hence bounded for all times . Because the distribution of converges to a multivariate Gaussian, is stochastically bounded as the width grows, where is a bound on the norm of the gradient on . We then have the bound which is itself stochastically bounded.

For the binary and softmax cross-entropy losses the gradient is uniformly bounded:

Proposition 2.

For the binary cross-entropy loss and any , .

For the softmax cross-entropy loss C on classes and any , .

Proof.

The binary cross-entropy loss with labels is

and the gradient at an input is

which is bounded in absolute value by for both such that .

The softmax cross-entropy loss over classes with labels is defined by

The gradient is at an input and output class is

which is bounded in absolute value by such that . ∎

Appendix B Preliminaries

To study the moments of the matrix , we first have to show that two tensors vanish as :

We study these tensors recursively, for this, we need a recursive definition for the first derivatives and second derivatives . The value of these derivatives depend on the layer the parameters and belong to, and on whether they are connection weights or biases . The derivatives with respect to the parameters of the last layer are