An Investigation into Neural Net Optimization via Hessian Eigenvalue Density
Abstract
To understand the dynamics of optimization in deep neural networks, we develop a tool to study the evolution of the entire Hessian spectrum throughout the optimization process. Using this, we study a number of hypotheses concerning smoothness, curvature, and sharpness in the deep learning literature. We then thoroughly analyze a crucial structural feature of the spectra: in nonbatch normalized networks, we observe the rapid appearance of large isolated eigenvalues in the spectrum, along with a surprising concentration of the gradient in the corresponding eigenspaces. In batch normalized networks, these two effects are almost absent. We characterize these effects, and explain how they affect optimization speed through both theory and experiments. As part of this work, we adapt advanced tools from numerical linear algebra that allow scalable and accurate estimation of the entire Hessian spectrum of ImageNetscale neural networks; this technique may be of independent interest in other applications.
An Investigation into Neural Net Optimization via Hessian Eigenvalue Density
Behrooz Ghorbani^{†}^{†}thanks: Work was done while author was an intern at Google. Department of Electrical Engineering Stanford University ghorbani@stanford.edu Shankar Krishnan Machine Perception, Google Inc. skrishnan@google.com Ying Xiao Machine Perception, Google Inc. yingxiao@google.com
noticebox[b] \end@float
1 Introduction
The Hessian of the training loss (with respect to the parameters) is crucial in determining many behaviors of neural networks. The eigenvalues of the Hessian characterize the local curvature of the loss which, for example, determine how fast models can be optimized via firstorder methods (at least for convex problems), and is also conjectured to influence the generalization properties. Unfortunately, even for moderate sized models, exact computation of the Hessian eigenvalues is computationally impossible. Previous studies on the Hessian have focused on small models, or are limited to computing only a few eigenvalues [23, 24, 30]. In the absence of such concrete information about the eigenvalue spectrum, many researchers have developed clever ad hoc methods to understand notions of smoothness, curvature, sharpness, and poor conditioning in the landscape of the loss surface. Examples of such work, where some surrogate is defined for the curvature, include the debate on flat vs sharp minima [16, 5, 29, 15], explanations of the efficacy of residual connections [19] and batch normalization [25], the construction of lowenergy paths between different local minima [6], qualitative studies and visualizations of the loss surface [11], and characterization of the intrinsic dimensionality of the loss [18]. In each of these cases, detailed knowledge of the entire Hessian spectrum would surely be informative, if not decisive, in explaining the phenomena at hand.
In this paper, we develop a tool that allows us access to the entire spectrum of a deep neural network. The tool is both highly accurate (we validate it to a doubleprecision accuracy of for a 15000 parameter model), and highly scalable (we are able to generate the spectra of Resnets [13] and Inception V3 [27] on ImageNet in a small multiple of the time it takes to train the model). The underlying algorithm is extremely elegant, and has been known in the numerical analysis literature for decades [10]; here we introduce it to the machine learning community, and build (and release) a system to run it at modern deep learning scale.
This algorithm allows us to peer into the optimization process with unprecedented clarity. By generating Hessian spectra with fine time resolution, we are able to study all phases of training, and are able to comment fruitfully on a number of hypotheses in the literature about the geometry of the loss surface. Our main experimental result focuses on the role of outlier eigenvalues, we analyze how the outlier eigenvalues affect the speed of optimization; this in turn provides significant insight into how batch normalization [14], one of the most popular innovations in training deep neural nets, speeds up optimization.
We believe our tool and style of analysis will open up new avenues of research in optimization, generalization, architecture design etc. So we release our code to the community to accelerate a Hessian based analysis of deep learning.
1.1 Contributions
In this paper, we empirically study the full Hessian spectrum of the loss function of deep neural networks. Our contributions are as follows:
In Section 2, we introduce a tool and a system, for estimating the full Hessian spectrum, capable of tackling models with tens of millions of parameters, and millions of data points. We both theoretically prove convergence properties of the underlying algorithm, and validate the system to double precision accuracy on a toy model.
In Section 3, we use our tool to generate Hessian spectra along the optimization trajectory of a variety of deep learning models. In doing so, we revisit a number of hypotheses in the machine learning literature surrounding curvature and optimization. With access to the entire Hessian spectrum, we are able to provide new perspectives on a variety of interesting problems: we concur with many of the coarse descriptions of the loss surface, but disagree with a number of hypotheses about how learning rate and residual connections interact with the loss surface. Our goal is not necessarily to provide proofs or refutation – at the very least, that would require the study of a more diverse set of models – but to provide strong evidence for/against certain interesting ideas, and simultaneously to highlight some applications of our tool.
In Section 4, we observe that models with significant outlier Hessian eigenvalues exhibit slow training behavior. We provide a theoretical justification for this in Section 4.1 – we argue that a nontrivial fraction of energy of the Hessian is distributed across the bulk in tiny eigenvalues, and that a coupling between the stochastic gradients and the outlier eigenvalues prevents progress in those directions. We then show that batch normalization pushes these outliers back into the bulk, and are able to isolate this effect by ablating the batch normalization operation. In Section 4.2, we confirm the predictions of our hypothesis by studying a careful intervention to batch normalization that causes the resurgence of outlier eigenvalues, and dramatic slowdowns in optimization.
1.2 Related Work
Empirical analysis of the Hessian has been of significance interest in the deep learning community. Due to computational costs of computing the exact eigenvalues ( for an explicit matrix), most of the papers in this line of research either focus on smaller models or on lowdimensional projections of the loss surface. Sagun et al. [23, 24] study the spectrum of the Hessian for small twolayer feedforward networks. They show that the spectrum is divided into two parts: (1) a bulk concentrated near zero which includes almost all of the eigenvalues and (2) roughly “number of classes  1” outlier eigenvalues emerging from the bulk. We extend this analysis in two ways. First, we calculate the Hessian for models with parameters on datasets with examples – we find that many, but not all of the above observations hold at this scale, and refine some of their observations. Secondly, we leverage the scalability of our algorithm to compute and track the Hessian spectrum throughout the optimization (as opposed to only at the end). Observing this evolution allows us to study how individual architecture choices affect optimization. There is an extensive literature regarding estimating the eigenvalues distribution of large matrices (for a small survey, see [20]). The algorithm we use is due to Golub and Welsch [10]. While many of these algorithms have theoretical guarantees, their empirical success is highly dependent on the problem structure. We perform a thorough comparison of our work to the recent proposal of [2] in Appendix D.
Batch Normalization (BN) [14] is one of the most influential innovations in optimizing deep neural networks as it substantially reduces the training time and the dependence of the training on initialization. There has been much interest in determining the underlying reasons for this effect. The original BN paper suggests that as the model trains, the distribution of inputs to each layer changes drastically, a phenomenon called internal covariance shift (ICS). They suggest that BN improves training by reducing ICS. There has been a series of exciting new works exploring the effects of BN on the loss surface. Santurkar et al. [25] empirically show that ICS is not necessarily related to the success of the optimization. They instead prove that under certain conditions, the Lipschitz constant of the loss and smoothness of the loss with respect to the activations and weights of a linear layer are improved when BN is present. Unfortunately, these bounds are on a perlayer basis; this yields bounds on the diagonal blocks of the overall Hessian, but does not directly imply anything about the overall smoothness of the entire Hessian. In fact even exact knowledge of for the entire Hessian and parameter norms (to control the distance from the optimum) is insufficient to determine the speed of optimization: in Section 4.2, we exhibit two almost identical networks that differ only in the way batch norm statistics are calculated; they have almost exactly the same largest eigenvalue and the parameters have the same scale, yet the optimization speeds are vastly different.
During the preparation of this paper, [21] appeared on Arxiv which briefly introduces the same spectrum estimation methodology and studies the Hessian on small subsamples of MNIST and CIFAR10 at the end of the training. In comparison, we provide a detailed exposition, error analysis and validation of the estimator in Section 2, and present optimization results on full datasets, up to and including ImageNet.
1.3 Notation
Neural networks are trained iteratively. We call the estimated weights at optimization iteration , , . We define the loss associated with batch be . The fullbatch loss is defined as where is the number of batches.^{1}^{1}1We define the loss in terms of perbatch loss (as opposed to the per sample loss) in order to accommodate networks that use batch normalization. The Hessian, is a symmetric matrix such that . Note that our Hessians are all “fullbatch” Hessians (i.e., they are computed using the entire dataset). When there is no confusion, we represent with . Throughout the paper, has the spectral decomposition where , and .
2 Accurate and Scalable Estimation of Hessian Eigenvalue Densities for
To understand the Hessian, we would like to compute the eigenvalue (or spectral) density, defined as where is the Dirac delta operator. The naive approach requires calculating ; however, when the number of parameters, , is large this is not tractable. We relax the problem by convolving with a Gaussian density of variance to obtain:
(1) 
where For small enough , provides all practically relevant information regarding the eigenvalues of . Explicit representation of the Hessian matrix is infeasible when is large, but using Pearlmutter’s trick [22] we are able to compute Hessianvector products for any chosen vector.
2.1 Stochastic Lanczos Quadrature
It has long been known in the numerical analysis literature that accurate stochastic approximations to the eigenvalue density can be achieved with much less computation than a full eigenvalue decomposition. In this section, we describe the stochastic Lanczos quadrature algorithm [10]. Although the algorithm is already known, its mathematical complexity and potential as a research tool warrant a clear exposition for a machine learning audience. We give the pseudocode in Algorithm 1, and describe the individual steps below, deferring a discussion of the various approximations to Section 2.2.
Since is diagonalizable and is analytic, we can define where acts pointwise on the diagonal of . Now observe that if , we have
(2) 
Thus, as long as concentrates fast enough, to estimate , it suffices to sample a small number of random ’s and average .
By definition, we can write
(3) 
where . Instead of summing over the discrete index variable , we can rewrite this as a RiemannStieltjes integral over a continuous variable weighted by :
(4) 
where is a CDF (note that the probability density is a sum of delta functions that directly recovers Equation 2.1)^{2}^{2}2Technically is a positive measure, not a probability distribution, because only concentrates on 1. This wrinkle is irrelevant..
To evaluate this integral, we apply a quadrature rule (a quadrature rule approximates an integral as a weighted sum – the wellknown highschool trapezoid rule is a simple example). In particular, we want to pick a set of weights and a set of nodes so that
(5) 
The hope is that there exists a good choice of where such that and are close for all , and that we can find the nodes and weights efficiently for our particular integrand and the CDF . The construction of a set of suitable nodes and weights is a somewhat complicated affair. It turns out that if the integrand were a polynomial of degree , with small enough compared to , it is possible to compute the integral exactly,
(6) 
Theorem 2.1 ([9] Chapter 6).
Fix . For all , there exists an approximation rule generating nodeweight pairs such that for any polynomial, with , (6) is true. This approximation rule is called the Gaussian quadrature. The degree achieved is maximal: for a general , no other approximation rule can guarantee exactness of Equation (6) for higher degree polynomials.
The Gaussian quadrature rule always generates nonnegative weights. Therefore, as , it is guaranteed that which is a desirable property for a density estimate. For these reasons, despite the fact that our integrand is not a polynomial, we use the Gaussian quadrature rule. For the construction of the Gaussian quadrature nodes and weights, we rely on a deep connection between Gaussian quadrature and Krylov subspaces via orthogonal polynomials. We refer the interested reader to the excellent [9] for this connection.
Theorem 2.2 ([10]).
Let and be the incomplete basis resulting from applying QR factorization on . Let and be the spectral decomposition of . Then the Gaussian quadrature nodes are given by , and the Gaussian quadrature weights are given by .
Theorem 2.2 presents a theoretical way to compute the Gaussian quadrature rule (i.e., apply the matrix repeatedly and orthogonalize the resulting vectors). There are wellknown algorithms that circumvent calculating the numerically unstable , and compute and directly. We use Lanczos algorithm [17] (with full reorthogonalization) to perform this computation in a numerically stable manner.
2.2 Accuracy of Gaussian Quadrature Approximation
Intuition suggests that as long as is close to some polynomial of degree at most , our approximation must be accurate (i.e., Theorem 2.1). Crucially, it is not necessary to know the exact approximating polynomial, its mere existence is sufficient for an accurate estimate. There exists an extensive literature on bounding this error; [28] prove that under suitable conditions that
(7) 
where . The constant is closely tied to how well can be approximated by Chebyshev polynomials. ^{3}^{3}3We refer the interested reader to [28, 4] for more details In our setting, as decreases, higherorder polynomials become necessary to approximate well. Therefore, as decreases, decreases and more Lanczos iterations become necessary to approximate the integral well.
To establish a suitable value of , we perform an empirical analysis of the error decay when corresponds to a neural network loss Hessian. In Appendix B, we study this error on a 15910 parameter feedforward MNIST network, where the model is small enough that we can compute exactly. For , a quadrature approximation of order achieves maximum doubleprecision accuracy of . Following these results, we use for our experiments. Equation 7 implies that the error decreases exponentially in , and since GPUs are typically run in single precision, our is an extremely conservative choice.
2.3 Concentration of the Quadratic Forms
Although is an unbiased estimator for , we must still study its concentration towards its mean. We prove:
Claim 2.3.
Let be a fixed evaluation point and be the number of realizations of in step II. Let and . Then for any ,
Alternatively, since is a Gaussian density, we can give norm independent bounds: ,
(8) 
where .
2.4 Implementation, Validation and Runtime
We implemented a large scale version of Algorithm 1 in TensorFlow [1]; the main component is a distributed Lanczos Algorithm. We describe the implementation and performance in Appendix C. To validate our system, we computed the exact eigenvalue distribution on the 15910 parameter MNIST model. Our proposed framework achieves which corresponds to an extremely accurate solution. The largest model we’ve run our algorithm on is Inception V3 on ImageNet. The runtime is dominated by the application of the Hessianvector products within the Lanczos algorithm; we run fullbatch Hessian vector products. The remaining cost of the Lanczos algorithm is negligible at floating point operations. For a Resnet18 on ImageNet, running a single draw takes about half the time of training the model.
3 Spectral densities throughout optimization
The tool we developed in Section 2 gives us an unprecedented ability to examine the loss landscape of deep neural networks. In particular, we can track the spectral density throughout the entire optimization process. Our goal in this section is to provide direct curvature evidence for (and against) a number of hypotheses about the loss surface and optimization in the literature. We certainly can not conclusively prove or refute even a single hypothesis within the space constraints, but we believe that the evidence is very strong in many of these cases.
For our analysis, we study a variety of Resnet and VGG [26] architectures on both CIFAR10 and ImageNet. Details are presented in Appendix F. The Resnet32 on CIFAR10 has parameters; all other models have at least . For the sake of consistency, our plots in this section are of Resnet spectral densities; we have reproduced all these results on nonresidual (VGG) architectures.
At initialization, we observe that large negative eigenvalues dominate the spectrum. However, as Figure 2 shows, in only very few steps ( of the total number of steps; we made no attempt to optimize this bound), these large negative eigenvalues disappear and the overall shape of the spectrum stabilizes. Sagun et al. [23] had observed a similar disappearance of negative eigenvalues for toy feedforward models after the training, but we are able to pinpoint this phase to the very start of optimization. This observation is readily reproducible on ImageNet.
Throughout the rest of the optimization, the spectrum is almost entirely flat, with the vast majority ( of eigenvalues being close to 0). This is in accordance with the ideas of Li et al. [18], who hypothesize that the loss surface has low intrinsic dimensionality, and also with results of Sagun et al. on toy models. In the case of class classification with small twolayer feedforward networks, Sagun et al. had observed that the Hessian spectrum contains roughly outliers which are a few orders of magnitudes larger than the rest of the eigenvalues. Contrary to this, we find that the emergence of these outliers is highly dependent on whether BN is present in the model or not. We study this behavior in depth in Section 4.
Also in Sagun et al. is the observation that the negative eigenvalues at the end of the training are orders of magnitude smaller than the positive ones. While we are able to observe this on CIFAR10, what happens on ImageNet seems to be less clear (Figure 3). We can derive a useful metric by integrating the spectral densities. At the end of optimization, the total energy of the negative eigenvalues is comparable to that of the positive eigenvalues (0.434 vs 0.449), and the energy is smaller, but still far from zero (0.025 vs 0.036). In comparison, on CIFAR10 the energies are 0.025 and 0.179 in the negative and positive components respectively. We believe that the observation of Sagun et al. may be an artifact of the tiny datasets used – on MNIST and CIFAR10 one can easily attain zero classification loss (presumably a global minimum); on ImageNet, even a much larger model will fail to find a zero loss solution.
Jastrzkebski et al. [15], building on a line of work surrounding flat and sharp minima, hypothesized that lower learning rates correspond to sharper optima. We consider this question by inspecting the spectral densities immediately preceding and following a learning rate drop. According to the hypothesis, we would then expect the spectral density to exhibit more extremal eigenvalues. In fact, we find the exact opposite to be true in Figure 4 – not only do the large eigenvalues contract substantially after the learning rate drop at 40k steps, we have a lower density at all values of except in a tiny ball around 0. This is an extremely surprising result, and violates the common intuition that lower learning rates allow one to slip into small, sharp crevices in the loss surface. We note that this is not a transient phenomenon – the spectrum before and afterwards are stable over time.
Finally, Li et al. [19] recently hypothesized that adding residual connections significantly smooths the optimization landscape, producing a series of compelling twodimensional visualizations. We compared a Resnet32 with and without residual connections, and we observe in Figure 5 that all eigenvalues contract substantially towards zero. This is contrary to the visualizations of Li et al.
4 Outlier Eigenvalues Slow Optimization; Batch Norm Suppresses Outliers
In some of the spectral densities presented so far, perhaps the most salient feature is the presence of a small number of outlier eigenvalues that are located far from the bulk of the spectrum. We noticed that these outliers are much larger and much further from the bulk for some architectures than others (i.e., for VGG the outliers are extremely far, less so for Resnets). Suspecting that batch normalization was the crucial difference, we ran a series of ablation experiments contrasting the spectral density in the presence and absence of batch normalization (i.e., we added BN to models that did not already have it, and removed BN from models that already did). Figure 8 contrasts the the Hessian spectrum in the presence of BN vs the spectrum when BN is removed. The experiment yields the same results on VGG on CIFAR10 (Figure 9), and Resnet18 on ImageNet (Figure 7), and at various points through training.
Our experiments reveal that, in the presence of BN, the largest eigenvalue of the Hessian, tend to not to deviate as much from the bulk. In contrast, in nonBN networks, the outliers grow much larger, and further from the bulk. To probe this behavior further we formalize the notion of an outlier with a metric:
This provides a scaleinvariant measure of the presence of outliers in the spectrum. In particular, if (as suggested by Sagun et al. [23, 24] outliers are present in the spectrum, we expect . Figure 6 plots throughout training. It is evident that relative large eigenvalues appear in the spectrum. Normalization layer induces an odd dependency on parameter scale – scaling the (batch normalized) weights leads to unchanged activations, and inversely scales the gradients. Obviously, we can not conclude that the problem is much easier! Thus, for studying the optimization performance of batch normalization, we must have at least a global scaling invariant quantity – which is. In contrast, the analysis in [25] varies wildly with scale^{4}^{4}4We have also tried normalizing individual weights matrices and filters, but this leads to blowup in some gradient components..
Informed by the experimental results in this section, we hypothesize a mechanistic explanation for why batch normalization speeds up optimization: it does so via suppression of outlier eigenvalues which slow down optimization.
4.1 Mechanisms by which outliers slow optimization
In this section, we seek to answer the question “Why do outlier eigenvalues slow optimization?” One answer to this question is obvious. Large implies that one must use a very low learning rate; but this an incomplete explanation – has to be large with respect to the rest of the spectrum. To make this explicit, consider a simple quadratic approximation to the loss around the optimum, :
(9) 
where without loss of generality, we assume with . We can easily show that when optimized with gradient descent with a learning rate sufficiently small for convergence that in the eigenbasis, we have:
(10) 
For all directions where is small with respect to , we expect convergence to be slow. One might hope that these small do not contribute significantly to the loss; unfortunately, when we measure this in a Resnet32 with no batch normalization, a small ball around 0 accounts for almost 50% of the total energy of the Hessian eigenvalues for a converged model (the reflects the loss function ). Thus to achieve successful optimization, we are forced to optimize these slowly converging directions^{5}^{5}5While the loss function in deep nets is not quadratic, the intuition that the result above provides is still valid in practice..
A second, more pernicious reason lies in the interaction between the large eigenvalues of the Hessian and the stochastic gradients. Define the covariance of the (stochastic) gradients at time to be
(11) 
The eigenvalue density of characterizes how the energy of the (minibatch) gradients is distributed (the tools of Section 2 apply just as well here). As with the Hessian, we observe that in nonBN networks the spectrum of has outlier eigenvalues (Figure 10). Throughout the optimization, we observe that almost all of the gradient energy is concentrated in these outlier subspaces (Figure 11), reproducing an observation of GurAri et al. [12]^{6}^{6}6In addition, we numerically verify that the outlier subspaces of and mostly coincide: throughout the optimization, for a Resnet32, of the energy of the outlier Hessian eigenvectors lie in the outlier subspace of .. We observe that when BN is introduced in the model, this concentration subsides substantially.
Since almost all of the gradient energy is in the very few outlier directions, the projection of the gradient in the complement of this subspace is minuscule. Thus, most gradient updates do not optimize the model in the flatter directions of the loss. As argued earlier, a significant portion of the loss comes from these flatter directions and a large fraction of the path towards the optimum lies in these subspaces. The fact that the gradient vanishes in these directions forces the training to be very slow.
Stated differently, the argument above suggest that, in nonBN networks, the gradient is uninformative for optimization, i.e., moving towards the (negative) gradient hardly takes us closer to the optimum . To support this argument, we plot the normalized inner product between the path towards the optimum, , ^{7}^{7}7We use the parameter at the end of the training as a surrogate for . and the gradients, , throughout the training trajectory (Figure 12). The figure suggests that the direction given by the gradient is almost orthogonal to the path towards the optimum. Moreover, the plot suggests that in BN networks, where the gradient is less concentrated in the highcurvature directions, the situation is significantly better.
In Appendix E, we study the relationship of the Hessian outliers with the concentration of the gradient phenomenon in a simple stochastic quadratic model. We show that when the model is optimized via stochastic gradients, outliers in the Hessian spectrum overinfluence the gradient and cause it to concentrate in their direction. As argued above, gradient concentration is detrimental to the optimization process. Therefore, this result suggests yet another way in which outlier eigenvalues in disrupt training.
4.2 Testing our hypothesis
Our hypothesis that batch norm suppresses outliers, and hence speeds up training, is simple enough to allow us to make predictions based on it. The original batch normalization paper [14] observed that the normalization parameters of BN, and , have to be computed (and backpropagated through) using the minibatch. If are computed using the complete dataset, the training becomes slow and unstable. Therefore, we postulate that when and are calculated from the population (i.e. fullbatch) statistics, the outliers persist in the spectrum.
To test our prediction, we train a Resnet32 on Cifar10 once using minibatch normalization constants (denoted by minibatchBN network), and once using fullbatch normalization constants (denoted by fullbatchBN network). The model trained with fullbatch statistics trains much slower (Appendix G). Figure 13 compares the spectrum of the two networks in the early stages of the training (the behavior is the same during the rest of training). The plot suggests strong outliers are present in the spectrum with fullbatchBN. This observation supports our hypothesis. Moreover, we observe that the magnitude of the largest eigenvalue of the Hessian in between the two models is roughly the same throughout the training. Given that fullbatchBN network trains much more slowly, this observation shows that analyses based on the top eigenvalue of the Hessian do not provide the fullpicture of the optimization hardness.
5 Conclusion
We presented tools from advanced numerical analysis that allow for computing the spectrum of the Hessian of deep neural networks in an extremely accurate and scalable manner. We believe this tool is valuable for the research community as it gives a comprehensive view of the local geometry of the loss. This information can be used to further our understanding of neural networks.
We used this toolbox to study how the loss landscape locally evolves throughout the optimization. We uncovered surprising phenomena, some of which run contrary to the widely held beliefs in the machine learning community. In addition, we provided simple and clear answers to how batchnormalization speeds up training. We believe that BN is only one of the many architecture choices that can be studied using our framework. Studying these other architecture choices can be an interesting avenue for future research.
References
 [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.
 [2] Ryan P Adams, Jeffrey Pennington, Matthew J Johnson, Jamie Smith, Yaniv Ovadia, Brian Patton, and James Saunderson. Estimating the spectral density of large implicit matrices. arXiv preprint arXiv:1802.03451, 2018.
 [3] P Bellec. Concentration of quadratic forms under a Bernstein moment assumption. Technical report, Technical report, Ecole Polytechnique, 2014.
 [4] Laurent Demanet and Lexing Ying. On chebyshev interpolation of analytic functions. preprint, 2010.
 [5] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.
 [6] Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred A Hamprecht. Essentially no barriers in neural network energy landscape. arXiv preprint arXiv:1803.00885, 2018.
 [7] Amparo Gil, Javier Segura, and Nico M Temme. Numerical methods for special functions, volume 99. Siam, 2007.
 [8] Github. Tensorflow models. https://github.com/tensorflow/models/blob/master/official/, 2017.
 [9] Gene H Golub and Gérard Meurant. Matrices, moments and quadrature with applications, volume 30. Princeton University Press, 2009.
 [10] Gene H Golub and John H Welsch. Calculation of gauss quadrature rules. Mathematics of computation, 23(106):221–230, 1969.
 [11] Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544, 2014.
 [12] Guy GurAri, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754, 2018.
 [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [14] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [15] Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.
 [16] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On largebatch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
 [17] Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. United States Governm. Press Office Los Angeles, CA, 1950.
 [18] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838, 2018.
 [19] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems, pages 6391–6401, 2018.
 [20] Lin Lin, Yousef Saad, and Chao Yang. Approximating spectral densities of large matrices. SIAM review, 58(1):34–65, 2016.
 [21] Vardan Papyan. The full spectrum of deep net hessians at scale: Dynamics with sample size. arXiv preprint arXiv:1811.07062, 2018.
 [22] Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147–160, 1994.
 [23] Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016.
 [24] Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of overparametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
 [25] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization?(no, it is not about internal covariate shift). arXiv preprint arXiv:1805.11604, 2018.
 [26] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [27] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
 [28] Shashanka Ubaru, Jie Chen, and Yousef Saad. Fast estimation of tr(f(a)) via stochastic lanczos quadrature. SIAM Journal on Matrix Analysis and Applications, 38(4):1075–1099, 2017.
 [29] Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv preprint arXiv:1706.10239, 2017.
 [30] Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, and Michael W Mahoney. Hessianbased analysis of large batch training and robustness to adversaries. arXiv preprint arXiv:1802.08241, 2018.
Appendix A Concentration of Quadratic Forms
The following lemma is one result on the concentration of quadratic forms:
Lemma A.1 (Concentration of Quadratic Forms, [3]).
Let . Let be any matrix. Then, ,
We are now ready to prove Claim 2.3.
Proof.
Consider the blockdiagonal matrix . Then, where is the concatenation of the realizations of divided by . Now observe that is i.i.d . Therefore, by Lemma A.1,
Now observe that and . Therefore, we get
(12) 
Figure 14 shows how changes with respect to probability bound in the worst case bound . We can see that even with modest values of , we can achieve tight bounds on with high probability.
Appendix B Numerical Verification on Small Models
Figure 15 shows how fast converges to as increases in terms of total variation () distance.
Before going to large scale experiments, we empirically demonstrate the accuracy of our proposed framework on a small model where the Hessian eigenvalues can be computed exactly. Let’s consider a feedforward neural network trained on MNIST examples with hidden layer of size , corresponding to parameters. The Hessian of networks of this type were studied earlier in [24] where it was shown that, after training, the spectrum consists of a bulk near zero and a few outlier eigenvalues. In our example, the range roughly corresponds to the bulk and corresponds to the outlier eigenvalues. Figures 1 and 16 compare our estimates with the exact smoothed density on each of these intervals. Our results show that with a modest number of quadrature points (90 here) we are able to approximate the density extremely well. Our proposed framework achieves which corresponds to an extremely accurate solution. As demonstrated in Figure 16, our estimator detects the presence of outlier eigenvalues. Therefore, the information at the edges of is also recovered.
Appendix C Implementation Details
The implementation of Algorithm 1 for a single machine is straightforward and can be done in a few lines of code. Scaling it to run on a 27 million parameter Inception V3 [27] on ImageNet (where we performed our largest scale experiments) requires a significant engineering effort.
The major component is a distributed Lanczos algorithm. Because modern deep learning models and datasets are so large, it is important to be able to run Hessianvector products in parallel across multiple machines. At each iteration of the Lanczos algorithm, we need to compute a Hessianvector product on the entire dataset. To do so, we split the data across all our workers (each one of which is endowed with one or more GPUs), each worker computes minibatch Hessianvector products, and these products are summed globally in an accumulator. Once worker is done on its partition of the data, it signals via semaphore to the chief that it is done. When all workers are done, the chief computes completes the Lanczos iteration by applying a QR orthogonalization step to total Hessianvector product. When the chief is done, it writes the result to shared memory and raises all the semaphores to signal to the workers to start on a new iteration.
For the Hessianvector products, we are careful to eliminate all nondeterminism from the computation, including potential subsampling from the data, shuffle order (this affects e.g., batch normalization), random number seeds for dropout and data augmentation, parallel threads consuming data elements for summaries etc. Otherwise, it is unclear what matrix the Lanczos iteration is actually using.
Although GPUs typically run in single precision, it is important to perform the Hessianvector accumulation in double precision. Similarly, we run the orthogonalization in the Lanczos algorithm in double precision. TensorFlow variable updates are not atomic by default, so it is important to turn on locking, especially on the accumulators. TensorFlow lacks communication capability between workers, so the coordination via semaphores (untrainable tf.Variables) is crude but necessary.
For a CIFAR10, on 10 Tesla P100 GPUs, it takes about an hour to compute 90 Lanczos iterations. For ImageNet, a Resnet18 takes about 20 hours to run 90 Lanczos iterations. An Inception V3 takes far longer, at about 3 days, due to needing to use 2 GPUs per worker to fit the computation graph. We were unable to run any larger models due to an unexpected OOM bugs in TensorFlow. It should be straightforward to obtain a 50100% speedup – we use the default TensorFlow parameter server setup, and one could easily reduce wasteful network transfers of model parameters from parameter servers for every minibatch, and conversely from transferring every minibatch Hessianvector product back to the parameter servers. We made no attempt to optimize these variable placement issues.
For the largest models, TensorFlow graph optimizations via Grappler can dramatically increase peak GPU memory usage, and we found it necessary to manage these carefully.
Appendix D Comparison with Other Spectrum Estimation Methods
There is an extensive literature on estimation of spectrum of large matrices. A large fraction of the algorithms in this literature relay on explicit polynomial approximations to . To be more specific, these methods approximate with a polynomial of degree , . In step I of Algorithm 1, is approximated by
(14) 
If is a good approximation for , we expect .
Since is a polynomial, (14) can be exactly evaluated as soon as
(15) 
are known. Note that by definition,
Therefore, if done carefully, can be computed by performing Hessianvector products in total. Hence, by performing Hessianvector products one can run Algorithm 1 with different realizations of .
This approximation framework is arguably simpler than Gaussian quadrature method as it does not have to cope with complexities of Lanczos algorithm. Therefore, it is has been extensively used in the numerical linear algebra literature. The polynomial approximation step is usually done via Chebyshev polynomials. This class of polynomials enjoy strong computational and theoretical properties that make them suitable for approximating smooth functions. For more details on Chebyshev polynomials we refer the reader to [7].
Recently, there has been a proposal to use Chebyshev approximation for estimating the Hessian spectrum for deep networks [2]. For completeness, we compare the performance of this algorithm with the Gaussian quadrature rule on the feedforward network defined earlier.
Figure 17 shows the performance of the Chebyshev method in approximating . The hyperparameters are selected such that the performance of the Chebyshev method in Figure 17 is directly comparable with the performance of Gaussian quadrature in Figure 1. In particular, both approximations take the same amount of computation (as measured by the number of Hessianvector products) and they both use the same kernel width (). As the figure shows, the Chebyshev method utterly fails to provide a decent approximation to the spectrum. As it can be seen from the figure, almost all of the details of the spectrum are masked by the artifacts of the polynomial approximation. In general, we expect the Chebyshev method to require orders of magnitude more Hessianvector products to match the accuracy of the Gaussian quadrature.
It is not a surprise that explicit polynomial approximation fails to provide a good solution. For small kernel widths, extremely high order polynomials are necessary to approximate the kernel well. Figure 18 shows how well Chebyshev polynomials approximate the kernel with . The figure suggests that even with a degree approximation, there is a significant difference between the polynomial approximation and the exact kernel.
Appendix E Gradient Concentration in the Quadratic Case
In this section, we theoretically show the phenomenon of gradient concentration on a simple quadratic loss function with stochastic gradient descent. The loss function is of the form
where the ordered (in decreasing order) eigenpairs of are (implies ) and the iteration starts at . We model the stochastic loss (from which we compute the gradients for SGD) as
where is a random variable such that and . In order to understand gradient concentration, we look at the alignment of individual SGD updates with individual eigenvectors of the Hessian. We are now ready to prove the following theorem.
Theorem E.1.
Consider a single gradient descent iteration, with a constant learning rate for a constant . Then,
(16) 
for some sufficiently large constant as .
Proof.
Each stochastic gradient step has the form . Expanding the recurrence induced by gradient step over steps, we can write
Therefore a single update can be expanded as
We can write the above equation as , where
Consider the dot product of this update with one of the eigenvectors . Clearly from the form of the update . We now quantify the variance of the update in the direction of . Using the identity , it is easy to see that
Squaring the sum of the two terms above and taking expectations, only the squared terms survive. We write the
As , the first term above goes to 0. This suggests that in the absence of noise in the gradients there is no reason to expect any alignment of the gradient updates with the eigenvectors of the Hessian. However, the second term (after some algebraic simplification) can be written as
Parameterizing completes the proof.
∎
A couple of observations are appropriate here. We can see that as the separation of eigenvalues increases, gradient updates align quadratically with the top eigenspaces. By manipulating the alignment of with the top eigenspaces of , we can dramatically change the concentration of updates. For example, if was similar to , the alignment with the top eigenspaces can be enhanced. If was similar to , the alignment with the top eigenspaces can be diminished. We have seen that, even in practice, if we could control the noise in the gradients, we can hamper or improve optimization in significant ways.
Appendix F Experimental Details
On CIFAR10, our models of interest are:
 Resnet32:

This model is a standard Resnet32 with 460k parameters. We train with SGD and a batch size of 128, and decay the learning from 0.1 by factors of 10 at step 40k, 60k, 80k. This attains a validation of 92% with data augmentation (and around 85% without)
 VGG11:

This model is a slightly modified VGG11 architecture. Instead of the enormous final fully connected layers, we are able to reduce these to 256 neurons with only a little degradation in validation accuracy (81% vs 83% with a 2048 size fully connected layers). We train with a constant SGD learning rate of 0.1, and a batch size of 128. This model has over 10 million parameters.
To ensure that our models have a finite local minimum, we introduce a small label smoothing of 0.1. This does not affect the validation accuracy; the only visible effect is that the lowest attained cross entropy loss is the entropy 0.509.
On ImageNet, our primary model of interest is Resnet18. We use the model in the official TensorFlow Models repository [8]. However, we train the model on resolution images, in an asynchronous fashion on 50 GPUs with an exponentially decaying learning rate starting at 0.045 and batch size 32. This attains 71.9% validation accuracy. This model has over 11 million parameters.
Appendix G Batch normalization with population statistics
The population loss experiment is quite difficult to run on CIFAR10 (we were unable to make Inception V3 train in this way without using a tiny learning rate of ). In particular, it is important to divide the learning rate by a factor of 100, and also to spend at least 400 steps at the start of optimization with a learning rate of 0: this allows the batch normalization population statistics to stabilize with a better initialization than the default mean of 0.0 and variance of 1.0.