PyHessian: Neural Networks Through the Lens of the Hessian

PyHessian: Neural Networks Through the Lens of the Hessian

Abstract

We present PyHessian, a new scalable framework that enables fast computation of Hessian (i.e., second-order derivative) information for deep neural networks. PyHessian enables fast computations of the top Hessian eigenvalues, the Hessian trace, and the full Hessian eigenvalue/spectral density, and it supports distributed-memory execution on cloud/supercomputer systems and is available as open source [pyhessian]. This general framework can be used to analyze neural network models, including the topology of the loss landscape (i.e., curvature information) to gain insight into the behavior of different models/optimizers. To illustrate this, we analyze the effect of residual connections and Batch Normalization layers on the trainability of neural networks. One recent claim, based on simpler first-order analysis, is that residual connections and Batch Normalization make the loss landscape “smoother”, thus making it easier for Stochastic Gradient Descent to converge to a good solution. Our extensive analysis shows new finer-scale insights, demonstrating that, while conventional wisdom is sometimes validated, in other cases it is simply incorrect. In particular, we find that Batch Normalization does not necessarily make the loss landscape smoother, especially for shallower networks.

\useunder

\NewDocumentCommand\varOs m O

I Introduction

Residual neural networks [he2016deep] (ResNets) are widely used Neural Networks (NNs) for various learning tasks. The two main architectural components of ResNets are residual connections [he2016deep] and Batch Normalization (BN) layers [ioffe2015batch]. However, going beyond motivating stories to characterize precisely when and why these two popular architectural ingredients help or hurt training/generalization—especially in terms of measurable properties of the model—is still largely unsolved. Relatedly, characterizing whether other suggested architectural changes will help or hurt training/generalization is still done in a largely ad hoc manner. For example, it is often motivated by plausible but untested intuitions, and it is not characterized in terms of measurable properties of the model.

In this work, we present and apply PyHessian, an open source scalable framework with which one can directly analyze Hessian information, i.e., second-derivative information w.r.t. model parameters, in order to address these and related questions. PyHessian computes Hessian information by applying known techniques from Numerical Linear Algebra (NLA) [bai1996some, golub2009matrices, lin2016approximating] and Randomized NLA (RandNLA) [Mah-mat-rev_BOOK, RandNLA_PCMIchapter_chapter, drineas2006fast, yao2018hessian, avron2011randomized, ubaru2017fast] (that are approximate but come with rigorous theory). PyHessian enables computing Hessian information—including top Hessian eigenvalues, Hessian trace, and Hessian eigenvalue spectral density (ESD), and it supports distributed implementation—allowing distributed-memory execution on both cloud (e.g., AWS, Google Cloud) and supercomputer systems, for fast and efficient Hessian computation.

As an application of PyHessian, we use it to analyze the impact of residual connections and BN on the trainability of NNs, leading to new insights. In more detail, our main contributions are the following:

Fig. 1: The parametric loss landscapes of ResNet20 (top) and ResNet38 (bottom) on Cifar-10 is plotted by perturbing the model parameters at the end of training across the first and second Hessian eigenvector. Results for the original ResNet architecture (left), ResNet without BN (middle; denoted as ResNet), and ResNet without residual connection (right; denoted as ResNet). It can be clearly seen that removing BN from ResNet20 actually leads to a smoother loss landscape, which is opposite to the common belief that adding BN leads to a smoother loss landscape [santurkar2018does]. We only observed the claimed smoothness property for the deeper ResNet38 model (second row). This smoothness can be quantified by measuring the trace of the Hessian operator, reported in Figure 2, as well as the full Hessian ESD, shown in Figure 3 and 13. We also visualize the loss landscape throughout training for different epochs as shown in Figure 18 and 20, which provide further evidence. Models trained on Cifar-100 also exhibit a similar behavior (as shown in Figure 22, 23 and 24).
  • We introduce PyHessian, a new framework for direct and efficient computation of Hessian information, including the top eigenvalue, the trace, and the full ESD [pyhessian]. We also apply PyHessian to study how residual connections and BN affect training.

  • We observe that removing BN layers from ResNet (denoted below as ResNet) leads to rapid increase of the Hessian spectrum (the top eigenvalue, the trace, and the ESD support range). This increase is significantly more rapid for deeper models. See Figure 2, 1112, and 13 on Cifar-10 as well as Figure 91516, and 17 on Cifar-100.

  • We observe that, for shallower networks (ResNet20), removing the BN layer results in a flatter Hessian spectrum, as compared to standard ResNet20 with BN. See Figure 2 and 3 on Cifar-10 and Figure 9 and 15 on Cifar-100. This observation is the opposite of the common belief that the addition of BN layers make the loss landscape smoother (which we observe to hold only for deeper networks).

  • We observe that, for deeper networks (in our case, ResNet32/38), removing BN results in converging to sharper local minima, as compared to ResNet with BN. See Figure 2, 3, 19 and 20 on Cifar-10 as well as Figure 91523 and 24 on Cifar-100.

  • We show that removing residual connections from ResNet generally makes the top eigenvalue, the trace, and the Hessian ESD support range increase slightly. This increase is consistent for both shallower and deeper models (ResNet20/32/38/56). See Figure 2, 3,  14181920, and 21 on Cifar-10 as well as Figure 9152223, and 24 on Cifar-100.

  • We perform Hessian analysis for different stages of ResNet models (details in Section IV-A), and we find that generally BN is more important for the final stages than for earlier stages. In particular, removing BN from the last stage significantly degrades testing performance, with a strong correlation with the Hessian trace. See the comparison between the orange curve and the blue curve in Figure 4 and 8, the accuracy reported in Table II on Cifar-10 (Figure 10, and the accuracy reported on Cifar-100 in Table VI).

Ii Related work

Here, we review work related to Hessian-based analysis for NN training and inference, as well as work that studies the impact of different architectural components on the topology of the NN loss landscape.

Hessian and Large-scale Hessian Computation: Hessian-based analysis/computation is widely used in scientific computing. However, due to the (incorrect) belief that Hessian-based computations are infeasible for large NN problems, the majority of work in ML (except for quite small problems) performs only first-order analysis.1 However, using implicit or matrix-free methods, it is not even necessary to form the Hessian matrix explicitly in order to extract second-order information. Instead, it is possible to use stochastic methods from RandNLA to extract this information, without explicitly forming the Hessian matrix. For example, [bai1996some, avron2011randomized] proposed fast algorithms for trace computation; and [lin2016approximating, ubaru2017fast] provided efficient randomized algorithms to estimate the ESD of a positive semi-definite matrix. These algorithms only require an oracle for computing the product of the Hessian matrix with a given random vector. It is possible to compute this so-called “matvec” and extract Hessian information without explicitly forming the Hessian [becker1988improving, martens2010deep]. In particular, using the so-called R-operator, the Hessian matvec can be computed with the same computational graph used for backpropagating the gradient [martens2010deep].

Hessian eigenvalues of small NN models were analyzed [sagun2016eigenvalues, sagun2017empirical]; and the work of [pennington2017geometry] studied the geometry of NN loss landscapes by computing the distribution of Hessian eigenvalues at critical points. More recently, [yao2018hessian] used a deflated power-iteration method to compute the top eigenvalues for deep NNs during training. Moreover, the work of [ghorbani2019investigation] measured the Hessian ESD, based on the Stochastic Lanczos algorithm of [lin2016approximating, ubaru2017fast]. Here, we extend the analysis of [ghorbani2019investigation, yao2018hessian] by studying how the depth of the NN model as well as its architecture affect the Hessian spectrum (in terms of top eigenvalue, trace, and full ESD). Furthermore, we also perform block diagonal Hessian spectrum analysis, and we observe a fine-scale relationship between the Hessian spectrum and the impact of adding/removing residual connections and BN.

Hessian-based analysis has also been used in the context of NN training and inference. For example, [lecun1991second] analytically computes Hessian information for a single linear layer and uses the Hessian spectrum to determine the optimal learning rate to accelerate training. In [lecun1990optimal], the authors approximated the Hessian as a diagonal operator and used the inverse of this diagonal matrix to prune NN parameters. Subsequently, [hassibi1993second] used the inverse of the full Hessian matrix to develop an “Optimal Brain Surgeon” method for pruning NN parameters. The authors argued that a diagonal approximation may not be very accurate, as off-diagonal elements of the Hessian are important; and they showed that capturing these off-diagonal elements does indeed lead to better performance, as compared to [lecun1990optimal]. In the recent work of [dong2017learning], a layer-wise pruning method was proposed. This restricts the Hessian computations to each layer, and it provides bounds on the performance drop after pruning. More recently, [dong2019hawq, shen2019q, dong2019hawqv2] proposed a Hessian-based method for quantizing2 NN models, achieving significantly better performance, as compared to first-order based methods.

(Quasi-)Newton (second-order) methods [agarwal2016second, dembo1982inexact, pearlmutter1994fast, pilanci2017newton, pratt1998gauss, amari1998natural, bottou2018optimization] have been extensively explored for convex optimization problems [boyd2004convex]. In particular, in the seminal work of [nocedal1980updating, liu1989limited], a Quasi-Newton method was proposed to accelerate first-order based optimization methods. The idea is to precondition the gradient vector with the inverse of the Hessian. However, instead of directly using the Hessian, a series of approximate rank-1 updates are used instead. Follow up work of [schraudolph2007stochastic] extended this method and proposed a stochastic BFGS algorithm. More recently, the work of [bollapragada2018progressive] proposed an adaptive batch size Limited-memory BFGS method [liu1989limited] for large-scale machine learning problems; and an adaptive batch size method based on measuring directly the spectrum of the Hessian has been proposed [yao2018large] for large-scale NN training.

Hessian-based methods have also been explored for non-convex problems, including trust-region (TR) [conn2000trust], cubic regularization (CR) [nesterov2006cubic], and its adaptive variant (ARC)  [cartis2011adaptiveI, cartis2011adaptiveII]. For these problems, [byrd2011use, erdogdu2015convergence, roosta2016sub, xu2016sub] provide sketching/sampling techniques for Newton methods, where guarantees are established for sampling size and convergence rates; and [xu2017newton, xu2016sub, xu2017second, yao2018inexact] show that sketching/sampling methods can significantly reduce the need for data in approximate Hessian computation.

One important concern for applying second-order methods to training is the cost of computing Hessian information at every iteration. The work of [martens2015optimizing] proposed the so-called Kronecker-Factored Approximations (K-FAC) method, which approximates the Fisher information matrix into a Kronecker product. However, the approach comes with several new hyperparameters, which can actually be more expensive to tune, compared to first-order methods [ma2019inefficiency].

A major limitation in most of this prior work is that tests are typically restricted to small/simple NN models that may not be representative of NN workloads that are encountered in practice. This is in part due to the lack of a scalable and easily programmable framework that could be used to test second-order methods for a wide range of state-of-the-art models. Addressing this is the main motivation behind our development of PyHessian, which is released as open-source software and is available to researchers [pyhessian]. In this paper, we illustrate how PyHessian can be used for analyzing the NN behaviour during training, even for very deep state-of-the-art models. Future work includes using this framework for second-order based optimization, by testing it on modern NN models, as well as fairly gauging the benefit that may arise from such methods, in light of the cost for any extra hyperparameter tuning that may be needed [ma2019inefficiency].

Residual Connections and Batch Normalization: Residual connections [he2016deep] and BN [ioffe2015batch] are two of the most important ingredients in modern convolutional NNs. There have been different hypothesis offered for why these two components help training/generalization. First, the original motivation for residual connections was that they allow gradient information to flow to earlier layers of the NN, thereby reducing the vanishing gradient problem during training. The empirical study of [li2018visualizing] found that deep NNs with residual connections exhibit a significantly smoother loss landscape, as compared to models without residual connections. This was achieved by the so-called filter-normalized random direction method to plot 3D loss landscapes, i.e., not through direct analysis of the Hessian spectrum. This result is interesting, but it is hard to draw conclusions with perturbations in two directions, for a model that has millions of parameters (and thus millions of possible perturbation directions).

Second, the original motivation for why BN helps training/generalization was originally attributed to reducing the so-called Internal Covariance Shift (ICS) [ioffe2015batch]. However, this was disputed in the recent study of [santurkar2018does]. In particular, the work of [santurkar2018does] used first-order analysis to analyze the loss landscape, and found that adding a BN layer results in a smoother loss landscape. Importantly, they found that adding BN does not reduce the so called ICS. Again, while interesting, such first-order analysis may not fully capture the topology of the landscape; and, as we will show with our second-order analysis, this smoothness claim is not correct in general.

The work of [santurkar2018does] also performed an interesting theoretical analysis, showing a connection between adding the BN layer and the Lipschitz constant of the gradient (i.e., the top Hessian eigenvalue). It was argued that adding the BN layer leads to a smaller Lipschitz constant. However, the theoretical analysis is only valid for per-layer Lipschitz constant, as it ignores the complex interaction between different layers. It cannot be extended to the Lipschitz constant of the entire model (and, as we will show, this result does not hold for shallow networks).

Iii Methodology

For a supervised learning problem, we seek to minimize:

(1)

where is the learnable weight parameter, is the loss function, is the input pair, is the NN architecture, and is the size of training data. Below we first discuss how PyHessian computes the second-order statistics, and we then discuss the impact of architectural components on the trainability of the model.

Iii-a Neural Network Hessian Matvec

For a NN with parameters, the gradient of the loss w.r.t. model parameters is a vector

and the second derivative of the loss is a matrix,

commonly called the Hessian. A typical NN model involves millions of parameters, and thus even forming the Hessian is computationally infeasible. However, it is possible to compute properties of the Hessian spectrum without explicitly forming the Hessian matrix. Instead, all we need is an oracle to compute the application of the Hessian to a random vector . This can be achieved by observing the following:

(2)

Here, the first equality is the chain rule, the second is due to the independence of to , and the third equality is the definition of the Hessian. Importantly, note that the cost of this Hessian matrix-vector multiply (hereafter referred to as Hessian matvec) is the same as one gradient backpropagation. Having this oracle, we can easily compute the top Hessian eigenvalues using power iteration [yao2018hessian]; see Algorithm 2. However, for a typical NN with millions of parameters, the top eigenvalues may not be representative of how the loss landscape behaves. Therefore, we also compute the trace and ESD of the Hessian, as described below.

Iii-B Hutchinson Method for Hessian Trace Computation

The trace of the Hessian can be computed using RandNLA, and in particular with Hutchinson’s method [bai1996some, avron2011randomized] for the fast computation of the trace, using only Hessian matvec computations (as given in Eq. 2). In particular, since we are interested in the Hessian, i.e., a symmetric matrix, suppose we have a random vector , whose components are i.i.d. sampled from a Rademacher distribution (or Gaussian distribution with mean 0 and variance 1). Then, we have the identity

(3)

where is the identity matrix of appropriate size. That is, the trace of can be estimated by computing , where we compute the expectation by drawing multiple random samples. Note that can be efficiently computed from Eq. 2, and then is simply a dot product between the Hessian matvec and the original vector . See Algorithm 3 for a description.

Fig. 2: The Hessian trace of the entire network for ResNet/ResNet/ResNet with depth 20/32/38/56 on Cifar-10. For each depth, ResNet (blue) is the baseline. It can be clearly seen that removing BN from the architecture (orange) generally results in a rapid increase of the Hessian trace. This increase is more pronounced for deeper networks such as ResNet32 and ResNet38. Importantly, the Hessian trace of ResNet20 without BN is lower than the original model (blue). This is in contrast to the claim of [santurkar2018does]. Also, we generally observe that residual connections smooth the Hessian trace for both shallow and deep networks (compare blue and green lines). Results on Cifar-100 also exhibit the same behaviour (as shown in Figure 9).

Iii-C Full Eigenvalue Spectral Density

To provide finer-grained information on the Hessian spectrum than is provided by the top eigenvalues or the trace, we need to compute the full empirical spectral density (ESD) of the Hessian eigenvalues, defined as

(4)

where is the Dirac distribution and is the eigenvalue of , in descending order.

Input: Parameter: , degree m and .
Compute the gradient of by backpropagation, i.e., compute . for i  do        // Different Seeds
       Draw a random vector from N(0,1) and normalize it (same dimension as ). Get the tridiagonal matrix through Lanczos algorithm. Compute and from
Return
Algorithm 1 Stochastic Lanczos Quadrature for ESD Computation

Recent work in NLA/RandNLA has provided efficient matrix-free algorithms to estimate this ESD [lin2016approximating, golub2009matrices, ubaru2017fast] through Stochastic Lanczos Quadrature (SLQ). Here, we briefly describe SLQ in simple terms. This approach was also used in [ghorbani2019investigation] to compute the Hessian ESD. For more details, see [lin2016approximating, golub2009matrices, ubaru2017fast].

Here is a summary of our approach to compute the ESD . First, we approximate (of Eq. 4) by (Eq. 5 below) by applying a Gaussian kernel (first approximation), and we express this in the same expectation form as in the Hutchinson algorithm (Eq. 9 below). Next, since the computation inside the expectation depends directly on and the unknown eigenvalues (denoted by s), we further simplify the problem by using Gaussian quadrature (Eq. 13 below) (second approximation). Then, since the weights and s in the Gaussian quadrature are still unknown, we use the stochastic Lanczos algorithm to approximate the weights and s (Eq. 14 below) (third approximation). Finally, we approximate the expectation of the eigenvalue distribution as a sum (Eq. 15 below) (forth approximation).

In more detail, for the first approximation, we apply a Gaussian kernel, , with variance to Eq. 4 to obtain

(5)

where is the Gaussian kernel. Clearly, , as . Thus, if we had an algorithm to approximate Eq. 5, then we could take the limit and reduce the standard deviation of the Gaussian kernel to approximate Eq. 4. In our context, the question of how to compute amounts to computing the density distribution of the Hessian convolved with a Gaussian kernel.

To do this, observe that

(6)

where is the eigendecomposition of , and let be the matrix function defined as

(7)

We can plug Eq. 6 into Eq. 5 to get

(8)

For a given value of , the trace can be efficiently computed using the Hutchinson algorithm (described in §III-B). That is, we draw a random Rademacher vector and compute the expectation to get

(9)

However, this is still intractable, as the trace computation needs to be repeated for every value of (which scales with the number of model parameters).

To get around this, we relax this problem further [lin2016approximating, ubaru2017fast]. Define , in which case we have

(10)

where is the magnitude (or dot product) of along the eigenvector of . Now let us define a probability distribution w.r.t. with the cumulative distribution function, , as the following piece-wise function:

(11)

Then, by the Riemann-Stieltjes integral, it follows that

(12)

This integral can be estimated by the Gauss quadrature rule [golub1969calculation],

(13)

where is the weight-node pair to estimate the integral. The stochastic Lanczos algorithm can then be used to estimate accurately this quantity [ubaru2017fast, golub2009matrices, lin2016approximating]. Specifically, for the -step Lanczos algorithm, we have eigenpairs . Let , where is the first component of , in which case it follows that

(14)

Therefore, as in the Hutchinson algorithm, with multiple different runs (e.g., times) of Lanczos algorithm, can be approximated by

(15)

See Algorithm 1 for a description of the SLQ algorithm.

Iv Results

Here, we provide extensive experiments to study the impact of BN and residual connection on the Hessian spectrum. We first start by discussing the experimental settings in §IV-A, followed by presenting the Hessian spectrum results for the entire model in §IV-B as well different ResNet stages in §IV-C.

Iv-a Experimental Setting

Using PyHessian, we measure all three Hessian spectrum metrics (i.e., top eigenvalues, trace, and full ESD) throughout the training process of SGD with momentum. We consider various ResNet [he2016deep] architectures, and in particular ResNet20/32/38/56 on the Cifar-10, and we analyze these models and variants with/without BN and with/without residual connections. We also experimented with same networks tested on Cifar-100 dataset, and all of the observations were consistent. These results are presented in Appendix.

For clarity, we refer to the networks without BN as ResNet, and we refer to the networks without residual connections as ResNet. We train each model with various initial learning rates, and we pick the best performing result for analysis. See Appendix -C for more details on training settings. We analyze the spectrum throughout training at all checkpoints. The accuracy of each model is reported in Table I, and the testing curve is shown in Figure 6.

Model\Depth 20 32 38 56
ResNet 92.01% 92.05% 92.37% 93.59%
ResNet 87.27% 66.57% 53.65% N/A
ResNet 90.66% 89.8% 88.92% 87.38%
TABLE I: Accuracy of ResNet, ResNet, and ResNet, with different depths, on Cifar-10. The accuracy drops if the BN layer is removed (denoted by ResNet), and this degradation is more pronounced for deeper models. In fact, ResNet 56 cannot be trained at all. Removing the residual connections (denoted as ResNet) also results in slight performance degradation. Accuracy of models on Cifar-100 is reported in Table V.

Iv-B Full Network Hessian Analysis

We start with the original ResNet model with BN and residual connections. Hereafter we refer to this as ResNet. The behaviour of the Hessian trace throughout training is shown in Figure 2. Furthermore, we show the evolution of the Hessian ESD throughout training in Figure 3 for Cifar-10.

Batch Normalization

As discussed before, a BN layer is crucial for training NN models, and removing this component can adversely affect the generalization performance, as is shown in Table I. The drop in performance is very significant for deeper models. For example, we could not even train ResNet56 on Cifar-10 without a BN layer, even with hyperparameter tuning.

The first interesting observation is that removing BN layer (denoted by ResNet) exhibits different behaviour for shallower versus deeper models. For example, for ResNet20 we see that removing BN results in smaller trace and Hessian ESD values, as compared to baseline, as shown in Figure 2 (orange curve versus blue curve), and 3 (second versus first column). In more detail, from the evolution plot of Figure 3 throughout training, it can be seen that the ESD of ResNet 20 initially reduces significantly and centers around zero. That is, the model gets attracted to areas with a significantly large number of small/degenerate Hessian directions. This continues until epoch 30, at which point the training gets attracted to regions of the loss landscape with several non-degenerate Hessian directions.

This clearly shows that training without BN makes training harder, but it does not necessarily mean that the Hessian spectrum is going to be larger than the baseline model, despite the claim made by [santurkar2018does]. In fact, we only observe the smoothing behaviour proposed by [santurkar2018does] for deeper NN models. For example, observe the Hessian trace plot of ResNet32/38, shown in Figure 2. Here, the Hessian trace of ResNet 32 increases to 10000 from zero, as compared to 2000 for ResNet. The Hessian ESD also exhibits the same behaviour, as shown in Figure 12 and 13. We can clearly see that the range of eigenvalues of ResNet is significantly larger, as compared to ResNet.

Fig. 3: (first row) We show the Hessian ESD throughout training for ResNet/ResNet/ResNet (each shown in a different column) with depth 20 on Cifar-10. For a fixed epoch, every point corresponds to a Hessian eigenvalue. These plots show several important phenomena. First note that removing BN (middle column) does not lead to a non-smooth loss landscape as was claimed by [santurkar2018does]. We can clearly see that this is true throughout training. However, removing the residual connection leads makes the loss landscape non-smooth throughout training (middle/last row). We show the Hessian ESD at epoch 0 and epoch 180. This clearly shows that removing BN leads to a maximum eigenvalue of  100, whereas the baseline has a maximum Hessian eigenvalue of  150. See Figure 11, where we plot the Hessian ESD for several other epochs throughout training. We observed the same behaviour on Cifar-100 dataset (as shown in Figure 15).

The Hessian ESD of ResNet32 and ResNet38 throughout the training process is shown in Figure 1213. Again, we see the interesting behaviour that without the BN layer, the spectrum initially converges to degenerate Hessian directions, before finding non-degenerate directions in later epochs of training. The Hessian trace and the range of the Hessian ESD significantly increases as the model gets deeper.

These plots show the numerical values of the Hessian spectrum. However, the results could be more intuitively presented via parametric plots of the loss landscape. We plot the parametric 3D loss landscapes of ResNet20/38 on Cifar-10 with/without BN in Figure 1 (compare left and middle columns). These plots are computed by perturbing the model parameters across the first and second eigenvectors of the Hessian. For ResNet20, it can be clearly seen that removing the BN layer (middle plot) results in convergence to a flatter local minimum, as compared to ResNet20 with BN. This observation is the opposite of the common belief that adding BN layer makes the loss landscape smoother [santurkar2018does]. However, for ResNet38, we can also see that removing the BN layer results in convergence to a point with a higher value of loss. The visualizations corroborate our finding that initially ResNet finds points with degenerate Hessian directions, before converging to a point with non-degenerate directions. We provide more visualizations for ResNet20 (Figure 18), ResNet32 (Figure 19), and ResNet38 (Figure 20), which show the same behaviour.

In summary, our empirical results highlight two points. First, our findings show several fine-scale behaviours when the BN layer is removed. Importantly, we find that the observation made in [santurkar2018does] only holds for deeper models, and are not necessarily true for shallow networks. Second, using the scalable Hessian-based techniques implemented in PyHessian, one can test the hypotheses that these or other claims hold more generally. For example, we observed exactly similar behaviour for Cifar-100 dataset, as shown in Appendix -E.

Fig. 4: Stage-wise Hessian trace of ResNet/ResNet/ResNet 32 on Cifar-10. (See Figure 8 for depth 20/32; and see Figure 5 for stage illustration.) Removing BN layer from the third stage significantly increases the trace, compared to removing BN layer from the first/second stage. This has a direct correlation with the final generalization performance, as shown in Table II. ResNet/ResNet/ResNet on Cifar-100 has the similar trend as shown in Figure 10.
Model\Depth 20 32 38 56
ResNet 92.01% 92.05% 92.37% 93.59%
RM BN stage 1 91.28% 91.98% 92.20% 92.19%
RM BN stage 2 91.49% 91.94% 91.70% 92.20%
RM BN stage 3 90.59% 88.57% 86.96% 73.77%
TABLE II: Accuracy of ResNet models on Cifar-10 with different depths is shown in the first row. Accuracy of the corresponding architectures, but with BN removed from one of the stages, is shown in the next three rows, respectively. (See Figure 5 for stage definition.) For instance, the last row is a ResNet model with no BN layer in the third stage. We observe a general correlation between the accuracy drop and stage based Hessian analysis, shown in Figure 4. In particular, we see that stages which significantly affect accuracy also exhibit a significant increase in the Hessian trace. Models on Cifar-100 have the similar trend, as shown in Table VI.

Residual Connection

We next study the impact of residual connections on the smoothness of the loss landscape. Removing residual connections leads to slightly poorer generalization, as shown in Table I, although the degradation is much smaller than removing the BN layer.

We report the behaviour of the Hessian trace for ResNet in Figure 2 for ResNet20/32/38/56 on Cifar-10. It can clearly be seen that the trace of ResNet is consistently higher than that of ResNet, for both shallow and deep models on different datasets.

In addition, from the Hessian ESD in Figure 3111213, and 14, we can see that the top eigenvalues as well as the support range of ESD of ResNet increases for deeper models. These results are in line with the findings of [li2018visualizing].

We also visualize the loss landscape of these models in Figure 1, 18, 1920, and 21. It can clearly be seen that the converging point for ResNet becomes sharper, as compared with ResNet, as the depth grows.

Again, our empirical results highlight two points. First, we make observations that provide a finer-scale understanding of seemingly-contradictory claims in the previous literature. Second, using the scalable Hessian-based techniques that are implemented in PyHessian, one can ask these questions and test the hypotheses that these or other claims hold more generally. Similar to the previous section, we saw exactly similar behaviour for Cifar-100, as reported in Appendix -E.

Iv-C Stage-wise Hessian Analysis

We also analyzed the impact of removing BN and residual connection from different stages of the model. We define each stage of ResNet as blocks with the same activation resolution, as schematically shown in Figure 5.

We plot the Hessian trace for the three stages of ResNet32 on Cifar-10 in Figure 4 (similar plots for ResNet20/38/56 on Cifar-10 is shown in Figure 8). We can clearly see that removing the BN from the last stage of ResNet32 results in a more rapid increase in the Hessian trace, as compared to removing BN from the first or second stage. Interestingly, this has a direct correlation with the final generalization performance reported in Table II. We can see that removing BN in the third stage results in higher accuracy drop, as compared to removing it from other stages. A similar trend exists for other models (ResNet20/38); and we generally observe the same behaviour on Cifar-100, as reported in Figure 10 and Table VI.

As for the residual connections, we can see that removing them results in a relatively smaller increase in the Hessian trace, and correspondingly the impact of removing the residual connections on accuracy is smaller, as compared to removing BN. See Table III for Cifar-10.

Model\Depth 20 32 38 56
ResNet 92.01% 92.05% 92.37% 93.59%
RM Res stage 1 91.52% 92.27% 91.74% 91.79%
RM Res stage 2 91.06% 91.07% 91.08% 91.28%
RM Res stage 3 91.54% 92.09% 92.14% 92.34%
TABLE III: Accuracy of ResNet on Cifar-10 is reported for baseline (first row), along with architectures where the residual connection is removed at different stages.

Iv-D Summary of Results

Table IV presents a summary of the tables and figures used in this work and their corresponding properties, i.e., Accuracy, Trace, ESD, and Loss Landscape.

Cifar-10 Cifar-100
Accuracy Table I, Figure 6 Table V, Figure 7
RM BN Acc. Table II Table VI
RM Res Acc Table III Table VII
Trace Figure 2 Figure 9
Stage-wise Trace Figure 4, 8 Figure 10
ESD Figure 3, 11, 12, Figure 15, 16,
13, 14 17
Loss Landscape Figure 1, 18, 19, Figure 22, 23,
20, 21 24
TABLE IV: Navigation summary for all figures and tables used throughout this paper.

V Conclusions

We have developed PyHessian, an open-source framework for analyzing NN behaviour through the lens of the Hessian [pyhessian]. PyHessian enables direct and efficient computation of Hessian-based statistics, including the top eigenvalues, the trace, and the full ESD, with support for distributed-memory execution on cloud/supercomputer systems. Importantly, since it uses matrix-free techniques, PyHessian accomplishes this without the need to form the full Hessian. This means that we can compute second-order statistics for state-of-the-art NNs in times that are only marginally longer than the time used by popular stochastic gradient-based techniques.

As a typical application, we have also shown how PyHessian can be used to study in detail the impact of popular NN architectural changes (such as adding/modifying BN and residual connections) on the NN loss landscape. Importantly, we found that adding BN layers does not necessarily result in a smoother loss landscape, as claimed by [santurkar2018does]. We have observed this phenomenon only for deeper models, where removing the BN layer results in convergence to “sharp” local minima that have high training loss and poor generalization, but it does not seem to hold for shallower models. We also showed that removing residual connections resulted in a slightly coarser loss landscape, a finding which we illustrated with parametric 3D visualizations, and which all three Hessian spectrum metrics confirmed. We have open-sourced PyHessian to encourage reproducibility and as a scalable framework for research on second-order optimization methods, on practical diagnostics for NN learning/generalization, as well as on analytics tools for NNs more generally.

Acknowledgments

This work was supported by a gracious fund from Intel and Samsung. We are also grateful for a gracious fund from Google Cloud, Google TFTC team, as well as support from the Amazon AWS. We would also like to acknowledge ARO, DARPA, NSF, and ONR for providing partial support of this work.

References

In this appendix, we present additional results to complement and extend the results presented in the main text.

\counterwithin

figuresection \counterwithintablesection

-a Illustration of ResNet Stages

In Figure 5, we show the illustration of ResNet20 on Cifar-10/100 and its three stages.

Fig. 5: Illustration of ResNet20 on Cifar-10/100 and its three stages. Blue, green, and purple boxes shows the first, second and third stages, respectively.

-B Algorithms

We provide the pseudo-code for power iteration, Hutchinson algorithm, and stochastic Lanczos Quadrature in this section. See Algorithm 2 and Algorithm 3. (Algorithm 1 is presented in the main text.)

Input: Parameter: .
Compute the gradient of by backpropagation, i.e., compute . Draw a random vector from (same dimension as ). Normalize , for i  do        // Power Iteration
       Compute   // Inner product
       Compute by backpropagation,   // Get Hessian vector product
       Normalize and reset ,
Algorithm 2 Power Iteration for Top Eigenvalue Computation
Input: Parameter: .
Compute the gradient of by backpropagation, i.e., compute . for i  do        // Hutchinson Steps
       Draw a random vector from Rademacher distribution (same dimension as ). Compute   // Inner product
       Compute by backpropagation,   // Get Hessian vector product
       Compute and record
Return the average of all computed .
Algorithm 3 Hutchinson Method for Trace Computation

-C Training Details

We train each model (ResNet, ResNet, and ResNet) for 180 epochs, with five different initial learning rates (0.1, 0.05, 0.01, 0.005, 0.001) on Cifar-10, and ten different initial learning rates (0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0004, 0.0003, 0.0002, 0.00001) on Cifar-100. The optimizer is SGD with momentum (0.9). The learning rate decays by a factor of 10 at epoch 80, 120.

Model\Depth 20 32 38
ResNet 66.47% 68.26% 69.06%
ResNet 62.82% 25.89% 11.25%
ResNet 64.59% 62.08% 62.75%
TABLE V: Accuracy of ResNet, ResNet, and ResNet with different depths, on Cifar-100. Results are similar to those shown in Table I, i.e., removing BN (ResNet) or residual connections (ResNet) results in performance degradation.
Fig. 6: Testing curve of all models reported in Table I. The generalization performance of models without BN (denoted as ResNet) is much worse than the baseline (denoted as ResNet). We see a similar but much smaller generalization loss when the residual connection is removed (denoted as ResNet).
Fig. 7: Testing curve of all models reported in Table V. The generalization performance of models without BN (denoted as ResNet) is much worse than the baseline (denoted as ResNet). We see a similar but much smaller generalization loss when the residual connection is removed (denoted as ResNet).

-D Loss Landscape Details

The parametric loss landscape plots are plotted by perturbing the model parameters, , along the first and second top eigenvectors of the Hessian, denoted as and . Then, we compute the loss of K (in our case, ) data points with the following formula,

-E Extra Results

In the remainder of this appendix, we present additional results that we described in the main text. See Table IV for a summary.

Fig. 8: Stage-wise Hessian trace of ResNet/ResNet/ResNet with depth 20/32/38/56 on Cifar-10. See Figure 5 for stage illustration. Removing BN layer from the third stage significantly increases the trace, compared to removing BN layer from the first/second stage. This has a direct correlation with the final generalization performance, as shown in Table II.
Fig. 9: The Hessian trace of the entire network for ResNet/ResNet/ResNet with depth 20/32/38 on Cifar-100. Similar to the results for Cifar-10, shown in Figure 2, we see that removing the BN layer results in a rapid increase of the Hessian trace, and that removing the residual connection leads to sharper loss landscape throughout training.
Fig. 10: Stage-wise Hessian trace of ResNet/ResNet/ResNet with depth 20/32/38 on Cifar-100. See Figure 5 for stage illustration. Removing BN layer from the third stage significantly increases the trace, compared to removing BN layer from the first/second stage. This has a direct correlation with the final generalization performance, as shown in Table VI.
Fig. 11: Hessian ESD of the entire network for ResNet/ResNet/ResNet with depth 20 on Cifar-10 with Hessian batch size 50000. This figure shows the Hessian ESD throughout the training process, which is an full version of Figure 3. One notable thing here is that although ResNet 20 has smaller Hessian ESD support range than ResNet 20 does, the Hessian ESD of ResNet 20 centers around zero (at least) until epoch 90. This clearly shows that training without BN is indeed harder.
Fig. 12: Hessian ESD of the entire network for ResNet/ResNet/ResNet with depth 32 on Cifar-10 with Hessian batch size 50000. This figure shows the Hessian ESD throughout the training process. One notable thing here is that the Hessian ESD of ResNet 32 centers around zero (at least) until epoch 5. This clearly shows that training without BN is indeed harder.
Fig. 13: Hessian ESD of the entire network for ResNet/ResNet/ResNet with depth 38 on Cifar-10 with Hessian batch size 50000. This figure shows the Hessian ESD throughout the training process. One notable thing here is that the Hessian ESD of ResNet 38 centers around zero at very beginning phase. This clearly shows that training without BN is indeed harder.
Fig. 14: Hessian ESD of the entire network for ResNet/ResNet with depth 56 with Hessian batch size 50000. Residual connection can help smooth the loss landscape.
Fig. 15: Hessian ESD of the entire network for ResNet/ResNet/ResNet with depth 20 on Cifar-100 with Hessian batch size 50000. This figure shows the Hessian ESD throughout the training process. One notable thing here is that although ResNet 20 has smaller Hessian ESD support range than ResNet 20 does, the Hessian ESD of ResNet 20 centers around zero at very beginning (epoch 1). This clearly shows that training without BN is indeed harder.
Fig. 16: Hessian ESD of the entire network for ResNet/ResNet/ResNet with depth 32 on Cifar-100 with Hessian batch size 50000. This figure shows the Hessian ESD throughout the training process. One notable thing here is that the Hessian ESD of ResNet 32 centers around zero (at least) until epoch 5. This clearly shows that training without BN is indeed harder.
Fig. 17: Hessian ESD of the entire network for ResNet/ResNet/ResNet with depth 38 on Cifar-100 with Hessian batch size 50000. This figure shows the Hessian ESD throughout the training process. One notable thing here is that the Hessian ESD of ResNet 38 centers around zero (at least) until epoch 5. This clearly shows that training without BN is indeed harder.
Fig. 18: Loss landscape of ResNet/ResNet/ResNet 20 on Cifar-10 with batch size 4096 by perturbing the parameters along the first two dominant eigenvectors of the Hessian. The loss landscape of ResNet 20 (ResNet 20) is indeed smoother (sharper) than that of ResNet 20, which is align with the trace plot in Figure 2 and the Hessian ESD plot in Figure 3.
Fig. 19: Loss landscape of ResNet/ResNet/ResNet 32 on Cifar-10 with batch size 4096 by perturbing the parameters along the first two dominant eigenvectors of the Hessian. The loss landscape of ResNet 32/ResNet 32 is indeed sharper than that of ResNet 32, which is align with the trace plot in Figure 2 and the Hessian ESD plot in Figure 12.
Fig. 20: Loss landscape of ResNet/ResNet/ResNet 38 on Cifar-10 with batch size 4096 by perturbing the parameters along the first two dominant eigenvectors of the Hessian. The loss landscape of ResNet 38/ResNet 38 is indeed sharper than that of ResNet 38, which is align with the trace plot in Figure 2 and the Hessian ESD plot in Figure 13.
Fig. 21: Loss landscape of ResNet/ResNet 56 on Cifar-10 with batch size 4096 by perturbing the parameters along the first two dominant eigenvectors of the Hessian. Note that the z-axis of ResNet56 at epoch 0 has different range than all the others. The loss landscape of ResNet 56 is indeed sharper than that of ResNet 56, which is align with the trace plot in Figure 2 and the Hessian ESD plot in Figure 14.
Fig. 22: Loss landscape of ResNet/ResNet/ResNet 20 on Cifar-100 with batch size 4096 by perturbing the parameters along the first two dominant eigenvectors of the Hessian. The loss landscape of ResNet 20 (ResNet 20) is indeed smoother (sharper) than that of ResNet 20, which is align with the trace plot in Figure 9 and the Hessian ESD plot in Figure 15.
Fig. 23: Loss landscape of ResNet/ResNet/ResNet 32 on Cifar-100 with batch size 4096 by perturbing the parameters along the first two dominant eigenvectors of the Hessian. The loss landscape of ResNet 32/ResNet 32 is indeed sharper than that of ResNet 32, which is align with the trace plot in Figure 9 and the Hessian ESD plot in Figure 16.
Fig. 24: Loss landscape of ResNet/ResNet/ResNet 38 on Cifar-100 with batch size 4096 by perturbing the parameters along the first two dominant eigenvectors of the Hessian. The loss landscape of ResNet 38/ResNet 38 is indeed sharper than that of ResNet 38, which is align with the trace plot in Figure 9 and the Hessian ESD plot in Figure 17.
Model\Depth 20 32 38
ResNet 66.47% 68.26% 69.06%
RM BN stage 1 65.69% 65.74% 67.31%
RM BN stage 2 65.62% 64.68% 66.46%
RM BN stage 3 65.63% 64.57% 61.04%
TABLE VI: Accuracy of ResNet models on Cifar-100 with different depths is shown in the first row. In the second through the last rows, we report the accuracy of the corresponding architectures, but with BN layer removed from one of the stages, respectively. (See Figure 5 for stage definition.) For instance, the last row reports ResNet model with no BN layer in the third stage.
Model\Depth 20 32 38
ResNet 66.47% 68.26% 69.06%
RM Res stage 1 66.46% 66.94% 67.61%
RM Res stage 2 65.70% 66.05% 66.70%
RM Res stage 3 66.21% 66.38% 66.03%
TABLE VII: Accuracy of ResNet on Cifar-100 is reported for baseline (first row), along with architectures where the residual connection is removed at different stages.

Footnotes

  1. The naïve view arises since the Hessian matrix is of size (say) . Thus, like most linear algebra computations, exact full spectral computations (which are sufficient but never necessary) cost time.
  2. Quantization is a process in which the precision of the parameters is reduced from single precision (32-bits) to a lower precision (such as 8-bits).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
410795
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description