PyHessian: Neural Networks Through the Lens of the Hessian
We present PyHessian, a new scalable framework that enables fast computation of Hessian (i.e., second-order derivative) information for deep neural networks. PyHessian enables fast computations of the top Hessian eigenvalues, the Hessian trace, and the full Hessian eigenvalue/spectral density, and it supports distributed-memory execution on cloud/supercomputer systems and is available as open source [pyhessian]. This general framework can be used to analyze neural network models, including the topology of the loss landscape (i.e., curvature information) to gain insight into the behavior of different models/optimizers. To illustrate this, we analyze the effect of residual connections and Batch Normalization layers on the trainability of neural networks. One recent claim, based on simpler first-order analysis, is that residual connections and Batch Normalization make the loss landscape “smoother”, thus making it easier for Stochastic Gradient Descent to converge to a good solution. Our extensive analysis shows new finer-scale insights, demonstrating that, while conventional wisdom is sometimes validated, in other cases it is simply incorrect. In particular, we find that Batch Normalization does not necessarily make the loss landscape smoother, especially for shallower networks.
\NewDocumentCommand\varOs m O
Residual neural networks [he2016deep] (ResNets) are widely used Neural Networks (NNs) for various learning tasks. The two main architectural components of ResNets are residual connections [he2016deep] and Batch Normalization (BN) layers [ioffe2015batch]. However, going beyond motivating stories to characterize precisely when and why these two popular architectural ingredients help or hurt training/generalization—especially in terms of measurable properties of the model—is still largely unsolved. Relatedly, characterizing whether other suggested architectural changes will help or hurt training/generalization is still done in a largely ad hoc manner. For example, it is often motivated by plausible but untested intuitions, and it is not characterized in terms of measurable properties of the model.
In this work, we present and apply PyHessian, an open source scalable framework with which one can directly analyze Hessian information, i.e., second-derivative information w.r.t. model parameters, in order to address these and related questions. PyHessian computes Hessian information by applying known techniques from Numerical Linear Algebra (NLA) [bai1996some, golub2009matrices, lin2016approximating] and Randomized NLA (RandNLA) [Mah-mat-rev_BOOK, RandNLA_PCMIchapter_chapter, drineas2006fast, yao2018hessian, avron2011randomized, ubaru2017fast] (that are approximate but come with rigorous theory). PyHessian enables computing Hessian information—including top Hessian eigenvalues, Hessian trace, and Hessian eigenvalue spectral density (ESD), and it supports distributed implementation—allowing distributed-memory execution on both cloud (e.g., AWS, Google Cloud) and supercomputer systems, for fast and efficient Hessian computation.
As an application of PyHessian, we use it to analyze the impact of residual connections and BN on the trainability of NNs, leading to new insights. In more detail, our main contributions are the following:
We introduce PyHessian, a new framework for direct and efficient computation of Hessian information, including the top eigenvalue, the trace, and the full ESD [pyhessian]. We also apply PyHessian to study how residual connections and BN affect training.
We observe that removing BN layers from ResNet (denoted below as ResNet) leads to rapid increase of the Hessian spectrum (the top eigenvalue, the trace, and the ESD support range). This increase is significantly more rapid for deeper models. See Figure 2, 11, 12, and 13 on Cifar-10 as well as Figure 9, 15, 16, and 17 on Cifar-100.
We observe that, for shallower networks (ResNet20), removing the BN layer results in a flatter Hessian spectrum, as compared to standard ResNet20 with BN. See Figure 2 and 3 on Cifar-10 and Figure 9 and 15 on Cifar-100. This observation is the opposite of the common belief that the addition of BN layers make the loss landscape smoother (which we observe to hold only for deeper networks).
We show that removing residual connections from ResNet generally makes the top eigenvalue, the trace, and the Hessian ESD support range increase slightly. This increase is consistent for both shallower and deeper models (ResNet20/32/38/56). See Figure 2, 3, 14, 18, 19, 20, and 21 on Cifar-10 as well as Figure 9, 15, 22, 23, and 24 on Cifar-100.
We perform Hessian analysis for different stages of ResNet models (details in Section IV-A), and we find that generally BN is more important for the final stages than for earlier stages. In particular, removing BN from the last stage significantly degrades testing performance, with a strong correlation with the Hessian trace. See the comparison between the orange curve and the blue curve in Figure 4 and 8, the accuracy reported in Table II on Cifar-10 (Figure 10, and the accuracy reported on Cifar-100 in Table VI).
Ii Related work
Here, we review work related to Hessian-based analysis for NN training and inference, as well as work that studies the impact of different architectural components on the topology of the NN loss landscape.
Hessian and Large-scale Hessian Computation:
Hessian-based analysis/computation is widely used in scientific computing.
However, due to the (incorrect) belief that Hessian-based computations are infeasible for large NN problems, the majority of work in ML (except for quite small problems) performs only first-order analysis.
Hessian eigenvalues of small NN models were analyzed [sagun2016eigenvalues, sagun2017empirical]; and the work of [pennington2017geometry] studied the geometry of NN loss landscapes by computing the distribution of Hessian eigenvalues at critical points. More recently, [yao2018hessian] used a deflated power-iteration method to compute the top eigenvalues for deep NNs during training. Moreover, the work of [ghorbani2019investigation] measured the Hessian ESD, based on the Stochastic Lanczos algorithm of [lin2016approximating, ubaru2017fast]. Here, we extend the analysis of [ghorbani2019investigation, yao2018hessian] by studying how the depth of the NN model as well as its architecture affect the Hessian spectrum (in terms of top eigenvalue, trace, and full ESD). Furthermore, we also perform block diagonal Hessian spectrum analysis, and we observe a fine-scale relationship between the Hessian spectrum and the impact of adding/removing residual connections and BN.
Hessian-based analysis has also been used in
the context of NN training and inference.
For example, [lecun1991second] analytically computes Hessian information for a single linear layer and uses the Hessian spectrum to determine the optimal learning rate to accelerate training.
In [lecun1990optimal], the authors approximated the Hessian as a diagonal operator and used the inverse of this diagonal matrix to prune NN parameters.
[hassibi1993second] used the inverse of the full Hessian matrix to develop an “Optimal Brain Surgeon” method for pruning NN parameters.
The authors argued that a diagonal approximation may not be very accurate, as off-diagonal elements of the Hessian are important; and they showed that capturing these off-diagonal elements does indeed lead to better performance, as compared to [lecun1990optimal].
In the recent work of [dong2017learning], a layer-wise pruning method was proposed.
This restricts the Hessian computations to each layer, and it provides bounds on the performance drop after pruning.
More recently, [dong2019hawq, shen2019q, dong2019hawqv2] proposed a Hessian-based method for quantizing
(Quasi-)Newton (second-order) methods [agarwal2016second, dembo1982inexact, pearlmutter1994fast, pilanci2017newton, pratt1998gauss, amari1998natural, bottou2018optimization] have been extensively explored for convex optimization problems [boyd2004convex]. In particular, in the seminal work of [nocedal1980updating, liu1989limited], a Quasi-Newton method was proposed to accelerate first-order based optimization methods. The idea is to precondition the gradient vector with the inverse of the Hessian. However, instead of directly using the Hessian, a series of approximate rank-1 updates are used instead. Follow up work of [schraudolph2007stochastic] extended this method and proposed a stochastic BFGS algorithm. More recently, the work of [bollapragada2018progressive] proposed an adaptive batch size Limited-memory BFGS method [liu1989limited] for large-scale machine learning problems; and an adaptive batch size method based on measuring directly the spectrum of the Hessian has been proposed [yao2018large] for large-scale NN training.
Hessian-based methods have also been explored for non-convex problems, including trust-region (TR) [conn2000trust], cubic regularization (CR) [nesterov2006cubic], and its adaptive variant (ARC) [cartis2011adaptiveI, cartis2011adaptiveII]. For these problems, [byrd2011use, erdogdu2015convergence, roosta2016sub, xu2016sub] provide sketching/sampling techniques for Newton methods, where guarantees are established for sampling size and convergence rates; and [xu2017newton, xu2016sub, xu2017second, yao2018inexact] show that sketching/sampling methods can significantly reduce the need for data in approximate Hessian computation.
One important concern for applying second-order methods to training is the cost of computing Hessian information at every iteration. The work of [martens2015optimizing] proposed the so-called Kronecker-Factored Approximations (K-FAC) method, which approximates the Fisher information matrix into a Kronecker product. However, the approach comes with several new hyperparameters, which can actually be more expensive to tune, compared to first-order methods [ma2019inefficiency].
A major limitation in most of this prior work is that tests are typically restricted to small/simple NN models that may not be representative of NN workloads that are encountered in practice. This is in part due to the lack of a scalable and easily programmable framework that could be used to test second-order methods for a wide range of state-of-the-art models. Addressing this is the main motivation behind our development of PyHessian, which is released as open-source software and is available to researchers [pyhessian]. In this paper, we illustrate how PyHessian can be used for analyzing the NN behaviour during training, even for very deep state-of-the-art models. Future work includes using this framework for second-order based optimization, by testing it on modern NN models, as well as fairly gauging the benefit that may arise from such methods, in light of the cost for any extra hyperparameter tuning that may be needed [ma2019inefficiency].
Residual Connections and Batch Normalization: Residual connections [he2016deep] and BN [ioffe2015batch] are two of the most important ingredients in modern convolutional NNs. There have been different hypothesis offered for why these two components help training/generalization. First, the original motivation for residual connections was that they allow gradient information to flow to earlier layers of the NN, thereby reducing the vanishing gradient problem during training. The empirical study of [li2018visualizing] found that deep NNs with residual connections exhibit a significantly smoother loss landscape, as compared to models without residual connections. This was achieved by the so-called filter-normalized random direction method to plot 3D loss landscapes, i.e., not through direct analysis of the Hessian spectrum. This result is interesting, but it is hard to draw conclusions with perturbations in two directions, for a model that has millions of parameters (and thus millions of possible perturbation directions).
Second, the original motivation for why BN helps training/generalization was originally attributed to reducing the so-called Internal Covariance Shift (ICS) [ioffe2015batch]. However, this was disputed in the recent study of [santurkar2018does]. In particular, the work of [santurkar2018does] used first-order analysis to analyze the loss landscape, and found that adding a BN layer results in a smoother loss landscape. Importantly, they found that adding BN does not reduce the so called ICS. Again, while interesting, such first-order analysis may not fully capture the topology of the landscape; and, as we will show with our second-order analysis, this smoothness claim is not correct in general.
The work of [santurkar2018does] also performed an interesting theoretical analysis, showing a connection between adding the BN layer and the Lipschitz constant of the gradient (i.e., the top Hessian eigenvalue). It was argued that adding the BN layer leads to a smaller Lipschitz constant. However, the theoretical analysis is only valid for per-layer Lipschitz constant, as it ignores the complex interaction between different layers. It cannot be extended to the Lipschitz constant of the entire model (and, as we will show, this result does not hold for shallow networks).
For a supervised learning problem, we seek to minimize:
where is the learnable weight parameter, is the loss function, is the input pair, is the NN architecture, and is the size of training data. Below we first discuss how PyHessian computes the second-order statistics, and we then discuss the impact of architectural components on the trainability of the model.
Iii-a Neural Network Hessian Matvec
For a NN with parameters, the gradient of the loss w.r.t. model parameters is a vector
and the second derivative of the loss is a matrix,
commonly called the Hessian. A typical NN model involves millions of parameters, and thus even forming the Hessian is computationally infeasible. However, it is possible to compute properties of the Hessian spectrum without explicitly forming the Hessian matrix. Instead, all we need is an oracle to compute the application of the Hessian to a random vector . This can be achieved by observing the following:
Here, the first equality is the chain rule, the second is due to the independence of to , and the third equality is the definition of the Hessian. Importantly, note that the cost of this Hessian matrix-vector multiply (hereafter referred to as Hessian matvec) is the same as one gradient backpropagation. Having this oracle, we can easily compute the top Hessian eigenvalues using power iteration [yao2018hessian]; see Algorithm 2. However, for a typical NN with millions of parameters, the top eigenvalues may not be representative of how the loss landscape behaves. Therefore, we also compute the trace and ESD of the Hessian, as described below.
Iii-B Hutchinson Method for Hessian Trace Computation
The trace of the Hessian can be computed using RandNLA, and in particular with Hutchinson’s method [bai1996some, avron2011randomized] for the fast computation of the trace, using only Hessian matvec computations (as given in Eq. 2). In particular, since we are interested in the Hessian, i.e., a symmetric matrix, suppose we have a random vector , whose components are i.i.d. sampled from a Rademacher distribution (or Gaussian distribution with mean 0 and variance 1). Then, we have the identity
where is the identity matrix of appropriate size. That is, the trace of can be estimated by computing , where we compute the expectation by drawing multiple random samples. Note that can be efficiently computed from Eq. 2, and then is simply a dot product between the Hessian matvec and the original vector . See Algorithm 3 for a description.
Iii-C Full Eigenvalue Spectral Density
To provide finer-grained information on the Hessian spectrum than is provided by the top eigenvalues or the trace, we need to compute the full empirical spectral density (ESD) of the Hessian eigenvalues, defined as
where is the Dirac distribution and is the eigenvalue of , in descending order.
Recent work in NLA/RandNLA has provided efficient matrix-free algorithms to estimate this ESD [lin2016approximating, golub2009matrices, ubaru2017fast] through Stochastic Lanczos Quadrature (SLQ). Here, we briefly describe SLQ in simple terms. This approach was also used in [ghorbani2019investigation] to compute the Hessian ESD. For more details, see [lin2016approximating, golub2009matrices, ubaru2017fast].
Here is a summary of our approach to compute the ESD . First, we approximate (of Eq. 4) by (Eq. 5 below) by applying a Gaussian kernel (first approximation), and we express this in the same expectation form as in the Hutchinson algorithm (Eq. 9 below). Next, since the computation inside the expectation depends directly on and the unknown eigenvalues (denoted by s), we further simplify the problem by using Gaussian quadrature (Eq. 13 below) (second approximation). Then, since the weights and s in the Gaussian quadrature are still unknown, we use the stochastic Lanczos algorithm to approximate the weights and s (Eq. 14 below) (third approximation). Finally, we approximate the expectation of the eigenvalue distribution as a sum (Eq. 15 below) (forth approximation).
In more detail, for the first approximation, we apply a Gaussian kernel, , with variance to Eq. 4 to obtain
where is the Gaussian kernel. Clearly, , as . Thus, if we had an algorithm to approximate Eq. 5, then we could take the limit and reduce the standard deviation of the Gaussian kernel to approximate Eq. 4. In our context, the question of how to compute amounts to computing the density distribution of the Hessian convolved with a Gaussian kernel.
To do this, observe that
where is the eigendecomposition of , and let be the matrix function defined as
For a given value of , the trace can be efficiently computed using the Hutchinson algorithm (described in §III-B). That is, we draw a random Rademacher vector and compute the expectation to get
However, this is still intractable, as the trace computation needs to be repeated for every value of (which scales with the number of model parameters).
To get around this, we relax this problem further [lin2016approximating, ubaru2017fast]. Define , in which case we have
where is the magnitude (or dot product) of along the eigenvector of . Now let us define a probability distribution w.r.t. with the cumulative distribution function, , as the following piece-wise function:
Then, by the Riemann-Stieltjes integral, it follows that
This integral can be estimated by the Gauss quadrature rule [golub1969calculation],
where is the weight-node pair to estimate the integral. The stochastic Lanczos algorithm can then be used to estimate accurately this quantity [ubaru2017fast, golub2009matrices, lin2016approximating]. Specifically, for the -step Lanczos algorithm, we have eigenpairs . Let , where is the first component of , in which case it follows that
Therefore, as in the Hutchinson algorithm, with multiple different runs (e.g., times) of Lanczos algorithm, can be approximated by
See Algorithm 1 for a description of the SLQ algorithm.
Here, we provide extensive experiments to study the impact of BN and residual connection on the Hessian spectrum. We first start by discussing the experimental settings in §IV-A, followed by presenting the Hessian spectrum results for the entire model in §IV-B as well different ResNet stages in §IV-C.
Iv-a Experimental Setting
Using PyHessian, we measure all three Hessian spectrum metrics (i.e., top eigenvalues, trace, and full ESD) throughout the training process of SGD with momentum. We consider various ResNet [he2016deep] architectures, and in particular ResNet20/32/38/56 on the Cifar-10, and we analyze these models and variants with/without BN and with/without residual connections. We also experimented with same networks tested on Cifar-100 dataset, and all of the observations were consistent. These results are presented in Appendix.
For clarity, we refer to the networks without BN as ResNet, and we refer to the networks without residual connections as ResNet. We train each model with various initial learning rates, and we pick the best performing result for analysis. See Appendix -C for more details on training settings. We analyze the spectrum throughout training at all checkpoints. The accuracy of each model is reported in Table I, and the testing curve is shown in Figure 6.
Iv-B Full Network Hessian Analysis
We start with the original ResNet model with BN and residual connections. Hereafter we refer to this as ResNet. The behaviour of the Hessian trace throughout training is shown in Figure 2. Furthermore, we show the evolution of the Hessian ESD throughout training in Figure 3 for Cifar-10.
As discussed before, a BN layer is crucial for training NN models, and removing this component can adversely affect the generalization performance, as is shown in Table I. The drop in performance is very significant for deeper models. For example, we could not even train ResNet56 on Cifar-10 without a BN layer, even with hyperparameter tuning.
The first interesting observation is that removing BN layer (denoted by ResNet) exhibits different behaviour for shallower versus deeper models. For example, for ResNet20 we see that removing BN results in smaller trace and Hessian ESD values, as compared to baseline, as shown in Figure 2 (orange curve versus blue curve), and 3 (second versus first column). In more detail, from the evolution plot of Figure 3 throughout training, it can be seen that the ESD of ResNet 20 initially reduces significantly and centers around zero. That is, the model gets attracted to areas with a significantly large number of small/degenerate Hessian directions. This continues until epoch 30, at which point the training gets attracted to regions of the loss landscape with several non-degenerate Hessian directions.
This clearly shows that training without BN makes training harder, but it does not necessarily mean that the Hessian spectrum is going to be larger than the baseline model, despite the claim made by [santurkar2018does]. In fact, we only observe the smoothing behaviour proposed by [santurkar2018does] for deeper NN models. For example, observe the Hessian trace plot of ResNet32/38, shown in Figure 2. Here, the Hessian trace of ResNet 32 increases to 10000 from zero, as compared to 2000 for ResNet. The Hessian ESD also exhibits the same behaviour, as shown in Figure 12 and 13. We can clearly see that the range of eigenvalues of ResNet is significantly larger, as compared to ResNet.
The Hessian ESD of ResNet32 and ResNet38 throughout the training process is shown in Figure 12, 13. Again, we see the interesting behaviour that without the BN layer, the spectrum initially converges to degenerate Hessian directions, before finding non-degenerate directions in later epochs of training. The Hessian trace and the range of the Hessian ESD significantly increases as the model gets deeper.
These plots show the numerical values of the Hessian spectrum. However, the results could be more intuitively presented via parametric plots of the loss landscape. We plot the parametric 3D loss landscapes of ResNet20/38 on Cifar-10 with/without BN in Figure 1 (compare left and middle columns). These plots are computed by perturbing the model parameters across the first and second eigenvectors of the Hessian. For ResNet20, it can be clearly seen that removing the BN layer (middle plot) results in convergence to a flatter local minimum, as compared to ResNet20 with BN. This observation is the opposite of the common belief that adding BN layer makes the loss landscape smoother [santurkar2018does]. However, for ResNet38, we can also see that removing the BN layer results in convergence to a point with a higher value of loss. The visualizations corroborate our finding that initially ResNet finds points with degenerate Hessian directions, before converging to a point with non-degenerate directions. We provide more visualizations for ResNet20 (Figure 18), ResNet32 (Figure 19), and ResNet38 (Figure 20), which show the same behaviour.
In summary, our empirical results highlight two points. First, our findings show several fine-scale behaviours when the BN layer is removed. Importantly, we find that the observation made in [santurkar2018does] only holds for deeper models, and are not necessarily true for shallow networks. Second, using the scalable Hessian-based techniques implemented in PyHessian, one can test the hypotheses that these or other claims hold more generally. For example, we observed exactly similar behaviour for Cifar-100 dataset, as shown in Appendix -E.
|RM BN stage 1||91.28%||91.98%||92.20%||92.19%|
|RM BN stage 2||91.49%||91.94%||91.70%||92.20%|
|RM BN stage 3||90.59%||88.57%||86.96%||73.77%|
We next study the impact of residual connections on the smoothness of the loss landscape. Removing residual connections leads to slightly poorer generalization, as shown in Table I, although the degradation is much smaller than removing the BN layer.
We report the behaviour of the Hessian trace for ResNet in Figure 2 for ResNet20/32/38/56 on Cifar-10. It can clearly be seen that the trace of ResNet is consistently higher than that of ResNet, for both shallow and deep models on different datasets.
In addition, from the Hessian ESD in Figure 3, 11, 12, 13, and 14, we can see that the top eigenvalues as well as the support range of ESD of ResNet increases for deeper models. These results are in line with the findings of [li2018visualizing].
We also visualize the loss landscape of these models in Figure 1, 18, 19, 20, and 21. It can clearly be seen that the converging point for ResNet becomes sharper, as compared with ResNet, as the depth grows.
Again, our empirical results highlight two points. First, we make observations that provide a finer-scale understanding of seemingly-contradictory claims in the previous literature. Second, using the scalable Hessian-based techniques that are implemented in PyHessian, one can ask these questions and test the hypotheses that these or other claims hold more generally. Similar to the previous section, we saw exactly similar behaviour for Cifar-100, as reported in Appendix -E.
Iv-C Stage-wise Hessian Analysis
We also analyzed the impact of removing BN and residual connection from different stages of the model. We define each stage of ResNet as blocks with the same activation resolution, as schematically shown in Figure 5.
We plot the Hessian trace for the three stages of ResNet32 on Cifar-10 in Figure 4 (similar plots for ResNet20/38/56 on Cifar-10 is shown in Figure 8). We can clearly see that removing the BN from the last stage of ResNet32 results in a more rapid increase in the Hessian trace, as compared to removing BN from the first or second stage. Interestingly, this has a direct correlation with the final generalization performance reported in Table II. We can see that removing BN in the third stage results in higher accuracy drop, as compared to removing it from other stages. A similar trend exists for other models (ResNet20/38); and we generally observe the same behaviour on Cifar-100, as reported in Figure 10 and Table VI.
As for the residual connections, we can see that removing them results in a relatively smaller increase in the Hessian trace, and correspondingly the impact of removing the residual connections on accuracy is smaller, as compared to removing BN. See Table III for Cifar-10.
|RM Res stage 1||91.52%||92.27%||91.74%||91.79%|
|RM Res stage 2||91.06%||91.07%||91.08%||91.28%|
|RM Res stage 3||91.54%||92.09%||92.14%||92.34%|
Iv-D Summary of Results
Table IV presents a summary of the tables and figures used in this work and their corresponding properties, i.e., Accuracy, Trace, ESD, and Loss Landscape.
|Accuracy||Table I, Figure 6||Table V, Figure 7|
|RM BN Acc.||Table II||Table VI|
|RM Res Acc||Table III||Table VII|
|Trace||Figure 2||Figure 9|
|Stage-wise Trace||Figure 4, 8||Figure 10|
|ESD||Figure 3, 11, 12,||Figure 15, 16,|
|Loss Landscape||Figure 1, 18, 19,||Figure 22, 23,|
We have developed PyHessian, an open-source framework for analyzing NN behaviour through the lens of the Hessian [pyhessian]. PyHessian enables direct and efficient computation of Hessian-based statistics, including the top eigenvalues, the trace, and the full ESD, with support for distributed-memory execution on cloud/supercomputer systems. Importantly, since it uses matrix-free techniques, PyHessian accomplishes this without the need to form the full Hessian. This means that we can compute second-order statistics for state-of-the-art NNs in times that are only marginally longer than the time used by popular stochastic gradient-based techniques.
As a typical application, we have also shown how PyHessian can be used to study in detail the impact of popular NN architectural changes (such as adding/modifying BN and residual connections) on the NN loss landscape. Importantly, we found that adding BN layers does not necessarily result in a smoother loss landscape, as claimed by [santurkar2018does]. We have observed this phenomenon only for deeper models, where removing the BN layer results in convergence to “sharp” local minima that have high training loss and poor generalization, but it does not seem to hold for shallower models. We also showed that removing residual connections resulted in a slightly coarser loss landscape, a finding which we illustrated with parametric 3D visualizations, and which all three Hessian spectrum metrics confirmed. We have open-sourced PyHessian to encourage reproducibility and as a scalable framework for research on second-order optimization methods, on practical diagnostics for NN learning/generalization, as well as on analytics tools for NNs more generally.
This work was supported by a gracious fund from Intel and Samsung. We are also grateful for a gracious fund from Google Cloud, Google TFTC team, as well as support from the Amazon AWS. We would also like to acknowledge ARO, DARPA, NSF, and ONR for providing partial support of this work.
In this appendix, we present additional results to complement and extend the results presented in the main text.
-a Illustration of ResNet Stages
In Figure 5, we show the illustration of ResNet20 on Cifar-10/100 and its three stages.
-C Training Details
We train each model (ResNet, ResNet, and ResNet) for 180 epochs, with five different initial learning rates (0.1, 0.05, 0.01, 0.005, 0.001) on Cifar-10, and ten different initial learning rates (0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0004, 0.0003, 0.0002, 0.00001) on Cifar-100. The optimizer is SGD with momentum (0.9). The learning rate decays by a factor of 10 at epoch 80, 120.
-D Loss Landscape Details
The parametric loss landscape plots are plotted by perturbing the model parameters, , along the first and second top eigenvectors of the Hessian, denoted as and . Then, we compute the loss of K (in our case, ) data points with the following formula,
-E Extra Results
In the remainder of this appendix, we present additional results that we described in the main text. See Table IV for a summary.
|RM BN stage 1||65.69%||65.74%||67.31%|
|RM BN stage 2||65.62%||64.68%||66.46%|
|RM BN stage 3||65.63%||64.57%||61.04%|
|RM Res stage 1||66.46%||66.94%||67.61%|
|RM Res stage 2||65.70%||66.05%||66.70%|
|RM Res stage 3||66.21%||66.38%||66.03%|
- The naïve view arises since the Hessian matrix is of size (say) . Thus, like most linear algebra computations, exact full spectral computations (which are sufficient but never necessary) cost time.
- Quantization is a process in which the precision of the parameters is reduced from single precision (32-bits) to a lower precision (such as 8-bits).