Scaling Laws for the Principled Design, Initialization and Preconditioning of ReLU Networks
In this work, we describe a set of rules for the design and initialization of well-conditioned neural networks, guided by the goal of naturally balancing the diagonal blocks of the Hessian at the start of training. Our design principle balances multiple sensible measures of the conditioning of neural networks. We prove that for a ReLU-based deep multilayer perceptron, a simple initialization scheme using the geometric mean of the fan-in and fan-out satisfies our scaling rule. For more sophisticated architectures, we show how our scaling principle can be used to guide design choices to produce well-conditioned neural networks, reducing guess-work.
The design of neural networks is often considered a black-art, driven by trial and error rather than foundational principles. This is exemplified by the success of recent architecture random-search techniques (Zoph and Le, 2016; Li and Talwalkar, 2019), which take the extreme of applying no human guidance at all. Although as a field we are far from fully understanding the nature of learning and generalization in neural networks, this does not mean that we should proceed blindly.
This work derives various scaling laws by investigating a simple guiding principle:
All else being equal, the diagonal blocks of the Hessian corresponding to each weight matrix should have similar average singular values.
This condition is important when the stochastic gradient algorithm is used and can help even when adaptive optimization methods are used, as they have no notion of the correct conditioning right at initialization.
In this work we define a scaling quantity for each layer that approximates the average singular value, involving the second moments of the forward-propagated values and the second moments of the backward-propagated gradients. We argue that networks with constant are better conditioned than those that don’t, and we analyze how common layer types affect this quantity. We call networks that obey this rule preconditioned neural networks, in analogy to preconditioning of linear systems.
As an example of some of the possible applications of our theory, we:
Propose a principled weight initialization scheme that can often provide an improvement over existing schemes;
Show which common layer types automatically result in well-conditioned networks;
Show how to improve the conditioning of common structures such as bottlenecked residual blocks by the addition of fixed scaling constants to the network (Detailed in Appendix E).
We will use the multilayer perceptron (i.e. a classical feed-forward deep neural network) as a running example as it is the simplest non-trivial deep neural network structure. We use ReLU activation functions, and use the following notation for layer (following He et al., 2015):
where is a matrix of weights, is the bias vector, the preactivation vector and is the input activation vector for the layer. The quantities and are called the fan-out and fan-in of the layer respectively. We also denote the gradient of a quantity with respect to the loss (i.e. the back-propagated gradient) with the prefix . We initially focus on the least-squares loss. Additionally, we assume that each bias vector is initialized with zeros unless otherwise stated.
3 Conditioning by balancing the Hessian
Our proposed approach focuses on the spectrum of the diagonal blocks of the Hessian. In the case of a MLP network, each diagonal block corresponds to the weights from a single weight matrix or bias vector . This block structure is used by existing approaches such as K-FAC and variants (Martens and Grosse, 2015; Grosse and Martens, 2016; Ba et al., 2017; George et al., 2018), which correct the gradient step using estimates of second-order information. In contrast, our approach modifies the network to improve the Hessian without modifying the step.
When using the ReLU activation function, as we consider in this work, a neural network is no longer a smooth function of its inputs, and the Hessian becomes ill-defined at some points in the parameter space. Fortunately, the spectrum is still well-defined at any twice-differentiable point, and this gives a local measure of the curvature. ReLU networks are typically twice-differentiable almost everywhere, which is the case when none of the activations or weights are exactly 0 for instance. We assume this throughout the remainder of this work.
3.1 GR scaling: A measure of Hessian average conditioning
Our analysis will proceed with batch-size 1 and a network with outputs. We consider the network at initialization, where weights are centered, symmetric and i.i.d random variables, and biases are set to zero.
ReLU networks have a particularly simple structure for the Hessian with respect to any set of activations, as the network’s output is a piecewise-linear function fed into a final layer consisting of a loss. This structure results in greatly simplified expressions for diagonal blocks of the Hessian with respect to the weights.
We will consider the output of the network as a composition two functions, the current layer , and the remainder of the network . We write this as a function of the weights, i.e. . The dependence on the input to the network is implicit in this notation, and the network below layer does not need to be considered.
Let be the Hessian of , the remainder of the network after application of layer (recall ). Let be the Jacobian of with respect to . The Jacobian has shape . Given these quantities, the diagonal block of the Hessian corresponding to is equal to:
The th diagonal block of the (Generalized) Gauss-Newton matrix (Martens, 2014). We will use this fact to simplify our analysis. We discuss this decomposition further in Appendix B.1. Note that each row of has non-zero elements, each containing a value from . This structure can be written as a block matrix,
Where each is a row vector. This can also be written as a Kronecker product with an identity matrix as .
Our quantity of interest is the average squared singular value of , which is simply equal to the (element-wise) second moment of the product of with a i.i.d normal random vector :
Balancing this theoretically derived GR scaling quantity in a network will produce an initial optimization problem for which the blocks of the Hessian are expected to be approximately balanced with respect to their average singular value.
Due the the large number of approximations needed for this derivation, including complete independence between forward and backward signals (Appendix A), we don’t claim that this theoretical approximation is accurate, or that the blocks will be closely matched in practice. Rather, we make the lesser claim that a network with very disproportionate values of between layers is likely to have more convergence difficulties during the early stages of optimization then one for which the are balanced.
To check the quality of our approximation, we computed the ratio of the convolutional version of the GR scaling equation (Equation 8) to the actual product for a strided (rather than max-pooled, see Table 1) LeNet model, where we use random input data and a random loss (i.e. for outputs we use for an i.i.d normal matrix ), with batch-size 1024, and input images. The results are shown in Figure 1 for 100 sampled setups; there is generally good agreement with the theoretical expectation.
4 Preconditioning balances weight-to-gradient ratios
We provide further motivation for the utility of preconditioning by comparing it to another simple quantity of interest. Consider at network initialization, the ratio of the (element-wise) second moments of each weight-matrix gradient to the weight matrix itself:
This ratio approximately captures the relative change that a single SGD step with unit step size on will produce. We call this quantity the weight-to-gradient ratio. When is very small compared to , the weights will stay close to their initial values for longer than when is large. In contrast, if is very large compared to , then learning can be expected to be unstable, as the sign of the elements of may change rapidly between optimization steps.
In either case the global learning rate can be chosen to correct the step’s magnitude, however this affects all weight matrices equally, possibly making the step too small for some weight matrices and too large for others. By matching across the network, we avoid this problem. Remarkably, this weight-to-gradient ratio turns out to be equivalent to the GR scaling for MLP networks:
5 Preconditioning of neural networks via initialization
For ReLU networks with a classical multilayer-perceptron (i.e. non-convolutional, non-residual) structure, we show in this section that initialization using i.i.d mean-zero random variables with second moment inversely proportional to the geometric mean of the fans:
for some fixed constant , gives a constant GR scaling throughout the network.
Let and be weight matrices satisfying the geometric initialization criteria of Equation 2, and let be zero-initialized bias parameters. Then consider the following sequence of two layers where and are i.i.d, mean 0, uncorrelated and symmetrically distributed:
Then and so .
Note that the ReLU operation halves both the forward and backward second moments, due to our assumptions on the distributions of and . So:
Consider the first weight-gradient ratio, using :
Under our assumptions, backwards propagation to results in , so:
Now consider the second weight-gradient ratio:
Under our assumptions, applying forward propagation gives and so from Equation 3 we have:
which matches Equation 4, so . ∎
This relation also holds for sequences of (potentially) strided convolutions, but only if the same kernel size is used everywhere, and a nonzero-padding scheme is used such as reflective padding.
5.1 Traditional Initialization schemes
For the feed-forward network above, assuming random activations, the forward-activation variance will remain constant in expectation throughout the network if fan-in initialization of weights (LeCun et al., 2012) is used, whereas the fan-out variant maintains a constant variance of the back-propagated signal. The constant factor 2 in the above expressions corrects for the variance-reducing effect of the ReLU activation. Although popularized by He et al. (2015), similar scaling was in use in early neural network models that used tanh activation functions (Bottou, 1988).
These two principles are clearly in conflict; unless , either the forward variance or backward variance will become non-constant, or as more commonly expressed, either explode or vanish. No prima facie reason for preferring one initialization over the other is provided. Unfortunately, there is some confusion in the literature as many works reference using Kaiming initialization without specifying if the fan-in or fan-out variant is used.
The Xavier initialization (Glorot and Bengio, 2010) is the closest to our proposed approach. They balance these conflicting objectives using the arithmetic mean:
to “… approximately satisfy our objectives of maintaining activation variances and back-propagated gradients variance as one moves up or down the network”. This approach to balancing is essentially heuristic, in contrast to the geometric mean approach that our theory directly guides us to.
5.2 Geometric initialization balances biases
We can use the same proof technique to compute the GR scaling for the bias parameters in a network. Our update equations change to include the bias term: , with assumed to be initialized at zero. We show in Appendix D that:
It is easy to show using the techniques of Section 5 that the biases of consecutive layers have equal GR scaling as long as geometric initialization is used. However, unlike in the case of weights, we have less flexibility in the choice of the numerator. Instead of allowing all weights to be scaled by for any positive , we require that , so that:
5.3 Network input scaling balances weights against biases
It is traditional to normalize a dataset before applying a neural network so that the input vector has mean 0 and variance 1 in expectation. This principle is rarely quested in modern neural networks, despite the fact that there is no longer a good justification for its use in modern ReLU based networks. In contrast, our theory provides direct guidance for the choice of input scaling. We show that the second moment of the input effects the GR scaling of bias and weight parameters differently, and that they can be balanced by careful choice of the initialization.
Consider the GR scaling values for the bias and weight parameters in the first layer of a ReLU-based multilayer perceptron network, as considered in previous sections. We assume the data is already centered. Then the scaling factors for the weight and bias layers are:
We can cancel terms to find the value of that makes these two quantities equal:
In common computer vision architectures such as VGG (detailed below), the input planes are the 3 color channels, and the kernel size is , giving . Using the traditional variance-one normalization will result in the effective learning rate for the bias terms being lower than that of the weight terms. This will result in potentially slower learning of the bias terms than for the input scaling we propose.
5.4 Output second moments
A neural network’s behavior is also very sensitive to the second moment of the outputs. For a convolutional network without pooling layers (but potentially with strided dimensionality reduction), if geometric-mean initialization is used the activation second moments are given by:
The application of a sequence of these layers gives a telescoping product:
We potentially have independent control over this second moment at initialization, as we can insert a fixed scalar multiplication factor at the end of the network that modifies it. This may be necessary when adapting a network architecture that was designed and tested under a different initialization scheme, as the success of the architecture may be partially due to the output scaling that happens to be produced by that original initialization. We are not aware of any existing theory guiding the choice of output variance at initialization for the case of log-softmax losses, where it has a non-trivial effect on the back-propagated signals, although output variances of 0.01 to 0.1 appear to work well. The output variance should always be checked and potentially corrected when switching initialization schemes.
6 Designing well-conditioned neural networks
|Linear layer||✓||Will not be well-conditioned against other layers unless geometric initialization is used|
|(Strided) convolution||✓||As above, but only if all kernel sizes are the same|
|Skip connections||✗||Operations in residual blocks will be scaled correctly against each other, but not against non-residual operations|
|ReLU/LeakyReLU||✓||Any positively-homogenous function with degree 1|
|Tanhh||✗||Maintains scaling if entirely within the linear regime|
Our scaling principle can be used for the design of more complex network structures as well. In this section, we detail the general principles that can be used to design well-conditioned networks with more complicated structures.
6.1 Convolutional networks
The concept of GR scaling may be extended to convolutional layers with kernel width , batch-size , and output resolution . A straight-forward derivation gives expressions for the convolution weight and biases of:
This requires an assumption of independence of the values of activations within a channel that is not true in practice, so tends to be further away from empirical estimates for convolutional layers than for non-convolutional layers, although it is still a useful guide. The effect of padding is also ignored here. The standard technique of padding with zeros will only cause a modest decrease in output variance, and so it is typically safe to ignore this additional complication except in extremely deep networks. Sequences of convolutions are well scaled against each other along as the kernel size remains the same. The scaling of layers involving differing kernel sizes can be corrected using the addition of constants into the network (Appendix E).
6.2 Maintaining conditioning
Consider a network where is constant throughout. We may add an additional layer between any two existing layers without affecting this conditioning, as long as the new layer maintains the activation-gradient second-moment product:
and dimensionality; this follows from Equation 8. For instance, adding a simple scaling layer of the form doubles the second moment during the forward pass and doubles the backward second moment during back-propagation, which maintains this product:
When spatial dimensionality changes between layers we can see that the GR scaling is no longer maintained just by balancing this product, as depends directly on the square of the spatial dimension. Instead, a pooling operation that changes the forward and backwards signals in a way that counter-acts the change in spatial dimension is needed. The use of stride-2 convolutions, as well as average pooling results in the correct scaling, but other types of pooling generally do not. Table 1 lists operations that preserve scaling when inserted into an existing preconditioned network. Operations such as linear layers preserve the scaling of existing layers but are only themselves well-scaled if they are initialized correctly. For an architecture such as ResNet-50 that uses operations that break scaling, some constants should be introduced into the network to correct scaling. In a ResNet-50, residual connections, max-pooling and varying kernel sizes need to be corrected for (we describe this procedure in Appendix E).
6.3 Conditioning multipliers
We can change the value of for a single layer without modifying the forward or backward propagated signals in the network via reparametrization (Lafond et al., 2017). If we introduce an additional scalar :
and modify the initialization of such that is unchanged, then the backward signal is unchanged, however the value of changes and the GR scaling is then multiplied by :
By introducing these untrained conditioning constants we may modify the GR scaling of each block independently, and potentially improve the initial conditioning of any given network.
7 Experimental Results
|Method||Average Normalized loss||Worst in #||Best in #|
We considered a selection of dense and moderate-sparsity multi-class classification datasets from the LibSVM repository, 26 in total. The same model was used for all datasets, a non-convolutional ReLU network with 3 weight layers total. The inner two layers were fixed at 384 and 64 nodes respectively. These numbers were chosen to result in a larger gap between the optimization methods, very little difference could be expected if a more typical gap was used.
For every dataset, learning rate and initialization combination we ran 10 seeds and picked the median loss after 5 epochs as the focus of our study (The largest differences can be expected early in training). Learning rates in the range to (in powers of 2) were checked for each dataset and initialization combination, with the best learning rate chosen in each case based off of the median of the 10 seeds. Training loss was used as the basis of our comparison as we care primarily about convergence rate, and are comparing identical network architectures. Some additional details concerning the experimental setup and which datasets were used is available in the Appendix.
Table 1 shows that geometric initialization is the most consistent of the initialization approaches considered. It has the lowest loss, after normalizing each dataset, and it is never the worst of the 4 methods on any dataset. Interestingly, the fan out method is most often the best method, but consideration of the per-dataset plots (Appendix F) shows that it often completely fails to learn for some problems, which pulls down its average loss and results in it being the worst for 9 datasets.
Although not a panacea, by using the scaling principle we have introduced, neural networks can be designed with a reasonable expectation that they will be optimizable by stochastic gradient methods, minimizing the amount of guess-and-check neural network design. As a consequence of our scaling principle, we have derived an initialization scheme that automatically preconditions common network architectures. Most developments in neural network theory attempt to explain the success of existing techniques post-hoc. Instead, we show the power of the scaling law approach by deriving a new initialization technique from theory directly.
- Ba et al.  Jimmy Ba, Roger Grosse, and James Martens. Distributed second-order optimization using kronecker-factored approximations. International Conference On Learning Representations (ICLR2017), 2017.
- Balles and Hennig  Lukas Balles and Philipp Hennig. Dissecting adam: The sign, magnitude and variance of stochastic gradients. International conference on machine learning (ICML), 2018.
- Bernstein et al.  Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar. signSGD: Compressed optimisation for non-convex problems. International conference on machine learning (ICML), 2018.
- Bottou  Léon Bottou. Reconnaissance de la parole par reseaux connexionnistes. In Proceedings of Neuro Nimes 88, pages 197–218, Nimes, France, 1988. URL http://leon.bottou.org/papers/bottou-88b.
- Du et al.  Simon S. Du, Wei Hu, and Jason D. Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. Neural Information Processing Systems (NIPS), 2018.
- George et al.  Thomas George, César Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a kronecker-factored eigenbasis, 2018.
- Glorot and Bengio  Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010.
- Grosse and Martens  Roger Grosse and James Martens. A kronecker-factored approximate fisher matrix for convolution layers. International Conference on Machine Learning (ICML2016), 2016.
- Hanin  Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? In Advances in Neural Information Processing Systems 31, pages 582–591. Curran Associates, Inc., 2018.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Jacobi  C. G. J. Jacobi. Ueber eine neue auflösungsart der bei der methode der kleinsten quadrate vorkommenden lineären gleichungen. Astron. Nachrichten, 1845.
- Kingma and Ba  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference for Learning Representations (ICLR2015), 2015.
- Lafond et al.  Jean Lafond, Nicolas Vasilache, and Léon Bottou. Diagonal rescaling for neural networks. ArXiv e-prints, 2017.
- Latala  Rafal Latala. Some estimates of norms of random matrices. Proc. Amer. Math. Soc.,, 2005.
- LeCun et al.  Yann A. LeCun, Léon Bottou, Genevieve B. Orr, and Klaus-Robert Müller. Neural Networks: Tricks of the Trade, chapter Efficient BackProp. Springer, 2012.
- Li and Talwalkar  Liam Li and Ameet S. Talwalkar. Random search and reproducibility for neural architecture search. CoRR, abs/1902.07638, 2019.
- Martens  James Martens. New insights and perspectives on the natural gradient method. In ArXiv e-prints, 2014.
- Martens and Grosse  James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. International conference on machine learning (ICML), 2015.
- Radford et al.  Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ArXiv e-prints, 2015.
- Ronneberger et al.  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Springer, 2015.
- Rudelson and Vershynin  Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: extreme singular values. Proceedings of the International Congress of Mathematicians, 2010.
- Russakovsky et al.  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
- Saxe et al.  Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ArXiv e-prints, 2014.
- Simonyan and Zisserman  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. international conference on learning representations (ICLR2015), 2015.
- Sutskever et al.  Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning (ICML), pages 1139–1147, 2013.
- Tao  Terrance Tao. Topics in Random Matrix Theory. American Mathematical Soc.,, 2012.
- Tieleman and Hinton  T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
- Vershynin  Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. Compressed sensing, 2012.
- Xiao et al.  Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. ICML, 2018.
- Zbontar et al.  Jure Zbontar, Florian Knoll, Anuroop Sriram, Matthew J. Muckley, Mary Bruno, Aaron Defazio, Marc Parente, Krzysztof J. Geras, Joe Katsnelson, Hersh Chandarana, Zizhao Zhang, Michal Drozdzal, Adriana Romero, Michael Rabbat, Pascal Vincent, James Pinkerton, Duo Wang, Nafissa Yakubova, Erich Owens, C. Lawrence Zitnick, Michael P. Recht, Daniel K. Sodickson, and Yvonne W. Lui. fastMRI: An open dataset and benchmarks for accelerated MRI. In ArXiv e-prints, 2018.
- Zhang et al.  Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Residual learning without normalization via better initialization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1gsz30cKX.
- Zoph and Le  Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. CoRR, abs/1611.01578, 2016. URL http://arxiv.org/abs/1611.01578.
Appendix A Assumptions
The following assumptions are used in the derivation of the GR scaling:
(A1) The input and target values are drawn element-wise i.i.d from a centered symmetric distribution with known variance.
(A2) The Hessian of the remainder of the network above each block, with respect to the output, has Frobenius norm much larger than . We make this assumption so that we can neglect all but the highest order terms that are polynomial in this norm.
(A3) All activations, pre-activations and gradients are independently distributed element-wise. In practice due to the mixing effect of multiplication by random weight matrices, only the magnitudes of these quantities are correlated, and the effect is small for wide networks due to the law of large numbers. Independence assumptions of this kind are common when approximating second-order methods; the block-diagonal variant of K-FAC [Martens and Grosse, 2015] makes similar assumptions for instance.
Assumption A2 is the most problematic of these assumptions, and we make no claim that it holds in practice. However, we are primarily interested in the properties of blocks and their scaling with respect to each other, not their absolute scaling. Assumption A2 results in very simple expressions for the scaling of the blocks without requiring a more complicated analysis of the top of the network. Similar theory can be derived for other assumptions on the output structure, such as the assumption that the target values are much smaller than the outputs of the network.
Appendix B GR scaling derivation
To compute the second moment of the elements of , we can calculate the second moment of matrix-random-vector products against , and separately since is uncorrelated with , and the back-propagated gradient is uncorrelated with (Assumption A3).
Jacobian products and
Recall that each row of has non-zero elements (Equation 1), each containing a value from . The value is i.i.d random at the bottom layer of the network (Assumption A1). For layers further up, the multiplication by a random weight matrix from the previous layer ensures that the entries of are identically distributed (Assumption A3). So we have:
Note that we didn’t assume that the input is mean zero, so This is needed as often the input to a layer is the output from a ReLU operation, which will not be mean zero.
For the transposed case, we have a single entry per column, so:
Upper Hessian product
Instead of using , for any random , we will instead compute it for , it will have the same expectation since both and are uncorrelated with (Assumption A3). The piecewise linear structure of the network above with respect to the makes the structure of particularly simple. It is a least-squares problem for some that is the linearization of the remainder of the network. The gradient is and the Hessian is simply . So we have that
Applying this gives:
b.1 The Gauss-Newton matrix
Standard ReLU classification and regression networks have a particularly simple structure for the Hessian with respect to the input, as the network’s output is a piecewise-linear function feed into a final layer consisting of a convex log-softmax operation, or a least-squares loss. This structure results in the Hessian with respect to the input being equivalent to its Gauss-Newton approximation. The Gauss-Newton matrix can be written in a factored form, which is used in the analysis we perform in this work. We emphasize that this is just used as a convenience when working with diagonal blocks, the GN representation is not an approximation in this case.
The (Generalized) Gauss-Newton matrix is a positive semi-definite approximation of the Hessian of a non-convex function , given by factoring into the composition of two functions where is convex, and is approximated by its Jacobian matrix at , for the purpose of computing :
The GN matrix also has close ties to the Fisher information matrix [Martens, 2014], providing another justification for its use.
Surprisingly, the Gauss-Newton decomposition can be used to compute diagonal blocks of the Hessian with respect to the weights as well as the inputs [Martens, 2014]. To see this, note that for any activation , the layers above may be treated in a combined fashion as the in a decomposition of the network structure, as they are the composition of a (locally) linear function and a convex function and thus convex. In this decomposition is a function of with fixed, and as this is linear in , the Gauss-Newton approximation to the block is thus not an approximation.
Appendix C The Weight gradient ratio is equal to GR scaling for MLP models
The weight-gradient ratio is equal to the GR scaling for i.i.d mean-zero randomly-initialized multilayer perceptron layers under the independence assumptions of Appendix A.
To see the equivalence, note that under the zero-bias initialization, we have from that:
The gradient of the weights is given by and so its second moment is:
Combining these quantities gives:
Appendix D Bias scaling
We consider the case of a convolutional neural network with spatial resolution for greater generality. Consider the Jacobian of with respect to the bias. It has shape . Each row corresponds to a output, and each column a bias weight. As before, we will approximate the product of with a random i.i.d unit variance vector :
The structure of is that each block of rows has the same set of 1s in the same column. Only a single 1 per row. It follows that:
The calculation of the product of with is approximated in the same way as in the weight scaling calculation. For the product, note that there is an additional as each column has non-zero entries, each equal to 1. Combining these three quantities gives:
Consider the setup of Proposition 3, with the addition of biases:
As long as the weights are initialized following Equation 7 and the biases are initialized to 0, we have that
We will include as a variable as it clarifies it’s relation to other quantities. We reuse some calculations from Proposition 3. Namely that:
Plugging these into the definition of :
For , we require the additional quantity:
Again plugging this in:
So comparing these expressions for and , we see that if and only if
Appendix E Conditioning of ResNets Without Normalization Layers
There has been significant recent interest in training residual networks without the use of batch-normalization or other normalization layers [Zhang et al., 2019]. In this section, we explore the modifications that are necessary to a network for this to be possible and show how to apply our preconditioning principle to these networks.
The building block of a ResNet model is the residual block:
where in this notation is a composition of layers. Unlike classical feedforward architectures, the pass-through connection results in an exponential increase in the variance of the activations in the network as the depth increases. A side effect of this is the output of the network becomes exponentially more sensitive to the input of the network as depth increases, a property characterized by the Lipschitz constant of the network [Hanin, 2018].
This exponential dependence can be reduced by the introduction of scaling constants to each block:
The introduction of these constants requires a modification of the block structure to ensure constant conditioning between blocks. A standard bottleneck block, as used in the ResNet-50 architecture, has the following form:
In this notation, is a convolution that reduces the number of channels 4 fold, is a convolution with equal input and output channels, and is a convolution at increases the number of channels back up 4 fold to the original input count.
If we introduce a scaling factor to each block , then we must also add conditioning multipliers to each convolution to change their GR scaling, as we described in Section 6.3. The correct scaling constant depends on the scaling constant of the previous block. A simple calculation gives the equation:
Since scaling is relative, the first block may be scaled with and . We recommend using a flat for all to avoid having to introduce the factors. The block structure including the factors is:
The weights of each convolution must then be initialized with the standard deviation modified such that the combined convolution-scaling operation gives the same output variance as would be given if the geometric-mean initialization scheme is used without extra scaling constants. For instance, the initialization of the convolution must have standard deviation scaled down by dividing by so that the multiplication by during the forward pass results in the correct forward variance. The factor corrects for the change in kernel shape for the middle convolution.
e.1 Correction for mixed residual and non-residual blocks
Since the initial convolution in a ResNet-50 model is also not within a residual block, it’s GR scaling is different from the convolutions within residual blocks. Consider the composition of a non-residual followed by a residual block, without max-pooling or ReLUs for simplicity of exposition:
Without loss of generality, we assume that , and assume a single channel input and output.
Our goal is to find a constant , so that . Note that the initialization of is tied to inversely, so that the variance of is independent of the choice of . Our scaling factor will also depend on the kernel sizes used in the two convolutions, so we must include those in the calculations.
From Equation 9, the GR scaling for is
Note that so:
For the residual convolution, we need to use a modification of the standard GR equation due to the residual branch. The derivation of for non-residual convolutions assumes that the remainder of the network above the convolution responds linearly (locally) with the scaling of the convolution, but here due to the residual connection, this is no longer the case. For instance, if the weight were scaled to zero, the output of the network would not also become zero (recall our assumption of zero-initialization for bias terms). This can be avoided by noting that the ratio in the GR scaling may be computed further up the network, as long as any scaling in between is corrected for. In particular, we may compute this ratio at the point after the residual addition, as long as we include the factor to account for this. So we in fact have:
We now equate :
Therefore to ensure that we need:
A similar calculation applies when the residual block is before the non-residual convolution, as in the last layer linear in the ResNet network, giving a scaling factor for the linear layer (effective kernel size 1) of:
Appendix F Full experimental results
Details of input/output scaling
To prevent the results from being skewed by the number of classes and the number of inputs affecting the output variance, the logit output of the network was scaled to have standard deviation 0.05 after the first minibatch evaluation for every method, with the scaling constant fixed thereafter. LayerNorm was used on the input to whiten the data. Weight decay of 0.00001 was used for every dataset. To aggregate the losses across datasets we divided by the worst loss across the initializations before averaging.
Plots show the interquartile range (25%, 50% and 75% quantiles) of the best learning rate for each case.