Scaling Laws for the Principled Design, Initialization and Preconditioning of ReLU Networks
Abstract
In this work, we describe a set of rules for the design and initialization of wellconditioned neural networks, guided by the goal of naturally balancing the diagonal blocks of the Hessian at the start of training. Our design principle balances multiple sensible measures of the conditioning of neural networks. We prove that for a ReLUbased deep multilayer perceptron, a simple initialization scheme using the geometric mean of the fanin and fanout satisfies our scaling rule. For more sophisticated architectures, we show how our scaling principle can be used to guide design choices to produce wellconditioned neural networks, reducing guesswork.
1 Introduction
The design of neural networks is often considered a blackart, driven by trial and error rather than foundational principles. This is exemplified by the success of recent architecture randomsearch techniques (Zoph and Le, 2016; Li and Talwalkar, 2019), which take the extreme of applying no human guidance at all. Although as a field we are far from fully understanding the nature of learning and generalization in neural networks, this does not mean that we should proceed blindly.
This work derives various scaling laws by investigating a simple guiding principle:
All else being equal, the diagonal blocks of the Hessian corresponding to each weight matrix should have similar average singular values.
This condition is important when the stochastic gradient algorithm is used and can help even when adaptive optimization methods are used, as they have no notion of the correct conditioning right at initialization.
In this work we define a scaling quantity for each layer that approximates the average singular value, involving the second moments of the forwardpropagated values and the second moments of the backwardpropagated gradients. We argue that networks with constant are better conditioned than those that don’t, and we analyze how common layer types affect this quantity. We call networks that obey this rule preconditioned neural networks, in analogy to preconditioning of linear systems.
As an example of some of the possible applications of our theory, we:

Propose a principled weight initialization scheme that can often provide an improvement over existing schemes;

Show which common layer types automatically result in wellconditioned networks;

Show how to improve the conditioning of common structures such as bottlenecked residual blocks by the addition of fixed scaling constants to the network (Detailed in Appendix E).
2 Notation
We will use the multilayer perceptron (i.e. a classical feedforward deep neural network) as a running example as it is the simplest nontrivial deep neural network structure. We use ReLU activation functions, and use the following notation for layer (following He et al., 2015):
where is a matrix of weights, is the bias vector, the preactivation vector and is the input activation vector for the layer. The quantities and are called the fanout and fanin of the layer respectively. We also denote the gradient of a quantity with respect to the loss (i.e. the backpropagated gradient) with the prefix . We initially focus on the leastsquares loss. Additionally, we assume that each bias vector is initialized with zeros unless otherwise stated.
3 Conditioning by balancing the Hessian
Our proposed approach focuses on the spectrum of the diagonal blocks of the Hessian. In the case of a MLP network, each diagonal block corresponds to the weights from a single weight matrix or bias vector . This block structure is used by existing approaches such as KFAC and variants (Martens and Grosse, 2015; Grosse and Martens, 2016; Ba et al., 2017; George et al., 2018), which correct the gradient step using estimates of secondorder information. In contrast, our approach modifies the network to improve the Hessian without modifying the step.
When using the ReLU activation function, as we consider in this work, a neural network is no longer a smooth function of its inputs, and the Hessian becomes illdefined at some points in the parameter space. Fortunately, the spectrum is still welldefined at any twicedifferentiable point, and this gives a local measure of the curvature. ReLU networks are typically twicedifferentiable almost everywhere, which is the case when none of the activations or weights are exactly 0 for instance. We assume this throughout the remainder of this work.
3.1 GR scaling: A measure of Hessian average conditioning
Our analysis will proceed with batchsize 1 and a network with outputs. We consider the network at initialization, where weights are centered, symmetric and i.i.d random variables, and biases are set to zero.
ReLU networks have a particularly simple structure for the Hessian with respect to any set of activations, as the network’s output is a piecewiselinear function fed into a final layer consisting of a loss. This structure results in greatly simplified expressions for diagonal blocks of the Hessian with respect to the weights.
We will consider the output of the network as a composition two functions, the current layer , and the remainder of the network . We write this as a function of the weights, i.e. . The dependence on the input to the network is implicit in this notation, and the network below layer does not need to be considered.
Let be the Hessian of , the remainder of the network after application of layer (recall ). Let be the Jacobian of with respect to . The Jacobian has shape . Given these quantities, the diagonal block of the Hessian corresponding to is equal to:
The th diagonal block of the (Generalized) GaussNewton matrix (Martens, 2014). We will use this fact to simplify our analysis. We discuss this decomposition further in Appendix B.1. Note that each row of has nonzero elements, each containing a value from . This structure can be written as a block matrix,
(1) 
Where each is a row vector. This can also be written as a Kronecker product with an identity matrix as .
Our quantity of interest is the average squared singular value of , which is simply equal to the (elementwise) second moment of the product of with a i.i.d normal random vector :
Proposition 1.
Balancing this theoretically derived GR scaling quantity in a network will produce an initial optimization problem for which the blocks of the Hessian are expected to be approximately balanced with respect to their average singular value.
Due the the large number of approximations needed for this derivation, including complete independence between forward and backward signals (Appendix A), we don’t claim that this theoretical approximation is accurate, or that the blocks will be closely matched in practice. Rather, we make the lesser claim that a network with very disproportionate values of between layers is likely to have more convergence difficulties during the early stages of optimization then one for which the are balanced.
To check the quality of our approximation, we computed the ratio of the convolutional version of the GR scaling equation (Equation 8) to the actual product for a strided (rather than maxpooled, see Table 1) LeNet model, where we use random input data and a random loss (i.e. for outputs we use for an i.i.d normal matrix ), with batchsize 1024, and input images. The results are shown in Figure 1 for 100 sampled setups; there is generally good agreement with the theoretical expectation.
4 Preconditioning balances weighttogradient ratios
We provide further motivation for the utility of preconditioning by comparing it to another simple quantity of interest. Consider at network initialization, the ratio of the (elementwise) second moments of each weightmatrix gradient to the weight matrix itself:
This ratio approximately captures the relative change that a single SGD step with unit step size on will produce. We call this quantity the weighttogradient ratio. When is very small compared to , the weights will stay close to their initial values for longer than when is large. In contrast, if is very large compared to , then learning can be expected to be unstable, as the sign of the elements of may change rapidly between optimization steps.
In either case the global learning rate can be chosen to correct the step’s magnitude, however this affects all weight matrices equally, possibly making the step too small for some weight matrices and too large for others. By matching across the network, we avoid this problem. Remarkably, this weighttogradient ratio turns out to be equivalent to the GR scaling for MLP networks:
5 Preconditioning of neural networks via initialization
For ReLU networks with a classical multilayerperceptron (i.e. nonconvolutional, nonresidual) structure, we show in this section that initialization using i.i.d meanzero random variables with second moment inversely proportional to the geometric mean of the fans:
(2) 
for some fixed constant , gives a constant GR scaling throughout the network.
Proposition 3.
Let and be weight matrices satisfying the geometric initialization criteria of Equation 2, and let be zeroinitialized bias parameters. Then consider the following sequence of two layers where and are i.i.d, mean 0, uncorrelated and symmetrically distributed:
Then and so .
Proof.
Note that the ReLU operation halves both the forward and backward second moments, due to our assumptions on the distributions of and . So:
(3) 
Consider the first weightgradient ratio, using :
Under our assumptions, backwards propagation to results in , so:
So:
(4) 
Now consider the second weightgradient ratio:
Under our assumptions, applying forward propagation gives and so from Equation 3 we have:
which matches Equation 4, so . ∎
Remark 4.
This relation also holds for sequences of (potentially) strided convolutions, but only if the same kernel size is used everywhere, and a nonzeropadding scheme is used such as reflective padding.
5.1 Traditional Initialization schemes
The most common approaches are the Kaiming (He et al., 2015) and Xavier (Glorot and Bengio, 2010) initializations. The Kaiming technique for ReLU networks is actually one of two approaches:
(5) 
For the feedforward network above, assuming random activations, the forwardactivation variance will remain constant in expectation throughout the network if fanin initialization of weights (LeCun et al., 2012) is used, whereas the fanout variant maintains a constant variance of the backpropagated signal. The constant factor 2 in the above expressions corrects for the variancereducing effect of the ReLU activation. Although popularized by He et al. (2015), similar scaling was in use in early neural network models that used tanh activation functions (Bottou, 1988).
These two principles are clearly in conflict; unless , either the forward variance or backward variance will become nonconstant, or as more commonly expressed, either explode or vanish. No prima facie reason for preferring one initialization over the other is provided. Unfortunately, there is some confusion in the literature as many works reference using Kaiming initialization without specifying if the fanin or fanout variant is used.
The Xavier initialization (Glorot and Bengio, 2010) is the closest to our proposed approach. They balance these conflicting objectives using the arithmetic mean:
(6) 
to “… approximately satisfy our objectives of maintaining activation variances and backpropagated gradients variance as one moves up or down the network”. This approach to balancing is essentially heuristic, in contrast to the geometric mean approach that our theory directly guides us to.
5.2 Geometric initialization balances biases
We can use the same proof technique to compute the GR scaling for the bias parameters in a network. Our update equations change to include the bias term: , with assumed to be initialized at zero. We show in Appendix D that:
It is easy to show using the techniques of Section 5 that the biases of consecutive layers have equal GR scaling as long as geometric initialization is used. However, unlike in the case of weights, we have less flexibility in the choice of the numerator. Instead of allowing all weights to be scaled by for any positive , we require that , so that:
(7) 
5.3 Network input scaling balances weights against biases
It is traditional to normalize a dataset before applying a neural network so that the input vector has mean 0 and variance 1 in expectation. This principle is rarely quested in modern neural networks, despite the fact that there is no longer a good justification for its use in modern ReLU based networks. In contrast, our theory provides direct guidance for the choice of input scaling. We show that the second moment of the input effects the GR scaling of bias and weight parameters differently, and that they can be balanced by careful choice of the initialization.
Consider the GR scaling values for the bias and weight parameters in the first layer of a ReLUbased multilayer perceptron network, as considered in previous sections. We assume the data is already centered. Then the scaling factors for the weight and bias layers are:
We can cancel terms to find the value of that makes these two quantities equal:
In common computer vision architectures such as VGG (detailed below), the input planes are the 3 color channels, and the kernel size is , giving . Using the traditional varianceone normalization will result in the effective learning rate for the bias terms being lower than that of the weight terms. This will result in potentially slower learning of the bias terms than for the input scaling we propose.
5.4 Output second moments
A neural network’s behavior is also very sensitive to the second moment of the outputs. For a convolutional network without pooling layers (but potentially with strided dimensionality reduction), if geometricmean initialization is used the activation second moments are given by:
The application of a sequence of these layers gives a telescoping product:
We potentially have independent control over this second moment at initialization, as we can insert a fixed scalar multiplication factor at the end of the network that modifies it. This may be necessary when adapting a network architecture that was designed and tested under a different initialization scheme, as the success of the architecture may be partially due to the output scaling that happens to be produced by that original initialization. We are not aware of any existing theory guiding the choice of output variance at initialization for the case of logsoftmax losses, where it has a nontrivial effect on the backpropagated signals, although output variances of 0.01 to 0.1 appear to work well. The output variance should always be checked and potentially corrected when switching initialization schemes.
6 Designing wellconditioned neural networks
Method  Maintains Scaling  Notes 

Linear layer  ✓  Will not be wellconditioned against other layers unless geometric initialization is used 
(Strided) convolution  ✓  As above, but only if all kernel sizes are the same 
Skip connections  ✗  Operations in residual blocks will be scaled correctly against each other, but not against nonresidual operations 
Average pooling  ✓  
Max pooling  ✗  
Dropout  ✓  
ReLU/LeakyReLU  ✓  Any positivelyhomogenous function with degree 1 
Sigmoid  ✗  
Tanhh  ✗  Maintains scaling if entirely within the linear regime 
Our scaling principle can be used for the design of more complex network structures as well. In this section, we detail the general principles that can be used to design wellconditioned networks with more complicated structures.
6.1 Convolutional networks
The concept of GR scaling may be extended to convolutional layers with kernel width , batchsize , and output resolution . A straightforward derivation gives expressions for the convolution weight and biases of:
(8) 
This requires an assumption of independence of the values of activations within a channel that is not true in practice, so tends to be further away from empirical estimates for convolutional layers than for nonconvolutional layers, although it is still a useful guide. The effect of padding is also ignored here. The standard technique of padding with zeros will only cause a modest decrease in output variance, and so it is typically safe to ignore this additional complication except in extremely deep networks. Sequences of convolutions are well scaled against each other along as the kernel size remains the same. The scaling of layers involving differing kernel sizes can be corrected using the addition of constants into the network (Appendix E).
6.2 Maintaining conditioning
Consider a network where is constant throughout. We may add an additional layer between any two existing layers without affecting this conditioning, as long as the new layer maintains the activationgradient secondmoment product:
and dimensionality; this follows from Equation 8. For instance, adding a simple scaling layer of the form doubles the second moment during the forward pass and doubles the backward second moment during backpropagation, which maintains this product:
When spatial dimensionality changes between layers we can see that the GR scaling is no longer maintained just by balancing this product, as depends directly on the square of the spatial dimension. Instead, a pooling operation that changes the forward and backwards signals in a way that counteracts the change in spatial dimension is needed. The use of stride2 convolutions, as well as average pooling results in the correct scaling, but other types of pooling generally do not. Table 1 lists operations that preserve scaling when inserted into an existing preconditioned network. Operations such as linear layers preserve the scaling of existing layers but are only themselves wellscaled if they are initialized correctly. For an architecture such as ResNet50 that uses operations that break scaling, some constants should be introduced into the network to correct scaling. In a ResNet50, residual connections, maxpooling and varying kernel sizes need to be corrected for (we describe this procedure in Appendix E).
6.3 Conditioning multipliers
We can change the value of for a single layer without modifying the forward or backward propagated signals in the network via reparametrization (Lafond et al., 2017). If we introduce an additional scalar :
and modify the initialization of such that is unchanged, then the backward signal is unchanged, however the value of changes and the GR scaling is then multiplied by :
(9) 
By introducing these untrained conditioning constants we may modify the GR scaling of each block independently, and potentially improve the initial conditioning of any given network.
7 Experimental Results
Method  Average Normalized loss  Worst in #  Best in # 

Arithmetic mean  0.90  14  3 
Fan in  0.84  3  5 
Fan out  0.88  9  12 
Geometric mean  0.81  0  6 
We considered a selection of dense and moderatesparsity multiclass classification datasets from the LibSVM repository, 26 in total. The same model was used for all datasets, a nonconvolutional ReLU network with 3 weight layers total. The inner two layers were fixed at 384 and 64 nodes respectively. These numbers were chosen to result in a larger gap between the optimization methods, very little difference could be expected if a more typical gap was used.
For every dataset, learning rate and initialization combination we ran 10 seeds and picked the median loss after 5 epochs as the focus of our study (The largest differences can be expected early in training). Learning rates in the range to (in powers of 2) were checked for each dataset and initialization combination, with the best learning rate chosen in each case based off of the median of the 10 seeds. Training loss was used as the basis of our comparison as we care primarily about convergence rate, and are comparing identical network architectures. Some additional details concerning the experimental setup and which datasets were used is available in the Appendix.
Table 1 shows that geometric initialization is the most consistent of the initialization approaches considered. It has the lowest loss, after normalizing each dataset, and it is never the worst of the 4 methods on any dataset. Interestingly, the fan out method is most often the best method, but consideration of the perdataset plots (Appendix F) shows that it often completely fails to learn for some problems, which pulls down its average loss and results in it being the worst for 9 datasets.
8 Conclusion
Although not a panacea, by using the scaling principle we have introduced, neural networks can be designed with a reasonable expectation that they will be optimizable by stochastic gradient methods, minimizing the amount of guessandcheck neural network design. As a consequence of our scaling principle, we have derived an initialization scheme that automatically preconditions common network architectures. Most developments in neural network theory attempt to explain the success of existing techniques posthoc. Instead, we show the power of the scaling law approach by deriving a new initialization technique from theory directly.
References
 Ba et al. [2017] Jimmy Ba, Roger Grosse, and James Martens. Distributed secondorder optimization using kroneckerfactored approximations. International Conference On Learning Representations (ICLR2017), 2017.
 Balles and Hennig [2018] Lukas Balles and Philipp Hennig. Dissecting adam: The sign, magnitude and variance of stochastic gradients. International conference on machine learning (ICML), 2018.
 Bernstein et al. [2018] Jeremy Bernstein, YuXiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar. signSGD: Compressed optimisation for nonconvex problems. International conference on machine learning (ICML), 2018.
 Bottou [1988] Léon Bottou. Reconnaissance de la parole par reseaux connexionnistes. In Proceedings of Neuro Nimes 88, pages 197–218, Nimes, France, 1988. URL http://leon.bottou.org/papers/bottou88b.
 Du et al. [2018] Simon S. Du, Wei Hu, and Jason D. Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. Neural Information Processing Systems (NIPS), 2018.
 George et al. [2018] Thomas George, César Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a kroneckerfactored eigenbasis, 2018.
 Glorot and Bengio [2010] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010.
 Grosse and Martens [2016] Roger Grosse and James Martens. A kroneckerfactored approximate fisher matrix for convolution layers. International Conference on Machine Learning (ICML2016), 2016.
 Hanin [2018] Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? In Advances in Neural Information Processing Systems 31, pages 582–591. Curran Associates, Inc., 2018.
 He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
 He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 Jacobi [1845] C. G. J. Jacobi. Ueber eine neue auflösungsart der bei der methode der kleinsten quadrate vorkommenden lineären gleichungen. Astron. Nachrichten, 1845.
 Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference for Learning Representations (ICLR2015), 2015.
 Lafond et al. [2017] Jean Lafond, Nicolas Vasilache, and Léon Bottou. Diagonal rescaling for neural networks. ArXiv eprints, 2017.
 Latala [2005] Rafal Latala. Some estimates of norms of random matrices. Proc. Amer. Math. Soc.,, 2005.
 LeCun et al. [2012] Yann A. LeCun, Léon Bottou, Genevieve B. Orr, and KlausRobert Müller. Neural Networks: Tricks of the Trade, chapter Efficient BackProp. Springer, 2012.
 Li and Talwalkar [2019] Liam Li and Ameet S. Talwalkar. Random search and reproducibility for neural architecture search. CoRR, abs/1902.07638, 2019.
 Martens [2014] James Martens. New insights and perspectives on the natural gradient method. In ArXiv eprints, 2014.
 Martens and Grosse [2015] James Martens and Roger Grosse. Optimizing neural networks with kroneckerfactored approximate curvature. International conference on machine learning (ICML), 2015.
 Radford et al. [2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ArXiv eprints, 2015.
 Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and ComputerAssisted Intervention – MICCAI 2015. Springer, 2015.
 Rudelson and Vershynin [2010] Mark Rudelson and Roman Vershynin. Nonasymptotic theory of random matrices: extreme singular values. Proceedings of the International Congress of Mathematicians, 2010.
 Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s112630150816y.
 Saxe et al. [2014] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ArXiv eprints, 2014.
 Simonyan and Zisserman [2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. international conference on learning representations (ICLR2015), 2015.
 Sutskever et al. [2013] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning (ICML), pages 1139–1147, 2013.
 Tao [2012] Terrance Tao. Topics in Random Matrix Theory. American Mathematical Soc.,, 2012.
 Tieleman and Hinton [2012] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
 Vershynin [2012] Roman Vershynin. Introduction to the nonasymptotic analysis of random matrices. Compressed sensing, 2012.
 Xiao et al. [2018] Lechao Xiao, Yasaman Bahri, Jascha SohlDickstein, Samuel S. Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of cnns: How to train 10,000layer vanilla convolutional neural networks. ICML, 2018.
 Zbontar et al. [2018] Jure Zbontar, Florian Knoll, Anuroop Sriram, Matthew J. Muckley, Mary Bruno, Aaron Defazio, Marc Parente, Krzysztof J. Geras, Joe Katsnelson, Hersh Chandarana, Zizhao Zhang, Michal Drozdzal, Adriana Romero, Michael Rabbat, Pascal Vincent, James Pinkerton, Duo Wang, Nafissa Yakubova, Erich Owens, C. Lawrence Zitnick, Michael P. Recht, Daniel K. Sodickson, and Yvonne W. Lui. fastMRI: An open dataset and benchmarks for accelerated MRI. In ArXiv eprints, 2018.
 Zhang et al. [2019] Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Residual learning without normalization via better initialization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1gsz30cKX.
 Zoph and Le [2016] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. CoRR, abs/1611.01578, 2016. URL http://arxiv.org/abs/1611.01578.
Appendix A Assumptions
The following assumptions are used in the derivation of the GR scaling:
(A1) The input and target values are drawn elementwise i.i.d from a centered symmetric distribution with known variance.
(A2) The Hessian of the remainder of the network above each block, with respect to the output, has Frobenius norm much larger than . We make this assumption so that we can neglect all but the highest order terms that are polynomial in this norm.
(A3) All activations, preactivations and gradients are independently distributed elementwise. In practice due to the mixing effect of multiplication by random weight matrices, only the magnitudes of these quantities are correlated, and the effect is small for wide networks due to the law of large numbers. Independence assumptions of this kind are common when approximating secondorder methods; the blockdiagonal variant of KFAC [Martens and Grosse, 2015] makes similar assumptions for instance.
Assumption A2 is the most problematic of these assumptions, and we make no claim that it holds in practice. However, we are primarily interested in the properties of blocks and their scaling with respect to each other, not their absolute scaling. Assumption A2 results in very simple expressions for the scaling of the blocks without requiring a more complicated analysis of the top of the network. Similar theory can be derived for other assumptions on the output structure, such as the assumption that the target values are much smaller than the outputs of the network.
Appendix B GR scaling derivation
To compute the second moment of the elements of , we can calculate the second moment of matrixrandomvector products against , and separately since is uncorrelated with , and the backpropagated gradient is uncorrelated with (Assumption A3).
Jacobian products and
Recall that each row of has nonzero elements (Equation 1), each containing a value from . The value is i.i.d random at the bottom layer of the network (Assumption A1). For layers further up, the multiplication by a random weight matrix from the previous layer ensures that the entries of are identically distributed (Assumption A3). So we have:
Note that we didn’t assume that the input is mean zero, so This is needed as often the input to a layer is the output from a ReLU operation, which will not be mean zero.
For the transposed case, we have a single entry per column, so:
Upper Hessian product
Instead of using , for any random , we will instead compute it for , it will have the same expectation since both and are uncorrelated with (Assumption A3). The piecewise linear structure of the network above with respect to the makes the structure of particularly simple. It is a leastsquares problem for some that is the linearization of the remainder of the network. The gradient is and the Hessian is simply . So we have that
Applying this gives:
b.1 The GaussNewton matrix
Standard ReLU classification and regression networks have a particularly simple structure for the Hessian with respect to the input, as the network’s output is a piecewiselinear function feed into a final layer consisting of a convex logsoftmax operation, or a leastsquares loss. This structure results in the Hessian with respect to the input being equivalent to its GaussNewton approximation. The GaussNewton matrix can be written in a factored form, which is used in the analysis we perform in this work. We emphasize that this is just used as a convenience when working with diagonal blocks, the GN representation is not an approximation in this case.
The (Generalized) GaussNewton matrix is a positive semidefinite approximation of the Hessian of a nonconvex function , given by factoring into the composition of two functions where is convex, and is approximated by its Jacobian matrix at , for the purpose of computing :
The GN matrix also has close ties to the Fisher information matrix [Martens, 2014], providing another justification for its use.
Surprisingly, the GaussNewton decomposition can be used to compute diagonal blocks of the Hessian with respect to the weights as well as the inputs [Martens, 2014]. To see this, note that for any activation , the layers above may be treated in a combined fashion as the in a decomposition of the network structure, as they are the composition of a (locally) linear function and a convex function and thus convex. In this decomposition is a function of with fixed, and as this is linear in , the GaussNewton approximation to the block is thus not an approximation.
Appendix C The Weight gradient ratio is equal to GR scaling for MLP models
Proposition 6.
The weightgradient ratio is equal to the GR scaling for i.i.d meanzero randomlyinitialized multilayer perceptron layers under the independence assumptions of Appendix A.
Proof.
To see the equivalence, note that under the zerobias initialization, we have from that:
(10) 
and so:
The gradient of the weights is given by and so its second moment is:
(11) 
Combining these quantities gives:
∎
Appendix D Bias scaling
We consider the case of a convolutional neural network with spatial resolution for greater generality. Consider the Jacobian of with respect to the bias. It has shape . Each row corresponds to a output, and each column a bias weight. As before, we will approximate the product of with a random i.i.d unit variance vector :
The structure of is that each block of rows has the same set of 1s in the same column. Only a single 1 per row. It follows that:
The calculation of the product of with is approximated in the same way as in the weight scaling calculation. For the product, note that there is an additional as each column has nonzero entries, each equal to 1. Combining these three quantities gives:
Proposition 7.
As long as the weights are initialized following Equation 7 and the biases are initialized to 0, we have that
We will include as a variable as it clarifies it’s relation to other quantities. We reuse some calculations from Proposition 3. Namely that:
Plugging these into the definition of :
For , we require the additional quantity:
Again plugging this in:
So comparing these expressions for and , we see that if and only if
Appendix E Conditioning of ResNets Without Normalization Layers
There has been significant recent interest in training residual networks without the use of batchnormalization or other normalization layers [Zhang et al., 2019]. In this section, we explore the modifications that are necessary to a network for this to be possible and show how to apply our preconditioning principle to these networks.
The building block of a ResNet model is the residual block:
where in this notation is a composition of layers. Unlike classical feedforward architectures, the passthrough connection results in an exponential increase in the variance of the activations in the network as the depth increases. A side effect of this is the output of the network becomes exponentially more sensitive to the input of the network as depth increases, a property characterized by the Lipschitz constant of the network [Hanin, 2018].
This exponential dependence can be reduced by the introduction of scaling constants to each block:
The introduction of these constants requires a modification of the block structure to ensure constant conditioning between blocks. A standard bottleneck block, as used in the ResNet50 architecture, has the following form:
In this notation, is a convolution that reduces the number of channels 4 fold, is a convolution with equal input and output channels, and is a convolution at increases the number of channels back up 4 fold to the original input count.
If we introduce a scaling factor to each block , then we must also add conditioning multipliers to each convolution to change their GR scaling, as we described in Section 6.3. The correct scaling constant depends on the scaling constant of the previous block. A simple calculation gives the equation:
Since scaling is relative, the first block may be scaled with and . We recommend using a flat for all to avoid having to introduce the factors. The block structure including the factors is:
The weights of each convolution must then be initialized with the standard deviation modified such that the combined convolutionscaling operation gives the same output variance as would be given if the geometricmean initialization scheme is used without extra scaling constants. For instance, the initialization of the convolution must have standard deviation scaled down by dividing by so that the multiplication by during the forward pass results in the correct forward variance. The factor corrects for the change in kernel shape for the middle convolution.
e.1 Correction for mixed residual and nonresidual blocks
Since the initial convolution in a ResNet50 model is also not within a residual block, it’s GR scaling is different from the convolutions within residual blocks. Consider the composition of a nonresidual followed by a residual block, without maxpooling or ReLUs for simplicity of exposition:
Without loss of generality, we assume that , and assume a single channel input and output.
Our goal is to find a constant , so that . Note that the initialization of is tied to inversely, so that the variance of is independent of the choice of . Our scaling factor will also depend on the kernel sizes used in the two convolutions, so we must include those in the calculations.
From Equation 9, the GR scaling for is
Note that so:
For the residual convolution, we need to use a modification of the standard GR equation due to the residual branch. The derivation of for nonresidual convolutions assumes that the remainder of the network above the convolution responds linearly (locally) with the scaling of the convolution, but here due to the residual connection, this is no longer the case. For instance, if the weight were scaled to zero, the output of the network would not also become zero (recall our assumption of zeroinitialization for bias terms). This can be avoided by noting that the ratio in the GR scaling may be computed further up the network, as long as any scaling in between is corrected for. In particular, we may compute this ratio at the point after the residual addition, as long as we include the factor to account for this. So we in fact have:
We now equate :
Therefore to ensure that we need:
Final layer
A similar calculation applies when the residual block is before the nonresidual convolution, as in the last layer linear in the ResNet network, giving a scaling factor for the linear layer (effective kernel size 1) of:
Appendix F Full experimental results
Details of input/output scaling
To prevent the results from being skewed by the number of classes and the number of inputs affecting the output variance, the logit output of the network was scaled to have standard deviation 0.05 after the first minibatch evaluation for every method, with the scaling constant fixed thereafter. LayerNorm was used on the input to whiten the data. Weight decay of 0.00001 was used for every dataset. To aggregate the losses across datasets we divided by the worst loss across the initializations before averaging.
Plots
Plots show the interquartile range (25%, 50% and 75% quantiles) of the best learning rate for each case.