Layer rotation: a surprisingly powerful indicator of generalization in deep networks?
Abstract
Our work presents extensive empirical evidence that layer rotation, i.e. the evolution across training of the cosine distance between each layer’s weight vector and its initialization, constitutes an impressively consistent indicator of generalization performance. In particular, larger cosine distances between final and initial weights of each layer consistently translate into better generalization performance of the final model. Interestingly, this relation admits a network independent optimum: training procedures during which all layers’ weights reach a cosine distance of 1 from their initialization consistently outperform other configurations by up to test accuracy. Moreover, we show that layer rotations are easily monitored and controlled (helpful for hyperparameter tuning) and potentially provide a unified framework to explain the impact of learning rate tuning, weight decay, learning rate warmups and adaptive gradient methods on generalization and training speed. In an attempt to explain the surprising properties of layer rotation, we show on a 1layer MLP trained on MNIST that layer rotation correlates with the degree to which features of intermediate layers have been trained.
Layer rotation: a surprisingly powerful indicator of generalization in deep networks?
Simon Carbonnelle, Christophe De Vleeschouwer FNRS research fellows ICTEAM, Université catholique de Louvain LouvainLaNeuve, Belgium {simon.carbonnelle, christophe.devleeschouwer}@uclouvain.be
noticebox[b]This paper constitutes an extended version of our work presented at the ICML 2019 workshop ’Identifying and Understanding Deep Learning Phenomena’.\end@float
1 Introduction
In order to understand the intriguing generalization properties of deep neural networks highlighted by Neyshabur2015 (); Zhang2017 (); Keskar2017 (), the identification of numerical indicators of generalization performance that remain applicable across a diverse set of training settings is critical. A wellknown and extensively studied example of such indicator is the width of the minima the network has converged to (Hochreiter1997a, ; Keskar2017, ). These indicators provide important insights for theoretical works around generalization in deep learning, and help explaining why commonly used tricks and techniques influence generalization performance.
In this paper, we present empirical evidence supporting the discovery of a novel indicator of generalization: the evolution across training of the cosine distance between each layer’s weight vector and its initialization (denoted by layer rotation). Indeed, we show across a diverse set of experiments (with varying datasets, networks and training procedures), that larger layer rotations (i.e. larger cosine distance between final and initial weights of each layer) consistently translate into better generalization performance. In addition to providing an original perspective on generalization, our experiments suggest that layer rotation also benefits from the following properties compared to alternative indicators of generalization:

It has a networkindependent optimum (all layers reaching a cosine distance of 1)

It is easily monitored and, since it only depends on the evolution of the network’s weights, can be controlled along the optimization through appropriate weight update adjustments

It provides a unified framework to explain the impact of learning rate tuning, weight decay, learning rate warmups and adaptive gradient methods on generalization and training speed.
In comparison, other indicators usually provide a metric to optimize (e.g. the wider the minimum, the better) but no clear optimum to be reached (what is the optimal width?), nor a precise methodology to tune it (how to converge to a minimum with a specific width?). By disclosing simple guidelines to tune layer rotations and an easytouse controlling tool, our work can also help practitioners get the best out of their network with minimal hyperparameter tuning.
After a discussion of related work (Section 2), the presentation of our experimental study is structured according to three successive steps:

Development of tools to monitor and control layer rotation (Section 3);

Systematic study of layer rotation configurations in a controlled setting (Section 4);^{1}^{1}1Our study focuses on convolutional neural networks used for image classification.

Study of layer rotation configurations in standard training settings, with a special focus on SGD, weight decay and adaptive gradient methods (Section 5).^{1}^{1}1Our study focuses on convolutional neural networks used for image classification.
Finally, Section 6 provides preliminary results and discussion for explaining the surprising properties of layer rotation.
To encourage further validation of our claims, the tools and source code used to create all the figures of this paper are provided at https://github.com/ispgroupucl/layerrotationpaperexperiments (code uses the Keras (chollet2015keras, ) and TensorFlow (Agarwal2016, ) libraries). In order to facilitate the usage of our controlling and monitoring tool by practitioners, implementations in different deep learning libraries are available at https://github.com/ispgroupucl/layerrotationtools.
2 Related work
After the intriguing generalization properties of deep neural networks were highlighted by Neyshabur2015 (); Zhang2017 (); Keskar2017 (), several works have tried to identify aspects of training which could predict generalization performance in a consistent and general way.
A first line of work tries to identify the characteristics of trained models that correlate with good generalization properties. Different complexity metrics have been proposed: normbased metrics are studied in (Bartlett1998, ; Neyshabur2015a, ; Liang2019, ), sensitivitybased metrics in (Xu2012, ; Novak2018, ) and sharpnessbased metrics in (Hochreiter1997a, ; Keskar2017, ). These three approaches are compared and further studied in (Neyshabur2017, ). Other works proposed complexity metrics that explicitly use the decomposition of deep neural networks in layers. Such decomposition is also key to our work and has already been very successful when analysing the training difficulties of deep neural networks (Bengio1994, ; Hochreiter1998, ; Glorot2010, ). In (Morcos2018, ), the similarity of hidden representations resulting from training with different initializations is used as an indicator of generalization. In (Morcos2018a, ), sensitivity to perturbation of hidden representations is studied.
While the works described above help us understand the characteristics of models that generalize well, they don’t explicitly disclose how the training procedure itself leads to such characteristics. A second line of work, to which this paper belongs, studies indicators of generalization that characterize the whole training trajectory instead of solely focusing on its endpoint. The sharpness metric is revisited in (Jastrzebski2019, ) and (Fort, ) analyses stiffness. While the presence of noise is believed to affect generalization, the mechanisms at play are poorly understood Goodfellow2015 (); Xing2018 () and a clear metric to quantify its influence on generalization is still lacking. Hoffer2017 () studies the evolution of the euclidean distance between the model’s weight vector and its initialization in the context of large batch training. This metric is the most similar to layer rotation, and also has the particularity of being both easy to monitor and to control. Our work differs by the used distance metric (layerlevel cosine distance instead of modellevel euclidean distance) and by performing a larger study that extends the context of largebatch training.
3 Tools for monitoring and controlling layer rotation
This section describes the tools for monitoring and controlling layer rotation during training, such as its relation with generalization can be studied in Sections 4 and 5.
3.1 Monitoring layer rotation with layer rotation curves
Layer rotation is defined as the evolution of the cosine distance between each layer’s weight vector and its initialization during training. More precisely, let be the flattened weight tensor of the layer at optimization step ( corresponding to initialization), then the rotation of layer at training step is defined as the cosine distance between and . ^{2}^{2}2It is worth noting that our study focuses on weights that multiply the inputs of a layer (e.g. kernels of fully connected and convolutional layers). Studying the training of additive weights (biases) is left as future work. In order to visualize the evolution of layer rotation during training, we record how the cosine distance between each layer’s current weight vector and its initialization evolves across training steps. We denote this visualization tool by layer rotation curves hereafter.
3.2 Controlling layer rotation with Layca
The ability to control layer rotations during training would enable a systematic study of its relation with generalization. Therefore, we present Layca (LAYerlevel Controlled Amount of weight rotation), an algorithm where the layerwise learning rates directly determine the amount of rotation performed by each layer’s weight vector during each optimization step (the layer rotation rates), in a direction specified by an optimizer (SGD being the default choice). Inspired by techniques for optimization on manifolds (absil2010, ), and on spheres in particular, Layca applies layerwise orthogonal projection and normalization operations on SGD’s updates, as detailed in Algorithm 1 in Supplementary Material. These operations induce the following simple relation between the learning rate of layer at training step and the angle between and : .
Our controlling tool is based on a strong assumption: that controlling the amount of rotation performed during each individual training step (i.e. the layer rotation rate) enables control of the cumulative amount of rotation performed since the start of training (i.e. layer rotation). This assumption is not trivial since the aggregated rotation is a priori very dependent on the structure of the loss landscape. For example, for an identical layer rotation rate, the layer rotation will be much smaller if iterates oscillate around a minimum instead of following a stable downward slope. As will be attested by the inspection of the layer rotation curves, our assumption however appeared to be sufficiently valid, and the control of layer rotation was effective in our experiments.
4 A systematic study of layer rotation configurations with Layca
Section 3 provides tools to monitor and control layer rotation. The purpose of this section is to use these tools to conduct a systematic experimental study of layer rotation configurations. We adopt SGD as default optimizer, but use Layca (cfr. Algorithm 1) to vary the relative rotation rates (faster rotation for first layers, last layers, or no prioritization) and the global rotation rate value (high or low rate, for all layers). The experiments are conducted on five different tasks which vary in network architecture and dataset complexity, and are further described in Table 1.
Name  Architecture  Dataset 

C10CNN1  VGGstyle 25 layers deep CNN  CIFAR10 
C100resnet  ResNet32  CIFAR100 
tinyCNN  VGGstyle 11 layers deep CNN  Tiny ImageNet 
C10CNN2  deep CNN from torch blog  CIFAR10 + data augm. 
C100WRN  Wide ResNet 2810 with 0.3 dropout  CIFAR100 + data augm. 
4.1 Layer rotation rate configurations
Layca enables us to specify layer rotation rate configurations, i.e. the amount of rotation performed by each layer’s weight vector during one optimization step, by setting the layerwise learning rates. To explore the large space of possible layer rotation rate configurations, our study restricts itself to two directions of variation. First, we vary the initial global learning rate , which affects the layer rotation rate of all the layers. During training, the global learning rate drops following a fixed decay scheme (hence the dependence on ), as is common in the literature (cfr. Supplementary Material B.2). The second direction of variation tunes the relative rotation rates between different layers. More precisely, we apply static, layerwise learning rate multipliers that exponentially increase/decrease with layer depth (which is typically encountered with exploding/vanishing gradients). The multipliers are parametrized by the layer index (in forward pass ordering) and a parameter such that the learning rate of layer becomes:
(1) 
Values of close to correspond to faster rotation of first layers, corresponds to uniform rotation rates, and values close to to faster rotation of last layers. Visualization of the layerwise multipliers for different values is provided in Supplementary Material (B.1).
We explore 10 logarithmically spaced values of () in the setting, and 13 different values of in the setting. A lower amount of configurations ( and ) are investigated for the C10CNN2 and C100WRN tasks given the increased computational cost required for their training.
4.2 Study of the relation between layer rotation and generalization
Figure 1 depicts the layer rotation curves (cfr. Section 3.1) and the corresponding test accuracies obtained with different layer rotation rate configurations (results for a larger set of and configurations are provided in Supplementary Material (Figure 9)). While each configuration solves the classification task on the training data ( training accuracy in all configurations, cfr. Supplementary Material B.3), we observe huge differences in generalization ability (differences of up to test accuracy). More importantly, these differences in generalization ability seem to be tightly connected to differences in layer rotations. In particular, we extract the following rule of thumb that is applicable across the five considered tasks: the larger the layer rotations, the better the generalization performance. The best performance is consistently obtained when nearly all layers reach the largest possible distance from their initialization: a cosine distance of (cfr. fifth column of Figure 1). This observation would have limited value if many configurations (amongst which the best one) lead to cosine distances of . However, we notice that most configurations do not. In particular, rotating the layers weights very slightly is sufficient for the network to achieve training accuracy (cfr. third column of Figure 1).
We also observe that layer rotation rates (rotation with respect to the previous iterate) translate remarkably well in layer rotations (rotation with respect to the initialization). For example, the setting used in the fifth column indeed leads all layers to rotate quasi synchronously. As discussed in Section 3.2, this is not selfevident. Understanding why this happens (and why the first and last layers seem to be less tameable) is an interesting direction of research resulting from our work.
5 A study of layer rotation in standard training settings
Section 4 uses Layca to study the relation between layer rotations and generalization in a controlled setting. This section investigates the layer rotation configurations that naturally emerge when using SGD, weight decay or adaptive gradient methods for training. First of all, these experiments will provide supplementary evidence for the rule of thumb proposed in Section 4. Second, we’ll see that studying training methods from the perspective of layer rotation can provide useful insights to explain their behaviour.
The experiments are performed on the five tasks of Table 1. The learning rate parameter is tuned independently for each training setting through grid search over 10 logarithmically spaced values (), except for C10CNN2 and C100WRN where learning rates are taken from their original implementations when using SGD + weight decay, and from (Wilson2017, ) when using adaptive gradient methods for training. The test accuracies obtained in standard settings will often be compared to the best results obtained with Layca, which are provided in the 5th column of Figure 1.
5.1 Analysis of SGD’s learning rate
The influence of SGD’s learning rate on generalization has been highlighted by several works (Jastrz2017, ; SmithSam2017, ; Smith2017, ; Hoffer2017, ; masters2018revisiting, ). The learning rate parameter directly affects layer rotation rates, since it changes the size of the updates. In this section, we verify if the learning rate’s impact on generalization is coherent with our rules of thumb around layer rotation.
Figure 2 displays the layer rotation curves and test accuracies generated by different learning rate configurations during vanilla SGD training on the five tasks of table 1. We observe that test accuracy increases for larger layer rotations (consistent with our rule of thumb) until a tipping point where it starts to decrease (inconsistent with our rule of thumb). Interestingly, these problematic cases also correspond to cases with extremely abrupt layer rotations that do not translate in improvements of the training loss (cfr. Figure 10 in Supplementary Material). These observations thus highlight an expected yet important condition for our rules of thumb to hold true: the monitored layer rotations should coincide with actual training (i.e. improvements on the loss function).
A second interesting observation is that the layer rotation curves obtained with vanilla SGD are far from the ideal scenario disclosed in Section 4, where the majority of the layers’ weights reached a cosine distance of 1 from their initialization. In accordance with our rules of thumb, SGD also reaches considerably lower test performances than Layca. A more extensive tuning of the learning rate (over 10 logarithmically spaced values) did not help SGD to solve its two systematic problems: 1) layer rotations are not uniform and 2) the layers’ weights stop rotating before reaching a cosine distance of 1.
5.2 Analysis of SGD and weight decay
Several papers have recently shown that, in batch normalized networks, the regularization effect of weight decay was caused by an increase of the effective learning rate (VanLaarhoven2017, ; Hoffer2018, ; Zhang2019, ). More generally, reducing the norm of weights increases the amount of rotation induced by a given training step. It is thus interesting to see how weight decay affects layer rotations, and if its impact on generalization is coherent with our rule of thumb. Figure 3 displays, for the 5 tasks, the layer rotation curves generated by SGD when combined with weight decay (in this case, equivalent to regularization). We observe that weight decay solves SGD’s problems ( cfr. Section 5.1): all layers’ weights reach a cosine distance of 1 from their initialization, and the resulting test performances are on par with the ones obtained with Layca.
This experiment not only provides important supplementary evidence for our rules of thumb, but also novel insights around weight decay’s regularization ability in deep nets: weight decay seems to be key for enabling large layer rotations (weights reaching a cosine distance of 1 from their initialization) during SGD training. Since the same behaviour can be achieved with tools that control layer rotation rates (cfr. Layca), without an extra parameter to tune, our results could potentially lead weight decay to disappear from the standard deep learning toolkit.
5.3 Analysis of learning rate warmups
We’ve seen in Section 5.1 that during SGD training, high learning rates could generate abrupt layer rotations at the very beginning of training that do not improve the training loss. In this section, we investigate if these unstable layer rotations could be the reason why learning rate warmups are sometimes necessary when using high learning rates He2016 (); Goyal2018 (). For this experiment, we use a deeper network that notoriously requires warmups for training: ResNet110 He2016 (). The network is trained on the CIFAR10 dataset with standard data augmentation techniques. We use a warmup strategy that starts at a 10 times smaller learning rate and linearly increases to reach the final learning rate in a specified number of epochs.
Figure 4 displays the layer rotation and training curves when training with a high learning rate () and different warmup durations (0,5,10 or 15 epochs of warmup). We observe that without warmup, SGD generates unstable layer rotations and training accuracy doesn’t improve before the 25th epoch. Using warmups brings significant improvements: a training accuracy is reached after 25 epochs, with only some instabilities in the training curves that again are synchronized with abrupt layer rotations. Finally, we also use Layca for training (with a learning rate). Thanks to its controlling ability, Layca doesn’t suffer from SGD’s instabilities in terms of layer rotation. We observe that this enables Layca to achieve large layer rotations and good generalization performance without the need for warmups.
5.4 Analysis of adaptive gradient methods
The recent years have seen the rise of adaptive gradient methods in the context of machine learning (e.g. RMSProp (Tieleman2012, ), Adagrad (Duchi2011, ), Adam (Kingma2015, )). The key element distinguishing adaptive gradient methods from their nonadaptative equivalents is a parameterlevel tuning of the learning rate based on the statistics of each parameter’s partial derivative. Initially introduced for improving training speed, (Wilson2017, ) observed that these methods also had a considerable impact on generalization. Since these methods affect the rate at which individual parameters change, they might also influence the rate at which layers change (and thus layer rotations). To emphasize this claim, we first observe to what extent the parameterlevel learning rates of Adam vary across layers.
We monitor Adam’s estimate of the second raw moment when training on the C10CNN1 task. The estimate is computed by:
where and are vectors containing respectively the gradient and the estimates of the second raw moment at training step , and is a parameter specifying the decay rate of the moment estimate. While our experiment focuses on Adam, the other adaptive methods (RMSprop, Adagrad) also use statistics of the squared gradients to compute parameterlevel learning rates. Figure 5 displays the , and percentiles of the moment estimations, for each layer separately, as measured at the end of epochs 1, 10 and 50. The conclusion is clear: the estimates vary much more across layers than inside layers. This suggests that adaptive gradient methods might have a drastic impact on layer rotations.
Adaptive gradient methods can reach SGD’s generalization ability with Layca
Since adaptive gradient methods probably affect layer rotations, we will verify if their influence on generalization is coherent with our rule of thumb. Figure 6 ( line) provides the layer rotation curves and test accuracies obtained when using adaptive gradient methods to train the 5 tasks described in Table 1. We observe an overall worse generalization ability compared to Layca’s optimal configuration and small and/or nonuniform layer rotations. We also observe that the layer rotations of adaptive gradient methods are considerably different from the ones induced by SGD (cfr. Figure 2). For example, adaptive gradient methods seem to induce larger rotations of the last layers’ weights, while SGD usually favors rotation of the first layers’ weights. Could these differences explain the impact of parameterlevel adaptivity on generalization in deep learning? In Figure 6 ( line), we show that when Layca is used on top of adaptive methods (to control layer rotation), adaptive methods can reach test accuracies on par with SGD + weight decay. Our observations thus suggest that adaptive gradient methods’ poor generalization properties are due to their impact on layer rotations. Moreover, the results again provide supplementary evidence for our rule of thumb.
SGD can achieve adaptive gradient methods’ training speed with Layca
We’ve seen that the negative impact of adaptive gradient methods on generalization was largely due to their influence on layer rotations. Could layer rotations also explain their positive impact on training speed? To test this hypothesis, we recorded the layer rotation rates emerging from training with adaptive gradient methods, and reproduced them during SGD training with the help of Layca. We then observe if this SGDLayca optimization procedure (that doesn’t perform parameterlevel adaptivity) could achieve the improved training speed of adaptive gradient methods. Figure 5 shows the training curves during training of the 5 tasks of Table 1 with adaptive gradient methods, SGD+weight decay and SGDLaycaAdaptCopy (which copies the layer rotation rates of adaptive gradient methods). While adaptive gradient methods train significantly faster than SGD+weight decay, we observe that their training curves are nearly indistinguishable from SGDLaycaAdaptCopy. Our study thus suggests that adaptive gradient methods impact on both generalization and training speed is due to their influence on layer rotations. This result relativizes the importance of parameterlevel adaptivity in deep learning applications, suggesting that layerlevel adaptivity is what matters most. The influence of layer rotation on training speed is further studied in Supplementary Material A.3.
6 How to interpret layer rotations?
The previous sections of this paper demonstrate the remarkable consistency and explanatory power of layer rotation’s relation with generalization in deep learning. The fundamental character of layer rotation we observe experimentally clashes however with the complete lack of theory or intuitions to support these observations. In this section, we provide a preliminary experiment and discussion to initiate a reflection on how to relate layer rotations to more established concepts in machine learning whose role during learning we can grasp more easily.
We use a toy experiment to visualize how layer rotation affects the features learned by a network. We train a 1 hidden layer MLP (784 neurons) on a reduced MNIST dataset (1000 samples per class, to increase overparameterization). This toy network has the advantage of having intermediate features that are easily visualized: the weights associated to hidden neurons live in the same space as the input images. Starting from an identical initialization, we train the network with four different learning rates (using Layca), leading to four different layer rotation configurations that all reach training accuracy but different generalization abilities (in accordance with our rule of thumb).
Figure 8 displays the features obtained by the different layer rotation configurations (for 5 randomly selected hidden neurons). This visualization unveils a remarkable phenomenon: layer rotation does not seem to affect which features are learned, but rather to what extent they have been learned during the training process. The larger the layer rotation, the more prominent the features and the less retrievable the initialization. Ultimately, for a layer rotation close to 1, the final weights of the network got rid of all remnants of the initialization. This connection between layer rotation and the degree to which features have been learned suggest a novel interpretation of this papers’ results: perfectly learning intermediate features would not be necessary to reach training accuracy (probably because of overparameterization), but training procedures that are able to do so anyway lead to better generalization performance.
7 Conclusion
This paper contains extensive empirical evidence that layer rotations constitute a remarkably powerful indicator of generalization performance. The consistency (a rule of thumb that is widely applicable), simplicity (a networkindependent optimum) and explanatory power (novel insights around widely used techniques) of their relation with generalization suggest that layer rotations are tightly connected to a fundamental aspect of deep neural network training. We initiate the quest for this aspect through a preliminary experiment that reveals a connection between layer rotations and the degree by which intermediate features have been learned during the training process. We look forward to the future investigations emerging from these observations. We also hope that our (publicly available) tools for monitoring and controlling layer rotation will reduce the current struggle of practitioners to optimize hyperparameters.
Acknowledgements
Thanks to the anonymous NeurIPS and ICLR reviewers for their helpful feedback on previous versions of this paper. Thanks to the organizers of the ICML 2019 workshop Deep Phenomena for enabling a live presentation of this work to the community.
Special thanks to the reddit r/MachineLearning community for helping outsiders to stay up to date with the last discoveries and discussions of our fast moving field.
Simon and Christophe are Research Fellows of the Fonds de la Recherche Scientifique – FNRS, which provided funding for this work.
References
 (1) P.A. Absil, R. Mahony, and R. Sepulchre. Optimization On Manifolds : Methods And Applications. In Recent Advances in Optimization and its Applications in Engineering, pages 125—144. Springer, 2010.
 (2) Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow : LargeScale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint arXiv:1603.04467, 2016.
 (3) Peter L Bartlett. The Sample Complexity of Pattern Classification with Neural Networks : The Size of the Weights is More Important than the Size of the Network. IEEE transactions on Information Theory, 44(2):525–536, 1998.
 (4) Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
 (5) François Chollet et al. Keras, 2015.
 (6) Stanford CS231N. Tiny ImageNet Visual Recognition Challenge. https://tinyimagenet.herokuapp.com/, 2016.
 (7) Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR, pages 248–255, 2009.
 (8) John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 (9) Stanislav Fort, Paweł Krzysztof, and Nowak Srini. Stiffness: A New Perspective on Generalization in Neural Networks. arXiv 1901.09491, 2019.
 (10) Boris Ginsburg, Igor Gitman, and Yang You. Large Batch Training of Convolutional Networks with Layerwise Adaptive Rate Scaling, 2018.
 (11) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, pages 249–256, 2010.
 (12) Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. In ICLR, 2015.
 (13) Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677, 2018.
 (14) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In CVPR, pages 770–778, 2016.
 (15) Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. IJUFKS, 6(2):1–10, 1998.
 (16) Sepp Hochreiter and Jurgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
 (17) Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate normalization schemes in deep networks. arXiv:1803.01814, 2018.
 (18) Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer , generalize better : closing the generalization gap in large batch training of neural networks. In NIPS, pages 1729—1739, 2017.
 (19) Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three Factors Influencing Minima in SGD. arXiv:1711.04623, 2017.
 (20) Stanisław Jastrzȩbski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. On the relation between the sharpest directions of DNN loss and the SGD step length. In ICLR, 2019.
 (21) Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On LargeBatch Training for Deep Learning: Generalization Gap and Sharp Minima. In ICLR, 2017.
 (22) Diederik P Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 (23) Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, 2009.
 (24) Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. FisherRao Metric, Geometry, and Complexity of Neural Networks. In AISTATS, volume 89, 2019.
 (25) Dominic Masters and Carlo Luschi. Revisiting Small Batch Training for Deep Neural Networks. arXiv:1804.07612, 2018.
 (26) Ari Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canonical correlation. In NIPS, 2018.
 (27) Ari S Morcos, David G T Barrett, Neil C Rabinowitz, and Matthew Botvinick. On the importance of single directions for generalization. In ICLR, 2018.
 (28) Behnam Neyshabur, Srinadh Bhojanapalli, David Mcallester, and Nathan Srebro. Exploring Generalization in Deep Learning. In NIPS, 2017.
 (29) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. In ICLR, 2015.
 (30) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. NormBased Capacity Control in Neural Networks. In Conference on Learning Theory, pages 1376–1401, 2015.
 (31) Roman Novak, Yasaman Bahri, Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohldickstein. Sensitivity and generalization in neural networks: an empirical study. In ICLR, 2018.
 (32) Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for LargeScale Image Recognition. arXiv:1409.1556, 2014.
 (33) Leslie N Smith and Nicholay Topin. SuperConvergence: Very Fast Training of Residual Networks Using Large Learning Rates. arXiv:1708.07120, 2017.
 (34) Samuel L Smith and Quoc V Le. A bayesian perspective on generalization and stochastic gradient descent. In Proceedings of Second workshop on Bayesian Deep Learning (NIPS 2017), 2017.
 (35) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
 (36) Twan van Laarhoven. L2 Regularization versus Batch and Weight Normalization. arXiv:1706.05350, 2017.
 (37) Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The Marginal Value of Adaptive Gradient Methods in Machine Learning. In NIPS, pages 4151–4161, 2017.
 (38) Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A Walk with SGD. arXiv:1802.08770, 2018.
 (39) Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391–423, 2012.
 (40) Adams Wei Yu, Qihang Lin, Ruslan Salakhutdinov, and Jaime Carbonell. Normalized gradient with adaptive stepsize method for deep neural network training. arXiv:1707.04822, 2017.
 (41) Sergey Zagoruyko. 92.45% on CIFAR10 in Torch. http://torch.ch/blog/2015/07/30/cifar.html, 2015.
 (42) Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks. In BMVC, 2016.
 (43) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
 (44) Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight decay regularization. In ICLR, 2019.
Supplementary Material
The supplementary material of this paper is divided into two sections. Section A contains supplementary results, which are not essential for the main message of the paper but could be useful for researchers interested in pursuing our line of work. Section B contains supplementary information about the experimental procedures used in the paper.
Appendix A Supplementary results
a.1 Test accuracies for supplementary and configurations
Cfr. Figure 9.
a.2 Further analysis of high learning rates
Figure 2 reveals unstable layer rotations when using high learning rates with SGD. Figure 10 takes a closer look at this phenomenon by plotting layer rotation and training curves at iterationlevel (instead of epochlevel) precision during the first epoch of training. The visualization reveals large layer rotations sometimes performed in a single iteration. Importantly, these iterations do not induce improvements in training accuracy. Such configurations escape the scope of our rule of thumb.
a.3 How layer rotation influences training speed
While generalization is the main focus of our paper, we observed through our experiments that layer rotation rates also influenced the loss curves of our models in a remarkable way. Figure 11 depicts the loss curves obtained for different values of and on the first three tasks of Table 1. It appears that the larger or the more uniform the layer rotation rates are, the higher the plateaus in which loss curves get stuck into. This curious phenomenon again emphasizes the fundamental character of layer rotation. We have no explanation of this behaviour however.
Following our rule of thumb, this result also suggests that high plateaus are additional indicators of good generalization performance. This is consistent with the systematic occurrence of high plateaus in the loss curves of state of the art networks [14, 42] (which usually use SGD with weight decay).
a.4 All operations of Layca are not always necessary in practice.
The 4 main operations of Layca are repeated in Algorithm 2. The first operation projects the step on the space orthogonal to the current weights of the layer. Having a step orthogonal to the current weights is necessary for operation 2 to normalize the rotation performed during the update. However, since a layer typically has more than thousands of parameters (i.e. has a lot of dimensions), the step proposed by an optimizer has a high probability of being approximately orthogonal to the current weights. Explicitly orthogonalizing the step and the weights through operation 1 is thus potentially redundant. Operation 4 keeps the norm of weights fixed during the whole training process. This operation prevents the weights from increasing too much (the first three operations lead the norm of weights to increase at every training step), which causes numerical problems. However, this operation is not fundamental for controlling the layer rotation rates.
We experimented with a subversion of Layca that does not perform Layca’s operations 1 and 4. Interestingly, the resulting algorithm is equivalent to and LARS introduced by [40] and [10] respectively. Both works reported improved test performance when using this algorithm. Figure 12 shows the layer rotation curves and associated test accuracies when applying LARS (or equivalently, ) on the C10CNN1, C100resnet and tinyCNN tasks.^{6}^{6}6While the norm of each layer’s weight vector was not fixed by LARS, we still had to limit the amount of norm increase per training step to prevent numerical errors. We limited it to times the initial norm of each layer’s weight vector. The layer rotation rate configuration parameters are and . We observe that this configuration also induces large layer rotations, and that the test accuracies are on par with Layca. This observation indicates that operations 1 and 4 of Layca can be removed in at least some practical applications.
Appendix B Supplementary information
b.1 Visualizing the parameter.
b.2 Learning rate decay schemes
Our work uses standard learning rate decay schemes, as follows:

C10CNN1: 100 epochs and a reduction of the learning rate by a factor 5 at epochs 80, 90 and 97

C100resnet: 100 epochs and a reduction of the learning rate by a factor 10 at epochs 70, 90 and 97

tinyCNN: 80 epochs and a reduction of the learning rate by a factor 5 at epoch 70

C10CNN2: 250 epochs and a reduction of the learning rate by a factor 5 at epochs 100, 170, 220

C100WRN: 250 epochs and a reduction of the learning rate by a factor 5 at epochs 100, 170, 220
The only exceptions are C10CNN2 and C100WRN training with SGD+weight decay and with adaptive methods, where the learning rate decay schemes are the ones used in their original implementation or in [37].
b.3 Training errors associated to the layer rotation curves.
In Figures 1, 2, 3, 4 and 6, the test accuracies corresponding to each layer rotation curves visualization are provided. While it is briefly mentioned that training accuracy is close to perfect in most cases, Tables 2, 3, 4, 5 and 6 provide the exact values for completeness.
Best  

C10CNN1  
C100resnet  
tinyCNN  
C10CNN2  
C100WRN 
C10CNN1  

C100resnet  
tinyCNN  
C10CNN2  
C100WRN 
C10CNN1  C100resnet  tinyCNN  C10CNN2  C100WRN  

SGD + 
No warmup  5 epochs  10 epochs  15 epochs  LaycaNo warmup 

C10CNN1  C100resnet  tinyCNN  C10CNN2  C100WRN  

Adaptive methods  
Adaptive + Layca 
b.4 Momentum scheme used by SGD_AMom and Adam.
SGD_AMom was designed for Section 5.4, as a nonadaptive equivalent of Adam. In particular, SGD_AMom uses the same momentum scheme as Adam:
where is the gradient at step , the learning rate, the momentum parameter.