On layerlevel control of DNN training
and its impact on generalization
Abstract
The generalization ability of a neural network depends on the optimization procedure used for training it. For practitioners and theoreticians, it is essential to identify which properties of the optimization procedure influence generalization. In this paper, we observe that prioritizing the training of distinct layers in a network significantly impacts its generalization ability, sometimes causing differences of up to in test accuracy. In order to better monitor and control such prioritization, we propose to define layerlevel training speed as the rotation rate of the layer’s weight vector (denoted by layer rotation rate hereafter), and develop Layca, an optimization algorithm that enables direct control over it through each layer’s learning rate parameter, without being affected by gradient propagation phenomena (e.g. vanishing gradients). We show that controlling layer rotation rates enables Layca to significantly outperform SGD with the same amount of learning rate tuning on three different tasks (up to test error improvement). Furthermore, we provide experiments that suggest that several intriguing observations related to the training of deep models, i.e. the presence of plateaus in learning curves, the impact of weight decay, and the bad generalization properties of adaptive gradient methods, are all due to specific configurations of layer rotation rates. Overall, our work reveals that layer rotation rates are an important factor for generalization, and that monitoring it should be a key component of any deep learning experiment.
On layerlevel control of DNN training
and its impact on generalization
Simon Carbonnelle, Christophe De Vleeschouwer FNRS research fellows ICTEAM, Université catholique de Louvain LouvainLaNeuve, Belgium {simon.carbonnelle, christophe.devleeschouwer}@uclouvain.be
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Generalization and gradient propagation are two popular themes in deep learning literature that motivated the questions studied in this paper. On the one hand, it has been observed that a network’s ability to generalize depends on a subtle interaction between the training data and stochastic gradient descent [36, 4]. Considering this idea at layer level, each layer of a deep network might behave differently with respect to generalization since the training of each layer is guided by different input and feedback signals. On the other hand, several works have shown that the norm of gradients can gradually increase or decrease with the layers, creating inequalities amongst layers during SGD training: some are trained more than others (vanishing and exploding gradients are two wellknown examples [5, 16, 13]).
This work explores an interaction between generalization and the intricate nature of gradient propagation, and focuses on the following research question: how do inequalities between layers during training influence generalization? To study this question in depth, we need a method for controlling and monitoring the layerlevel training speeds that is robust to gradient propagation problems. In our first experiment, we achieve this by focusing on an extreme scenario where only one layer of the network is trained. Hence, controlling the layerlevel training speeds reduces to selecting the layer to train. When switching to full training we propose the rotation rate of the layer’s weight vector (denoted by layer rotation rate hereafter) as a measure of layerlevel training speed, and develop the LAYerlevel Controlled Amount of weight rotation algorithm (Layca) that enables control over it through the layerwise learning rate parameters. In both scenarios, the conclusion is clear: controlling the training speed on a per layer basis has an influence on the generalization ability of the network.
Motivated by this observation, we then study the configurations of layer rotation rates that emerge from training with commonly used optimizers and practices. On the one hand, we show that minima that generalize well and are easily found by Layca, may require extensive tuning of layerwise learning rates with SGD due to the intricate way gradient propagate in deep networks. With the same amount of learning rate tuning, we show that Layca significantly outperforms SGD on three different tasks (up to test error improvement). On the other hand, we show that many mysterious observations around deep learning, such as the bad generalization ability of adaptive gradient methods [34], could be due to differences in layer rotation rate configurations. For example, we show that with the same experiment settings as in [34], applying Layca makes the training curves of adaptive methods and their nonadaptive equivalents indistinguishable.
In addition to revealing a novel factor, unique to deep learning, that influences generalization, we expect that our work, and the development of Layca in particular, will also help practitioners by reducing the hyperparameter tuning required to train state of the art networks. Moreover, through our analysis of previous observations around deep learning, we show that our work can also provide a common ground for investigating and clarifying at first sight unrelated deep learning mysteries, thus facilitating the work of theoreticians. Source code to reproduce all the figures of this paper is provided at https://github.com/Simoncarbo/LayerlevelcontrolofDNNtraining (code uses the Keras [7] and TensorFlow [2] libraries).
2 Related work
Recent works have demonstrated that generalization in deep neural networks was mostly due to a puzzling interaction between the training data and the optimization method [36, 4]. Our paper discloses a way by which optimization influences generalization in deep learning: by prioritizing the training of specific layers. This novel factor complements batch size and global learning rate, two parameters of SGD that have been extensively studied in the light of generalization [22, 20, 32, 31, 17].
The works studying the vanishing and exploding gradients problems [5, 16, 13] heavily inspired this paper. These works introduce two ideas which are central to our investigation: the intuitive notion of layerlevel training speed and the fact that SGD does not necessarily treat all layers equally during training. Our work explores the same phenomena, but studies them in the light of generalization instead of trainability and speed of convergence.
Our paper also proposes Layca, an algorithm to control inequalities in training speed across layers. It is thus related to the works that sought solutions to the gradient propagation problems at optimization level [27, 14, 30]. Again, our work differs by focusing on the impact on generalization instead of on convergence speed and trainablity. Recently, a series of papers proposed optimization algorithms similar to Layca and observed an impact on generalization [35, 37, 12]. Our paper complements these works by providing an extensive analysis of the reasons behind such observations.
3 Layerlevel analysis matters when studying generalization
Many previous works on generalization in deep learning do not consider the decomposition of networks in layers, but treat networks as black box functions without internal structure. The goal of this section is to show through a toy example that apprehending the network at layerlevel granularity is necessary when studying generalization in deep nets. In particular, it constitutes a first motivating example showing that prioritizing training of specific layers can influence generalization. The model used in this experiment is an eleven layer deep MLP (multilayer perceptron) where all hidden layers are composed of 784 units with ReLU activation [26]. The data is the MNIST dataset [25], where only 10 randomly selected examples per class are used for training. The particularity of this toy example is that, first of all, the hidden layers are equivalent up to parameter initialization: each layer has 784 inputs, 784 outputs, and a linear+ReLU mapping between them. Second, training only one layer is sufficient to reach accuracy on the training set. This setting allows us to ask the question: will all layers, if trained in isolation, reach the same accuracy on the test set? In other words, are all layers equivalent with respect to generalization?
3.1 Which layer would you train?
The generalization ability of a model depends on the training data used. In our toy example, the training of different hidden layers is guided by different inputs and feedback signals. At initialization, the difference lies in the number of random nonlinear transformation applied to the network inputs (forward pass) and model errors (backward pass) before reaching the layer. Do forward and backward signals degrade (w.r.t. generalization) as they go through random nonlinear transformations? Are both forward and backward signals as robust/sensible to random transformations? These questions are at the heart of the proposed dilemma: if you may train only a single layer, which one would you train to reach maximum test accuracy?
Figure 0(a) shows, as a function of the trained layer index, the test accuracy obtained by training a single layer in our toy example network. The results are averaged over 10 experiments. There is a clear trend: the test accuracy degrades with the depth of the trained layer, with a difference of nearly in test accuracy. Layers are thus not made equal with respect to generalization, and in our toy example, training the first layers of the network tends to generalize better than training layers at the end of the network.
3.2 Improving its own feedback: the first layers’ secret trick
To understand why different layers have different generalization properties, Section 3.1 highlights the fact that each layer’s training is guided by different inputs and feedback signals. Here, using the same toy example, we present another key element that comes into play.
Training a layer in isolation does not impact the inputs the layer receives. However, it impacts its feedback in a nonnegligible way: a layer’s training influences the way errors propagate through every subsequent layer, because the backward pass is dependent on the activations of the forward pass. Our analysis reveals that this is a key ingredient that enables the first layers’ performance. Figure 0(b) shows, using the Silhouette coefficient [21], how the feedback received by the first layer gets more correlated with the classes/targets through training, sign of an improved feedback quality. Furthermore, Figure 0(c) reports the test accuracies obtained when influence on feedback is prevented.^{1}^{1}1We prevent feedback improvement by storing ReLU’s regime (operating or saturated for positive and negative preactivations respectively) for each neuron and for each sample of the training and test sets at initialization and keeping it fixed during training even if the activation of a sample crosses the zero threshold. We observe that the trend is inverted: the last layers generalize better than the first layers.
As a conclusion, based on this toy example, we have revealed novel mechanisms unique to deep networks that influence generalization in complex ways, that need to be taken into account for a theory of generalization in deep learning.
4 Monitoring and controlling layerlevel training speed
In Section 3, we show that all layers are not made equal with respect to generalization, through a toy experiment where one layer is trained at a time. In a realistic situation, however, all layers of a deep network are trained simultaneously. In this case, the training speed with which each layer trains, could also have an impact on the final generalization ability of the network (training only one layer being an extreme configuration). However, the notion of layerlevel training speed is still unclear, and its control through SGD is potentially difficult because of the intricate nature of gradient propagation (cfr. vanishing and exploding gradients). The goal of this section is to present tools to monitor and control training speed at layer level, such that its impact on generalization can be studied in Section 5.
4.1 How to define training speed at layer level?
Training speed can be understood as the speed with which a model converges to its optimal solution not to be confounded with learning rate, which is only one of the parameters that affect training speed in current deep learning applications. The notion of layerlevel training speed is illposed, since a layer does not have a loss of its own: all layers optimize the same global loss function. Given a training step, how can we know by how much each layer’s update contributed to the improvement of the global loss? Previous work on vanishing and exploding gradients focused on the norm and variance of gradients as a measure of layerlevel training speed [5, 16, 13]. Provided the empirical work on activation and weight binarization during [8, 28, 18], or after training [3, 6], we argue that changes to the norm of a weight vector do not matter, and that only its orientation matters. Therefore, we suggest to measure training speed through the rotation rate of a layer’s weight vector (also denoted by layer rotation rate in this paper). ^{2}^{2}2It is worth noting that our measure focuses on weights that multiply the inputs of a layer (typically kernels of fully connected and convolutional layers). Additive weights (biases) are not used in our models, and we leave their study as future work.
4.2 Layca: an algorithm for layerlevel control
Given our definition of layerlevel training speed, we now develop an algorithm to control it. Ideally, the layer rotation rates should be directly controllable with the layerwise learning rates, ignoring the peculiarities of gradient propagation. We propose Layca (SGDguided LAYerlevel Controlled Amount of weight rotation), an algorithm where the layerwise learning rates directly determine the amount of rotation performed by each layer’s weight vector during an optimization step, in a direction specified by an optimizer (SGD being the default choice). Inspired by techniques for optimization on manifolds [1], and on spheres in particular, Layca is composed of 4 operations, applied individually to each layer: projection of the optimizer’s step on the space orthogonal to current weights, rotationbased normalization of the step, performing the update scaled by the learning rate, and projecting the resulting weights back onto the sphere. Algorithm 1 details these operations.
5 A study of layer prioritization during endtoend training
Section 4.2 provides a tool (Layca) to control layer rotation rate, a tentative definition of layerlevel training speed designed to facilitate control over layer prioritization during endtoend training. In this section, we analyse how prioritizing the training of specific layers impacts generalization by varying the layerwise learning rates used by Layca. In order to evaluate Layca’s benefit, we perform the same experiments with SGD. The experiments are conducted on a 25 layers deep VGGstyle CNN [29], ResNet32 [15] and a 11 layers deep CNN trained on CIFAR10 [24], CIFAR100 [24] and the Tiny ImageNet dataset [10, 9] respectively. All our networks use batch normalization [19]. More information about networks and training procedure can be found in Supplementary Material.
5.1 Layerwise learning rate configurations
In this paper, we restrict ourselves to a static configuration of layerwise learning rates where they exponentially increase/decrease with layer depth. The learning rate of layer is parametrized by as follows:
(1) 
Where is the index of the layer in forward pass ordering, is a global learning rate schedule parametrized by t, the current training step. Values of close to correspond to prioritizing first layers, corresponds to no prioritization, and values close to to prioritization of last layers. We study 13 values of in our experiments. Visualization of the layerwise learning rates in function of the studied values is provided in Supplementary Material. The initial global learning rate, , is determined by grid search over 10 values () in the setting (SGD’s detailed results are available in Supplementary Material).
5.2 Controlling layer prioritization with Layca
Figure 2 shows, for the three tasks, how the test error evolves with when Layca is used for training. Test errors are only reported for values for which at least training accuracy could be obtained. First of all, we observe that prioritizing early layers (negative values) generalizes better than prioritizing the last layers (positive values), which is in line with the observations of Section 3. Moreover, we observe that the best test accuracies are systematically obtained for , i.e. for uniform rotation rates, without prioritization. This observation might appear as contradictory with respect to Section 3, where the first layers’ superior ability to promote generalization is highlighted. There is however no contradiction, as the mechanisms governing the improvement of the forward and backward signals w.r.t generalization during training are different in isolated and simultaneous training settings. In particular, when all layers are trained simultaneously, training a layer will impact the inputs it receives (with a delay of two training steps), while it won’t when the layer is trained in isolation. A thorough explanation of our observations is left as future work. We hope that our toy example (see Section 3) will provide the necessary intuitions to understand these complex phenomena.
5.3 Comparison of Layca and SGD
The same experiment is performed when SGD is used for training, and the results are shown in Figure 2. We discuss two key differences between SGD’s and Layca’s performances, and both suggest Layca’s superior ability to control layer prioritization. First of all, while the evolution of the test accuracy in function of is consistent for all tasks when Layca is used, it is not for SGD. In particular, while a clear rule emerged from the Layca experiment (choosing ), SGD’s experiment does not provide any convincing recommendation on how to set to maximize generalization. Secondly, Layca is able to cover a larger range of test accuracies with the same values, and importantly, reaches test accuracies that outperform SGD by the order of accuracy.
In order to analyse further the different behaviours of Layca and SGD, we monitor the rotation of each layer’s weight vector across training for . We argue in Section 4 that rotation of a layer’s weight vector constitutes the bulk of DNN training. Figure 3 shows the evolution of the cosine distance between each layer’s weight vector and its initialization (denoted by layerwise angle deviation curves hereafter). For the three tasks, Layca training exhibits a similar pattern where most layers’ weight vectors are rotated significantly and synchronously. Our results suggests that such behaviour, induced by high and uniform layer rotation rates, is indicative of good generalization performance. SGD does not exhibit such dynamics as, first of all, gradient propagation phenomena induce different layer rotation rates even when layers use the same learning rate (especially visible on the CIFAR10 task). Second, the total amount of rotation resulting from SGD training is significantly inferior to Layca, although the global learning rate used is determined by grid search in both cases (especially visible on the tiny Imagenet task).
6 A second look at previous observations around deep learning
Section 5 demonstrates that layer rotation rates and Layca provide a way to monitor and control layer prioritization during training, which influences generalization. In this section, we use these useful tools to shine new light on previous observations concerning the training dynamics and generalization properties of deep nets.
6.1 Occurrence of plateaus in learning curves
The learning curves of most state of the art networks exhibit a curious phenomenon: the appearance of plateaus that are escaped by a reduction of the learning rate (e.g. [15]). To our knowledge, there are currently no explanations for this behaviour. While generalization is the main focus of our paper, we’ve observed through our experiments that layer rotation rate configurations are closely related to the emergence of such training plateaus. Figure 4(a) shows that when Layca is used on the tiny ImageNet task (see Section 5), the layerwise learning rate configurations (parametrized by as described in Section 5) directly influence the height of the plateaus. More specifically, we observe that the higher and more uniform the rotation rates are ( closer to ), the higher the plateaus, or equivalently the more difficult it is to reduce the loss in the early stages of training. The fact that plateaus are the most prominent when all layers are trained with high learning rate suggests that they are caused by some kind of interference between the layers during training. The same observation on the CIFAR10 and CIFAR100 tasks is presented in Supplementary Material.
6.2 Parameterlevel adaptivity in deep learning
It has recently been shown that adaptive gradient methods exhibit worse generalization properties than SGD in typical deep learning applications [34]. The key characteristic of adaptive methods is their tuning of the learning rate at parameter level based on the statistics of each parameter’s partial derivative. In this section, we argue that these approaches, in typical deep learning applications, result in learning rates that differ mostly across layers and negligibly inside layers, and that the observed generalization drop is due to the different layer rotation rate configurations that emerge from these methods. When the same layer rotation rate configuration is enforced by Layca, we thus expect the drop in generalization to disappear. To verify this hypothesis, we train the same convolutional network as [34] and show that when Layca is applied with the different optimization algorithms, the test curves of adaptive methods (RMSProp [33], Adagrad [11], Adam [23]) and their nonadaptive equivalents (SGD, SGD_AMom^{3}^{3}3SGD_AMom is a version of SGD with a momentum scheme similar to Adam (see Supplementary Material).) become indistinguishable (Figure 4). Moreover, Layca enables all methods to reach the best test accuracy reported by [34].
6.3 The impact of weight decay on training of deep networks
In the experiment described in Section 6.2, SGD generalizes as well as Layca. Moreover, in section 6.1, we make the observation that learning curve plateaus appear during training of most state of the art networks. These observations are consistent with our previous experiments only if with state of the art models whose metaparameters were carefully tuned, SGD is able to generate high and uniform layer rotation rates, acquiring the key strength of Layca. We verify this by analysing the layerwise angle deviation curves emerging from SGD training of the network used in Section 6.2. We observe in Figure 4(b) that indeed, training the network with SGD resulted in significant and synchronized rotation of each layer’s weight vectors. Further analysis showed that this beneficial behaviour was very sensitive to the tuning of weight decay. Indeed, training the same network without weight decay resulted in very different layerwise angle deviation curves (Figure 4(b)), and a drop in test accuracy. This suggests that in some cases, through some mysterious mechanism whose study we leave as future work, weight decay enables SGD to induce high and uniform layer rotation rates and thus, to generalize as well as Layca. Interestingly, Layca reached SGD’s performance without the need for weight decay.
7 Conclusion
Inspired by the works on generalization and gradient propagation in deep networks, this paper tackles the following research question: how does prioritizing layers during training influence generalization? A toy experiment in Section 1 demonstrates the importance of this research direction. In order to extend our analysis to common training settings, we propose to define layerlevel training speed as the rotation rate of a layer’s weights (i.e. the layer rotation rate), and develop Layca, an algorithm that provides control over it.
In section 5, we demonstrate that Layca enables a more pertinent study of the impact of layer prioritization on generalization compared to SGD, which is subject to the intricate gradient propagation in deep nets. Moreover, Layca’s builtin ability to use high and uniform layer rotation rates empowers it to significantly outperform SGD in terms of test accuracy on three different tasks. In Section 6, we show that Layca enables a precise control of the height of plateaus emerging in training curves, that Layca can eliminate the differences in generalization between adaptive methods and their nonadaptive equivalents, and finally, that state of the art models can exhibit, through tuning of the weight decay, high and uniform rotation rates, enabling SGD to generalize as well as Layca.
Overall, the observations of Sections 5 and 6 provide evidence that layer rotation rates are a pertinent definition of layerlevel training speed, and that as such, they are important indicators of the generalization properties that will emerge from training. Our hope is that this discovery will facilitate training of state of the art networks for practitioners, and provide guidance for theoreticians to solve deep learning’s remaining mysteries.
Acknowledgements
Special thanks to the reddit r/MachineLearning community for enabling outsiders to stay up to date with the last discoveries and discussions of our fast moving field.
References
 [1] P.A. Absil, R. Mahony, and R. Sepulchre. Optimization On Manifolds : Methods And Applications. In Recent Advances in Optimization and its Applications in Engineering, pages 125—144. Springer, 2010.
 [2] Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow : LargeScale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint arXiv:1603.04467, 2016.
 [3] Pulkit Agrawal, Ross Girshick, and Jitendra Malik. Analyzing the Performance of Multilayer Neural Networks for Object Recognition. In ECCV, pages 329–344, 2014.
 [4] Devansh Arpit, Stanisław Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon LacosteJulien. A Closer Look at Memorization in Deep Networks. In ICML, 2017.
 [5] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
 [6] Simon Carbonnelle and Christophe De Vleeschouwer. Discovering the mechanics of hidden neurons. https://openreview.net/forum?id=H1srNebAZ, 2018.
 [7] François Chollet et al. Keras, 2015.
 [8] Matthieu Courbariaux and JeanPierre David. BinaryConnect : Training Deep Neural Networks with binary weights during propagations. In NIPS, pages 3123—3131, 2015.
 [9] Stanford CS231N. Tiny ImageNet Visual Recognition Challenge. https://tinyimagenet.herokuapp.com/, 2016.
 [10] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR, pages 248–255, 2009.
 [11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 [12] Boris Ginsburg, Igor Gitman, and Yang You. Large Batch Training of Convolutional Networks with Layerwise Adaptive Rate Scaling, 2018.
 [13] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, pages 249–256, 2010.
 [14] Elad Hazan, Kfir Levy, and Shai ShalevShwartz. Beyond convexity: Stochastic quasiconvex optimization. In NIPS, pages 1594—1602, 2015.
 [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In CVPR, pages 770–778, 2016.
 [16] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. IJUFKS, 6(2):1–10, 1998.
 [17] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer , generalize better : closing the generalization gap in large batch training of neural networks. In NIPS, pages 1729—1739, 2017.
 [18] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized Neural Networks. In NIPS, 2016.
 [19] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML, pages 448—456, 2015.
 [20] Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three Factors Influencing Minima in SGD. arXiv:1711.04623, 2017.
 [21] Leonard Kaufman and Peter J Rousseeuw. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 2009.
 [22] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On LargeBatch Training for Deep Learning: Generalization Gap and Sharp Minima. In ICLR, 2017.
 [23] Diederik P Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [24] Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, 2009.
 [25] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2323, 1998.
 [26] Vinod Nair and Geoffrey E Hinton. Rectified Linear Units Improve Restricted Boltzmann Machines. In ICML, pages 807—814, 2010.
 [27] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In ICML, pages 1310—1318, 2013.
 [28] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In ECCV, pages 525–542. Springer, 2016.
 [29] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for LargeScale Image Recognition. arXiv:1409.1556, 2014.
 [30] Bharat Singh, Soham De, Yangmuzi Zhang, Thomas Goldstein, and Gavin Taylor. Layerspecific adaptive learning rates for deep networks. In ICMLA, pages 364—368, 2015.
 [31] Leslie N Smith and Nicholay Topin. SuperConvergence: Very Fast Training of Residual Networks Using Large Learning Rates. arXiv:1708.07120, 2017.
 [32] Samuel L Smith and Quoc V Le. A bayesian perspective on generalization and stochastic gradient descent. In Proceedings of Second workshop on Bayesian Deep Learning (NIPS 2017), 2017.
 [33] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
 [34] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The Marginal Value of Adaptive Gradient Methods in Machine Learning. In NIPS, pages 4151–4161, 2017.
 [35] Adams Wei Yu, Qihang Lin, Ruslan Salakhutdinov, and Jaime Carbonell. Normalized gradient with adaptive stepsize method for deep neural network training. arXiv:1707.04822, 2017.
 [36] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
 [37] Zijun Zhang, Lin Ma, Zongpeng Li, and Chuan Wu. Normalized Directionpreserving Adam. arXiv:1709.04546, 2017.
Supplementary Material
Appendix A Additional notes
a.1 Information about models and training procedure
Please refer to the source code, provided at https://github.com/Simoncarbo/LayerlevelcontrolofDNNtraining.
a.2 Some recommendations when using Layca

A learning rate of was optimal in all our experiments with Layca, and constitutes thus a good default value.

Using batch normalization is recommended. Early experiments suggest that removing batch normalization sometimes disables Layca’s ability to perform significant and synchronized rotation of the layers’ weight vectors.

Staying on plateaus for a large number of epochs (in other words waiting before reducing the learning rate) systematically improved generalization performance (this has also been observed for SGD in [17]).

Layca’s operations are prone to numerical instabilities. Replacing eventual NaN values in the update with zeros is required.

Layca was not evaluated for networks with additive biases. We suggest to remove biases for now (also in the batch normalization layers). If you use biases anyway, do not initialize them to zero: rotation of a zero vector makes no sense.
a.3 SGD_AMom
SGD_AMom was designed for Section 6.2, as a nonadaptive equivalent of Adam. In particular, SGD_AMom uses the same momentum scheme as Adam:
where is the gradient at step , the learning rate, the momentum parameter.