Learning with Random Learning Rates
Hyperparameter tuning is a bothersome step in the training of deep learning models. One of the most sensitive hyperparameters is the learning rate of the gradient descent. We present the All Learning Rates At Once (Alrao) optimization method for neural networks: each unit or feature in the network gets its own learning rate sampled from a random distribution spanning several orders of magnitude. This comes at practically no computational cost. Perhaps surprisingly, stochastic gradient descent (SGD) with Alrao performs close to SGD with an optimally tuned learning rate, for various architectures and problems. Alrao could save time when testing deep learning models: a range of models could be quickly assessed with Alrao, and the most promising models could then be trained more extensively. This text comes with a PyTorch implementation of the method, which can be plugged on an existing PyTorch model.
Hyperparameter tuning is a notable source of computational cost with deep learning models [zoph2016neural]. One of the most critical hyperparameters is the learning rate of the gradient descent [theodoridis2015machine, p. 892]. With too large learning rates, the model does not learn; with too small learning rates, optimization is slow and can lead to local minima and poor generalization [jastrzkebski2017three, Kurita2018, Mack2016, Surmenok2017]. Although popular optimizers like Adam [Kingma2015] come with default hyperparameters, fine-tuning and scheduling of the Adam learning rate is still frequent [denkowski2017stronger], and we suspect the default setting might be somewhat specific to current problems and architecture sizes. Such hyperparameter tuning takes up a lot of engineering time. These and other issues largely prevent deep learning models from working out-of-the-box on new problems, or on a wide range of problems, without human intervention (AutoML setup, [guyon2016brief]).
We propose All Learning Rates At Once (Alrao), an alteration of standard optimization methods for deep learning models. Alrao uses multiple learning rates at the same time in the same network. By sampling one learning rate per feature, Alrao reaches performance close to the performance of the optimal learning rate, without having to try multiple learning rates. Alrao can be used on top of various optimization algorithms; we tested SGD and Adam [Kingma2015]. Alrao with Adam typically led to strong overfit with good train but poor test performance (see Sec. 4), and our experimental results are obtained with Alrao on top of SGD.
Alrao could be useful when testing architectures: an architecture could first be trained with Alrao to obtain an approximation of the performance it would have with an optimal learning rate. Then it would be possible to select a subset of promising architectures based on Alrao, and search for the best learning rate on those architectures, fine-tuning with any optimizer.
Alrao increases the size of a model on the output layer, but not on the internal layers: this usually adds little computational cost unless most parameters occur on the output layer. This text comes along with a Pytorch implementation usable on a wide set of architectures.
Automatically using the “right” learning rate for each parameter was one motivation behind “adaptive” methods such as RMSProp [tieleman2012lecture], AdaGrad [adagrad] or Adam [Kingma2015]. Adam with its default setting is currently considered the default go-to method in many works [wilson2017marginal], and we use it as a baseline. However, further global adjustement of the learning rate in Adam is common [liu2017progressive]. Many other heuristics for setting the learning rate have been proposed, e.g., [pesky]; most start with the idea of approximating a second-order Newton step to define an optimal learning rate [lecun1998efficient].
Methods that directly set per-parameter learning rates are equivalent to preconditioning the gradient descent with a diagonal matrix. Asymptotically, an arguably optimal preconditioner is either the Hessian of the loss (Newton method) or the Fisher information matrix [Amari98]. These can be viewed as setting a per-direction learning rate after redefining directions in parameter space. From this viewpoint, Alrao just replaces these preconditioners with a random diagonal matrix whose entries span several orders of magnitude.
Another approach to optimize the learning rate is to perform a gradient descent on the learning rate itself through the whole training procedure (for instance [maclaurin2015gradient]). This can be applied online to avoid backpropagating through multiple training rounds [masse2015speed]. This idea has a long history, see, e.g., [schraudolph1999local] or [mahmood2012tuning] and the references therein.
The learning rate can also be treated within the framework of architecture search, which can explore both the architecture and learning rate at the same time (e.g., [real2017large]). Existing methods range from reinforcement learning [zoph2016neural, baker2016designing] to bandits [li2017hyperband], evolutionary algorithms (e.g., [stanley2002evolving, jozefowicz2015empirical, real2017large]), Bayesian optimization [bergstra2013making] or differentiable architecture search [liu2018darts]. These powerful methods are resource-intensive and do not allow for finding a good learning rate in a single run.
Alrao was inspired by the intuition that not all units in a neural network end up being useful. Hopefully, in a large enough network, a sub-network made of units with a good learning rate could learn well, and hopefully the units with a wrong learning rate will just be ignored. (Units with a too large learning rate may produce large activation values, so this assumes the model has some form of protection against those, such as BatchNorm or sigmoid/tanh activations.)
Several lines of work support the idea that not all units of a network are useful or need to be trained. First, it is possible to prune a trained network without reducing the performance too much (e.g., [lecun1990, Han2015, Han2015a, See]). [Li2018] even show that performance is reasonable if learning only within a very small-dimensional affine subspace of the parameters, chosen in advance at random rather than post-selected.
Second, training only some of the weights in a neural network while leaving the others at their initial values performs reasonably well (see experiments in Appendix F). So in Alrao, units with a very small learning rate should not hinder training.
Alrao is consistent with the lottery ticket hypothesis, which posits that “large networks that train successfully contain subnetworks that—when trained in isolation—converge in a comparable number of iterations to comparable accuracy” [Frankle2018]. This subnetwork is the lottery ticket winner: the one which had the best initial values. Arguably, given the combinatorial number of subnetworks in a large network, with high probability one of them is able to learn alone, and will make the whole network converge.
Viewing the per-feature learning rates of Alrao as part of the initialization, this hypothesis suggests there might be enough sub-networks whose initialization leads to good convergence.
2 All Learning Rates At Once: Description
Alrao starts with a standard optimization method such as SGD, and a range of possible learning rates . Instead of using a single learning rate, we sample once and for all one learning rate for each feature, randomly sampled log-uniformly in . Then these learning rates are used in the usual optimization update:
where is the set of parameters used to compute the feature of layer from the activations of layer (the incoming weights of feature ). Thus we build “slow-learning” and “fast-learning” features, in the hope to get enough features in the “Goldilocks zone”.
What constitutes a feature depends on the type of layers in the model. For example, in a fully connected layer, each component of a layer is considered as a feature: all incoming weights of the same unit share the same learning rate. On the other hand, in a convolutional layer we consider each convolution filter as constituting a feature: there is one learning rate per filter (or channel), thus keeping translation-invariance over the input image. In LSTMs, we apply the same learning rate to all components in each LSTM unit (thus in the implementation, the vector of learning rates is the same for input gates, for forget gates, etc.).
However, the update (1) cannot be used directly in the last layer. For instance, for regression there may be only one output feature. For classification, each feature in the final classification layer represents a single category, and so using different learning rates for these features would favor some categories during learning. Instead, on the output layer we chose to duplicate the layer using several learning rate values, and use a (Bayesian) model averaging method to obtain the overall network output (Fig. 1).
We set a learning rate per feature, rather than per parameter. Otherwise, every feature would have some parameters with large learning rates, and we would expect even a few large incoming weights to be able to derail a feature. So having diverging parameters within a feature is hurtful, while having diverging features in a layer is not necessarily hurtful since the next layer can choose to disregard them. Still, we tested this option; the results are compatible with this intuition (Appendix E).
Definitions and notation.
We now describe Alrao more precisely for deep learning models with softmax output, on classification tasks (the case of regression is similar).
Let , with , be a classification dataset. The goal is to predict the given the , using a deep learning model . For each input , is a probability distribution over , and we want to minimize the categorical cross-entropy loss over the dataset: .
A deep learning model for classification is made of two parts: a pre-classifier which computes some quantities fed to a final classifier layer , namely, . The classifier layer with categories is defined by with , and The pre-classifier is a computational graph composed of any number of layers, and each layer is made of multiple features.
We denote the log-uniform probability distribution on an interval : namely, if , then is uniformly distributed between and . Its density function is
Alrao for the pre-classifier: A random learning rate for each feature.
In the pre-classifier, for each feature in each layer , a learning rate is sampled from the probability distribution , once and for all at the beginning of training.111With learning rates resampled at each time, each step would be, in expectation, an ordinary SGD step with learning rate , thus just yielding an ordinary SGD trajectory with more noise. Then the incoming parameters of each feature in the preclassifier are updated in the usual way with this learning rate (Eq. 4).
Alrao for the classifier layer: Model averaging from classifiers with different learning rates.
In the classifier layer, we build multiple clones of the original classifier layer, set a different learning rate for each, and then use a model averaging method from among them. The averaged classifier and the overall Alrao model are:
where the are copies of the original classifier layer, with non-tied parameters, and . The are the parameters of the model averaging, and are such that for all , , and . These are not updated by gradient descent, but via a model averaging method from the literature (see below).
For each classifier , we set a learning rate , so that the classifiers’ learning rates are log-uniformly spread on the interval .
Thus, the original model leads to the Alrao model . Only the classifier layer is modified, the pre-classifier architecture being unchanged.
The Alrao update.
Alg. 1 presents the full Alrao algorithm for use with SGD (other optimizers like Adam are treated similarly). The updates for the pre-classifier, classifier, and model averaging weights are as follows.
The update rule for the pre-classifier is the usual SGD one, with per-feature learning rates. For each feature in each layer , its incoming parameters are updated as:
The parameters of each classifier clone on the classifier layer are updated as if this classifier alone was the only output of the model:
(still sharing the same pre-classifier ). This ensures classifiers with low weights still learn, and is consistent with model averaging philosophy. Algorithmically this requires differentiating the loss times with respect to the last layer (but no additional backpropagations through the preclassifier).
To set the weights , several model averaging techniques are available, such as Bayesian Model Averaging [Wasserman2000]. We decided to use the Switch model averaging [VanErven2011], a Bayesian method which is both simple, principled and very responsive to changes in performance of the various models. After each sample or mini-batch, the switch computes a modified posterior distribution over the classifiers. This computation is directly taken from [VanErven2011] and explained in Appendix A. The observed evolution of this posterior during training is commented in Appendix B.
We release along with this paper a Pytorch [paszke2017automatic] implementation of this method. It can be used on an existing model with little modification. A short tutorial is given in Appendix H. The features (sets of weights which will share the same learning rate) need to be defined for each layer type: for now we have done this for linear, convolutional, and LSTMs layers.
We tested Alrao on various convolutional networks for image classification (CIFAR10), and on LSTMs for text prediction. The baselines are SGD with an optimal learning rate, and Adam with its default setting, arguably the current default method [wilson2017marginal].
Image classification on CIFAR10.
For image classification, we used the CIFAR10 dataset [Krizhevsky2009]. It is made of 50,000 training and 10,000 testing data; we split the training set into a smaller training set with 40,000 samples, and a validation set with 10,000 samples. For each architecture, training on the smaller training set was stopped when the validation loss had not improved for 20 epochs. The epoch with best validation loss was selected and the corresponding model tested on the test set. The inputs are normalized. Training used data augmentation (random cropping and random horizontal flipping). The batch size is always 32. Each setting was run 10 times: the confidence intervals presented are the standard deviation over these runs.
We tested Alrao on three architectures known to perform well on this task: GoogLeNet [szegedy2015going], VGG19 [Simonyan14c] and MobileNet [howard2017mobilenets]. The exact implementation for each can be found in our code.
The Alrao learning rates were sampled log-uniformly from to . For the output layer we used 10 classifiers with switch model averaging (Appendix A); the learning rates of the output classifiers are deterministic and log-uniformly spread in .
In addition, each model was trained with SGD for every learning rate in the set . The best SGD learning rate is selected on the validation set, then reported in Table 1. We also compare to Adam with its default hyperparameters ().
|Model||SGD with optimal LR||Adam - Default||Alrao-SGD|
|LR||Loss||Acc (%)||Loss||Acc (%)||Loss||Acc (%)|
Recurrent learning on Penn Treebank.
To test Alrao on a different kind of architecture, we used a recurrent neural network for text prediction on the Penn Treebank [Marcus1993] dataset. The experimental procedure is the same, with and output classifiers for Alrao. The results appear in Table 1, where the loss is given in bits per character and the accuracy is the proportion of correct character predictions.
The model was trained for character prediction rather than word prediction. This is technically easier for Alrao implementation: since Alrao uses copies of the output layer, memory issues arise for models with most parameters on the output layer. Word prediction (10,000 classes on PTB) requires more output parameters than character prediction; see Section 4 and Appendix D.
The model is a two-layer LSTM [hochreiter1997long] with an embedding size of 100 and with 100 hidden features. A dropout layer with rate is included before the decoder. The training set is divided into 20 minibatchs. Gradients are computed via truncated backprop through time [werbos1990backpropagation] with truncation every 70 characters.
As expected, Alrao performs slightly worse than the best learning rate. Still, even with wide intervals , Alrao comes reasonably close to the best learning rate, across all setups; hence Alrao’s possible use as a quick assessment method. Although Adam with its default parameters almost matches optimal SGD, this is not always the case, for example with the MobileNet model (Fig.1(b)). This confirms a known risk of overfit with Adam [wilson2017marginal]. In our setup, Alrao seems to be a more stable default method.
Our results, with either SGD, Adam, or SGD-Alrao, are somewhat below the art: in part this is because we train on only 40,000 CIFAR samples, and do not use stepsize schedules.
4 Limitations, further remarks, and future directions
Increased number of parameters for the classification layer.
Alrao modifies the output layer of the optimized model. The number of parameters for the classification layer is multiplied by the number of classifier copies used (the number of parameters in the pre-classifier is unchanged). On CIFAR10 (10 classes), the number of parameters increased by less than 5% for the models used. On Penn Treebank, the number of parameters increased by in our setup (working at the character level); working at word level it would have increased threefold (Appendix D).
This is clearly a limitation for models with most parameters in the classifier layer. For output-layer-heavy models, this can be mitigated by handling the copies of the classifiers on distinct computing units: in Alrao these copies work in parallel given the pre-classifier.
Still, models dealing with a very large number of output classes usually rely on other parameterizations than a direct softmax, such as a hierarchical softmax (see references in [jozefowicz2016exploring]); Alrao could be used in conjunction with such methods.
Adding two hyperparameters.
We claim to remove a hyperparameter, the learning rate, but replace it with two hyperparameters and .
Formally, this is true. But a systematic study of the impact of these two hyperparameters (Fig. 3) shows that the sensitivity to and is much lower than the original sensitivity to the learning rate. In our experiments, convergence happens as soon as contains a reasonable learning rate (Fig. 3).
A wide range of values of will contain one good learning rate and achieve close-to-optimal performance (Fig. 3). Typically, we recommend to just use an interval containing all the learning rates that would have been tested in a grid search, e.g., to .
So, even if the choice of and is important, the results are much more stable to varying these two hyperparameters than to the learning rate. For instance, standard SGD fails due to numerical issues for while Alrao with works with any (Fig. 3), and is thus stable to relatively large learning rates. We would still expect numerical issues with very large , but this has not been observed in our experiments.
Alrao with Adam.
Alrao is much less reliable with Adam than with SGD. Surprisingly, this occurs mostly for test performance, which can even diverge, while training curves mostly look good (Appendix C). We have no definitive explanation for this at present. It might be that changing the learning rate in Adam also requires changing the momentum parameters in a correlated way. It might be that Alrao does not work on Adam because Adam is more sensitive to its hyperparameters. The stark train/test discrepancy might also suggest that Alrao-Adam performs well as a pure optimization method but exacerbates the underlying risk of overfit of Adam [wilson2017marginal, keskar2017improving].
Increasing network size.
With Alrao, neurons with unsuitable learning rates will not learn: those with a too large learning rate might learn nothing, while those with too small learning rates will learn too slowly to be used. Thus, Alrao may reduce the effective size of the network to only a fraction of the actual architecture size, depending on .
Our first intuition was that increasing the width of the network was going to be necessary with Alrao, to avoid wasting too many units. In a fully connected network, the number of weights is quadratic in the width, so increasing width (beyond a factor three in our experiments) can be bothersome. Comparisons of Alrao with increased width are reported in Appendix G. Width is indeed a limiting factor for the models considered, even without Alrao (Appendix G). But to our surprise, Alrao worked well even without width augmentation.
Other optimization algorithms, other hyperparameters, learning rate schedulers…
Using a learning rate schedule instead of a fixed learning rate is often effective [bengio2012practical]. We did not use learning rate schedulers here; this may partially explain why the results in Table 1 are worse than the state-of-the-art. Nothing prevents using such a scheduler within Alrao, e.g., by dividing all Alrao learning rates by a time-dependent constant; we did not experiment with this yet.
The idea behind Alrao could be used on other hyperparameters as well, such as momentum. However, if more hyperparameters are initialized randomly for each feature, the fraction of features having all their hyperparameters in the “Goldilocks zone” will quickly decrease.
Applying stochastic gradient descent with random learning rates for different features is surprisingly resilient in our experiments, and provides performance close enough to SGD with an optimal learning rate, as soon as the range of random learning rates contains a suitable one. This could save time when testing deep learning models, opening the door to more out-of-the-box uses of deep learning.
We would like to thank Corentin Tallec for his technical help, and his many remarks and advice. We thank Olivier Teytaud for pointing useful references.
Appendix A Model Averaging with the Switch
As explained is Section 2, we use a model averaging method on the classifiers of the output layer. We could have used the Bayesian Model Averaging method [Wasserman2000]. But one of its main weaknesses is the catch-up phenomenon [VanErven2011]: plain Bayesian posteriors are slow to react when the relative performance of models changes over time. Typically, for instance, some larger-dimensional models need more training data to reach good performance: at the time they become better than lower-dimensional models for predicting current data, their Bayesian posterior is so bad that they are not used right away (their posterior needs to “catch up” on their bad initial performance). This leads to very conservative model averaging methods.
The solution from [VanErven2011] against the catch-up phenomenon is to switch between models. It is based on previous methods for prediction with expert advice (see for instance [herbster1998tracking, volf1998switching] and the references in [koolen2008combining, VanErven2011]), and is well rooted in information theory. The switch method maintains a Bayesian posterior distribution, not over the set of models, but over the set of switching strategies between models. Intuitively, the model selected can be adapted online to the number of samples seen.
We now give a quick overview of the switch method from [VanErven2011]: this is how the model averaging weights are chosen in Alrao.
Assume that we have a set of prediction strategies . We define the set of switch sequences, . Let be a switch sequence. The associated prediction strategy uses model on the time interval , namely
where is such that for . We fix a prior distribution over switching sequences. In this work, the prior is, for a switch sequence :
with a geometric distribution over the switch sequences lengths, the uniform distribution over the models (here the classifiers) and .
This defines a Bayesian mixture distribution:
Then, the model averaging weight for the classifier after seeing samples is the posterior of the switch distribution: .
These weights can be computed online exactly in a quick and simple way [VanErven2011], thanks to dynamic programming methods from hidden Markov models.
The implementation of the switch used in Alrao exactly follows the pseudo-code from [NIPS2007_3277], with hyperparameter (allowing for many switches a priori). It can be found in the accompanying online code.
Appendix B Evolution of the Posterior
The evolution of the model averaging weights can be observed during training. In Figure 4, we can see their evolution during the training of the GoogLeNet model with Alrao, 10 classifiers, with and .
We can make several observations. First, after only a few gradient descent steps, the model averaging weights corresponding to the three classifiers with the largest learning rates go to zero. This means that their parameters are moving too fast, and their loss is getting very large.
Next, for a short time, a classifier with a moderately large learning rate gets the largest posterior weight, presumably because it is the first to learn a useful model.
Finally, after the model has seen approximately 4,000 samples, a classifier with a slightly smaller learning rate is assigned a posterior weight close to 1, while all the others go to 0. This means that after a number of gradient steps, the model averaging method acts like a model selection method.
Appendix C Alrao-Adam
This is especially true for the test performance, which can even diverge while training performance remains either good or acceptable (Fig. 5). Thus Alrao-Adam seems to send the model into atypical regions of the search space.
Appendix D Number of Parameters
As explained in Section 4, Alrao increases the number of parameters of a model, due to output layer copies. The additional number of parameters is approximately equal to where is the number of classifier copies used in Alrao, is the dimension of the output of the pre-classifier, and is the number of classes in the classification task (assuming a standard softmax output; classification with many classes often uses other kinds of output parameterization instead).
|Model||Number of parameters|
|Without Alao||With Alrao|
The number of parameters for the models used, with and without Alrao, are in Table 2. We used 10 classifiers in Alrao for convolutional neural networks, and 6 classifiers for LSTMs. Using Alrao for classification tasks with many classes, such as word prediction (10,000 classes on PTB), increases the number of parameters noticeably.
For those model with significant parameter increase, the various classifier copies may be done on parallel GPUs.
Appendix E Other Ways of Sampling the Learning Rates
In Alrao we sample a learning rate for each feature. Intuitively, each feature (or neuron) is a computation unit of its own, using a number of inputs from the previous layer. If we assume that there is a “right” learning rate for learning new features based on information from the previous layer, then we should try a learning rate per feature; some features will be useless, while others will be used further down in the network.
An obvious variant is to set a random learning rate per weight, instead of for all incoming weights of a given feature. However, this runs the risk that every feature in a layer will have a few incoming weights with a large rate, so intuitively every feature is at risk of diverging. This is why we favored per-feature rates.
Still, we tested sampling a learning rate for each weight in the pre-classifier (while keeping the same Alrao method for the classifier layer).
In our experiments on LSTMs, per-weight learning rates sometimes perform well but are less stable and more sensitive to the interval : for some intervals with very large , results with per-weight learning rates are a lot worse than with per-feature learning rates. This is consistent with the intuition above.
Appendix F Learning a Fraction of the Features
As explained in the introduction, several works support the idea that not all units are useful when learning a deep learning model. Additional results supporting this hypothesis are presented in Figure 7. We trained a GoogLeNet architecture on CIFAR10 with standard SGD with learning rate , but learned only a random fraction of the features (chosen at startup), and kept the others at their initial value. This is equivalent to sampling each learning rate from the probability distribution and .
We observe that even with a fraction of the weights not being learned, the model’s performance is close to its performance when fully trained.
When training a model with Alrao, many features might not learn at all, due to too small learning rates. But Alrao is still able to reach good results. This could be explained by the resilience of neural networks to partial training.
Appendix G Increasing Network Size
As explained in Section 4, learning with Alrao reduces the effective size of the network to only a fraction of the actual architecture size, depending on . We first tought that increasing the width of each layer was going to be necessary in order to use Alrao. However, our experiments show that this is not necessary.
Alrao and SGD experiments with increased width are reported in Figure 8. As expected, Alrao with increased width has better performance, since the effective size increases. However, increasing the width also improves performance of standard SGD, by roughly the same amount.
Thus, width is still a limiting factor both for GoogLeNet and MobileNet. This shows that Alrao can perform well even when network size is a limiting factor; this runs contrary to our initial intuition that Alrao would require very large networks in order to have enough features with suitable learning rates.
Appendix H Tutorial
In this section, we briefly show how Alrao can be used in practice on an already implemented method in Pytorch. The code will be available soon.
The first step is to build the preclassifier. Here, we use the VGG19 architecture. The model is built without a classifier. Nothing else is required for Alrao at this step.
Then, we can build the Alrao-model with this preclassifier, sample the learning rates for the model, and define the Alrao optimizer
Finally, we can train the model. The only differences here with the usual training procedure is that each classifier needs to be updated as if it was alone, and that we need to update the model averaging weights, here the switch weights.