Online Learning Rate Adaptation with Hypergradient Descent

# Online Learning Rate Adaptation with Hypergradient Descent

## Abstract

We introduce a general method for improving the convergence rate of gradient-based optimizers that is easy to implement and works well in practice. We analyze the effectiveness of the method by applying it to stochastic gradient descent, stochastic gradient descent with Nesterov momentum, and Adam, showing that it improves upon these commonly used algorithms on a range of optimization problems; in particular the kinds of objective functions that arise frequently in deep neural network training. Our method works by dynamically updating the learning rate during optimization using the gradient with respect to the learning rate of the update rule itself. Computing this “hypergradient” needs little additional computation, requires only one extra copy of the original gradient to be stored in memory, and relies upon nothing more than what is provided by reverse-mode automatic differentiation.

## 1Introduction

In nearly all gradient descent algorithms the choice of learning rate remains central to efficiency; [4] asserts that it is “often the single most important hyper-parameter” and that it always should be tuned. This is because choosing to follow your gradient signal by something other than the right amount, either too much or too little, can be very costly in terms of how fast the overall descent procedure achieves a particular level of objective value.

Understanding that adapting the learning rate is a good thing to do, particularly on a per parameter basis dynamically, led to the development of a family of widely-used optimizers including AdaGrad [9], RMSProp [28], and Adam [14]. However, a persisting commonality of these methods is that they are parameterized by a “pesky” fixed global learning rate hyperparameter which still needs tuning. There have been methods proposed that do away with needing to tune such hyperparameters altogether [21] but their adoption has not been widespread, owing perhaps to their complexity, applicability in practice, or performance relative to the aforementioned family of algorithms.

Our initial conceptualization of the learning rate adaptation problem was one of automatic differentiation [2]. We hypothesized that the derivative of a parameter update procedure with respect to its global learning rate ought to be useful for improving optimizer performance. This conceptualization is not unique, having been explored, for instance, by [16]. While the automatic differentiation perspective was integral to our conceptualization, the resulting algorithm turns out to simplify elegantly and not require additional automatic differentiation machinery. In fact, it is easily adaptable to nearly any gradient update procedure while only requiring one extra copy of a gradient to be held in memory and very little computational overhead; just a dot product in the dimension of the parameter. Considering the general applicability of this method and adopting the name “hypergradient” introduced by [16] to mean a derivative taken with respect to a hyperparameter, we call our method hypergradient descent.

In addition to outperforming the regular variants of algorithms in their class, hypergradient algorithms significantly reduce the need for the expensive and time consuming practice of hyperparameter search [10], which in practice is being performed with grid search, random search [5], Bayesian optimization [26], and model-based approaches [6].

We define the hypergradient descent (HD) method by applying gradient descent on the learning rate of an underlying gradient descent algorithm, independently discovering a technique that has been previously considered in the optimization literature, most notably by [1]. This differs from the reversible learning approach of [16] in that we apply gradient-based updates to a hyperparameter (in particular, the learning rate) at each iteration in an online fashion, instead of propagating derivatives through an entire inner optimization that consists of many iterations.

The method is based solely on the partial derivative of an objective function—following an update step—with respect to the learning rate, and directly follows from the chain rule of differential calculus, involving no other arbitrary terms originating from empirical insights or running estimates. In this paper we consider and report the case where the learning rate is a scalar. It is straightforward to generalize the introduced method to the case where is a vector of per-parameter learning rates.

The most basic form of HD can be derived from regular gradient descent as follows. Regular gradient descent, given an objective function and previous parameters , evaluates the gradient and moves against it to arrive at updated parameters

where is the learning rate. In addition to this update rule, we would like to derive an update rule for the learning rate itself. For this, we will compute , the partial derivative of the objective with respect to the learning rate . Noting that , i.e., the result of the previous update step, and applying the chain rule, we get

which allows us to compute the needed hypergradient with a simple dot product and the memory cost of only one extra copy of the original gradient. Using this hypergradient, we construct a higher level update rule for the learning rate as

introducing as the hypergradient learning rate. We then modify Equation 1 to use the sequence to become

Equations Equation 3 and Equation 4 thus define the most basic form of the HD algorithm, updating both and at each iteration. Note that this derivation is applicable to any gradient-based optimization algorithm with an update rule expressed in the form of Equation 1, where the multiplier of in the original update rule will appear on the right-hand side of the dot product in the formula for in Equation 2.

Applying these derivation steps to stochastic gradient descent (SGD) (Algorithm ?), we arrive at the hypergradient variant of SGD that we abbreviate as SGD-HD (Algorithm ?). As all gradient-based algorithms that we consider have a common core where one iterates through a loop of gradient evaluations and parameter updates, for the sake of brevity, we define all algorithmic variants with reference to Algorithm ?, where one substitutes the initialization statement ( red) and the update rule ( blue) with their counterparts in the variant algorithms. In this way, from SGD with Nesterov momentum (SGDN) (Algorithm ?) and Adam (Algorithm ?), we formulate the hypergradient variants of SGDN-HD (Algorithm ?) and Adam-HD (Algorithm ?).

In Section 4, we empirically demonstrate the performance of these hypergradient algorithms for the problems of logistic regression and training of multilayer and convolutional neural networks for image classification, also investigating good settings for the hypergradient learning rate and the initial learning rate . Section 5 discusses extensions to this technique and examines the convergence of HD for convex objective functions.

### 2.1Statistical Justification

Here we provide a more detailed account of the intuition behind this technique. Assume that we want to optimize a function and we want to take a local approach by updating, in sequence, our current guess of where the minimum of is. Let be the sequence of all parameter values visited during the course of optimization and let be the learning rate. Let be the function that defines the next step . Here we will always take , where is an algorithm-specific update rule. For instance, ordinary gradient descent corresponds to an update rule of .

In each step, our goal is to update the value of towards its optimum value that would minimize the expected value of the objective in the next iteration of learning, that is, we want to minimize . As we shall see, we can compute using a gradient update so that it is closer than to the optimum value of in step . In order to do that, we compute the derivative of the former expression obtaining , which, evaluated at and , equals

If were the optimum value, the product on the right-hand side would be zero. If we knew the value , the gradient of yet to be evaluated, we could use Equation 5 to perform gradient updates on the learning rate to bring it closer to its optimum by using . Note that this expression has a circular dependency on due to (Eq. Equation 4). However, assuming that the optimum value for in each step changes slowly, we can use the values of the previous step to approximate , obtaining the update rule

Note that this can be also done with an unbiased estimator for the gradient, so the method can be applied to SGD variants. A major focus of machine learning is the optimization of scalar valued objective functions of the form , where with is our dataset and our goal is usually to find the argument that produces the minimum value . In big data settings where it is infeasible to compute exact gradients of this objective in every iteration, SGD is commonly used for computing the gradient at each iteration using -item subsets, or “minibatches”, of the data, , where indexes all possible subsets. With the minibatch loss and with minibatches drawn i.i.d., the minibatch-based estimator of the complete-data gradient with respect to is unbiased because .

Furthermore, assuming is appropriately large, we can appeal to the central limit theorem and assume that the minibatch gradient estimate is normally distributed about the true complete-data gradient , where for some . Abstracting this slightly, these assumptions are equivalent to assuming that we only have access to a noisy estimate of the gradient. The previous procedure can be applied in this case because

## 3Related Work

### 3.1Learning Rate Adaptation

[1] previously considered the adaptation of the learning rate using the derivative of the objective function with respect to the learning rate. [19] proposed methods using gradient-related information of up to two previous steps in adapting the learning rate. In any case, the approach can be interpreted as either applying gradient updates to the learning rate or simply as a heuristic of increasing the learning rate after a “successful” step and decreasing it otherwise.

Similarly, [24] propose a way of controlling the learning rate of a main algorithm by using an averaging algorithm based on the mean of a sequence of adapted learning rates, also investigating rates of convergence. The stochastic meta-descent (SMD) algorithm [23], developed as an extension of the gain adaptation work by [27], operates by multiplicatively adapting local learning rates using a meta-learning rate, employing second-order information from fast Hessian-vector products [17]. Other work that merits mention include RPROP [20], where local adaptation of weight updates are performed by using only the temporal behavior of the gradient’s sign, and Delta-Bar-Delta [12], where the learning rate is varied based on a sign comparison between the current gradient and an exponential average of the previous gradients.

Recently popular optimization methods with adaptive learning rates include AdaGrad [9], RMSProp [28], vSGD [21], and Adam [14], where different heuristics are used to estimate aspects of the geometry of the traversed objective.

### 3.2Hyperparameter Optimization Using Derivatives

Previous authors, most notably [3], have noted that the search for good hyperparameter values for gradient descent can be cast as an optimization problem itself, which can potentially be tackled via another level of gradient descent using backpropagation. More recent work includes [8], where an optimization procedure is truncated to a fixed number of iterations to compute the gradient of the loss with respect to hyperparameters, and [16], applying nested reverse automatic differentiation to larger scale problems in a similar setting.

A common point of these works has been their focus on computing the gradient of a validation loss at the end of a regular training session of many iterations with respect to hyperparameters supplied to the training in the beginning. This requires a large number of intermediate variables to be maintained in memory for being later used in the reverse pass of automatic differentiation. [16] introduce a reversible learning technique to efficiently store the information needed for exactly reversing the learning dynamics during the hyperparameter optimization step. As described in Sections Section 1 and Section 2, the main difference of our method from this is that we compute the hypergradients and apply hyperparameter updates in an online manner at each iteration,1 overcoming the costly requirement of keeping intermediate values during training and differentiating through whole training sessions per hyperparameter update.

## 4Experiments

We evaluate the impact of hypergradient updates on the learning rate on several tasks, comparing the behavior of the variant algorithms SGD-HD (Algorithm ?), SGDN-HD (Algorithm ?), and Adam-HD (Algorithm ?) to that of their ancestors SGD (Algorithm ?), SGDN (Algorithm ?), and Adam (Algorithm ?) showing, in all cases, superior performance. The algorithms are implemented in Torch [7] using an API compatible with the popular torch.optim package,2 to which we are planning to contribute via a pull request on GitHub.

### 4.1Logistic Regression

We start by evaluating the performance of the set of algorithms for fitting a logistic regression classifier to the MNIST database,3 assigning membership probabilities for ten classes to input vectors of length 784. We use a learning rate of for all algorithms, where for the HD variants this is taken as the initial . We take for SGDN and SGDN-HD. For Adam, we use , , , and apply a decay to the learning rate as used in [14] only for the logistic regression problem. We use the full 60,000 images in MNIST for training and compute the validation loss using the 10,000 test images. L2 regularization is used with a coefficient of . We use a minibatch size of 128 for all the experiments in the paper.

Figure 1 (left column) shows the negative log-likelihood loss for training and validation along with the evolution of the learning rate during training, using for SGD-HD and SGDN-HD, and for Adam-HD. Our main observation in this experiment, and the following experiments, is that the HD variants consistently outperform their regular versions.4 While this might not come as a surprise for the case of vanilla SGD, which does not possess capability for adapting the learning rate or the update speed, the improvement is also observed for SGD with Nesterov momentum (SGDN) and Adam. The improvement upon Adam is particularly striking because this method itself is based on adaptive learning rates.

An important feature to note is the initial smooth increase of the learning rates from to approximately 0.05 for SGD-HD and SGDN-HD. For Adam-HD, the increase is up to 0.001174 (a 17% change), virtually imperceivable in the plot due to scale. For all HD algorithms, this initial increase is followed by a decay to a range around zero. We conjecture that this initial increase and the later decay of , automatically adapting to the geometry of the problem, is behind the performance increase observed.

### 4.2Multi-Layer Neural Networks

We next evaluate the effectiveness of HD algorithms on training a multi-layer neural network, again on the MNIST database. The network consists of two fully connected hidden layers with 1,000 units each and ReLU activations. We again use a learning rate of for all algorithms. We use for SGD-HD and SGDN-HD, and for Adam-HD. L2 regularization is applied with a coefficient of .

As seen in the results in Figure 1 (middle column), the hypergradient variants again consistently outperform their regular counterparts. In particular, we see that Adam-HD converges to a level of validation loss not achieved by Adam, and shows an order of magnitude improvement over Adam in the training loss.

Of particular note is, again, the initial rise and fall in the learning rates, where we see the learning rate climb to 0.05 for SGD-HD and SGDN-HD, whereas for Adam-HD the overall behavior of the learning rate is that of decay following a minute initial increase to 0.001083 (invisible in the plot due to scale). Compared with logistic regression results, the initial rise of the learning rate for SGDN-HD happens noticeably before SGD-HD, possibly caused by the speedup from the momentum updates.

### 4.3Convolutional Neural Networks

To investigate whether the performance we have seen in the previous sections scales to deep architectures and large-scale high-dimensional problems, we apply these to train a VGG Net [25] on the CIFAR-10 image recognition dataset [15]. We base our implementation on the VGG Net architecture for Torch by Sergey Zagoruyko.5 The network used has an architecture of (conv-64) maxpool (conv-128) maxpool (conv-256) maxpool (conv-512) maxpool (conv-512) maxpool fc-512 fc-10, corresponding closely to the the “D configuration” in [25]. All convolutions have 33 filters and a padding of 1; all max pooling layers are 22 with a stride of 2. We use and for SGD-HD and SGDN-HD, and for Adam-HD. We use the 50,000 training images in CIFAR-10 for training and the 10,000 test images for evaluating the validation loss.

Looking at Figure 1 (right column), once again we see consistent improvements of the hypergradient variants over their regular counterparts. SGD-HD and SGDN-HD perform significantly better than their regular versions in the validation loss, whereas Adam and Adam-HD reach the same validation loss with relatively the same speed. Adam-HD performs significantly better than Adam in the training loss. For SGD-HD and SGDN-HD we see an initial rise of to approximately 0.025, this rise happening, again, with SGDN-HD before SGD-HD. During this initial rise, the learning rate of Adam-HD rises only up to 0.001002.

### 4.4Guidelines for Choosing α0 and β

Throughout the preceding sections, we have made recommendations for the “optimal” values of and to use for the architectures and algorithms considered. These specific recommendations come from searching for the best combination in the space of and in terms of the number of iterations required to achieve a certain loss. In Figure ? we report the results of a grid search for all the algorithms on the logitistic regression objective; similar results have been observed for the multi-layer neural network and CNN objectives as well.

Figure ? compels several empirical arguments. For one, independent of these results, and even if one acknowledges that using hypergradients for online learning rate adaption improves on the baseline algorithm, one might worry that using hypergradients makes the hyperparameter search problem worse. One might imagine that their use would require tuning both the initial learning rate and the hypergradient learning rate . In fact, what we have repeatedly observed and can be seen in this figure is that, given a good value of , HD is somewhat insensitive to the value of . So, in practice tuning by itself, if hyperparameters are to be tuned at all, is actually sufficient.

Also note that in reasonable ranges for and , no matter which values of and you choose, you improve upon the original method. The corollary to this is that if you have tuned to a particular value of and use our method with an arbitrary small (no tuning) you will still improve upon the original method started at the same ; remembering of course that recovers the original method in all cases.

## 5Convergence and Extensions

### 5.1Transitioning to the Underlying Algorithm

We observed in our experiments that follows a consistent trajectory. As shown in Figure 1, it initially grows large, then shrinks, and thereafter fluctuates around a small value that is comparable to the best fixed we could find for the underlying algorithm without hypergradients. This suggests that hypergradient updates improve performance partially due to their effect on the algorithm’s early behaviour, and motivates our first proposed extension, which involves smoothly transitioning to a fixed learning rate as the algorithm progresses.

More precisely, in this extension we update exactly as previously via Equation 6, and when we come to the update of , we use as our learning rate a new value instead of directly, so that our update rule is instead of as previously. Our satisfies when is small, and as , where is some constant we choose. Specifically, , where is some function such that and as (e.g., ).

Intuitively, this extension will behave roughly like HD at the beginning of the optimization process, and roughly like the original underlying algorithm by the end. We suggest choosing a value for that would produce good performance when used as a fixed learning rate throughout.

Our preliminary experimental evaluation of this extension shows that it gives good convergence performance for a larger range of than without, and hence can improve the robustness of our approach. It also allows us to prove theoretical convergence under certain assumptions about :

Note that

where the right-hand side is as . Our assumption about the limiting behaviour of then entails and therefore as . For large enough , we thus have , and the algorithm converges by the fact that standard gradient descent converges for such a (potentially non-constant) learning rate under our assumptions about (see, e.g., [13]).

While our method adapts during training, we still make use of a fixed , and it is natural to wonder whether one can use hypergradients to adapt this value as well. To do so would involve the addition of an update rule analogous to Equation 3, using a gradient of our objective function computed now with respect to . We would require a fixed learning rate for this update, but then may consider doing hypergradient updates for this quantity also, and so on arbitrarily. Since our use of a single hypergradient appears to make a gradient descent algorithm less sensitive to hyperparameter selection, it is possible that the use of higher-order hypergradients in this way would improve robustness even further. We leave this hypothesis to explore in future work.

## 6Conclusion

Having rediscovered a general method for adapting global parameters of gradient-based optimization procedures, we have used it to produce hypergradient descent variants of SGD, SGD with Nesterov momentum, and Adam that empirically appear to improve, sometimes significantly, upon their unmodified variants. The method is general, memory and computation efficient, and easy to implement. An important advantage of the presented method is that, with a small , it requires less tuning to give performance better than (or in the worst case the same as) the baseline. We believe that the significant performance improvement demonstrated and the ease with which the method can be applied to existing optimizers give this method the potential to become a standard tool and significantly impact the utilization of time and hardware resources in machine learning practice.

Our start towards the establishment of theoretical convergence guarantees in this paper is limited and as such there remains much to be done, both in terms of working towards a convergence result for the non-transitioning variant of hypergradient descent and a more general result for the mixed variant. Establishing convergence rates would be even more ideal but remains future work.

### Footnotes

1. Note that we use the training objective, as opposed to the validation objective as in [16], for computing hypergradients. Modifications of HD computing gradients for both training and validation sets at each iteration and using the validation gradient only for updating are possible, but not presented in this paper.
2. https://github.com/torch/optim
3. http://yann.lecun.com/exdb/mnist/
4. We would like to remark that the results in plots showing loss versus training iteration/epoch remain virtually the same when they are plotted versus wall-clock time.
5. http://torch.ch/blog/2015/07/30/cifar.html

### References

1. Parameter adaptation in stochastic optimization.
L. B. Almeida, T. Langlois, J. D. Amaral, and A. Plakhov. In D. Saad, editor, On-Line Learning in Neural Networks. Cambridge University Press, 1998.
2. Automatic differentiation in machine learning: a survey.
A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. arXiv preprint arXiv:1502.05767
3. Gradient-based optimization of hyperparameters.
Y. Bengio. Neural Computation
4. Practical recommendations for gradient-based training of deep architectures.
Y. Bengio. In Neural Networks: Tricks of the Trade, volume 7700, pages 437–478. Springer, 2012.
5. Random search for hyper-parameter optimization.
J. Bergstra and Y. Bengio. Journal of Machine Learning Research
6. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures.
J. Bergstra, D. Yamins, and D. D. Cox. In International Conference on Machine Learning, 2013.
7. Torch7: A MATLAB-like environment for machine learning.
R. Collobert, K. Kavukcuoglu, and C. Farabet. In BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011.
8. Generic methods for optimization-based modeling.
J. Domke. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22, pages 318–326, 2012.
9. Adaptive subgradient methods for online learning and stochastic optimization.
J. Duchi, E. Hazan, and Y. Singer. Journal of Machine Learning Research
10. Practical methodology.
I. Goodfellow, Y. Bengio, and A. Courville. In Deep Learning, chapter 11. MIT Press, 2016.
11. An evaluation of sequential model-based optimization for expensive blackbox functions.
F. Hutter, H. Hoos, and K. Leyton-Brown. In Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation, pages 1209–1216. ACM, 2013.
12. Increased rates of convergence through learning rate adaptation.
R. A. Jacobs. Neural Networks
13. Linear convergence of gradient and proximal-gradient methods under the Polyak-Lojasiewicz condition.
H. Karimi, J. Nutini, and M. Schmidt. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.
14. Adam: A method for stochastic optimization.
D. Kingma and J. Ba. In The International Conference on Learning Representations (ICLR), San Diego, 2015.
15. Learning multiple layers of features from tiny images.
A. Krizhevsky. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
16. Gradient-based hyperparameter optimization through reversible learning.
D. Maclaurin, D. K. Duvenaud, and R. P. Adams. In Proceedings of the 32nd International Conference on Machine Learning, pages 2113–2122, 2015.
17. Fast exact multiplication by the Hessian.
B. A. Pearlmutter. Neural Computation
18. An improved backpropagation method with adaptive learning rate.
V. P. Plagianakos, D. G. Sotiropoulos, and M. N. Vrahatis. Technical Report TR98-02, University of Patras, Department of Mathematics, 1998.
19. Learning rate adaptation in stochastic gradient descent.
V. P. Plagianakos, G. D. Magoulas, and M. N. Vrahatis. In Advances in Convex Analysis and Global Optimization, pages 433–444. Springer, 2001.
20. A direct adaptive method for faster backpropagation learning: The RPROP algorithm.
M. Riedmiller and H. Braun. In IEEE International Conference on Neural Networks, pages 586–591. IEEE, 1993.
21. No more pesky learning rates.
T. Schaul, S. Zhang, and Y. LeCun. Proceedings of the 30th International Conference on Machine Learning
22. Local gain adaptation in stochastic gradient descent.
N. N. Schraudolph. In Proceedings of the 9th International Conference on Neural Networks (ICANN), volume 2, pages 569–574, 1999.
23. Fast online policy gradient learning with SMD gain vector adaptation.
N. N. Schraudolph, J. Yu, and D. Aberdeen. In Advances in Neural Information Processing Systems, page 1185, 2006.
24. Rates of convergence of adaptive step-size of stochastic approximation algorithms.
S. Shao and P. P. C. Yip. Journal of Mathematical Analysis and Applications
25. Very deep convolutional networks for large-scale image recognition.
K. Simonyan and A. Zisserman. arXiv preprint arXiv:1409.1556
26. Practical Bayesian optimization of machine learning algorithms.
J. Snoek, H. Larochelle, and R. P. Adams. In Advances in Neural Information Processing Systems, pages 2951–2959, 2012.
27. Gain adaptation beats least squares?
R. S. Sutton. In Proceedings of the Seventh Yale Workshop on Adaptive and Learning Systems, pages 161–166, 1992.
28. Lecture 6.5 – RMSProp: Divide the gradient by a running average of its recent magnitude.
T. Tieleman and G. Hinton. COURSERA: Neural Networks for Machine Learning
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters