First-Order Preconditioning via Hypergradient Descent

First-Order Preconditioning via
Hypergradient Descent

Ted Moskovitz,  Rui Wang,  Janice Lan,  Sanyam Kapoor,
Thomas Miconi,  Jason Yosinski,  Aditya Rawal
Uber AI
{tmoskovitz, ruiwang, janlan, sanyam,
tmiconi, yosinski, aditya.rawal}@uber.com
Work done while an intern at Uber. Ted can be reached at ted@gatsby.ucl.ac.uk.
Abstract

Standard gradient descent methods are susceptible to a range of issues that can impede training, such as high correlations and different scaling in parameter space. These difficulties can be addressed by second-order approaches that apply a preconditioning matrix to the gradient to improve convergence. Unfortunately, such algorithms typically struggle to scale to high-dimensional problems, in part because the calculation of specific preconditioners such as the inverse Hessian or Fisher information matrix is highly expensive. We introduce first-order preconditioning (FOP), a fast, scalable approach that generalizes previous work on hypergradient descent (Almeida_1999; maclaurin_2015; hypergrad_2018) to learn a preconditioning matrix that only makes use of first-order information. Experiments show that FOP is able to improve the performance of standard deep learning optimizers on several visual classification tasks with minimal computational overhead. We also investigate the properties of the learned preconditioning matrices and perform a preliminary theoretical analysis of the algorithm.

1 Introduction

High-dimensional nonlinear optimization problems often present a number of difficulties, such as strongly-correlated parameters and variable scaling along different directions in parameter space (martens_2016). Despite this, deep neural networks and other large-scale machine learning models applied to such problems typically rely on simple variations of gradient descent to train, which is known to be highly sensitive to these difficulties. While this approach often works well in practice, addressing the underlying issues directly could provide stronger theoretical guarantees, accelerate training, and improve generalization.

Adaptive learning rate models such as Adam (adam_2014) and RMSProp (rmsprop_2012) provide some degree of higher-order approximation to re-scale updates based on per-parameter behavior. Newton-based methods make use of the curvature of the loss surface to both re-scale and rotate the gradient in order to improve convergence. Natural gradient methods (Amari_nat_grad_og) do the same in order to enforce smoothness in the evolution of the model’s conditional distribution. In each of the latter two cases, the focus is on computing a specific linear transformation of the gradient that improves the conditioning of the problem. This transformation is typically known as a preconditioning, or curvature, matrix. In the case of quasi-Newton methods, the preconditioning matrix takes the form of the inverse Hessian, while for natural gradient methods it’s the inverse Fisher information matrix. Computing these transformations is typically intractable for high-dimensional problems, and while a number of approximate methods exist for both (e.g., (byrd_nocedal_zhu_1996; KFAC_2015; KFC_2016)), they are often still too expensive for the performance gain they provide. These approaches also suffer from rigid inductive biases regarding the nature of the problems to which they are applied in that they seek to compute or approximate specific transformations. However, in large, non-convex problems, the optimal gradient transformation may be less obvious, or may even change over the course of training.

In this paper, we attempt to address these issues through a method we term first-order preconditioning (FOP). FOP doesn’t attempt to compute a specific preconditioner, such as the inverse Hessian, but rather uses first-order hypergradient descent (maclaurin_2015) to learn an adaptable transformation online directly from the task objective function. Our method adds minimal computational and memory cost to standard deep network training settings and results in improved convergence speed and generalization compared to standard approaches. FOP can flexibly be applied to any gradient-based optimization problem, and we show that when used in conjunction with standard optimizers, it improves their performance.

2 First Order Preconditioning

2.1 The Basic Approach

Consider a parameter vector and a loss function . A traditional gradient update with a preconditioning matrix can be written as

(1)
1:Require: model parameters , objective function , FOP matrix , learning rate , hypergradient learning rate
2:for t = 1,2,… do
3:  Draw data
4:  Perform forward pass:
5:  Compute loss
6:  Update inference parameters :
7:  Update preconditioning matrices:
8:  Cache
Algorithm 1 Learned First-Order Preconditioning (FOP)

Our goal is to learn . However, while we place no other constraints on our preconditioner, in order to ensure that it is positive definite, and therefore does not reverse the direction of the gradient, we replace Equation 1 with the following:

(2)

Under reasonable assumptions, gradient descent is guaranteed to converge with the use of even a random symmetric, positive-definite preconditioner, as we show in Supplementary Section A.2. Because is a function of , we can then backpropagate from the loss at iteration to the previous iteration’s preconditioner via a simple application of the chain rule:

(3)

The gradient with respect to the preconditioner is then simply

(4)

Note that ideally, we would compute to update , but as we don’t have access to yet, we follow the example of Almeida_1999 and assume that does not dramatically change across a single iteration. The basic approach is summarized in Algorithm 1. We use supervised learning as an example, but the same method applies to any gradient-based optimization. The preconditioned gradient can then be passed to any standard optimizer to produce an update for . For example, we describe the procedure for using FOP with momentum (momentum_1964) in Section 2.4. For multi-layer networks, in order to make the simplest modification to normal backpropagation, we learn a separate for each layer, not a global curvature matrix.

To get an intuition for the behavior of FOP compared to standard algorithms, we observed its trajectories on a set of low-dimensional optimization problems (Figure 1). Interestingly, while FOP converged in fewer iterations than SGD and Adam (adam_2014), it took more jagged paths along the objective function surface, suggesting that while it takes more aggressive steps, it is perhaps also able to change direction more rapidly.

2.2 Low-Rank FOP

If is an matrix, and we only apply the preconditioning matrix over input dimensions (and share it across output dimensions), then a full-rank would necessarily be . When is large, preconditioning the gradient becomes expensive. Instead, we can apply a rank- , with and . To ensure stable performance at the beginning of training, we initialize the preconditioning matrix as close as possible to the identity matrix, even for a low rank , so that the algorithm begins as straightforward gradient descent and learns to depart from vanilla SGD over time. Thus, we set the preconditioner to be

(5)

where is the identity matrix and , where is small so that starts out close to the identity matrix. We tested the effect of rank on a simple fully-connected network trained on the MNIST dataset (lecun_mnist). The results, shown in Figure 2, indicate that FOP is able to accelerate training compared to standard SGD (with momentum 0.9) with all values of and improve final test accuracy even with fairly low values of .

Figure 1: A comparison of FOP to common optimizers on toy problems. The red dot indicates the initial position on the loss surface. The purpose of these visualizations is not to establish the superiority of one optimizer over another, but rather to gain an intuition for their qualitative behavior. (Left) Gradient descent on the Booth function. FOP converges in 543 iterations, while SGD takes 832 steps and 6,221 for Adam. (Right) Gradient descent on Himmelbau’s function. Adam, converging in 5398 iterations, finds a different global minimum from FOP and SGD, which converge in 289 and 386 steps, respectively. In both cases, we can see that FOP moves more aggressively across the objective function surface compared to the other methods. The poor performance of Adam is likely attributable to the non-stochastic nature of these toy settings.

2.3 Spatial Preconditioning for Convolutional Networks

In order to further reduce the computational cost of FOP in convolutional networks (CNNs), we implemented layer-wise spatial preconditioners, sharing the matrices across both input and output channels (results shown in Section 4). More concretely, if a convolutional layer has spatial kernels with shape , we can learn a preconditioner that is . To implement this, when is a 4-tensor of kernels of shape , where and are the input and output channels, respectively, we can reshape it to a matrix of size , left-multiply it by the learned curvature matrix, and then reshape it back to its original dimensions. When is small, as is typically the case in deep CNNs, this preconditioner is small as well, resulting in both computational and memory efficiency.

2.4 FOP for Momentum

FOP can be implemented alongside any standard optimizer, such as gradient descent with momentum (momentum_1964). Given a parameter vector and a loss function , a basic momentum update with FOP matrix is typically expressed as two steps:

(6)
(7)

where is the velocity term and is the momentum parameter. Combining Equations 6 and 7 allows us to write the full update as

(8)

If, as in maclaurin_2015, we were meta-learning or only updating after a certain number of iterations, we would then have to backpropagate through to calculate the gradient for . As we are updating online, however, we only need to calculate . Therefore, the update is the same as for standard gradient descent. The experiments in Section 4 were performed using this modification of momentum.

Figure 2: FOP is able to improve training even with very low ranks. We plot the test accuracy over the course of training for a 4-layer fully-connected network with 100 units per layer trained on MNIST for different FOP matrix ranks, averaged over 5 runs each. The gradient was preconditioned by , where was rank . 784 is full rank for the first layer, and 100 is full rank for all layers except for the first. The larger is, the better the performance, both in terms of speed and final accuracy, compared to vanilla SGD with momentum 0.9. For , final accuracy is no longer better, but the initial training remains slightly faster. Final test accuracy of the full rank matrix is better than baseline test accuracy by 0.6% with -value 1e-4.

3 Related Work

Almeida_1999 introduced the idea of using gradients from the objective function to learn optimization parameters such as the learning rate or a curvature matrix. However, their preconditioning matrix was strictly diagonal, amounting to an approximate Newton algorithm (martens_2016), and they only tested their framework on simple optimization problems with either gradient descent or SGD. They also noted that a truly online stochastic update rule for the curvature matrix would involve a product of the gradients from the current iteration and the following, but to avoid the computational cost of producing an estimate for the next step’s gradient, they relied on the smoothness of the objective function and used the product of the current gradient and the previous gradient. FOP uses the same compromise. More recently, maclaurin_2015 applied this approach in a neural network context, terming the process of backpropagating through iterations hypergradient descent. Their method backpropagates through multiple iterations of the training process of a relatively shallow network to meta-learn a learning rate. However, this method can incur significant memory and computational cost for large models and long training times. hypergrad_2018 instead proposed an online framework directly inherited from Almeida_1999 that used hypergradient-based optimization in which the learning rate is updated after each iteration. Our method extends this idea to not only learn existing optimizer parameters (e.g., learning rate, momentum), but to introduce novel ones in the form of a non-diagonal, layer-specific preconditioning matrix for the gradient.

Method Test Accuracy Adtl. Params Adtl. Time (%) momentum 0 0.0 S-HD 9 0.2 PP-HD M 5.9 FOP-norm 65 1.0 FOP 65 0.8 Figure 3: Results on CIFAR-10 (top) and ImageNet (bottom), averaged over 3 runs. All models are trained with momentum as the base optimizer. We can see that FOP converges more quickly than standard and baseline methods, with slightly superior generalization performance. Learning a spatial curvature matrix adds negligible computational cost to the training process. Method Test Accuracy Adtl. Params Adtl. Time (%) momentum 0 S-HD 22 PP-HD M FOP-norm FOP

It’s also important to discuss the relationship of FOP to other, non-hypergradient preconditioning methods for deep networks. These mostly can be sorted into one of two categories, quasi-Newton algorithms and natural gradient approaches. Quasi-Newton methods seek to learn an approximation of the inverse Hessian. L-BFGS, for example, does this through tracking the differences between gradients across iterations (byrd_nocedal_zhu_1996). This is significantly different from FOP, although the outer product of gradients used in the update for FOP is can also be an approximation of the Hessian. Natural gradient methods, such as K-FAC (KFAC_2015) and KFC (KFC_2016), which approximate the inverse Fisher information matrix, bear a much stronger resemblance to FOP. However, there are notable differences. First, unlike FOP, these methods perform extra computation to ensure the invertibility of their curvature matrices. Second, the learning process for the preconditioner in these methods is completely different, as they do not backpropagate across iterations.

Figure 4: Adding FOP improves the hyperparameter robustness of standard optimizers. (a) The final test accuracy of a 9-layer CNN trained on CIFAR-10 for different settings of SGD with momentum (top) and SGD with momentum and FOP (bottom), averaged over three runs. Settings in which adding FOP improves performance by at least one standard deviation are highlighted in blue. FOP appears to be most useful for higher values of the learning rate and momentum parameters. The FOP matrices were trained with Adam with a learning rate of . (b) The performance of Adam for a range of learning rates both with and without FOP, averaged over three runs. While performance is similar, the top performing models are improved by the addition of FOP. The FOP matrices were trained using the same setting as the models in (a).

4 Experiments

We measured the performance of FOP on several visual classification tasks. In order to measure the importance of the rotation induced by the preconditioning matrices in addition to the scaling, we also implemented hypgradient descent (HD) methods to learn a scalar layer-wise learning rate (S-HD) and a per-parameter (PP-HD) learning rate. The former is the same method implemented by hypergrad_2018, and the latter is equivalent to a strictly diagonal curvature matrix. We also implement a method we call normalized FOP (FOP-norm), in which we rescale the preconditioning matrix to avoid any effect on the learning rate and rely solely on standard learning rate settings. Further details on this method can be found in Supplementary Section A.1. All experiments were run using the TensorFlow library (tensorflow2015-whitepaper)111Code currently available at this link:
https://drive.google.com/file/d/1vhB4fxDuxaYJcNP6ioEJQf4CLHxhy1ka/view?usp=sharing.
.

4.1 Cifar-10

For CIFAR-10 (CIFAR10), an image dataset consisting of 50,000 training and 10,000 test RGB images divided into 10 object classes, we implemented a 9-layer convolutional architecture inspired by all_conv_2014. We trained each model for 150 epochs with a batch size of 128 and initial learning rate 0.05, decaying the learning rate by a factor of 10 after 80 epochs. For S-HD, PP-HD, and FOP, we use Adam as the hypergradient optimizer with a learning rate of 1e-4. The results are plotted in Figure 3. FOP produces a significant speed-up in training and improves final test accuracy compared to baseline methods, including FOP-norm, indicating that both the rotation and the scaling learned by FOP is useful for learning.

4.2 ImageNet

The ImageNet dataset consists of 1,281,167 training and 50,000 validation RGB images divided into 1,000 categories (imagenet_cvpr09). Here, we trained a ResNet-18 (resnet2015) model for 60 epochs with a batch size of 256 and an initial learning rate of 0.1, decaying by a factor of 10 at the 25th and 50th epochs. A summary of our results is displayed in Figure 3. We can see that the improved convergence speed and test performance observed on CIFAR-10 is maintained on this deeper model and more difficult dataset.

4.3 Hyperparameter Robustness

In addition to measuring peak performance, we also tested the effect FOP had on the robustness of standard optimizers to hyperparameter selection. hypergrad_2018 demonstrated the ability of scalar hypergradients to mitigate the effect of the initial learning rate on performance. We therefore tested whether this benefit was preserved by FOP, as well as whether it extended to other hyperparameter choices, such as the momentum coefficient. Our results, summarized in Figure 4, support this idea, showing that FOP can improve performance on a wide array of optimizer settings. For momentum (Fig. 4a), we see that the performance gap is greater for higher learning rates and momentum values, and in several instances FOP is able to train successfully where pure SGD with momentum fails. For Adam, the difference is smaller, although in the highest-performance learning rate region adding FOP to Adam outperforms the standard method. We hypothesize that this smaller difference is in large part due to unanticipated effects that preconditioning the gradient has on the adaptive moment estimation performed by Adam. We leave further investigation into this interaction for future work.

Figure 5: Understanding the learned preconditioning matrices () for CIFAR-10. (a) The evolution of an example preconditioning matrix throughout the training process from the ninth layer in a ResNet-18 model trained on ImageNet. Each layer learned a similar whitening structure. (b) The histograms of matrix values across layers during training. The training process is traced by going from back to front in the plots. We can see that the convergence of the values of the matrix, corresponding to a stronger decorrelation structure and a reduced norm, is stronger in the higher layers of a network. (c) The sorted eigenvalues of the final learned preconditioning matrices for the first seven layers of a 9-layer network trained on CIFAR-10 (the top two layers were kernels). We can see that the distribution shifts downward and becomes more uniform in higher layers. This is interesting, as while a uniform distribution of eigenvalues is considered helpful in aiding convergence, the downward shift in values makes the matrix less invertible.

5 What is FOP learning?

By studying the learned preconditioning matrices, it’s possible to gain an intuition for the effect FOP has on the training process. Interestingly, we found visual tasks induced similar structures in the preconditioning matrices across initializations and across layers. Visualizing the matrices (Figure 5a) shows that they develop a decorrelating, or whitening, structure: each of the 9 positions in the 3x3 convolutional filter sends a strong positive weight to itself, and negative weights to its immediate neighbors, without wrapping over the corners of the filter. This is interesting, as images are known to have a high degree of spatial autocorrelation (barlow_1961; barlow_1989). As a mechanism for reducing redundant computation, whitening visual inputs is known to be beneficial for both retinal processing in the brain (attick_whitening_1992; hateren_1992) and in artificial networks (pascanu_naturalnets_2015; whitened_batchnorm_2018). However, it is more unusual to consider whitening of the learning signal, as observed in FOP, rather than the forward activation.

This learned pattern is accompanied by a shift in the norm of the curvature matrix, as the diagonal elements, initialized to ones, shrink in value, and the off-diagonal elements increase. This shift in distribution is visualized in Figure 5b, and grows stronger deeper in the network. It is possible that this indicates that standard gradient descent is more ill-conditioned in higher layers, or perhaps equivalently, that a greater degree of decorrelation is helpful for kernels disentangling higher-level representations.

We can also examine the eigenvalue spectra for the learned matrices across layers (Figure 5c). We can see that the basic requirement that in general a preconditioning matrix must be positive definite, so as not to reverse the direction of the gradient, is met. However, we also note that the eigenvalues are very small in magnitude, indicating a near-zero determinant. This results in a matrix that is essentially non-invertible, an interesting property, as quasi-Newton and natural gradient methods seek to compute or approximate the inverse of either the Hessian or Fisher information matrix. The implications of this non-invertibility are avenues for future study. Furthermore, the eigenvalues grow smaller, and their distribution more uniform, higher in the network, in accordance with the pattern observed in Figure 5b. A uniform eigenspectrum is seen as an attribute of an effective preconditioning matrix (precond_sgd_2015), as it indicates an even convergence rate in parameter space.

6 Convergence

Our experiments indicate that the curvature matrices learned by FOP converge to a relatively fixed norm roughly two-thirds of the way through training (Figure 5b). This is important, as it indicates that the effective learning rate induced by the preconditioners stabilizes. This allows us to perform a preliminary convergence analysis of the algorithm in a manner analogous to hypergrad_2018.

Consider a modification of FOP in which the symmetric, positive-definite preconditioning matrix is rescaled at each iteration to have a certain norm , such that when is small and when is large, where is some chosen constant. Specifically, as in hypergrad_2018, we set , where is some function that decays over time and starts training at (e.g., ).

This formulation allows us to extend the convergence proof of hypergrad_2018 to FOP, under the same assumptions about the objective function :

Theorem 1

Suppose that is convex and -Lipschitz smooth with for some fixed and all model parameters . Then if and as , where the are generated by (non-stochastic) gradient descent.

Proof. We can write

where is the hypergradient learning rate. Our assumptions about the limiting behavior of then imply that and so as . For sufficiently large , we therefore have . Note also that as is symmetric and positive definite, it will not prevent the convergence of gradient descent (Supplementary Section A.2). Moreover, preliminary investigation showed that the angle of rotation induced by the FOP matrices is significantly below (Supplementary Figure A.1). Previous work by fa-2016 shows that such a rotation does not impede the convergence of gradient descent. Because standard SGD converges under these conditions (karimi_convergence_2016), FOP must as well.

7 Conclusion

In this paper, we introduced a novel optimization technique, FOP, that learns a preconditioning matrix online to improve convergence and generalization in large-scale machine learning models. We tested FOP on several problems and architectures, examined the nature of the learned transformations, and provided a preliminary analysis of FOP’s convergence properties (Supplementary Section 6). There are a number of opportunities for future work, including learning transformations for other optimization parameters (e.g., the momentum parameter in case of the the momentum optimizer), expanding and generalizing our theoretical analysis, and testing FOP on a wider variety of models and data sets.

Acknowledgements

We would like to thank Kenneth Stanley, Jeff Clune, Rosanne Liu, and members of the Horizons and Deep Collective research groups at Uber AI for useful feedback and discussions.

References

Appendix A Supplemental Information

a.1 Normalized FOP

In order to control for the scaling induced by FOP and measure the effect of its rotation only, we introduce normalized FOP, which performs the following parameter update:

(9)

where is the first dimension of . This update has the effect of normalizing the preconditioner , then re-scaling the update by to match the norm of gradient descent, as , where is the identity matrix of size .

a.2 Convergence of Preconditioned Gradient Descent

We demonstrate that given a random, symmetric positive definite preconditioning matrix , gradient descent will still converge at a linear rate, modeling our proof after that of karimi_convergence_2016.

Theorem 2

Consider a convex, -Lipschitz objective function with global minimum which obeys the Polyak-Łojasiewicz (PL) Inequality [polyak_1963],

(10)

Then applying the gradient update method given by

(11)

where is a real, symmetric positive semi-definite matrix and is the step size, where and are the minimum and maximum eigenvalues of , respectively, results in a global linear convergence rate given by

(12)

Proof. First, assume a step-size of . Given that is -Lipschitz continuous, we can write

where denotes . Plugging in the gradient update equation gives

(13)

Let be the eigendecomposition of , such that the columns of are the orthonormal eigenvectors and is a diagonal matrix whose entries are the eigenvalues of , which are all non-negative due to symmetric positive semi-definiteness of . We can then rewrite the first term of Equation 13 as

We now change our basis, letting and define and , where are the eigenvalues. Then we have

(14)

Examining the second term of Equation 13, we have

(15)

Combining the results of Equations 14 and 15 gives

(16)

We can then revert to the original basis:

giving us

(17)

By the PL inequality we have that for some . Plugging this in gives

(18)

Then let . We have

Rearranging and subtracting from both sides gives

(19)

Applying Equation 19 recursively gives the desired convergence:

(20)

Thus this preconditioned gradient descent converges with a step size . Standard gradient descent converges with a step size under these assumptions. We also note that even if changes over the course of training, the required step-size will vary, but convergence will still occur.

Figure A.1: Angles in degrees between and for each layer of a 4 layer fully connected network on MNIST. Values are averaged over 5 runs and displayed every 200 iterations. Shaded areas represent the min and max values over all the runs, showing that the angle only varies slightly. Notice that the angle is usually slightly below , indicating that the while FOP does induce a rotation, it is far from orthogonal to the vanilla learning signal.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
394777
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description