# First-Order Preconditioning via

Hypergradient Descent

###### Abstract

Standard gradient descent methods are susceptible to a range of issues that can impede training, such as high correlations and different scaling in parameter space. These difficulties can be addressed by second-order approaches that apply a preconditioning matrix to the gradient to improve convergence. Unfortunately, such algorithms typically struggle to scale to high-dimensional problems, in part because the calculation of specific preconditioners such as the inverse Hessian or Fisher information matrix is highly expensive. We introduce first-order preconditioning (FOP), a fast, scalable approach that generalizes previous work on hypergradient descent (Almeida_1999; maclaurin_2015; hypergrad_2018) to learn a preconditioning matrix that only makes use of first-order information. Experiments show that FOP is able to improve the performance of standard deep learning optimizers on several visual classification tasks with minimal computational overhead. We also investigate the properties of the learned preconditioning matrices and perform a preliminary theoretical analysis of the algorithm.

## 1 Introduction

High-dimensional nonlinear optimization problems often present a number of difficulties, such as strongly-correlated parameters and variable scaling along different directions in parameter space (martens_2016). Despite this, deep neural networks and other large-scale machine learning models applied to such problems typically rely on simple variations of gradient descent to train, which is known to be highly sensitive to these difficulties. While this approach often works well in practice, addressing the underlying issues directly could provide stronger theoretical guarantees, accelerate training, and improve generalization.

Adaptive learning rate models such as Adam (adam_2014) and RMSProp (rmsprop_2012) provide some degree of higher-order approximation to re-scale updates based on per-parameter behavior. Newton-based methods make use of the curvature of the loss surface to both re-scale and rotate the gradient in order to improve convergence. Natural gradient methods (Amari_nat_grad_og) do the same in order to enforce smoothness in the evolution of the model’s conditional distribution. In each of the latter two cases, the focus is on computing a specific linear transformation of the gradient that improves the conditioning of the problem. This transformation is typically known as a preconditioning, or curvature, matrix. In the case of quasi-Newton methods, the preconditioning matrix takes the form of the inverse Hessian, while for natural gradient methods it’s the inverse Fisher information matrix. Computing these transformations is typically intractable for high-dimensional problems, and while a number of approximate methods exist for both (e.g., (byrd_nocedal_zhu_1996; KFAC_2015; KFC_2016)), they are often still too expensive for the performance gain they provide. These approaches also suffer from rigid inductive biases regarding the nature of the problems to which they are applied in that they seek to compute or approximate specific transformations. However, in large, non-convex problems, the optimal gradient transformation may be less obvious, or may even change over the course of training.

In this paper, we attempt to address these issues through a method we term first-order preconditioning (FOP). FOP doesn’t attempt to compute a specific preconditioner, such as the inverse Hessian, but rather uses first-order hypergradient descent (maclaurin_2015) to learn an adaptable transformation online directly from the task objective function. Our method adds minimal computational and memory cost to standard deep network training settings and results in improved convergence speed and generalization compared to standard approaches. FOP can flexibly be applied to any gradient-based optimization problem, and we show that when used in conjunction with standard optimizers, it improves their performance.

## 2 First Order Preconditioning

### 2.1 The Basic Approach

Consider a parameter vector and a loss function . A traditional gradient update with a preconditioning matrix can be written as

(1) |

Our goal is to learn . However, while we place no other constraints on our preconditioner, in order to ensure that it is positive definite, and therefore does not reverse the direction of the gradient, we replace Equation 1 with the following:

(2) |

Under reasonable assumptions, gradient descent is guaranteed to converge with the use of even a random symmetric, positive-definite preconditioner, as we show in Supplementary Section A.2. Because is a function of , we can then backpropagate from the loss at iteration to the previous iteration’s preconditioner via a simple application of the chain rule:

(3) |

The gradient with respect to the preconditioner is then simply

(4) |

Note that ideally, we would compute to update , but as we don’t have access to yet, we follow the example of Almeida_1999 and assume that does not dramatically change across a single iteration. The basic approach is summarized in Algorithm 1. We use supervised learning as an example, but the same method applies to any gradient-based optimization. The preconditioned gradient can then be passed to any standard optimizer to produce an update for . For example, we describe the procedure for using FOP with momentum (momentum_1964) in Section 2.4. For multi-layer networks, in order to make the simplest modification to normal backpropagation, we learn a separate for each layer, not a global curvature matrix.

To get an intuition for the behavior of FOP compared to standard algorithms, we observed its trajectories on a set of low-dimensional optimization problems (Figure 1). Interestingly, while FOP converged in fewer iterations than SGD and Adam (adam_2014), it took more jagged paths along the objective function surface, suggesting that while it takes more aggressive steps, it is perhaps also able to change direction more rapidly.

### 2.2 Low-Rank FOP

If is an matrix, and we only apply the preconditioning matrix over input dimensions (and share it across output dimensions), then a full-rank would necessarily be . When is large, preconditioning the gradient becomes expensive. Instead, we can apply a rank- , with and . To ensure stable performance at the beginning of training, we initialize the preconditioning matrix as close as possible to the identity matrix, even for a low rank , so that the algorithm begins as straightforward gradient descent and learns to depart from vanilla SGD over time. Thus, we set the preconditioner to be

(5) |

where is the identity matrix and , where is small so that starts out close to the identity matrix. We tested the effect of rank on a simple fully-connected network trained on the MNIST dataset (lecun_mnist). The results, shown in Figure 2, indicate that FOP is able to accelerate training compared to standard SGD (with momentum 0.9) with all values of and improve final test accuracy even with fairly low values of .

### 2.3 Spatial Preconditioning for Convolutional Networks

In order to further reduce the computational cost of FOP in convolutional networks (CNNs), we implemented layer-wise spatial preconditioners, sharing the matrices across both input and output channels (results shown in Section 4). More concretely, if a convolutional layer has spatial kernels with shape , we can learn a preconditioner that is . To implement this, when is a 4-tensor of kernels of shape , where and are the input and output channels, respectively, we can reshape it to a matrix of size , left-multiply it by the learned curvature matrix, and then reshape it back to its original dimensions. When is small, as is typically the case in deep CNNs, this preconditioner is small as well, resulting in both computational and memory efficiency.

### 2.4 FOP for Momentum

FOP can be implemented alongside any standard optimizer, such as gradient descent with momentum (momentum_1964). Given a parameter vector and a loss function , a basic momentum update with FOP matrix is typically expressed as two steps:

(6) |

(7) |

where is the velocity term and is the momentum parameter. Combining Equations 6 and 7 allows us to write the full update as

(8) |

If, as in maclaurin_2015, we were meta-learning or only updating after a certain number of iterations, we would then have to backpropagate through to calculate the gradient for . As we are updating online, however, we only need to calculate . Therefore, the update is the same as for standard gradient descent. The experiments in Section 4 were performed using this modification of momentum.

## 3 Related Work

Almeida_1999 introduced the idea of using gradients from the objective function to learn optimization parameters such as the learning rate or a curvature matrix. However, their preconditioning matrix was strictly diagonal, amounting to an approximate Newton algorithm (martens_2016), and they only tested their framework on simple optimization problems with either gradient descent or SGD. They also noted that a truly online stochastic update rule for the curvature matrix would involve a product of the gradients from the current iteration and the following, but to avoid the computational cost of producing an estimate for the next step’s gradient, they relied on the smoothness of the objective function and used the product of the current gradient and the previous gradient. FOP uses the same compromise. More recently, maclaurin_2015 applied this approach in a neural network context, terming the process of backpropagating through iterations hypergradient descent. Their method backpropagates through multiple iterations of the training process of a relatively shallow network to meta-learn a learning rate. However, this method can incur significant memory and computational cost for large models and long training times. hypergrad_2018 instead proposed an online framework directly inherited from Almeida_1999 that used hypergradient-based optimization in which the learning rate is updated after each iteration. Our method extends this idea to not only learn existing optimizer parameters (e.g., learning rate, momentum), but to introduce novel ones in the form of a non-diagonal, layer-specific preconditioning matrix for the gradient.

It’s also important to discuss the relationship of FOP to other, non-hypergradient preconditioning methods for deep networks. These mostly can be sorted into one of two categories, quasi-Newton algorithms and natural gradient approaches. Quasi-Newton methods seek to learn an approximation of the inverse Hessian. L-BFGS, for example, does this through tracking the differences between gradients across iterations (byrd_nocedal_zhu_1996). This is significantly different from FOP, although the outer product of gradients used in the update for FOP is can also be an approximation of the Hessian. Natural gradient methods, such as K-FAC (KFAC_2015) and KFC (KFC_2016), which approximate the inverse Fisher information matrix, bear a much stronger resemblance to FOP. However, there are notable differences. First, unlike FOP, these methods perform extra computation to ensure the invertibility of their curvature matrices. Second, the learning process for the preconditioner in these methods is completely different, as they do not backpropagate across iterations.

## 4 Experiments

We measured the performance of FOP on several visual classification tasks. In order to measure the importance of the rotation induced by the preconditioning matrices in addition to the scaling, we also implemented hypgradient descent (HD) methods to learn a scalar layer-wise learning rate (S-HD) and a per-parameter (PP-HD) learning rate. The former is the same method implemented by hypergrad_2018, and the latter is equivalent to a strictly diagonal curvature matrix. We also implement a method we call normalized FOP (FOP-norm), in which we rescale the preconditioning matrix to avoid any effect on the learning rate and rely solely on standard learning rate settings. Further details on this method can be found in Supplementary Section A.1. All experiments were run using the TensorFlow library (tensorflow2015-whitepaper)^{1}^{1}1Code currently available at this link:

https://drive.google.com/file/d/1vhB4fxDuxaYJcNP6ioEJQf4CLHxhy1ka/view?usp=sharing..

### 4.1 Cifar-10

For CIFAR-10 (CIFAR10), an image dataset consisting of 50,000 training and 10,000 test RGB images divided into 10 object classes, we implemented a 9-layer convolutional architecture inspired by all_conv_2014. We trained each model for 150 epochs with a batch size of 128 and initial learning rate 0.05, decaying the learning rate by a factor of 10 after 80 epochs. For S-HD, PP-HD, and FOP, we use Adam as the hypergradient optimizer with a learning rate of 1e-4. The results are plotted in Figure 3. FOP produces a significant speed-up in training and improves final test accuracy compared to baseline methods, including FOP-norm, indicating that both the rotation and the scaling learned by FOP is useful for learning.

### 4.2 ImageNet

The ImageNet dataset consists of 1,281,167 training and 50,000 validation RGB images divided into 1,000 categories (imagenet_cvpr09). Here, we trained a ResNet-18 (resnet2015) model for 60 epochs with a batch size of 256 and an initial learning rate of 0.1, decaying by a factor of 10 at the 25th and 50th epochs. A summary of our results is displayed in Figure 3. We can see that the improved convergence speed and test performance observed on CIFAR-10 is maintained on this deeper model and more difficult dataset.

### 4.3 Hyperparameter Robustness

In addition to measuring peak performance, we also tested the effect FOP had on the robustness of standard optimizers to hyperparameter selection. hypergrad_2018 demonstrated the ability of scalar hypergradients to mitigate the effect of the initial learning rate on performance. We therefore tested whether this benefit was preserved by FOP, as well as whether it extended to other hyperparameter choices, such as the momentum coefficient. Our results, summarized in Figure 4, support this idea, showing that FOP can improve performance on a wide array of optimizer settings. For momentum (Fig. 4a), we see that the performance gap is greater for higher learning rates and momentum values, and in several instances FOP is able to train successfully where pure SGD with momentum fails. For Adam, the difference is smaller, although in the highest-performance learning rate region adding FOP to Adam outperforms the standard method. We hypothesize that this smaller difference is in large part due to unanticipated effects that preconditioning the gradient has on the adaptive moment estimation performed by Adam. We leave further investigation into this interaction for future work.

## 5 What is FOP learning?

By studying the learned preconditioning matrices, it’s possible to gain an intuition for the effect FOP has on the training process. Interestingly, we found visual tasks induced similar structures in the preconditioning matrices across initializations and across layers. Visualizing the matrices (Figure 5a) shows that they develop a decorrelating, or whitening, structure: each of the 9 positions in the 3x3 convolutional filter sends a strong positive weight to itself, and negative weights to its immediate neighbors, without wrapping over the corners of the filter. This is interesting, as images are known to have a high degree of spatial autocorrelation (barlow_1961; barlow_1989). As a mechanism for reducing redundant computation, whitening visual inputs is known to be beneficial for both retinal processing in the brain (attick_whitening_1992; hateren_1992) and in artificial networks (pascanu_naturalnets_2015; whitened_batchnorm_2018). However, it is more unusual to consider whitening of the learning signal, as observed in FOP, rather than the forward activation.

This learned pattern is accompanied by a shift in the norm of the curvature matrix, as the diagonal elements, initialized to ones, shrink in value, and the off-diagonal elements increase. This shift in distribution is visualized in Figure 5b, and grows stronger deeper in the network. It is possible that this indicates that standard gradient descent is more ill-conditioned in higher layers, or perhaps equivalently, that a greater degree of decorrelation is helpful for kernels disentangling higher-level representations.

We can also examine the eigenvalue spectra for the learned matrices across layers (Figure 5c). We can see that the basic requirement that in general a preconditioning matrix must be positive definite, so as not to reverse the direction of the gradient, is met. However, we also note that the eigenvalues are very small in magnitude, indicating a near-zero determinant. This results in a matrix that is essentially non-invertible, an interesting property, as quasi-Newton and natural gradient methods seek to compute or approximate the inverse of either the Hessian or Fisher information matrix. The implications of this non-invertibility are avenues for future study. Furthermore, the eigenvalues grow smaller, and their distribution more uniform, higher in the network, in accordance with the pattern observed in Figure 5b. A uniform eigenspectrum is seen as an attribute of an effective preconditioning matrix (precond_sgd_2015), as it indicates an even convergence rate in parameter space.

## 6 Convergence

Our experiments indicate that the curvature matrices learned by FOP converge to a relatively fixed norm roughly two-thirds of the way through training (Figure 5b). This is important, as it indicates that the effective learning rate induced by the preconditioners stabilizes. This allows us to perform a preliminary convergence analysis of the algorithm in a manner analogous to hypergrad_2018.

Consider a modification of FOP in which the symmetric, positive-definite preconditioning matrix is rescaled at each iteration to have a certain norm , such that when is small and when is large, where is some chosen constant. Specifically, as in hypergrad_2018, we set , where is some function that decays over time and starts training at (e.g., ).

This formulation allows us to extend the convergence proof of hypergrad_2018 to FOP, under the same assumptions about the objective function :

###### Theorem 1

Suppose that is convex and -Lipschitz smooth with for some fixed and all model parameters . Then if and as , where the are generated by (non-stochastic) gradient descent.

Proof. We can write

where is the hypergradient learning rate. Our assumptions about the limiting behavior of then imply that and so as . For sufficiently large , we therefore have . Note also that as is symmetric and positive definite, it will not prevent the convergence of gradient descent (Supplementary Section A.2). Moreover, preliminary investigation showed that the angle of rotation induced by the FOP matrices is significantly below (Supplementary Figure A.1). Previous work by fa-2016 shows that such a rotation does not impede the convergence of gradient descent. Because standard SGD converges under these conditions (karimi_convergence_2016), FOP must as well.

## 7 Conclusion

In this paper, we introduced a novel optimization technique, FOP, that learns a preconditioning matrix online to improve convergence and generalization in large-scale machine learning models. We tested FOP on several problems and architectures, examined the nature of the learned transformations, and provided a preliminary analysis of FOP’s convergence properties (Supplementary Section 6). There are a number of opportunities for future work, including learning transformations for other optimization parameters (e.g., the momentum parameter in case of the the momentum optimizer), expanding and generalizing our theoretical analysis, and testing FOP on a wider variety of models and data sets.

#### Acknowledgements

We would like to thank Kenneth Stanley, Jeff Clune, Rosanne Liu, and members of the Horizons and Deep Collective research groups at Uber AI for useful feedback and discussions.

## References

## Appendix A Supplemental Information

### a.1 Normalized FOP

In order to control for the scaling induced by FOP and measure the effect of its rotation only, we introduce normalized FOP, which performs the following parameter update:

(9) |

where is the first dimension of . This update has the effect of normalizing the preconditioner , then re-scaling the update by to match the norm of gradient descent, as , where is the identity matrix of size .

### a.2 Convergence of Preconditioned Gradient Descent

We demonstrate that given a random, symmetric positive definite preconditioning matrix , gradient descent will still converge at a linear rate, modeling our proof after that of karimi_convergence_2016.

###### Theorem 2

Consider a convex, -Lipschitz objective function with global minimum which obeys the Polyak-Łojasiewicz (PL) Inequality [polyak_1963],

(10) |

Then applying the gradient update method given by

(11) |

where is a real, symmetric positive semi-definite matrix and is the step size, where and are the minimum and maximum eigenvalues of , respectively, results in a global linear convergence rate given by

(12) |

Proof. First, assume a step-size of . Given that is -Lipschitz continuous, we can write

where denotes . Plugging in the gradient update equation gives

(13) |

Let be the eigendecomposition of , such that the columns of are the orthonormal eigenvectors and is a diagonal matrix whose entries are the eigenvalues of , which are all non-negative due to symmetric positive semi-definiteness of . We can then rewrite the first term of Equation 13 as

We now change our basis, letting and define and , where are the eigenvalues. Then we have

(14) |

Examining the second term of Equation 13, we have

(15) |

Combining the results of Equations 14 and 15 gives

(16) |

We can then revert to the original basis:

giving us

(17) |

By the PL inequality we have that for some . Plugging this in gives

(18) |

Then let . We have

Rearranging and subtracting from both sides gives

(19) |

Applying Equation 19 recursively gives the desired convergence:

(20) |

Thus this preconditioned gradient descent converges with a step size . Standard gradient descent converges with a step size under these assumptions. We also note that even if changes over the course of training, the required step-size will vary, but convergence will still occur.