SGD momentum optimizer with
with step estimation by online parabola model
Abstract
In stochastic gradient descent, especially for neural network training, there are currently dominating first order methods: not modeling local distance to minimum. This information required for optimal step size is provided by second order methods, however, they have many difficulties, starting with full Hessian having square of dimension number of coefficients.
This article proposes a minimal step from successful first order momentum method toward second order: online parabola modelling in just a single direction: normalized from momentum method. It is done by estimating linear trend of gradients in direction: such that for , , . Using linear regression, , are MSE estimated by just updating four averages (of , , , ) in the considered direction. Exponential moving averages allow here for inexpensive online estimation, weakening contribution of the old gradients. Controlling sign of curvature , we can repel from saddles in contrast to attraction in standard Newton method. In the remaining directions: not considered in second order model, we can simultaneously perform e.g. gradient descent.
ptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptpt
SGD momentum optimizer with
with step estimation by online parabola model
Jarek Duda
Jagiellonian University, Golebia 24, 31007 Krakow, Poland, Email: dudajar@gmail.com
Keywords: nonconvex optimization, stochastic gradient descent, convergence, deep learning, Hessian, linear regression, saddlefree Newton method
I Introduction
In many optimization scenairios like neural network trainig, we search for a local minimum of objective/loss function of parameters , which number is often in millions. The real function is usually unknown, only modelled based on a size dataset: . Due to its large size, there is often used SGD (stochastic gradient descent) [1] philosophy: dataset is split into minibatches used to calculate succeeding stochastic gradients , which can be imagined as noisy (approximate) gradients:
Efficient training especially of deep neural network requires extraction and exploitation of statistical trends from such noisy gradients: calculated on subsets of samples. Their e.g. exponential moving averaging in momentum method [2] tries to estimate the real gradient of minimized objective function and use it for gradient descent. However, there remains a difficult problem of choosing the step size for such descent.
For example in a plateau we should greatly increase the step, simultaneously being careful not to jump over a valley. We will use linear trend of gradients to estimate position of bottom of such valley as parabola: where the linear trend of derivatives in the considered direction intersects zero, as visualized in Fig. 1 with basic pseudocode as Algorithm 1.
Linear trend can be estimated in MSE (mean squared error) optimal way with linear regression, which requires four averages: here of , , , for the linear relation between position and the first derivative . There will be used exponential moving averages for inexpensive online update and to reduce reliance on the old gradients  they have exponentially weakening weights.
Linear trend of gradients is a second order model. Generally, higher than first order methods are often imagined as impractically costly, for example full Hessian would need coefficients. We focus here on the opposite end of cost spectrum  only model parabola in a single direction (parameterized by just 2 additional coefficients), for example in direction found by the momentum method: suggesting increased local activity, hence deserving a higher order model. Calculated gradient, beside updating momentum and parabola model, can be also simultaneously used for e.g. gradient descent in the remaining directions.
For low cost it would be preferred to estimate second order behavior from the stochastic gradients only. It is done for example in LBFGS [3]. However, it estimates inverted Hessian from just a few recent noisy gradients: leading to stability issues and having relatively large cost of processing all these large gradients in each step. In contrast, thanks to working on updated averages, this processing cost becomes practically negligible in the proposed online gradient linear regression approach. We should also get a better estimation as instead of just a few recent noisy gradients, here we are using all of them with exponentially weakening weights in updated averages.
Another addressed here problem of many second order methods is attraction to saddles e.g. by standard Newton method, which handling can lead to large improvements as shown e.g. in saddlefree Newton (SFN) method article [4]. This repairment requires to control the signs of curvatures, what is relatively difficult and costly. In the presented approach it becomes simple as we need to control it in only a single direction.
A natural extension is analogously performing such second order modelling in a few dimensions, which was the original approach [5]. The purpose of this separate article is focusing on the simplest case for introduction and better understanding of the basics.
Ii 1D case with linear regression of derivatives
We would like to estimate second order behavior from a sequence of gradients: first order derivatives, which linear behavior corresponds to second order derivative. A basic approach is finite differences [6], for Hessian :
However, we have noisy gradients here, hence we need to use much more than two of them to estimate linear trend from their statistics. A standard tool for extraction of linear trend is least squares linear regression: optimal in MSE way. Additionally, it is very convenient due to working on averages: we can replace it with exponential moving averages for online estimation and to weaken contribution of less certain old gradients. Let us now focus on dimensional case, its general dimensional expansion is discussed in [5].
Iia 1D static case  parabola approximation
Let us start with 1D case  static parabola model as:
and MSE optimizing its parameters for sequence:
For parabola and times we can choose uniform weights . Later we will use exponential moving average  reducing weights of old noisy gradients, seeing such parabola as only local approximation. The necessary condition (neglecting case) becomes:
for averages:
(1) 
Their solution (least squares linear regression) is:
(2) 
Observe that estimator is covariance divided by variance of (positive if not all values are equal).
IiB Online update by exponential moving average
The optimized function is rather not a parabola, should be only locally approximated this way. To focus on local situation we can reduce weights of the old gradients. It is very convenient to use exponential moving averages for some for this purpose as they can be inexpensively updated to get online estimation of local situation. Starting with all 0 values for , for we get update rules:
IiC 1D linear regression based optimizer
Linear regression requires values in at least two points, hence there is needed at least one step (better a few) warmup  evolving using e.g. gradient descent, simultaneously updating averages (3), starting from initial . Then we can start using linear model for derivative: , using updated parameters from (2) regression formula.
Getting curvature, the parabola has minimum in , the modeled optimal position would be . However, as we do not have a complete confidence in such model, would like to work in online setting, a safer step is for some parameter describing trust in the model, which generally can vary e.g. depending on estimated uncertainty of parameters. Natural gradient method corresponds to complete trust. Lower would still give exponential decrease of distance from a fixed minimum.
Getting , parabola has maximum instead  second order method does not longer suggest a position of minimum. Such directions are relatively rare in neural network training [8], especially focusing on the steepest descent direction here. In many second order methods curvature signs are ignored  attracting to saddles e.g. in standard Newton method. Controlling sign of here, we can handle these cases  there are two basic approaches [6]: switch to gradient descent there, or reverse sign of step from second order method.
There are also cases, which are problematic as corresponding to very far predicted extremum in (2)  require some clipping of step size. Such situation can correspond to plateau, or to inflection point: switching between convex and concave behavior. For plateaus we need to use large steps.
While it leaves opportunities for improvements, for simplicity we can for example use SFNlike step: just reversing sign for directions. Applied clipping prevents cases, alternatively we could e.g. replace sign with :
(4) 
with example of clipping: .
Iii Momentum with online parabola methods
Having above 1D approach we can use it to model behavior of our function as locally parabola in dimensional affine space of space of parameters, still performing first order e.g. gradient descent in the remaining directions.
There is a freedom of choosing this emphasized direction , but for better use of such additional cost of higher order model we should choose a locally more promising direction  for example pointed by momentum method. Wanting a few dimensional promising local subspace instead, we could obtain them e.g. from onlinePCA [9] of recent gradients.
Iiia Common functions and basic OGR1d
Algorithms 2, 3, 4, 5 contain common functions, used e.g. in basic dimensional OGR (online gradient regression) as Algorithm 1:

upd_avg(avg, ) updates all averages (packed into avg vector) based on current position , gradient and considered direction ,

step(avg, ) finds parameters of linear trend of derivatives in direction and use them to perform step in this direction. It also optionally performs first order e.g. gradient descent in the remaining directions,

initialize() chooses initial , hyperparameters, sets averages and momentum to zero,

warmup() uses steps of momentum method to choose initial direction and averages avg.
Then the basic approach is presented as Algorithm 1: just regularly (online) update the direction of second order model accordingly to momentum method. However, such modification of assumes that averages remain the same in the new direction  effectively rotating the second order model. Such rotation might be problematic, should be performed much slower than updating the averages .
The next two subsections suggest ways to improve it: safer approach updating averages simultaneously for the old and new direction, and less expensive shifting center of rotation for updates of .
IiiB Safe variant: updating averages for old and new direction
Algorithm 6 suggests a safe solution for modification of modelled direction by simultaneously updating two sets of averages: for the previous direction (avgw for ) used for the proper step, and for the new one (avg for ). After steps it switches to the new direction and starts building from zero averages for the next switch.
IiiC Faster option: shifing rotation center
Update of in Algorithm 1 effective rotates second order model around , which is usually far from the current position , hence such rotation can essentially damage the model. Such rotation is much safer if shifting its center of rotation closer, e.g. to . For this purpose, instead of operating on function, we can work on , what does not change gradients  only shifts their positions.
We can periodically update this center to current position, shifting representation: replacing projection with for . This shift requires to modify 3 of averages:

transforms to

transforms to

transforms to .
Algorithm 7 is example of such modification of Algorithm 1.
Iv Conclusions and further work
While first order methods do not have a direct way for choosing step size accordingly to local situation, second order parabola model in current direction can provide such optimal step size. While it could be used in a separate line search, here is suggested to be combined e.g. with momentum method. Thanks to linear regression of gradients, 1) get this information online: continuously adapting to local situation, 2) using only gradients already required for momentum method, 3) in practically negligible cost thanks to operating on averages.
Choosing the details like hyperparameters, which generally could evolve in time, is a difficult experimental problem which will require further work.
The general possibility of combining different optimization approaches seems promising and nearly unexplored, starting e.g. with momentum+SGD hybrid: rare large certain steps interleaved with frequent small noisy steps.
There is popular technique of strengthening underrepresented coordinates e.g. in ADAM [7], which might be worth combining with simple second order methods like discussed. They exploit simple exponential moving averages  here we got motivation for exploring further possible averages.
Getting a successful second order method for dimensional subspace, a natural research direction will be increasing this dimension discussed in [5], e.g. to OGR10d. Choosing promising subspace (worth second order modelling) will require going from momentum method to e.g. onlinePCA [9] of recent gradients. Averages , become dimensional vectors, and become dimensional matrices, with estimated Hessian: .
References
 [1] H. Robbins and S. Monro, “A stochastic approximation method,” in Herbert Robbins Selected Papers. Springer, 1985, pp. 102–109.
 [2] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by backpropagating errors,” nature, vol. 323, no. 6088, p. 533, 1986.
 [3] D. C. Liu and J. Nocedal, “On the limited memory bfgs method for large scale optimization,” Mathematical programming, vol. 45, no. 13, pp. 503–528, 1989.
 [4] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in highdimensional nonconvex optimization,” in Advances in neural information processing systems, 2014, pp. 2933–2941.
 [5] J. Duda, “Improving sgd convergence by tracing multiple promising directions and estimating distance to minimum,” arXiv preprint arXiv:1901.11457, 2019.
 [6] J. Martens, “Deep learning via hessianfree optimization.” 2010.
 [7] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [8] G. Alain, N. L. Roux, and P.A. Manzagol, “Negative eigenvalues of the hessian in deep neural networks,” arXiv preprint arXiv:1902.02366, 2019.
 [9] C. Boutsidis, D. Garber, Z. Karnin, and E. Liberty, “Online principal components analysis,” in Proceedings of the twentysixth annual ACMSIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2015, pp. 887–901.