SGD momentum optimizer withwith step estimation by online parabola model

SGD momentum optimizer with
with step estimation by online parabola model

Jarek Duda
Jagiellonian University, Golebia 24, 31-007 Krakow, Poland, Email: dudajar@gmail.com
Abstract

In stochastic gradient descent, especially for neural network training, there are currently dominating first order methods: not modeling local distance to minimum. This information required for optimal step size is provided by second order methods, however, they have many difficulties, starting with full Hessian having square of dimension number of coefficients.

This article proposes a minimal step from successful first order momentum method toward second order: online parabola modelling in just a single direction: normalized from momentum method. It is done by estimating linear trend of gradients in direction: such that for , , . Using linear regression, , are MSE estimated by just updating four averages (of , , , ) in the considered direction. Exponential moving averages allow here for inexpensive online estimation, weakening contribution of the old gradients. Controlling sign of curvature , we can repel from saddles in contrast to attraction in standard Newton method. In the remaining directions: not considered in second order model, we can simultaneously perform e.g. gradient descent.

ptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptpt

SGD momentum optimizer with

with step estimation by online parabola model


Jarek Duda

Jagiellonian University, Golebia 24, 31-007 Krakow, Poland, Email: dudajar@gmail.com


Keywords: non-convex optimization, stochastic gradient descent, convergence, deep learning, Hessian, linear regression, saddle-free Newton method

I Introduction

In many optimization scenairios like neural network trainig, we search for a local minimum of objective/loss function of parameters , which number is often in millions. The real function is usually unknown, only modelled based on a size dataset: . Due to its large size, there is often used SGD (stochastic gradient descent) [1] philosophy: dataset is split into minibatches used to calculate succeeding stochastic gradients , which can be imagined as noisy (approximate) gradients:

Efficient training especially of deep neural network requires extraction and exploitation of statistical trends from such noisy gradients: calculated on subsets of samples. Their e.g. exponential moving averaging in momentum method [2] tries to estimate the real gradient of minimized objective function and use it for gradient descent. However, there remains a difficult problem of choosing the step size for such descent.

Figure 1: General diagram of the considered approach. Left: we perform momentum method to online choose direction for 2nd order model, in the remaining directions we can e.g. simultaneously perform gradient descent. Right: in direction we search for linear trend of gradients using linear regression, choose step size accordingly to this model - e.g. proportionally to distance to minimum of modelled parabola. Grayness of considered points represents their fading weights in used exponential moving averages.
  initialize()   {and choose hyperparameters}
  warmup()   {find initial direction and averages avg}
  repeat
     grad()       {momentum}
     upd_avg(avg,   {update , , , averages}
     step(avg, )  {get parabola from avg, make step}
          {update modeled direction}
  until step limit or convergence condition
Algorithm 1 OGR1d() {basic online grad. regr.}

For example in a plateau we should greatly increase the step, simultaneously being careful not to jump over a valley. We will use linear trend of gradients to estimate position of bottom of such valley as parabola: where the linear trend of derivatives in the considered direction intersects zero, as visualized in Fig. 1 with basic pseudocode as Algorithm 1.

Linear trend can be estimated in MSE (mean squared error) optimal way with linear regression, which requires four averages: here of , , , for the linear relation between position and the first derivative . There will be used exponential moving averages for inexpensive online update and to reduce reliance on the old gradients - they have exponentially weakening weights.

Linear trend of gradients is a second order model. Generally, higher than first order methods are often imagined as impractically costly, for example full Hessian would need coefficients. We focus here on the opposite end of cost spectrum - only model parabola in a single direction (parameterized by just 2 additional coefficients), for example in direction found by the momentum method: suggesting increased local activity, hence deserving a higher order model. Calculated gradient, beside updating momentum and parabola model, can be also simultaneously used for e.g. gradient descent in the remaining directions.

For low cost it would be preferred to estimate second order behavior from the stochastic gradients only. It is done for example in L-BFGS [3]. However, it estimates inverted Hessian from just a few recent noisy gradients: leading to stability issues and having relatively large cost of processing all these large gradients in each step. In contrast, thanks to working on updated averages, this processing cost becomes practically negligible in the proposed online gradient linear regression approach. We should also get a better estimation as instead of just a few recent noisy gradients, here we are using all of them with exponentially weakening weights in updated averages.

Another addressed here problem of many second order methods is attraction to saddles e.g. by standard Newton method, which handling can lead to large improvements as shown e.g. in saddle-free Newton (SFN) method article [4]. This repairment requires to control the signs of curvatures, what is relatively difficult and costly. In the presented approach it becomes simple as we need to control it in only a single direction.

A natural extension is analogously performing such second order modelling in a few dimensions, which was the original approach [5]. The purpose of this separate article is focusing on the simplest case for introduction and better understanding of the basics.

Ii 1D case with linear regression of derivatives

We would like to estimate second order behavior from a sequence of gradients: first order derivatives, which linear behavior corresponds to second order derivative. A basic approach is finite differences [6], for Hessian :

However, we have noisy gradients here, hence we need to use much more than two of them to estimate linear trend from their statistics. A standard tool for extraction of linear trend is least squares linear regression: optimal in MSE way. Additionally, it is very convenient due to working on averages: we can replace it with exponential moving averages for online estimation and to weaken contribution of less certain old gradients. Let us now focus on dimensional case, its general -dimensional expansion is discussed in [5].

Ii-a 1D static case - parabola approximation

Let us start with 1D case - static parabola model as:

and MSE optimizing its parameters for sequence:

For parabola and times we can choose uniform weights . Later we will use exponential moving average - reducing weights of old noisy gradients, seeing such parabola as only local approximation. The necessary condition (neglecting case) becomes:

for averages:

(1)

Their solution (least squares linear regression) is:

(2)

Observe that estimator is covariance divided by variance of (positive if not all values are equal).

Ii-B Online update by exponential moving average

The optimized function is rather not a parabola, should be only locally approximated this way. To focus on local situation we can reduce weights of the old gradients. It is very convenient to use exponential moving averages for some for this purpose as they can be inexpensively updated to get online estimation of local situation. Starting with all 0 values for , for we get update rules:

(3)

The is analogous e.g. to bias in ADAM method [7], in later training it can be assumed as just .

Ii-C 1D linear regression based optimizer

Linear regression requires values in at least two points, hence there is needed at least one step (better a few) warmup - evolving using e.g. gradient descent, simultaneously updating averages (3), starting from initial . Then we can start using linear model for derivative: , using updated parameters from (2) regression formula.

Getting curvature, the parabola has minimum in , the modeled optimal position would be . However, as we do not have a complete confidence in such model, would like to work in online setting, a safer step is for some parameter describing trust in the model, which generally can vary e.g. depending on estimated uncertainty of parameters. Natural gradient method corresponds to complete trust. Lower would still give exponential decrease of distance from a fixed minimum.

Getting , parabola has maximum instead - second order method does not longer suggest a position of minimum. Such directions are relatively rare in neural network training [8], especially focusing on the steepest descent direction here. In many second order methods curvature signs are ignored - attracting to saddles e.g. in standard Newton method. Controlling sign of here, we can handle these cases - there are two basic approaches [6]: switch to gradient descent there, or reverse sign of step from second order method.

There are also cases, which are problematic as corresponding to very far predicted extremum in (2) - require some clipping of step size. Such situation can correspond to plateau, or to inflection point: switching between convex and concave behavior. For plateaus we need to use large steps.

While it leaves opportunities for improvements, for simplicity we can for example use SFN-like step: just reversing sign for directions. Applied clipping prevents cases, alternatively we could e.g. replace sign with :

(4)

with example of clipping: .

Iii Momentum with online parabola methods

Having above 1D approach we can use it to model behavior of our function as locally parabola in dimensional affine space of space of parameters, still performing first order e.g. gradient descent in the remaining directions.

There is a freedom of choosing this emphasized direction , but for better use of such additional cost of higher order model we should choose a locally more promising direction - for example pointed by momentum method. Wanting a few -dimensional promising local subspace instead, we could obtain them e.g. from online-PCA [9] of recent gradients.

Iii-a Common functions and basic OGR1d

Algorithms 2, 3, 4, 5 contain common functions, used e.g. in basic dimensional OGR (online gradient regression) as Algorithm 1:

  • upd_avg(avg, ) updates all averages (packed into avg vector) based on current position , gradient and considered direction ,

  • step(avg, ) finds parameters of linear trend of derivatives in direction and use them to perform step in this direction. It also optionally performs first order e.g. gradient descent in the remaining directions,

  • initialize() chooses initial , hyperparameters, sets averages and momentum to zero,

  • warmup() uses steps of momentum method to choose initial direction and averages avg.

Then the basic approach is presented as Algorithm 1: just regularly (online) update the direction of second order model accordingly to momentum method. However, such modification of assumes that averages remain the same in the new direction - effectively rotating the second order model. Such rotation might be problematic, should be performed much slower than updating the averages .

The next two subsections suggest ways to improve it: safer approach updating averages simultaneously for the old and new direction, and less expensive shifting center of rotation for updates of .

      { are global variables here}
   {Formula (3): update 4 averages and bias }
Algorithm 2 upd_avg(avg, ) {of avg
   avg      {Calculate trend (2) of :}
{Step along , clipping: }
{Optional gradient descent in remaining directions:}
Algorithm 3 step(avg, )  {proper parameter update}
     {initial parameters}
     {step size: confidence in parabola model}
     {forgetting rate for linear regression}
     {rate for momentum choosing direction}
     {for clipping - handling situations}
     {for neglected directions gradient descent}
    {number of steps for warmup and stages}
  avg = (0,0,0,0,0)  { averages}
  
Algorithm 4 initialize() {and choose hyperparameters}
  {Initial direction and averages using momentum method:}
  for  to  do
     grad()         
  end for
     {normalize and use to find averages:}
  for  to  do
     grad()         
     upd_avg(avg,     {find initial averages}
  end for
Algorithm 5 warmup() {initial direction and avg}

Iii-B Safe variant: updating averages for old and new direction

Algorithm 6 suggests a safe solution for modification of modelled direction by simultaneously updating two sets of averages: for the previous direction (avgw for ) used for the proper step, and for the new one (avg for ). After steps it switches to the new direction and starts building from zero averages for the next switch.

  initialize()   {and choose hyperparameters}
  warmup()   {find initial direction and averages avg}
  repeat
         {previous direction}
       {new direction}
     avgw avg    avg
     for  to  do
        grad()       {momentum}
        upd_avg(avgw,  {update for previous direction}
        upd_avg(avg,   {update for new direction}
        step(avgw, )  {step using previous direction}
     end for
  until step limit or convergence condition
Algorithm 6 OGR1ds() {safe online grad. regress.}

Iii-C Faster option: shifing rotation center

Update of in Algorithm 1 effective rotates second order model around , which is usually far from the current position , hence such rotation can essentially damage the model. Such rotation is much safer if shifting its center of rotation closer, e.g. to . For this purpose, instead of operating on function, we can work on , what does not change gradients - only shifts their positions.

We can periodically update this center to current position, shifting representation: replacing projection with for . This shift requires to modify 3 of averages:

  • transforms to

  • transforms to

  • transforms to .

Algorithm 7 is example of such modification of Algorithm 1.

  initialize()   {and choose hyperparameters}
  warmup()   {find initial direction and averages avg}
     {center of rotation}
  repeat
     {Update direction and center of rotation  :}
     
     
     
     
     for  to  do
        grad()     {momentum}
        upd_avg(avg,   {update for both directions}
        step(avg, )    {step for }
     end for
  until step limit or convergence condition
Algorithm 7 OGR1dc() {centered rotations}

Iv Conclusions and further work

While first order methods do not have a direct way for choosing step size accordingly to local situation, second order parabola model in current direction can provide such optimal step size. While it could be used in a separate line search, here is suggested to be combined e.g. with momentum method. Thanks to linear regression of gradients, 1) get this information online: continuously adapting to local situation, 2) using only gradients already required for momentum method, 3) in practically negligible cost thanks to operating on averages.

Choosing the details like hyperparameters, which generally could evolve in time, is a difficult experimental problem which will require further work.

The general possibility of combining different optimization approaches seems promising and nearly unexplored, starting e.g. with momentum+SGD hybrid: rare large certain steps interleaved with frequent small noisy steps.

There is popular technique of strengthening underrepresented coordinates e.g. in ADAM [7], which might be worth combining with simple second order methods like discussed. They exploit simple exponential moving averages - here we got motivation for exploring further possible averages.

Getting a successful second order method for dimensional subspace, a natural research direction will be increasing this dimension discussed in [5], e.g. to OGR10d. Choosing promising subspace (worth second order modelling) will require going from momentum method to e.g. online-PCA [9] of recent gradients. Averages , become dimensional vectors, and become dimensional matrices, with estimated Hessian: .

References

  • [1] H. Robbins and S. Monro, “A stochastic approximation method,” in Herbert Robbins Selected Papers.   Springer, 1985, pp. 102–109.
  • [2] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” nature, vol. 323, no. 6088, p. 533, 1986.
  • [3] D. C. Liu and J. Nocedal, “On the limited memory bfgs method for large scale optimization,” Mathematical programming, vol. 45, no. 1-3, pp. 503–528, 1989.
  • [4] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization,” in Advances in neural information processing systems, 2014, pp. 2933–2941.
  • [5] J. Duda, “Improving sgd convergence by tracing multiple promising directions and estimating distance to minimum,” arXiv preprint arXiv:1901.11457, 2019.
  • [6] J. Martens, “Deep learning via hessian-free optimization.” 2010.
  • [7] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [8] G. Alain, N. L. Roux, and P.-A. Manzagol, “Negative eigenvalues of the hessian in deep neural networks,” arXiv preprint arXiv:1902.02366, 2019.
  • [9] C. Boutsidis, D. Garber, Z. Karnin, and E. Liberty, “Online principal components analysis,” in Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms.   Society for Industrial and Applied Mathematics, 2015, pp. 887–901.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
382918
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description