SGD momentum optimizer withwith step estimation by online parabola model

SGD momentum optimizer with with step estimation by online parabola model

Abstract

In stochastic gradient descent, especially for neural network training, there are currently dominating first order methods: not modeling local distance to minimum. This information required for optimal step size is provided by second order methods, however, they have many difficulties, starting with full Hessian having square of dimension number of coefficients.

This article proposes a minimal step from successful first order momentum method toward second order: online parabola modelling in just a single direction: normalized from momentum method. It is done by estimating linear trend of gradients in direction: such that for , , . Using linear regression, , are MSE estimated by just updating four averages (of , , , ) in the considered direction. Exponential moving averages allow here for inexpensive online estimation, weakening contribution of the old gradients. Controlling sign of curvature , we can repel from saddles in contrast to attraction in standard Newton method. In the remaining directions: not considered in second order model, we can simultaneously perform e.g. gradient descent. There is also discussed its learning rate approximation as , allowing e.g. for adaptive SGD - with learning rate separately optimized (2nd order) for each parameter.

Keywords: non-convex optimization, stochastic gradient descent, convergence, deep learning, Hessian, linear regression, saddle-free Newton method

I Introduction

In many optimization scenairios like neural network trainig, we search for a local minimum of objective/loss function of parameters , which number is often in millions. The real function is usually unknown, only modelled based on a size dataset: . Due to its large size, there is often used SGD (stochastic gradient descent) [1] philosophy: dataset is split into minibatches used to calculate succeeding stochastic gradients , which can be imagined as noisy (approximate) gradients:

 ∇θF(→θt)≈→gt=1|%minibatch|∑i∈minibatch∇θFi(→θt)

Efficient training especially of deep neural network requires extraction and exploitation of statistical trends from such noisy gradients: calculated on subsets of samples. Their e.g. exponential moving averaging in momentum method [2] tries to estimate the real gradient of minimized objective function and use it for gradient descent. However, there remains a difficult problem of choosing the step size for such descent.

For example in a plateau we should greatly increase the step, simultaneously being careful not to jump over a valley. We will use linear trend of gradients to estimate position of bottom of such valley as parabola: where the linear trend of derivatives in the considered direction intersects zero, as visualized in Fig. 1 with basic pseudocode as Algorithm 1.

Linear trend can be estimated in MSE (mean squared error) optimal way with linear regression, which requires four averages: here of , , , for the linear relation between position and the first derivative . There will be used exponential moving averages for inexpensive online update and to reduce reliance on the old gradients - they have exponentially weakening weights.

There will be also discussed approximation which should resolve necessity of clipping, additionally simplifying the method to learning rate, e.g. for adaptive SGD: inexpensive online optimization of learning rate separately for each parameter.

Linear trend of gradients is a second order model. Generally, higher than first order methods are often imagined as impractically costly, for example full Hessian would need coefficients. We focus here on the opposite end of cost spectrum - only model parabola in a single direction (parameterized by just 2 additional coefficients), for example in direction found by the momentum method: suggesting increased local activity, hence deserving a higher order model. Calculated gradient, beside updating momentum and parabola model, can be also simultaneously used for e.g. gradient descent in the remaining directions.

For low cost it would be preferred to estimate second order behavior from the stochastic gradients only. It is done for example in L-BFGS [3]. However, it estimates inverted Hessian from just a few recent noisy gradients: leading to stability issues and having relatively large cost of processing all these large gradients in each step. In contrast, thanks to working on updated averages, this processing cost becomes practically negligible in the proposed online gradient linear regression approach. We should also get a better estimation as instead of just a few recent noisy gradients, here we are using all of them with exponentially weakening weights in updated averages.

Another addressed here problem of many second order methods is attraction to saddles e.g. by standard Newton method, which handling can lead to large improvements as shown e.g. in saddle-free Newton (SFN) method article [4]. This repairment requires to control the signs of curvatures, what is relatively difficult and costly. In the presented approach it becomes simple as we need to control it in only a single direction.

A natural extension is analogously performing such second order modelling in a few dimensions, which was the original approach [5]. The purpose of this separate article is focusing on the simplest case for introduction and better understanding of the basics.

Ii 1D case with linear regression of derivatives

We would like to estimate second order behavior from a sequence of gradients: first order derivatives, which linear behavior corresponds to second order derivative. A basic approach is finite differences [6], for Hessian :

 H→v=limϵ→0(∇f(→θ+ϵ→v)−∇f(→θ))/ϵ

However, we have noisy gradients here, hence we need to use much more than two of them to estimate linear trend from their statistics. A standard tool for extraction of linear trend is least squares linear regression: optimal in MSE way. Additionally, it is very convenient due to working on averages: we can replace it with exponential moving averages for online estimation and to weaken contribution of less certain old gradients. Let us now focus on dimensional case, its general -dimensional expansion is discussed in [5].

Ii-a 1D static case - parabola approximation

 f(θ)=h+12λ(θ−p)2f′(θ)=λ(θ−p)

and MSE optimizing its parameters for sequence:

 argminλ,p∑twt(gt−λ(θt−p))2for some weights (wt)

For parabola and times we can choose uniform weights . Later we will use exponential moving average - reducing weights of old noisy gradients, seeing such parabola as only local approximation. The necessary condition (neglecting case) becomes:

 ∑twt(gt−λ(θt−p))=0=∑twt(θt−p)(gt−λθt+λp)
 ¯¯¯g−λ¯¯¯θ+sλp=0=¯¯¯¯¯θg−λ¯¯¯¯¯θ2+2λp¯¯¯θ−p¯¯¯g−sλp2

for averages:

 s=∑twt¯¯¯θ=∑twtθt¯¯¯g=∑twtgt
 ¯¯¯¯¯gθ=∑twtgtθt¯¯¯¯¯θ2=∑twt(θt)2 (1)

Their solution (least squares linear regression) is:

 λ=s¯¯¯¯¯gθ−¯¯¯g¯¯¯θs¯¯¯¯¯θ2−¯¯¯θ2p=λ¯¯¯θ−¯¯¯gsλ=¯¯¯¯¯θ2¯¯¯g−¯¯¯θ¯¯¯¯¯gθ¯¯¯θ¯¯¯g−s¯¯¯¯¯gθ (2)

Observe that estimator is covariance divided by variance of (positive if not all values are equal).

Ii-B Online update by exponential moving average

The optimized function is rather not a parabola, should be only locally approximated this way. To focus on local situation we can reduce weights of the old gradients. It is very convenient to use exponential moving averages for some for this purpose as they can be inexpensively updated to get online estimation of local situation. Starting with all 0 values for , for we get update rules:

 ¯¯¯θt=β¯¯¯θt−1+(1−β)θt=(1−β)t∑t′=1βt−t′θt′
 ¯¯¯gt=β¯¯¯gt−1+(1−β)gt¯¯¯¯¯gθt=β¯¯¯¯¯gθt−1+(1−β)gtθt
 ¯¯¯¯¯θ2t=β¯¯¯¯¯θ2t−1+(1−β)(θt)2
 st=(1−β)t∑t′=1βt−t′=1−βt=βst−1+(1−β) (3)

The is analogous e.g. to bias in ADAM method [7], in later training it can be assumed as just .

Ii-C 1D linear regression based optimizer

Linear regression requires values in at least two points, hence there is needed at least one step (better a few) warmup - evolving using e.g. gradient descent, simultaneously updating averages (3), starting from initial . Then we can start using linear model for derivative: , using updated parameters from (2) regression formula.

Getting curvature, the parabola has minimum in , the modeled optimal position would be . However, as we do not have a complete confidence in such model, would like to work in online setting, a safer step is for some parameter describing trust in the model, which generally can vary e.g. depending on estimated uncertainty of parameters. Natural gradient method corresponds to complete trust. Lower would still give exponential decrease of distance from a fixed minimum.

Getting , parabola has maximum instead - second order method does not longer suggest a position of minimum. Such directions are relatively rare in neural network training [8], especially focusing on the steepest descent direction here. In many second order methods curvature signs are ignored - attracting to saddles e.g. in standard Newton method. Controlling sign of here, we can handle these cases - there are two basic approaches [6]: switch to gradient descent there, or reverse sign of step from second order method.

There are also cases, which are problematic as corresponding to very far predicted extremum in (2) - require some clipping of step size. Such situation can correspond to plateau, or to inflection point: switching between convex and concave behavior. For plateaus we need to use large steps.

While it leaves opportunities for improvements, for simplicity we can for example use SFN-like step: just reversing sign for directions. Applied clipping prevents cases, alternatively we could e.g. replace sign with :

 θt+1=θt+αsign(λt)C(pt−θt) (4)

with example of clipping: .

Iii Momentum with online parabola methods

Having above 1D approach we can use it to model behavior of our function as locally parabola in dimensional affine space of space of parameters, still performing first order e.g. gradient descent in the remaining directions.

There is a freedom of choosing this emphasized direction , but for better use of such additional cost of higher order model we should choose a locally more promising direction - for example pointed by momentum method. Wanting a few -dimensional promising local subspace instead, we could obtain them e.g. from online-PCA [9] of recent gradients.

Iii-a Common functions and basic OGR1d

Algorithms 2, 3, 4, 5 contain common functions, used e.g. in basic dimensional OGR (online gradient regression) as Algorithm 1:

• upd_avg(avg, ) updates all averages (packed into avg vector) based on current position , gradient and considered direction ,

• step(avg, ) finds parameters of linear trend of derivatives in direction and use them to perform step in this direction. It also optionally performs first order e.g. gradient descent in the remaining directions,

• initialize() chooses initial , hyperparameters, sets averages and momentum to zero,

• warmup() uses steps of momentum method to choose initial direction and averages avg.

Then the basic approach is presented as Algorithm 1: just regularly (online) update the direction of second order model accordingly to momentum method. However, such modification of assumes that averages remain the same in the new direction - effectively rotating the second order model. Such rotation might be problematic, should be performed much slower than updating the averages .

The next two subsections suggest ways to improve it: safer approach updating averages simultaneously for the old and new direction, and less expensive shifting center of rotation for updates of .

Iii-B Safe variant: updating averages for old and new direction

Algorithm 6 suggests a safe solution for modification of modelled direction by simultaneously updating two sets of averages: for the previous direction (avgw for ) used for the proper step, and for the new one (avg for ). After steps it switches to the new direction and starts building from zero averages for the next switch.

Iii-C Faster option: shifing rotation center

Update of in Algorithm 1 effective rotates second order model around , which is usually far from the current position , hence such rotation can essentially damage the model. Such rotation is much safer if shifting its center of rotation closer, e.g. to . For this purpose, instead of operating on function, we can work on , what does not change gradients - only shifts their positions.

We can periodically update this center to current position, shifting representation: replacing projection with for . This shift requires to modify 3 of averages:

• transforms to

• transforms to

• transforms to .

Algorithm 7 is example of such modification of Algorithm 1, in practice we can shift the center in every step by using .

Iv Learning rate estimation as σθ/σg

Linear regression of gradients has lead to (2) formula for estimated curvature and position of extremum . Let us rewrite it for and sum of weights in the used averages:

 λ−1=¯¯¯¯¯θ2−¯¯¯θ2¯¯¯¯¯gθ−¯¯¯g¯¯¯θp=¯¯¯θ−λ−1¯¯¯g (5)

This formula is similar to gradient descent using learning rate, but starting from averaged position and using averaged gradient. Wanting to use more standard: last position instead, we can do it maintaining the slope if using the last gradient instead of averaged, as visualized in Fig. 2:

 p=θt−λ−1gt (6)

Observe that in (5) is (weighted) variance of , divided by covariance of , . Let us define their (weighted) correlation in standard way:

 corr(g,θ)=¯¯¯¯¯gθ−¯¯¯g¯¯¯θσgσθ∈[−1,1] (7)
 forσθ=√¯¯¯¯¯θ2−¯¯¯θ2σg=√¯¯¯¯¯g2−¯¯¯g2

where standard comes from Cauchy-Schwartz inequality , also applying to the discussed weighted case (by using ).

It allows to write as:

 λ−1=ηcorr(g,θ)forη=σθσg (8)

Thanks to , we get lower bound for by using practical approximation instead as in Fig. 2. This way using learning rate instead, it is always positive: step direction can only point downhill also for negative curvature (repelling saddles), and we have removed looking problematic increase of step size with growing noise. To include such noise dependence, also expressing incomplete trust in parabola model (the real function is not parabola), we could e.g. try to tune additional hyperparmeter and use proper step:

 θt+1=θt−αηtgt (9)

Iv-a Adaptive SGD: coordinate-wise optimized learning rate

While we could apply such learning rate to a single adapted direction as previously, thanks to simplicity we can alternatively just model and apply it individually for each parameter . Such ASGD example is presented in Algorithm 8 as optimization for standard small batch SGD by updating four exponential moving averages separately for each parameter - getting a method resembling standard ones like RMSprop [10] or ADAM [7], but finally being second order: having potential to minimize parabola in one step, independently for each coordinate.

The details have yet to be optimized based on experiments. Some remarks:

• There might be additionally needed some tiny positive in denominator of proper step to prevent its zeroing (use ), but especially for small batches this variance should be relatively large as it also includes variance of gradient estimation (discussed in the next subsection). It might be also worth to consider using negative to try to subtract contribution of this noise.

• , are some initial dispersions (e.g. 1), is initial learning rate - which can be chosen large at the beginning, then it will be adapted from the data in the first few steps.

• There might be needed some clipping of learning rate - but rather from below to prevent approaching zero learning rate (e.g. for some small ), the next subsection discusses this issue.

• The and are estimated local dispersions by updating exponential averages of value and squared value like and . Instead, alternative approach for the latter might be directly updating variance as central moment, e.g. with not necessarily equal rate.

While nominator seems not having been considered in literature before, it is popular to use related denominator started probably by RMSprop [10]. Its motivation is rater to strengthen evolution of rare coordinates. For 2nd order modelling it is crucial to also subtract to get variance evaluating dispersion, instead of just squared distance from zero.

Iv-B Including noise in learning rate approximations

In SGD instead of knowing the real gradients (denote them as ), we work on noisy gradients due to finite batch size - we can assume that they are e.g. from Gaussian distribution of standard deviation (which can vary between parameters , generally also time ):

 gti∼N(ˇgti,σi)σi∝1√batch size (10)

This gradient noise is included in the above learning rate estimation, artificially increasing denominator - to remove it we would need to use denominator instead, but it seems difficult to realize, especially when this difference is close to zero.

Such artificially increased denominator can lead to nearly zero learning rate near minimum. It is not necessarily bad, e.g. should give tendency to find wider minima - which are believed to be better from generalization perspective [11]. It can be also prevented e.g. by using instead for some small determining minimal learning rate. However, e.g. when we would like to remain in a varying minimum for some continuously optimized parameters [12], this error of gradient estimation becomes dominant.

Let us discuss a more sophisticated way for estimating learning rate from linear regression of gradients - including this noise from gradient estimation. Let us derive variance of assuming variance of is for weighted case. For simplicity let us shift to the center for a moment: assume , hence :

 var(λ)=var(∑iwigiθi¯¯¯¯¯θ2)=σ2¯¯¯¯¯θ22∑iw2iθ2i (11)

In the used exponential moving average for , leading to - we can approximate sum in (11) with its product with (assuming variance of is locally constant, could be also calculated more accurately), leading to formula for general :

 √var(λ)≈σ√¯¯¯¯¯θ2−¯¯¯θ2√1−β1+β (12)

We can assume that is approximately from Gaussian distribution of the above standard deviation. To prevent problematic , we can shift it from 0 by for some determining confidence level in units of standard deviations of Gaussian, multiplied by . Additionally, taking absolute value to repel from saddles, and multiplying by parameter describing confidence in parabola model (as the real function is not exactly parabola), we get a looking safe practical approximation of the original learning rate:

 η=α⎛⎜ ⎜⎝|¯¯¯¯¯gθ−¯¯¯g¯¯¯θ|¯¯¯¯¯θ2−¯¯¯θ2+γσ√¯¯¯¯¯θ2−¯¯¯θ2⎞⎟ ⎟⎠−1learning% rate (13)

We additionally need to know , which could be estimated e.g. by looking at variance of gradients in a single point. More sophisticated way is adapting it e.g. by approximating the real gradient using the found linear regression: where , getting:

 g−ˇg≈g−λ(θ−(¯¯¯θ−λ−1¯¯¯g))=g−¯¯¯g−λ(θ−¯¯¯θ)

Algorithm 9 is example of its realization with such adaptation of gradient estimation error, a simpler way is using fixed e.g. from gradient variance in one point.

Iv-C Estimation and diagonalization of Hessian

The -th coordinate of in proper step is estimation of , hence this way we approximate Hessian as diagonal matrix. We can analogously estimate nondiagonal terms by updating also averages of products of and pairs (but it neglects sign):

 Hij≈ ⎷¯¯¯¯¯¯¯¯¯gigj−¯¯¯gi¯¯¯gj¯¯¯¯¯¯¯¯θiθj−¯¯¯θi¯¯¯θj (14)

Performance of 2nd order methods can often be improved if diagonalizing Hessian e.g. in some significant subspace - especially that it turns out that gradient is mainly in dimensional top Hessian eigenspace, which is approximately maintained during training [13].

Hence it seems sufficient to model Hessian in dimensional top eigenspace, which should be pointed by sequence of recent gradients as statistically dominating - such model could be maintained using online PCA [9] of gradient sequence (with exponential moving average), or simpler approach proposed in [5] - just adding new gradient times some tiny constant to all vectors of the considered basis, additionally maintaining their orthonormality.

Having such subspace defined in a current moment by approximately orthornormal basis , we can project and on this basis, use projections to update the four averages in dimensions to periodically calculate Hessian using (14), to rotate the basis for maintaining diagonalized Hessian - analogously as discussed in [5].

V Conclusions and further work

While first order methods do not have a direct way for choosing step size accordingly to local situation, second order parabola model in current direction can provide such optimal step size. While it could be used in a separate line search, here is suggested to be combined e.g. with momentum method. Thanks to linear regression of gradients, 1) get this information online: continuously adapting to local situation, 2) using only gradients already required for momentum method, 3) in practically negligible cost thanks to operating on averages.

Choosing the details like hyperparameters, which generally could evolve in time, is a difficult experimental problem which will require further work. Approximated learning rate seems a simpler practical 2nd order approach.

The general possibility of combining different optimization approaches seems promising and nearly unexplored, starting e.g. with momentum+SGD hybrid: rare large certain steps interleaved with frequent small noisy steps.

There is popular technique of strengthening underrepresented coordinates e.g. in ADAM [7], which might be worth combining with simple second order methods like discussed. They exploit simple exponential moving averages - here we got motivation for exploring further possible averages.

Getting a successful second order method for dimensional subspace, a natural research direction will be increasing this dimension discussed in [5], e.g. to OGR10d. Choosing promising subspace (worth second order modelling) will require going from momentum method to e.g. online-PCA [9] of recent gradients. Averages , become dimensional vectors, and become dimensional matrices, with estimated Hessian: .

References

1. H. Robbins and S. Monro, “A stochastic approximation method,” in Herbert Robbins Selected Papers.   Springer, 1985, pp. 102–109.
2. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” nature, vol. 323, no. 6088, p. 533, 1986.
3. D. C. Liu and J. Nocedal, “On the limited memory bfgs method for large scale optimization,” Mathematical programming, vol. 45, no. 1-3, pp. 503–528, 1989.
4. Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization,” in Advances in neural information processing systems, 2014, pp. 2933–2941.
5. J. Duda, “Improving sgd convergence by tracing multiple promising directions and estimating distance to minimum,” arXiv preprint arXiv:1901.11457, 2019.
6. J. Martens, “Deep learning via hessian-free optimization.” 2010.
7. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
8. G. Alain, N. L. Roux, and P.-A. Manzagol, “Negative eigenvalues of the hessian in deep neural networks,” arXiv preprint arXiv:1902.02366, 2019.
9. C. Boutsidis, D. Garber, Z. Karnin, and E. Liberty, “Online principal components analysis,” in Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms.   Society for Industrial and Applied Mathematics, 2015, pp. 887–901.
10. T. Tieleman and G. Hinton, “Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural Networks for Machine Learning, 2012.
11. L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou, “Empirical analysis of the hessian of over-parametrized neural networks,” arXiv preprint arXiv:1706.04454, 2017.
12. J. Duda, “Parametric context adaptive laplace distribution for multimedia compression,” arXiv preprint arXiv:1906.03238, 2019.
13. G. Gur-Ari, D. A. Roberts, and E. Dyer, “Gradient descent happens in a tiny subspace,” arXiv preprint arXiv:1812.04754, 2018.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters