Neograd: gradient descent with an adaptive learning rate
Since its inception by Cauchy in 1847, the gradient descent algorithm has been without guidance as to how to efficiently set the learning rate. This paper identifies a concept, defines metrics, and introduces algorithms to provide such guidance. The result is a family of algorithms (Neograd) based on a constant ansatz, where is a metric based on the error of the updates. This allows one to adjust the learning rate at each step, using a formulaic estimate based on . It is now no longer necessary to do trial runs beforehand to estimate a single learning rate for an entire optimization run. The additional costs to operate this metric are trivial. One member of this family of algorithms, NeogradM, can quickly reach much lower cost function values than other first order algorithms. Comparisons are made mainly between NeogradM and Adam on an array of test functions and on a neural network model for identifying hand-written digits. The results show great performance improvements with NeogradM.
Gradient descent (GD) is an iterative optimization method, designed to find the minimum of a specified cost function (CF). The method has found diverse applications across numerous fields since it was discovered by Cauchy in 1847. In particular, it finds frequent use in the field of machine learning (ML), and the lack of a robust and efficiently adaptive learning rate for the method has been a long-lived impediment in its application. The GD algorithm operates by searching the parameter space over which the function is defined, starting from some initial value. The updates to this starting parameter value are proportional to the product of the gradient of the function and a scalar (the learning rate). While this is a straightforward and fairly stable method, it is challenged by certain types of CFs. It is the purpose of this paper to explain a new method to control the learning rate, so the GD method is (almost) never overly challenged by such CFs. Also, when used in the context of ML, it is highlighted that this paper primarily addresses the use of GD with batch processing, as opposed to mini-batch or stochastic gradient descent .
This paper begins with the section “Gradient descent” reviewing the most basic form of GD. It is followed by sub-sections on certain extant variations of GD, considerations on an ideal GD method, and finally a new derivation of GD. The next section “Measures of control”, introduces the concept of “Update Fidelity” and three diagnostic metrics (rho, dotp, sumd) which are meant to inform the user of the progress of the algorithm. In particular, the rho () diagnostic will be seen to form the basis for a new family of algorithms. Next, in the section “Computing rho”, a convenient means of computing for batch GD is introduced, which involves almost no additional cost over the usual approach. In the following section, “Solutions for simple cases”, an examination is made of the role and use of in GD when the CF is one of several simple forms; these forms include a quadratic, a quartic, an ellipse, and a simple well formed from two sigmoids. The purpose of this is to help build intuition about , as this is a new metric. The next section, “Algorithms”, begins with two sub-sections on the theoretical basis for adaptively updating the learning rate. The sub-sections that follow contain pseudocode for the algorithms that comprise the Neograd family. The following section, “Experiments”, contains selected comparisons and analyses of this family of algorithms. Each experiment was chosen to demonstrate at least one particular point or feature. For the most part, this section uses special functions, but it also includes a neural net model to recognize handwritten digits. The last section, “Final Remarks”, contains a summary, comparison and perspective. Two appendices are also included: one providing background on the linear algebra solution used for the new derivation of GD, and another discussing an alternative definition of . Accompanying code for this paper can be found on Github.
2 Gradient Descent (GD)
The GD algorithm  is an iterative method for finding the minimum of a function . With defined over , the algorithm begins with the initial choice of , and defines updates according to the formula:
where is a small positive constant known as the learning rate. This update step, along with a re-evaluation of and , is repeated until a suitable stopping condition is met. This algorithm is motivated by the fact that the gradient of a function points in the direction of its maximal increase, and so its negative must point toward its maximal decrease.
This would appear to be a straightforward, perhaps even robust approach to determining the minimum, and one should only have the trivial task of setting to some suitably small number. However, even for simple functions such as , it can be shown (cf. Section 5.2) that any value of will lead to unstable behavior for a sufficiently large . When this trivial quartic function is replaced by the much more complex CFs used in ML applications, the situation becomes exponentially compounded.
2.1 Existing variations on gradient descent
This section does not attempt to compare and contrast all the existing GD variations that are known (cf. ) (here called the “GD family”). Instead, the purpose is to mention two prominent themes that appear in some of the more widely used algorithms. The first such theme is momentum , which instead of using the current gradient utilizes an exponentially weighted average of past gradients (weighted in favor of the more recent updates). While this can be useful in smoothing over zig-zagging updates, it can be detrimental if the momentum parameter is set too large since it may cause the updates to adjust too slowly. A variation on this is Nesterov momentum , which uses a look-ahead approach toward computing the momentum, and hence acquires some second-order capabilities.
The second theme is a sort of “preferential weighting”, and uses heuristics to re-weight the update from basic GD. For this category, the approaches of AdaGrad , AdaDelta , RMS Prop , Adam  and others (see ) may be grouped together. The heuristic represented by this group is basically that gradients that are small should contribute more, and those that are large should contribute less. In doing this it is hoped that the problems mentioned previously (zig-zagging, flat/steep CF) will be ameliorated. Also, note that these two themes can overlap, such as with Adam, producing an algorithm that incorporates momentum and a preferential weighting scheme.
There are other approaches that aren’t easily classified into one of these two schemes. Among them is the line search method , which tries several values of the learning rate and selects the one yielding the lowest CF. Also, the whole class of second-order methods will not be considered here, as it’s outside the scope of this paper. As a general comment though, those methods tend to yield improved results, but at a higher computational cost. Finally, there are a number of heuristics, both with respect to learning rate schedules (e.g., ”″ decay ) and other modified gradients (e.g., norm clipping ). To the author’s knowledge, these heuristics aren’t known to work reliably well. In addition, while they have some technical motivation, they all seem to lack a theoretical basis for being generally applicable and will not be considered herein.
2.2 An ideal gradient descent algorithm
An ideal variation of the basic GD algorithm will gracefully handle any challenge that may appear. To be sure, these challenges arise from the landscape of the CF, which may have one or many minima and potentially many saddle points, as well as narrow valleys, steep cliffs, and nearly flat regions. The narrow valleys tend to lead to oscillating behavior in the updates; the flat regions may lead to a vanishing gradient, and the cliffs lead to exploding gradients .
An important concept here is the basin of attraction, which is just the region of parameter space defined by all the initial values that eventually lead to a small neighborhood of the same minimum, after doing a sufficient number of arbitrarily small updates. If the value of is set so large that updates lead it to frequently leave its basin of attraction then it is essentially no longer acting as a GD algorithm. Instead, it is perhaps more like a simple random sampling algorithm.
The approach that’s normally taken is to attempt to find a “best” learning rate (). This is done by doing a number of optimization runs with different values of , and comparing the plots of: training cost vs. iterations. However, even the idea of finding a single, best is flawed. By analogy, it’s equivalent to trying to find the “best speed” at which to drive between two distant cities. E.g., to drive from Chicago to St Louis, would the best speed be 10mph, 50mph, or 100mph? As with optimization, a higher rate does not always lead to a better outcome. Since a single does not seem to suffice, this paper will pursue a different route: an adaptive algorithm for , one that can navigate between the Scylla and Charybdis of high and low learning rates.
Another issue is the judging of optimization algorithms by how well the resulting ML model generalizes. While generalization is the ultimate goal of creating an ML model, it largely consists of optimization plus regularization. The focus here is solely on optimization; the ability to generalize will be ignored.
2.3 Another derivation of gradient descent
This section is provided to not only give a new derivation of the GD algorithm, but to do so in a way that provides insight into the relationship between basic GD and some of its variations (i.e., momentum, preferential weighting). Begin with a Taylor series expansion of the cost function
where . What is sought is a decrease in , so that is negative. To make the sign explicit, it is replaced with . Thus a first-order solution for is sought for this equation:
This equation can be interpreted as a system of linear equations , where , , and . If is n-dimensional, is a 1-by-n matrix, is an n-dimensional column vector, and is a scalar. So long as , this is an under-determined system of equations, and the inverse will not exist. Nevertheless, it’s possible to instead formulate the solution using a generalized inverse. A new technique  for solving it in this manner is given in an appendix. It is based on the row space of and yields the following solution
where the first term on the right-hand side is the particular solution, and the second term is the homogeneous (or null space) solution. After identifying the learning rate as , which is a positive constant, the final form for the update of is:
This solution becomes a minimum-norm solution when . This means that for the updates to make the quickest descent possible, it should use , which is just the usual GD. Observe that when momentum or preferential weighting is used, ; this contribution to the solution is perpendicular to the gradient. It leads to an update partially following the level-set contours of constant , as opposed to the direction dictated by a pure GD.
3 Measures of control
″Measure what is measurable, and make measurable what is not so.″ 
Toward achieving the goal of a GD algorithm with an adaptive learning rate, a concept and several diagnostic metrics are introduced. As will be seen, the most important is the rho diagnostic, which gives a measure of whether is too large or small. Also, the dotp diagnostic informs the decision on the target value used for , and the sumd diagnostic informs the decision on whether a stopping condition has been met.
3.1 Update Fidelity (UF)
This section puts a name on something probably everyone who has worked with GD has understood at some level. It is that it’s possible to choose the learning rate to be too large or too small, and in either case the result will be suboptimal; the best choice lies somewhere in between. Here that situation is illustrated in Fig. 1 with three plots of GD updates done on a quadratic CF, with each one having a different .
In each figure the initial point is marked with a red dot. A GD update is made, and its prediction for the new CF value is indicated by the end of a straight (red) line. The new (actual) CF value is marked with another red dot on the CF; from there, another GD update is made, and another line drawn. It’s important to remember that GD is a linear extrapolation in for both and . The three figures, from left to right, have increasing values of , which correspond to decreasing UF. Thus, the UF is “figuratively understood” to be the degree to which these linear extensions/predictions coincide with the underlying curve.
With that in mind, what these three figures indicate is that with high UF, the linear updates/predictions (i.e., red lines) are close to the underlying function, and it makes slow progress toward the minimum of the function. For low UF, the linear updates/predictions differ significantly from the underlying function, and perhaps can’t be trusted to lead to stable behavior during optimization. In contrast to these two extremes, the case of a moderate UF represents a compromise that the user of GD aims to achieve: it maintains reasonable fidelity to the underlying function, while also making reasonable progress toward the minimum (cf. Eq. (4)). These are the main points to be made using the concept of UF. It will now be quantified.
3.2 The rho diagnostic
Recall the GD parameter update step is:
where is evaluated at . Upon defining
where , a generic figure for GD can be illustrated as in Fig. 2.
The main insight this figure is meant to deliver is that for GD to be valid should not be far from , as measured on the scale of . In deciding on a suitably general measure, it is also reasonable to require scale and translation invariance with respect to . That is, multiplying by some number shouldn’t change the measure, and shifting by some value should also not change it. (It is understood that would stay fixed during such transformations of .) Such requirements immediately disallow the obvious metrics such as an absolute error (e.g., ) or a relative error (e.g., ). Instead, the following measure is introduced that consists of the error between and based on the scale
This definition leaves it manifestly invariant to scale and translation transformations of , assuming is fixed. By inspection, one can see that for high UF, is small; when is large, the UF is low. (Also, an alternative definition of is given in an appendix.)
If one does a Taylor series expansion, the leading terms comprising for small yield
These results apply when is computed using the usual first-order GD estimate. This relationship will become useful when control algorithms are developed later in the paper. The Hessian in the numerator also reveals that implicitly takes into account the local curvature of , if it wasn’t already obvious from the above figure. However, is not a direct measure of the curvature. Also, note that there is nothing intrinsically limiting the use of to a first-order GD algorithm. It can be used within the context of second-order algorithms as well; in that case the above Taylor series would produce .
The diagnostic may be thought of as a quality control parameter. When it is too small (or too large) the updates will be too small (or too large), and it will lead to a high UF (or a low UF). However, if has an appropriate size, the updates will be appropriately sized, and it will lead to a moderate UF, which is the goal. For comparison, another quality control parameter from a very different context is the use of 0.05 as a cutoff for p-values in statistics. In both cases it is desirable to have it take on small values, but is often impractical to require that it take on values close to zero.
3.3 The dotp diagnostic
As mentioned in Section 2 there is an issue with GD in that it may repeatedly overshoot the minimum, effectively zig-zagging (or oscillating) about it. A straightforward way to measure whether this is occuring is with a normalized dot product (i.e., “cosine distance”) between successive updates :
The normalization is used so it can provide a measure independent of the size of the update. Also, this diagnostic is meant to be evaluated after each iteration, since its value may rapidly change. When dotp strays from being near 1.0, it indicates the learning rate may be too large; this point will be revisited later in the paper.
The first drawback of this metric is that while dotp can easily detect a period of 2, it cannot easily reveal higher order periods . E.g., if it takes several updates for to change direction, there won’t be a clear indication from this metric. The second drawback is that dotp gives values independent of the scale of . However, this design choice is what is needed most of the time. (If it’s necessary to also consider the scale, one might create the vector metric of (dotp,)). Though this metric remains valuable as-is, these drawbacks do help motivate the introduction of the sumd metric in the next section.
3.4 The sumd diagnostic
To understand the progress of the updates of GD, it can be helpful to visualize them in -space. However, the dimension of can be extremely large, effectively precluding that possibility. Nevertheless, it will be shown in this section that a diagnostic metric can be formed from two measurables, yielding a visual depiction of the progress. With this metric, one can tell whether the updates are making progress or are exhibiting oscillations (of any periodicity). This information can help inform the decision of whether to stop the optimization run or to make some adjustments. The reader should note that this metric will not be directly used in forming the algorithms of this paper. However, it is related to the dotp metric, which itself may be used in conjunction with . In particular, sumd fills in the gaps left by dotp, in that it can detect limit cycles of period greater than two, and also in that it is sensitive to the magnitude of the updates.
The sumd metric is a 2D curve, essentially being a distance(d) vs. accumulated arclength(a) graph, computed with respect to a reference point . The points comprising the curve are and are computed as:
where , , and is defined as 0. In words, the arclength is defined as the sum of the (magnitudes of the) differences in (hence the name “sumd”) that lead up to starting from . Thus, if the series of parameters is , the first three points of sumd are:
To illustrate the relationship between the path of and the associated sumd curve, consider Fig. 3, where the path is a hyperbolic spiral.
Although the path of (left figure) was not based on GD updates, it should be imagined that way for the purpose of this example. The path of begins at the red star and spirals down toward the origin. Initially, the arclength and the distance are nearly equal; this is because there is little initial curvature in the path of . However, as the path begins to spiral, the distance from the initial point reaches a maximum, while the arclength continues to grow. Following that, the distance decreases, then increases again, etc. This oscillation on the right hand side of Fig. 3 is generically indicative of a limit cycle in -space. Clearly, the construct of this metric is easily capable of detecting a cycle of any periodicity.
At any given step the arclength () is greater than or equal to the distance () from the reference point . This is easily seen from:
It is also easily understood from intuition that these individual steps produce a maximal length vector when they all point in the same direction. In any case, this is why the sumd curve lies below the 45 degree line; a red dotted line at 45 degrees in the sumd plot is added as a visual reference. To summarize, if the path of is experiencing a curve relative to , it will be reflected by the sumd curve veering away from 45 degrees. If this path later straightens out, the sumd curve will again advance at a 45 degree angle (although now it will be displaced beneath the 45 degree red-dotted line in the figure).
It should be remembered that the horizontal axis is not the number of iterations. The points of sumd are not required to march along at a constant rate from left to right; in fact, they typically don’t do so. Instead, it is often the case that initial updates of are large, and are later followed by relatively small changes. Indeed, at some point it may even show that the changes have effectively become zero. To help identify these varying changes in later sumd figures, thin vertical lines will be added after a set number of updates; this will help display the rate of progress.
4 Computing rho
A quick inspection of the formula for shows that it depends on two successive values of the CF. It might then seem that in order to incorporate it into a typical optimization loop, another evaluation of the CF would be necessary. However, this is not the case. First, recall the standard sequence of steps for the optimization of a CF using GD as shown in Algo. 1.
Here, some of the steps that are normally combined are on separate lines, in order to more easily compare with later pseudocode. Also, dependence on other variables for these function calls is suppressed (e.g., the data, cached variables, etc.), as well as steps such as reading in the data and initializing the parameters (, , etc.). The “stopping condition” may depend on , , etc. It’s left out for simplicity; instead a fixed number of iterations is used. Finally, the number of iterations is set to num+1, to make the comparison easier for the following pseudocode.
In Algo. 1, the most important feature to notice is the ordering of these computations within the loop: ; ; ; update of . The approach now will be to do an initial computation of the CF before entering the loop, but to otherwise respect the order. Thus, inside the new loop, the order of the computations is: ; ; update of ; . This rewriting allows one to easily access the old and new values of , and so compute with essentially no new steps. Again, this approach is convenient because (and for that matter dotp and sumd), are computed “between steps”. With these changes, the standard algorithm to be used now is shown in Algo. 2:
The function get_rho refers to Eq. (3). This is perhaps the simplest rewriting of Algorithm 1 in order to compute without extra computation; alternatives might include performing additional steps before entering the main loop, such as the computation of the gradient . One might find a variation to be more convenient if also computing the other diagnostics (dotp and sumd), which were left out here for simplicity. Finally, as an exercise, one can verify that for a small value of that each algorithm computes the CF the same number of times.. Actually, the new ordering may be preferable, since the final value of corresponds to the final value of the CF; in the former case it doesn’t.
5 Solutions for simple cases
The purpose of this section is to build an understanding of the use of these metrics by applying them to several simple examples. Particular attention will be paid to the relationship between and . In the discussion that follows, the value of will be compared to a target interval, which is defined as
This is meant to correspond to a “moderate UF”, as defined earlier. This range for is meant as a guide, based on the experience of the author; the values weren’t derived from other considerations. In comparison to remarks made in Sec. 3.1, this range of is meant to ensure reasonable progress with a reasonable fidelity.
In this section, the CF is a quadratic function of an n-dimensional parameter . The CF is defined as , where is a positive constant. An update from to using GD leads to
Using these expressions in Eq. (3) and recalling leads to
It may be puzzling to the reader why the constant appears in this, when by construction should have no dependence on the scale of . However, in that earlier argument it was required that be held constant, which is not the case here. Also, in later examples it will be seen that may carry a dependence on .
The combination in Eq. (5) is interesting since it implies that can have the same value for cases where is very different. Eg, and will produce the same value for as and . This may help partially explain why a choice of that gives good results in one application may give poor results in another . In any event, it already appears from just one example that , not , is a better “universal parameter” by which to make comparisons between applications.
Even though this is an elementary example, one should not miss the conceptually important point that is now possible to formulaically determine a value for the learning rate, vis-à-vis , in which it depends only on a “quality control” parameter and a constant. This is fundamentally different from the usual approach, in which one tries a number of different values for , and makes a subjective decision based on the shape of a training error vs. iterations plot. In contrast, by using it is possible to set in an a priori fashion.
As this is a basic example, it’s possible to further quantitatively evaluate the dependence on using an exact solution. (In particular, the substitution will be assumed.) The value of at the th iteration (beginning from ) equals
This reveals how the updates, upon increasing , become oscillatory when and become divergent when . The period-2 oscillatory nature of when is also evident from the dotp diagnostic
In addition, the log of the CF at the th iteration equals
So in a plot of vs. , the slope should be and the intercept . Thus, the closer is to , the faster the CF decreases.
Next, the sumd plot is examined in Fig. 4 using and .
As expected, the plot for shows a steady progression along the 45 degree line, since the updates take a direct path toward the minimum at . In the right-hand plot with , the graph reveals the characteristics of an oscillation in parameter space: the arclength continually increases while the distance oscillates. Note it is not necessary to know the period of the oscillations beforehand in order to detect them with sumd. In both cases the vertical red lines were added (after every 10 iterations) so the reader could more easily assess the progress. Also, the values of in the plots appear as the blue points. Finally, plots of dotp were not shown since they were especially simple: for , dotp=1, and for , dotp= -1.
Here the CF is defined as , and a GD update is . Following the same approach from the last section, is computed as
where . The fact that depends on the combination reveals a dependency that may not have been appreciated hitherto. Inverting this equation while assuming small and reveals, approximately
Thus, once again, may be expressed in an a priori way as a function of . What is new is that now also depends on . This means that, in order to maintain a certain value of , must be continually adjusted while taking into account. It also means that a suitable for one initial may be wholly inappropriate for a different initial . Once again, it becomes apparent that , not , is a better universal measure for characterizing an optimization run.
Upon using from Eq. 7, the th update for becomes
This formula predicts converging oscillations when , which become diverging when ; using Eq. (6) these values are equivalent to and , respectively. This formula leads to this expression for the log of the CF at the th update:
Keep in mind that this is true when is adjusted according to . The algorithm that results when is adjusted this way is named “Ideal GD”; it is compared against GD (here called “Basic GD”) in Fig. 5.
The Basic GD algorithm used , and the Ideal GD used (or equivalently , according to Eq. (6)). The plots reveal that Ideal GD leads to a far smaller value of , by about 10 orders of magnitude. In the plot, it was revealed that Basic GD did not keep in the target interval of ; in fact, it has no apparent control over this metric. (The decrease of for Basic GD and the appearance of a plateau for will be seen to be a recurring criticism of the GD family of algorithms in this paper.) In contrast, the Ideal GD had a constant value of (equal to ), by construction. The third plot of the figure tracks the distance between the initial and the minimum (); Ideal GD again outperforms Basic GD. Finally, note that Eq. (7) also leads to becoming increasingly large for Ideal GD, as it now grows as .
Working in two dimensions with , the CF for the ellipse is
where . The update step from basic GD is . Computing using Eq. (3) leads to where
This result immediately leads to a formula for adjusting according to the changing value of , and a pre-set value of .
This dynamically-set is used in an “Ideal GD” algorithm, and is compared to Basic GD in Fig. 6 when applied to the ellipse CF
Once again, Ideal GD leads to a lower value for by many orders of magnitude. Also, as with the quartic CF, the derived for Basic GD is seen to drop below the target range, taking on very small values. This small explains why the approach toward the minimum by Basic GD in the distance plot is much slower than for Ideal GD. In this case the distance is measured with respect to the minimum at .
These simple examples have demonstrated that it’s possible to solve for as a function of . This is a unique situation: it means the learning rate can be set at each iteration, and will depend not just on , but also the diagnostic metric . Of course, for more complex CFs, it won’t be possible to obtain an exact solution for as has been done here. Nevertheless it will be shown how to obtain similar results.
These results establish the approach of the constant ansatz that will be used going forward: at each iteration adjustments will be made to keep at a fixed target value (as much as possible). Indeed it was seen to be very successful in these simple cases, with the value of the decreasing at a linear rate, far outstripping the results of Basic GD for the quartic and ellipse CFs. (The results were identical for the trivial quadratic CF since there .) Such a linear descent in is taken as a gold standard for algorithms based on GD.
6.1 Setting the learning rate dynamically
The approach in this section is to seek a means to isolate a formula for in a manner similar to what was done in the previous section. Of course, for realistic examples, the CF will have a complicated dependence on , and so a slightly varied strategy may be required. Also, a side goal in this approach is to avoid any additional computation, as much as possible. It is noted that
This follows from the Taylor series expansion of . Next, the leading -dependence is pulled out by defining the following quantities
Observe that and involve little additional computation. Recall that from the re-writing of GD in Algo. 2, and are already available, and is easily found. To lowest order in , is a constant. However it does in general retain -dependence. Also, is a constant; it equals and has no -dependence. With these definitions, the original equation for (i.e., Eq. (3)) becomes
This immediately yields a solution for , which is used to define a function
This will be called the adaptation formula for .
Finally, though it’s a moot point since is inside absolute value signs, the sign of depends on the concavity of . When is concave down, ; when it’s concave up, . In both cases, by construction, one expects that and . It follows that if is concave up (down), the sign of is . In contrast, the sign of is always . Also, in all cases, and are positive by construction.
The adaptation formula just introduced requires the computation of , and , and can be computed after an update of has already been done. This isn’t quite as convenient as with the Ideal GD algorithms used with the simple functions discussed earlier. Nevertheless, it is still useful in that it can be used in the subsequent iteration as an approximate . This approximation will work well provided doesn’t change too rapidly.
A slightly more detailed way of understanding the situation is the following. Suppose an iteration has completed, and , , and are known. It follows that these quantities exactly satisfy the adaptation formula
In general, this will be different from its target value, . What is sought here is the ability to change to a new value, and from the equation compute the new to be used. Calling these new values , , and , the new adaptation equation would appear as
(Recall is independent of .) As an approximation to this, on the assumption that isn’t changing too rapidly with respect to , one can write
It is expected that so long as and are sufficiently close, will be a good approximation to .
The planned use of this equation in an optimization algorithm is summarized as follows. After an iteration, the quantities , , and are computed. If isn’t too different from , the value is used in Eq. (10) to compute , which will be the learning rate used in the following iteration. This is the essence of the Neograd_v0 algorithm to be introduced in the next section. However, if differs too significantly from , a different value for will be used instead, one that is somewhere between the values of and . That heuristic is done to recognize that Eq. (10) has a limited range of usefulness as an approximation. This is the approach taken in Neograd_v1, also to be introduced below.
As alluded to in the previous section, this algorithm will compute the learning rate to be used in the subsequent iteration using , where was defined in Eq. (9). As a convenience in this program, is implemented with get_h in Algo. 3, and accepts slightly different inputs
The Neograd_v0 algorithm is defined in Algo. 4.
As already mentioned, this algorithm is expected to work well (and will be shown to do so) in cases where doesn’t change too rapidly with respect to . This algorithm represents the simplest way to use Eq. (9).
This algorithm is a refinement on Neograd_v0, in that when the initial differs too greatly from , it doesn’t try to make the adjustment in a single step. Instead, it uses a heuristic to make a partial adjustment. It has been the experience of the author that this is more critical when . Also, when , the main risk is that the algorithm will simply become inefficient, as the updates aren’t as large as they could be. On the other hand, when , there is a risk that the updates will cause to leave its local basin of attraction, which in general should be avoided (see Sec. 2.2). Thus, as a simplification for exposition, when , the next value of will be immediately set to . This logic of choosing a new value for is captured in Algo. 5
E.g., when and = 0.1, it returns So with respect to log values, is 3/4 of the way to and 1/4 to . The Neograd_v1 algorithm can now be defined in Algo. 6:
In this algorithm, the computation of the diagnostics (dotp, sumd, etc.) was suppressed for simplicity. There are (at least) two ways that get_rho_prime can be improved, so as to better deal with CFs that exhibit extremely challenging features (i.e., either extremely flat or steep). The first is to recognize that the number 0.75 in Algo. 5 is adjustable. When it is closer to 1.0 it is better able to handle rapidly changing CF landscapes. The second is to recognize that for both cases ( and ) it is prudent to verify that the proposed actually leads to an acceptable ; this may require a repeat calculation (cf. Neograd_dbl).
″Double, double toil and trouble…″ 
Neograd_dbl is a refinement that can be based on either Neograd_v0 or Neograd_v1; in this section it is based on v1. The idea of this refinement is to repeat some of the steps in the main loop, so that a more accurate estimate of can be found. This means that when Eq. (10) is used, the current and target will in general not be as different. As Algo. 7 shows, a portion of the loop is repeated nrep times. It is suggested to only use nrep=2, since there will be diminishing returns when setting it higher. (Computations in this loop will be expensive for a real application.)
Note that when nrep=1, this becomes Neograd_v1. This algorithm involves a trade-off between the increased time of computation of the inner loop portion versus the benefits of a more accurate assessment of an appropriate . For CFs with a challenging landscape, it may be advantageous to have this extra control. Finally, it has to be noted that this method isn’t fool-proof; it’s still possible for it to produce a that exceeds . This will be seen in the initial iterations of Figs. 19 and 20.
6.6 Hybrid versions
The most essential aspects of the above Neograd algorithms can be combined with existing algorithms of the GD family. E.g., Neograd_v1 can be combined with GD family algorithms, as shown in Algo. 8
The function get_dp_combo is a compact writing of the GD family algorithms (see Table 1). In the argument list for this function, and represent the exponential weighted averages of the first and second moments of , respectively. E.g., for NeogradM, is used, but is not; for Adam both are used. The variable type_opt is used to indicate which GD family algorithm is being combined with a Neograd algorithm. (The usual parameters , , used for Adam are suppressed in this pseudocode.) The actual implementation is available elsewhere . Also, as alluded to, while the above Algo. 8 is based on Neograd_v1, it could be based on Neograd_v0. A suggested naming rubric for the hybrid algorithms is in Table 1.
|Original name||New name|
|GD + momentum||NeogradM|
|GD + Nesterov momentum||NeoNAG|
To clarify which version of Neograd is being hybridized, one might consider appending the suffix _v0, _v1, or _dbl, according to which member of the “Neograd family” is being used.
6.7 Choosing an initial
No matter which optimization algorithm is used (from the Neograd or GD family), it is convenient to begin with an that immediately leads to an acceptable value of . Note that in the absence of the algorithms that have been introduced in this paper, one wouldn’t have a way to assess whether a given value for was appropriate before launching into an optimization run. As explained earlier, the modus operandi in that case was to just do several runs and choose the corresponding to the “best” run.
However, given that the metric is available, which can assess the appropriateness of an using only two measurements of the CF, an entirely more effective approach becomes possible. One can now simply choose an initial and , and measure the CF before and after an update. If the resulting between those two measurements is too large or small, it can just be adjusted appropriately. There are a variety of ways of accomplishing such an approach, and they can be adjusted in order to provide different levels of simplicity or robustness. Perhaps the simplest (cf. to a slightly more robust approach in ) is just to borrow elements of Neograd_dbl: in particular the inner loop that adjusts nrep times. This approach is shown in Algo. 9
This function should be called before entering the main loop in any of these optimization algorithms. Since Algo. 9 should be called at the start of a run, there’s no need for exponential weighted averages, and get_dp_combo in Algo. 8 was replaced with Basic GD. Note there are a number of ways to write this kind of algorithm. E.g., the for-loop can be replaced with a while-loop, with a break condition based on rerr, or on whether is in an acceptable range (cf. Eq. (4)). Also, note that this algorithm takes a starting as one of its inputs. The author has found it’s beneficial to choose this starting value to be overly small. With a value that’s too large, it’s possible to have an update that skips over a minimum in the CF profile, and yet still produces a that’s within the target interval; such a situation is to be avoided. In particular, the initial should preferably produce a that is less than .
An increasing amount of sophistication can be layered on top of get_rho_prime in Algo. 9. E.g., as already mentioned, in the code branch where , the constant 0.75 could be made adjustable, perhaps to reflect other variables. Also, in the section where , could be set to something smaller than (perhaps as a function of ). In addition, in either case, a second computation of the CF could be done to refine the estimate, in a manner similar to what was done with Neograd_dbl. Doing this would provide an increased level of robustness that might be warranted for the most challenging CF landscapes (cf. Sec. 7.5). An implementation with additional features is available elsewhere .
In this section various test functions and a NN model are used to evaluate and compare the various algorithms that have been introduced. While the test functions are fairly simple, they provide a clean window into the operating of the algorithm, thereby revealing their strengths and weaknesses. However, it is also suggested that in the course of an optimization run, a CF for an actual application will at times approximate one or more of these test functions. Thus, they potentially serve as a partial proxy for an actual CF.
There are a large number of potential comparisons that can be made between all the different algorithms on all the different CFs. However, it is not the purpose of this paper to make such an exhaustive study. Instead, this section will examine certain algorithms or scenarios associated with the CFs of interest. This is done to illustrate a point, or to highlight a comparison. However, it will be the case that most comparisons will be between Adam and NeogradM, as Adam is a current popular choice, and NeogradM seems to be the best performing of the Neograd family (excluding those with repeat computations, such as NeogradM_dbl).
An effort was made to find the “best” learning rate for algorithms of the GD family to make a fair comparison to those of the Neograd family. Remaining parameters for Adam were set to their default values (i.e., , , ). (Finding a best wasn’t necessary with the Neograd family.) The best values are defined as those leading to the lowest value CF while generally exhibiting stable behavior throughout. Also, the diagnostics dotp and sumd were used to help evaluate stability. In addition, all algorithms were evaluated with respect to the diagnostic. Recall that the sought-after behavior of is that it remain in the target interval , as defined in Eq. (4). Also, recall that excursions of below are judged as merely inefficient, while those above are deemed potentially hazardous. Finally, in some cases a plot will be made of the log(distance), where distance = , is the current parameter value, and corresponds to the minimum of the CF (assuming it’s known).
The quartic CF is perhaps the simplest CF in which interesting differences begin to appear; the quadratic is a bit too simple. In Fig. 7 are shown the most compelling contrasts between Adam and NeogradM.
The log of the CF on the left shows that NeogradM leads to linear descent (cf. Ideal GD in Fig. 5); the profile due to Adam is that of fast initial descent, followed by a near plateau. This switch-over occurs near the iterations=50 mark, and leads to significant consequences: after 200 iterations there is an approximately 15 order of magnitude ratio between the two CF values. This behavior is mirrored in the log(distance) plot, which shows that NeogradM is capable of bringing very close to , whereas Adam is not.
The reason for this plateau can be understood from the middle plot, that of . Recall that the ideal value for should be between the red lines, which define the interval . For Adam, is seen to drop precipitously near iterations=50. Recall that this means it is exhibiting high UF, and the updates are overly conservative. (By contrast, observe that NeogradM maintains a nearly constant value within the interval.)
These plateaus in the CF and distance plots will be seen to be a recurring theme in the GD family algorithms. They essentially implement an early stopping form of regularization, and greatly limit the performance of the optimization. At present it is unclear why they behave in this way.
In the other experiments of this paper comparisons are made between different algorithms. This section is unique, in that it will focus on one algorithm (NeogradM) and attempt to shed light on its behavior by examining how changes in response to a challenging CF.
The CF itself is composed of two sigmoid functions: one increases to the left, and one to the right; they are shifted and added. Together they produce the CF , which is named sigmoid-well
(Note that is a 1-dim vector and is written as .) The motivation for this CF was to test an algorithm in scenarios that are notoriously difficult for GD family algorithms, that of very flat and steep CF landscapes. The parameters and control the steepness and location of the drop, respectively; they’re set to and . Fig. 8 reveals these features:
In the optimization run to be discussed, the initial parameter value is , and is marked by a red star in this figure. The minimum occurs at and is marked by a black star. The updates in the optimization to be discussed will take from the red star, past the inflection point near , and eventually close to the black star. The plots of the CF, and together tell a story of the optimization run, as shown in Fig. 9.
In the plots, vertical red lines are drawn in at iterations = 55 and 171. As shown in the rightmost figure, they correspond to turning points in the behavior of vs iterations. The turning points can be understood by examining the plots of and vs. iterations. Note that near iterations=50 in the leftmost plot, the CF has dropped to near 0; at that point is still between -1.5 and -2.0. Also, won’t be close to the minimum at till about iterations = 170. Given these observations, the sharp descents and ascents of in the rightmost plot are now understandable. Near the beginning of the optimization run , the updates of are just beginning to lead to a drop in the CF. Because this is such a steep descent, must be reduced quickly; that is exactly what is seen in between approximately iterations 13 and 50. Following that, is near -1.5 where the CF is relatively flat, and the updates still have to bring it to . One would expect has to now increase in order to effect such change, and that is exactly what is seen in the rightmost plot of between iterations = 55 and 171. Having reached iterations =171, is near zero, but not exactly. In order to make a “soft landing” at , must start decreasing. Indeed, that is what the rightmost plot reveals from iterations = 171 onwards.
The quick changes of when NeogradM is used on this CF highlight the importance of the responsiveness of an algorithm. It is not enough for an algorithm to slowly adapt to a changing CF landscape, as other algorithms do (cf. [3, 16, 5]). It must be able to make suitable adjustments almost immediately.
7.3 Beale’s function
Beale’s function  is used as a standard test for optimization algorithms, as it offers several challenging features in its CF landscape. It is defined as
It has a global minimum at . The initial condition used here is , from which is known the minimum is reachable. In Fig. 10 are comparison plots between Adam and NeogradM when applied to this CF.
Once again, NeogradM reaches a much lower value than Adam in the plot; the exact number depends on how long the run is. Also, the plot of shows that Adam again naturally evolves into a situation where it produces very small values, which as discussed are indicative of high UF (i.e., slow progress). The graph of due to NeogradM does a good job of staying within the red lines, except for a narrow spike to a low value. Recall that such overly small values are merely inefficient, whereas overly large values are potentially problematic. This deviation seems to be a response to a very flat region in the CF profile. Finally, the right plot in the figure shows that NeogradM has reached its target after about 250 iterations, at which point Adam is still far from it.
It is illuminating to examine a plot of the updates of each algorithm as they move through the CF landscape. In Fig. 11 a red star indicates the starting point for the algorithms, and a black star the minimum.
The path taken by NeogradM (the blue dots), takes a very direct route; it is nearly perpendicular to the contour lines. Recall that a pure GD algorithm would be perfectly perpendicular; the slight discrepancy is due to the momentum in NeogradM. In contrast, Adam takes a relatively oblique path toward the black star; it is clearly not perpendicular to the contour lines. As discussed in Sec. 2.1, Adam utilizes preferential weighting which adds significant weight to the null space term in the solution for (cf. Eq. (2)), which are parallel to the contour lines. Thus, this plot is a visualization of that contribution.
7.4 Digit recognition
Prior to this section, the CFs that were examined only had a 1 or 2 dimensional parameter space. In contrast, the current experiment uses a model where the dimensionality of is 2260. Specifically, the model is that of a fully connected feed-forward neural network (NN) that is used to classify the digits 0-9. The digits are supplied through scikit  in the form of 8-by-8 images; there are 1437 labeled training images and 360 test images. The NN has one hidden layer, 64 input nodes, 30 hidden nodes, and 10 output nodes. No preprocessing was done on the input data, although it was one-hot encoded. A tanh activation function was used on the hidden layer, and a softmax on the final layer; a cross-entropy penalty CF was constructed between the output of the final layer and the training label. Also, the model had no regularization. Both Adam and NeogradM were used in training this NN model for a single run of 3500 iterations, resulting in Fig. 12
The most noticeable feature in the plot is that NeogradM leads to a significantly smaller value of the CF. Associated with that is the plot , which reveals the dramatic increase of throughout the course of optimization. (In this figure the value of for Adam is shown for reference.) This plot shows that can and should be increased, in order to keep in the target interval.
Next, in Fig. 13, a side-by-side comparison is made of for the two algorithms.
As expected, NeogradM does a good job of keeping in the target interval 4, whereas Adam falls far below it. This is similar to what was seen before with Adam. It also reveals that in this (somewhat) real application it’s necessary to adjust rather frequently, as is clear from the raggedness of the plot of for NeogradM.
In Fig. 14 a comparison is made between the sumd plots for the two algorithms.
These plots have several interesting features, one of which is the rate of change of the updates. In this case the red vertical lines are drawn every 50 iterations, and they reveal that while NeogradM makes slow and steady progress, Adam instead makes initial rapid progress but then slows down suddenly. The vertical axes of these plots also show that changes by almost twice as much for NeogradM: 38.9 for NeogradM versus 20.3 for Adam. Also, the plot for Adam shows additional curvature that is not present for NeogradM. This indicates the updates for Adam are going in a different direction.
To further investigate the differences in the -paths from these two algorithms, three plots in Fig. 15 are presented.
The left plot is a 2D comparison of the two parameter vectors, for Adam and for NeogradM. The angle between them is larger than might be expected, equalling over 58 degrees. To understand how this difference might result, a plot of the angle between and is given as a function of iterations. Somewhat surprisingly, almost all of the the change in direction happens after the first iteration. This must reflect that Adam uses preferential weighting, and simply does not follow the direction of steepest descent. This was seen previously in the section on Beale’s function 7.3, where the presence of the different directions was notable. Finally, the third plot was included to make a simple comparison between the basins of attraction reached by each . (Keep in mind that this comparison is a select 1D plot, and the underlying dimensionality of the space is 2260.) The interpolating vector between the two s is defined as
In this plot the range of is . This plot shows the basins are distinct (at least along this one dimension). While it may seem that Adam has simply discovered a higher, slightly wider basin, it’s actually the case that it simply hasn’t followed the gradient of down to lower values. Interestingly, if Adam were allowed to continue  the optimization for an additional 3500 iterations, it would only have gone from to . However, if NeogradM had instead continued from where Adam left off at 3500 iterations, it would have gone from to . This shows it’s not that Adam somehow found a shallower basin, it’s that it wasn’t as effective in reaching a lower value in that basin.
The plots of this section have revealed that NeogradM outperforms Adam. In order to quantify this, the speedup is computed, as defined by the ratio , where () is the number of iterations needed by Adam (NeogradM) to reach a specified CF value. (Another way to measure the performance increase is to compare CF values for a given iteration.) To reduce variation in the results due to the different initial s, 10 runs were made. In Fig. 16, three plots are shown of the results of this experiment
Adam was run for 80000 iterations, while NeogradM was run until the limits of machine precision caused it to stop. In the left plot, all runs are plotted. Although it seems like NeogradM has a higher variance (along the iteration axis), Adam actually has a higher variance for a given CF value because of its strong plateau. The center plot was obtained by computing the average number of iterations for a fixed CF value for each algorithm. It also gives a visual depiction of how the speedup is defined in this close-up view. Finally, in the right plot the speedup is computed for a larger range of the CF, and it reveals that in reaching these lower CF values, NeogradM is much faster than Adam. It is a reflection of the effective plateau in CF values produced by Adam, which seems to occur near . Near the plateau, the speedup appears to become arbitrarily large. Note the dashed red lines in the right plot, which show that when trying to reach , the speedup by NeogradM over Adam is approximately 6.3.
To understand the cause of the variance in the left plot, the angles were computed for each pair of final values for the 10 runs for Adam, and the 10 runs for NeogradM. It was found that every such pair were nearly perpendicular . This is a likely result of the extreme degeneracy of minima for NNs .
Finally, the analysis in this section has been when the CF was evaluated on the training data. Of course, in a real application one evaluates the resulting model on the testing data, to assess how well it generalizes. However, the goal here is to evaluate how well the algorithms optimize, not generalize. Nevertheless, to satisfy the curiosity of the reader, the results found indicated comparable generalizability. Note that this may indicate that in order to generalize better the model needs to be refined, perhaps by adding some form of regularization.
7.5 2D shell
While a simple polynomial function with gradual slopes is known to be easy fodder for a GD algorithm, CFs that have narrow wells are known to be much more challenging; they often lead to updates that oscillate about the minimum. The CF introduced here is meant as an extreme test in that regard, since it has a very narrow, curving well. This function is defined  over as :
where , , and is defined in Sec. 7.2; the remaining parameters are set to: , , , . As with the sigmoid-well CF, this function is composed of two sigmoids as well. Here, creates a circular “well” of radius ; creates a circular “bump” of radius . When these two functions are added together, they create a narrow valley that follows a circular path, representing a degenerate minimum. To break this degeneracy, the term is added, creating a local minimum near . Although this term tilts the entire CF landscape, this experiment is constrained so that updates should follow the local minimum of the circle. A view of a cross-section of this CF in Fig. 17 illustrates the effect of this term.
In previous sections the emphasis had been on plots of the CF vs. iterations. Here it is easier to form a comparison by examining two plots: vs iterations, and an overhead visual of the path of updates in the plane. In the figures, the two white stars mark the start and end points, with the start point being and the end point near . The ideal behavior of the algorithm is to move slightly left, drop down into the local well, and then follow it counter-clockwise until it reaches the local minimum near . The best algorithm will be judged by its ability to follow this path in a stable and efficient manner, and also to get as close as possible to the end point. The plot will be used to judge its stability and efficiency. For all of the following figures, the number of iterations was 1200, and the learning rate was chosen to show (near) optimal behavior of the algorithms.
With the Adam algorithm, was used. Immediately noticeable in Fig. 18 is the erratic graph of , which shows a limited amount of stability.
When a smaller was used, its stability improved, but at the expense of progress toward the minimum. In comparison to RMS Prop (figure available on Github ), it clearly does better. Actually, the general trend was that any algorithm that utilized momentum did much better than its momentum-free version. Apparently, when faced with a narrow well, the ability of momentum to average out mistakes is more important than just making smaller mistakes.
The results in Fig. 19 from NeogradM showed a significant contrast
First, it was very stable, as stayed within its target interval to a large degree. The updates also took it closer to the minimum near . The Neograd algorithm (i.e., the version of NeogradM without momentum), didn’t do nearly as well (cf. results on Github ). As mentioned earlier, the role of momentum is apparently very important in dealing with narrow valleys.
The algorithm NeogradM_dbl was also used for updates on this CF, as shown in Fig. 20.
The plot is much improved, in that it is better at respecting the boundaries of . Also, the path of the updates ends up close to the target of the white star on the left. From these two plots, this algorithm is visibly better than all the others. This was to be expected, as it is basically NeogradM except with a double computation of the CF. In the plots in Figs. 19 and 20, one of the first values is large, and above . This is due to the updates taking through the sudden drop as it enters the narrow valley. This issue will be addressed elsewhere .
In Fig. 21 is shown a comparison between the dotp and sumd diagnostics for Adam.
The most notable aspects displayed here are the reversals in direction that are apparent in both plots. Recall that when successive values are in opposite direction, dotp , and the overall distance change is approximately zero. This is exactly what’s indicated in these two plots: the two regions in dotp with negative values correspond to the two regions of the sumd plot where the slope is approximately zero. (The correspondence can be determined after realizing the red vertical lines in the sumd plot are drawn every 50 iterations.) Another reason for the slope of the sumd plot to be near zero is when is perpendicular to ; this actually occurs at the beginning of the run, soon after the updates take it into the well. Whenever the slope of the sumd is near 45 degrees, and point in the same direction. An inspection of Fig. 18 reveals that such could have been the case for the entire latter half of the sumd plot, were it not for the brief interruption where dotp was negative (as already discussed).
In Fig. 22 the same comparison plot is shown for NeogradM_dbl.
Here, dotp is almost exactly 1.0, for the whole course of the optimization, which indicates the process was very stable with no reversals. Also, the vertical red lines in the sumd plot occur very regularly, which indicates the updates are made at a somewhat constant size. Toward the end of the run, the sumd graph has a slope that’s veering away from 45 degrees toward zero. As already explained, a small slope occurs when the updates to are becoming perpendicular to . Since was , t