Yura Malitsky1    Konstantin Mishchenko KAUST, Thuwal, Saudi Arabia, konstantin.mishchenko@kaust.edu.sa
11Laboratory for Information and Inference Systems, EPFL, Lausanne, Switzerland, yurii.malitskyi@epfl.ch
###### Abstract

We present a strikingly simple proof that two rules are sufficient to automate gradient descent: 1) don’t increase the stepsize too fast and 2) don’t overstep the local curvature. No need for functional values, no line search, no information about the function except for the gradients. By following these rules, you get a method adaptive to the local geometry, with convergence guarantees depending only on smoothness in a neighborhood of a solution. Given that the problem is convex, our method will converge even if the global smoothness constant is infinity. As an illustration, it can minimize arbitrary continuously twice-differentiable convex function. We examine its performance on a range of convex and nonconvex problems, including matrix factorization and training of ResNet-18.

### 1 Introduction

Since the early days of optimization it was evident that there is a need for algorithms that are as independent from the user as possible. First-order methods have proven to be versatile and efficient in a wide range of applications, but one drawback has been present all that time: the stepsize. Despite some certain success stories, line search procedures and adaptive online methods have not removed the need to manually tune the optimization parameters. Even in smooth convex optimization, which is often believed to be much simpler than the nonconvex counterpart, robust rules for stepsize selection have been elusive. The purpose of this work is to remedy this deficiency.

The problem formulation that we consider is the basic unconstrained optimization problem

 minx∈Rd f(x), (1)

where is a differentiable function. Throughout the paper we assume that (1) has a solution and we denote its optimal value by .

The simplest and most known approach to this problem is the gradient descent method (GD), whose origin can be traced back to Cauchy [7, 20]. Although it is probably the oldest optimization method, it continues to play a central role in modern algorithmic theory and applications. Its definition can be written in a mere one line,

 xk+1=xk−λ∇f(xk),k≥0, (2)

where is arbitrary and . Under assumptions that is convex and –smooth222Alternatively, we will say that is -Lipschitz., that is

 ∥∇f(x)−∇f(y)∥≤L∥x−y∥,∀x,y, (3)

one can show that GD with converges to an optimal solution [30]. Moreover, with the convergence rate [10] is

 f(xk)−f∗≤L∥x0−x∗∥22(2k+1), (4)

where is any solution of (1). Note that this bound is not improvable [10].

We identify four important challenges that limit the applications of gradient descent even in the convex case:

1. GD is not general: many functions do not satisfy (3) globally.

2. GD is not a free lunch: one needs to guess , potentially trying many values before a success.

3. GD is not robust: failing to provide may lead to divergence.

4. GD is slow: even if is finite, it might be arbitrarily larger than local smoothness.

Certain ways to address some of the issues above already exist in the literature. They include line search, adaptive Polyak’s stepsize, mirror descent, dual preconditioning and stepsize estimation for subgradient methods. We discuss them one by one below, in a process reminiscent of cutting off Hydra’s limbs: if one issue is fixed, two others take its place.

If our goal is to decrease the function value robustly, without user input and by a significant amount, why don’t we just estimate the best stepsize at each iteration? This idea has been around for decades under the name line search (or backtracking) and, indeed, it remains the most practical and generic solution to tackle the aforementioned issues. This direction of research started from the seminal works of Goldstein [13] and Armijo [1] and continues to attract attention, see [5, 35] and references therein.

In general, at each iteration the line search executes another subroutine with additional evaluations of and/or until some condition is met, potentially exploding the cost of a single iteration. Naturally, this makes complexity bounds no longer informative.

At the same time, the famous Polyak’s stepsize [31] stands out as a very fast alternative to gradient descent. Furthermore, it does not depend on the global smoothness constant and uses the current gradient to estimate the geometry. The formula might look deceitfully simple, , but there is a catch: it is rarely possible to know . This method, again, requires the user to guess . What is more, with it was fine to underestimate it by a factor of 10, but the guess for must be tight, otherwise it has to be reestimated later [15].

Seemingly no issue is present in the Barzilai-Borwein [3] stepsize. Motivated by the quasi-Newton schemes, the authors of [3] suggested using steps

 λk=⟨xk−xk−1,∇f(xk)−∇f(xk−1)⟩∥∇f(xk)−∇f(xk−1)∥2.

Alas, the convergence results regarding this choice of are very limited, only the case with a quadratic is well understood [32, 9], otherwise one needs to add again a line search [33], see also the counter-example for constrained minimization in [8].

Other more interesting ways to deal with non-Lipschitzness of use the problem structure. The first method, pioneered recently in [4], shows that the mirror descent method, which is another extension of GD, can be used with a fixed stepsize, whenever satisfies a certain generalization of (3). In the other paper [21], the authors proposed the dual preconditioning method—also a refined version of GD. Similarly to the former technique, it also goes beyond the standard smoothness assumption of , but in a different way. Unfortunately, these two simple and elegant approaches cannot resolve all issues yet. First, not many functions fulfill respective generalized conditions. And second, both methods still get us back to the problem of not knowing the allowed range of stepsizes.

A whole branch of optimization considers adaptive extensions of GD that deal with functions whose (sub)gradients are bounded. Probably the earliest work in that direction was written by Shor[36]. He showed that the method

 xk+1=xk−λkgk∥gk∥,

where is a subgradient, converges for properly chosen sequences , see, e.g., Section 3.2.3 in [26]. Moreover, requires no knowledge about the function whatsoever.

Similar methods that work in online setting such as Adagrad [11, 23] received a lot of attention in recent years and remain an active topic of research [41]. Methods similar to Adagrad—Adam [18, 34], RMSprop [40] and Adadelta [43]—remain state-of-the-art for training neural networks. The corresponding objective is usually neither smooth nor convex, and the theory often assumes Lipschitzness of the function rather than of the gradients. Therefore, this direction of research is mostly orthogonal to ours, although we do compare with some of these methods in our ResNet-18 experiment.

We also note that without momentum Adam and RMSprop reduce to signSGD [6], which is known to be non-convergent for arbitrary stepsizes on a simple quadratic problem [16]. This issue seems to have something to do with pathological dependence on gradients noise, see the bounds in [42, 6].

In a close relation to ours is the recent work [22], where there was proposed an adaptive golden ratio algorithm for monotone variational inequalities. As it solves a more general problem, it does not exploit the structure of (1) and, as most variational inequality methods, has a more conservative update. Although the method estimates the smoothness, it still requires an upper bound on the stepsize as input.

##### Contribution.

We propose a new version of GD that at no cost resolves all aforementioned issues. The idea is simple, and it is surprising that it has not been yet discovered. In each iteration we choose as a certain approximation of the inverse local Lipschitz constant. With such a choice, we prove that convexity and local smoothness of and sufficient for convergence of iterates with the complexity for in the worst case.

##### Discussion.

Let us now briefly discuss why we believe that proofs based on monotonicity and global smoothness lead to slower methods.

Gradient descent is by far not a recent method, so there have been obtained optimal rates of convergence. However, we argue that adaptive methods require rethinking optimality of the stepsizes. Take as an example a simple quadratic problem, , where . Clearly, the smoothness constant of this problem is equal to and the strong convexity one is . If we run GD from an arbitrary point with the “optimal” stepsize , then one iteration of GD gives us , and similarly . Evidently for small enough we might converge to the solution quite long. Instead GD would converge in two iterations, if it adjusts its step after the first iteration to .

Nevertheless, all existing analyses of the gradient descent with -smooth use stepsizes bounded by . In addition, functional analysis gives

 f(xk+1)≤f(xk)−λ(1−λL2)∥∇f(xk)∥2,

from which can be seen as the “optimal” stepsize. Alternatively, we can assume that is -strongly convex, and the analysis in norms gives

 ∥xk+1−x∗∥2≤ (1−λμ)2∥xk−x∗∥2 −λ(2L+μ−λ)∥∇f(xk)−∇f(x∗)∥2,

from which the “optimal” step is .

Finally, line search procedures use some certain type of monotonicity, for instance ensuring that for some . We break with this tradition and merely ask for convergence in the end.

### 2 Main part

#### 2.1 Local smoothness of f

Recall that the mapping is locally Lipschitz if it is Lipschitz over any bounded set. It is natural to ask whether some interesting functions are smooth locally, but not globally.

It turns out there is no shortage of examples, most prominently among highly nonlinear functions. In , they include , , , , for , etc. More generally, they include any twice differentiable , since , as a continuous mapping, is bounded over any bounded set . In this case, we have that is Lipschitz on , due to the mean value inequality

 ∥∇f(x)−∇f(y)∥≤maxz∈C∥∇2f(z)∥∥x−y∥,∀x,y∈C.

Algorithm 1 we propose is just a slight modification of the GD. The quick explanation why local Lipschitzness of does not cause us any problems, unlike to most other methods, lies in the way we prove its convergence. Whenever the stepsize satisfies two inequalities333It can be shown that instead of the second condition it is enough to ask for , where , but we prefer the option written in the main text for its simplicity.

 ⎧⎨⎩λ2k≤(1+θk−1)λ2k−1,λk≤∥xk−xk−1∥2∥∇f(xk)−∇f(xk−1)∥,

independently of the properties of (apart from convexity), we can show that the iterates remain bounded. Here and everywhere else we use the convention , so if , the second inequality can be ignored.

In the introduction we stated that we are interested in algorithms as independent from the user as possible. However, Algorithm 1 still needs and as input. This is not an issue, as one can simply fix and . Equipped with a tiny , we ensure that will be close enough to and likely will give a good estimate for . Otherwise, this has no influence on the further steps.

#### 2.2 Analysis without descent

It is now time to show our main contribution, the new analysis technique. The tools that we are going to use are the well-known Cauchy-Schwarz and convexity inequalities. In addition, our methods are related to potential functions [38], which is a powerful tool for producing tight bounds for GD.

Another divergence from the common practice is that our main lemma includes not only and , but also . This can be seen as a two-step analysis, while the majority of optimization methods have one-step bounds. However, as we want to adapt to the local geometry of our objective, it is rather natural to have two terms to capture the change in the gradients.

Now, it is time to derive a characteristic inequality for a specific Lyapunov energy.

###### Lemma 1.

Let be any solution of (1). Then for generated by Algorithm 1 it holds

 ∥xk+1−x∗∥2+12∥xk+1−xk∥2+2λk(1+θk)(f(xk)−f∗)≤∥xk−x∗∥2+12∥xk−xk−1∥2+2λkθk(f(xk−1)−f∗). (5)
###### Proof.

Let . We start from the standard way of analyzing GD:

 ∥xk+1−x∗∥2 =∥xk−x∗∥2+2⟨xk+1−xk,xk−x∗⟩+∥xk+1−xk∥2 =∥xk−x∗∥2−2λk⟨∇f(xk),xk−x∗⟩+∥xk+1−xk∥2.

As usually, we can bound the scalar product by convexity of :

 −2λk⟨∇f(xk),xk−x∗⟩≤−2λk(f(xk)−f∗), (6)

which gives us

 ∥xk+1−x∗∥2=∥xk−x∗∥2−2λk(f(xk)−f∗)+∥xk+1−xk∥2. (7)

These two steps have been repeated thousands of times, but now we continue in a completely different manner. We have precisely one “bad” term in (7), which is . The reason it appears lies in the fact that gradients change from point to point, so we will bound it using the difference of gradients:

 ∥xk+1−xk∥2 =2∥xk+1−xk∥2−∥xk+1−xk∥2 (8) =−2λk⟨∇f(xk),xk+1−xk⟩−∥xk+1−xk∥2 =2λk⟨∇f(xk)−∇f(xk−1),xk−xk+1⟩ +2λk⟨∇f(xk−1),xk−xk+1⟩−∥xk+1−xk∥2.

Let us estimate the first two terms in the right-hand side above. First, definition of , followed by Cauchy-Schwarz and Young’s inequalities, yields

 2λk⟨∇f(xk)−∇f(xk−1),xk−xk+1⟩ ≤2λk∥∇f(xk)−∇f(xk−1)∥∥xk−xk+1∥ (9) ≤∥xk−xk−1∥∥xk−xk+1∥ ≤12∥xk−xk−1∥2+12∥xk+1−xk∥2.

Second, by convexity of ,

 2λk⟨∇f(xk−1),xk−xk+1⟩ =2λkλk−1⟨xk−1−xk,xk−xk+1⟩ (10) =2λkθk⟨xk−1−xk,∇f(xk)⟩≤2λkθk(f(xk−1)−f(xk)).

Plugging (9) and (10) in (8), we obtain

 ∥xk+1−xk∥2≤12∥xk−xk−1∥2+12∥xk+1−xk∥2+2λkθk(f(xk−1)−f(xk)).

Finally, using the produced estimate for in (7), we deduce the desired inequality (5). ∎

The above lemma already might give a good hint why our method works. From inequality (5) together with condition , we obtain that the Lyapunov energy—the left-hand side of (5)—is decreasing. This gives us boundedness of , which is often the key ingredient for proving convergence. In the next theorem we formally state our result.

###### Theorem 1.

Suppose that is convex with locally Lipschitz gradient . Then generated by Algorithm 1 converges to a solution of (1) and we have that

 f(^xk)−f∗≤D2Sk=O(1k),

where and and is a constant that explicitly depends on the initial data and the solution set.

Our proof will consist of two parts. The first one is a straightforward application of Lemma 1, from which we derive boundedness of and complexity result. Due to its conciseness, we provide it directly after this remark. In the second part, we prove that the whole sequences converges to a solution. Surprisingly, this part is a bit more technical than expected, and thus we postpone it to the appendix.

###### Proof.

(Boundedness and complexity result).

Fix any from the solution set of eq. 1. Telescoping inequality (5), we deduce

 ∥xk+1−x∗∥2+12∥xk+1−xk∥2+2λk(1+θk)(f(xk)−f∗) (11) +2k−1∑i=1[λi(1+θi)−λi+1θi+1](f(xi)−f∗) ≤ ∥x1−x∗∥2+12∥x1−x0∥2+2λ1θ1[f(x0)−f∗]def=D.

Note that by definition of , the second line above is always nonnegative. Thus, the sequence is bounded. Since is locally Lipschitz, it is Lipschitz continuous on bounded sets. It means that for the bounded set444Obviously, the closed convex-hull of a bounded set remains bounded. there exists such that

 ∥∇f(x)−∇f(y)∥≤L∥x−y∥∀x,y∈C.

Clearly, , thus, by induction one can prove that , in other words, the sequence is separated from zero.

Now we want to apply the Jensen inequality for the sum of all terms in the left-hand side of (11). Notice, that the total sum of coefficients at these terms is

 λk(1+θk)+k−1∑i=1[λi(1+θi)−λi+1θi+1]=k∑i=1λi+λ1θ1=Sk

Thus, by Jensen’s inequality,

 D2≥LHS of (???)2≥Sk(f(^xk)−f∗),

where And the first part of the proof is complete. ∎

As we have shown that for all , we have a theoretical upper bound . Note that in practice however, might be much larger than a pessimistic bound , which we observe in our experiments together with a faster convergence.

#### 2.3 f is L-smooth

The proof demonstrated above is novel even when is -smooth and we will use a fixed stepsize. A simple distinguished feature of it is that we did not use the descent lemma. This has, however, a drawback, as our analysis only allows one to take instead of . The natural question is then whether one can improve our algorithm and/or its analysis providing that is -smooth and is known.

In this case we propose Algorithm 2, which is just a slight modification of the previous one. Interestingly, Lemma 1 remains valid and we get the next theorem (the proofs are in the appendix).

###### Theorem 2.

Let be convex and -smooth. Then for generated by Algorithm 2 inequality (5) holds. As a corollary, it holds for some ergodic vector that .

Clearly, analysis of Algorithm 2 now allows one to take any fixed step .

#### 2.4 f is μ-strongly convex

Since one of our goals is to make optimization easy to use, we believe that a good method should have state-of-the-art guarantees in various scenarios. For strongly convex functions, this means that we want to see linear convergence, which is not covered by normalized GD or online methods. In section 2.1 we have shown that Algorithm 1 matches the complexity of GD on convex problems. Now we show that it also matches complexity of GD when is -strongly convex.

For proof simplicity, instead of using bound as in creftype 3 of Algorithm 1 we will use a more conservative bound (otherwise the derivation would be too technical). It is clear that with such a change Theorem 1 still holds true, so the sequence is bounded and we can rely on local smoothness and local strong convexity.

###### Theorem 3.

Suppose that is locally strongly convex and is locally Lipschitz. Then generated by Algorithm 1 (with the modification mentioned above) converges to the solution of (1). The complexity to get is , where and are the smoothness and strong convexity constants of on the set .

We want to highlight that in our rate depends on the local Lipschitz and strong convexity constants and , which is meaningful even when these properties are not satisfied globally. Similarly, if is globally smooth and strongly convex, our rate is still faster as it depends on the smaller local constant.

### 3 Heuristics

In this section we describe several extensions of our method. We do not have full theory for them, but believe that they are of interest in applications.

#### 3.1 Acceleration

Suppose that is -strongly convex. One version of the accelerated gradient method proposed by Nesterov [26] is

 yk =xk−1L∇f(xk), xk+1 =yk+1+β(yk+1−yk),

where . We know that in the gradient descent can be efficiently estimated using the scheme

 λk=min{√1+θk−1λk−1,∥xk−xk−1∥2∥∇f(xk)−∇f(xk−1)∥}.

What about the strong convexity constant ? We know that it equals to the inverse smoothness constant of the conjugate . Thus, it is tempting to estimate this inverse constant just as we estimated inverse smoothness of , i.e., by formula

 Λk=min{√1+Θk−1Λk−1,∥pk−pk−1∥2∥∇f∗(pk)−∇f∗(pk−1)∥}

where and are some elements of the dual space and . A natural choice then is since it is an element of the dual space that we use. What is its value? It is well known that , so we come up with the update rule

 Λk=min{√1+Θk−1Λk−1,∥∇f(xk)−∇f(xk−1)∥2∥xk−xk−1∥},

and hence we can estimate by .

We summarize our arguments in Algorithm 3. Unfortunately, we do not have any theoretical guarantees for it.

Estimating strong convexity parameter is very important in practice. Most common approaches rely on restarting technique, proposed by Nesterov [25], see also [12] and references therein. Unlike Algorithm 3, these works have theoretical guarantees, however these methods by itself are more complicated and still require tuning of other unknown parameters.

#### 3.2 Uniting our steps with stochastic gradients

Here we would like to discuss applications of our method to the problem

 minxE[fξ(x)],

where is almost surely -smooth and -strongly convex. Assume that at each iteration we get sample to make a stochastic gradient step,

 xk+1=xk−λk∇fξk(xk).

Then, we have two ways of incorporating our stepsize into SGD: by using in the estimate of , or using an extra sample . If we use , then we estimate , and compute and similarly we do if we used . Clearly, the update with will not be unbiased, although it is more intuitive to estimate smoothness on the same function that we use in the update. Surprisingly, the option with a biased estimate performed much better in our experiments. Full algorithm formulation is given in Algorithm 4.

The theorem below provides convergence guarantees for both cases, but with different assumptions. Unfortunately, we cannot guarantee similarly to Theorem 1 that one can use .

###### Theorem 4.

Let be -smooth and -strongly convex almost surely. Assuming and estimating with , the complexity to get is not worse than . Furthermore, if the model is overparameterized, i.e.,  almost surely, then one can estimate with and the complexity is .

Note that in both cases we match the known dependency on up to logarithmic terms, but we get an extra as the price for adaptive estimation of the stepsize.

#### 3.3 Decreasing stepsizes in SGD

Another potential application of our techniques is estimation of decreasing stepsizes in SGD. Namely, the best known rates for SGD [37], which are optimal up to some constants, are obtained using that decreases as . For instance, [37] proposes to use stepsize where , and is the total number of iterations. This requires estimates of both smoothness and strong convexity, which can be borrow from the previous discussion. We leave rigorous proof of such schemes for future work.

### 4 Experiments

In the experiments555The code for the experiments can be found at https://github.com/ymalitsky/adaptive_gd, we compare our approach with the two most related methods: GD and Nesterov’s acceleration of GD for convex functions [24] (AGD). We always tune the stepsize for GD and AGD unless is known. If no finite exists, we still run the methods and observe their convergence even though it is not backed by theory.

First, we consider probably the simplest possible optimization problem—minimization of the quadratic , for in different settings. We run experiments for three scenarios: 1) with a Gaussian matrix ; 2) , where ; 3) with the Hilbert matrix : . In the experiments we used and random initialization. The results are presented in Figure 1. This problem illustrates well that the “convergence rates are in the eye of a practitioner”.

##### Logistic regression.

The -regularized logistic regression objective is given by , where is the number of observations and , are the observations. We use ‘mushrooms’ and ‘covtype’ datasets to run the experiments. The results are provided in Figure 2. We choose the amount of regularization proportionally to as often done in practice. Since we have closed-form expressions to estimate and we know , we use them directly in GD and its acceleration.

##### Matrix factorization.

Given a matrix and , we want to solve for and . This is useful when we want to approximate by some low rank matrix. It is a nonconvex problem, moreover, the gradient is not globally Lipschitz. With some tuning one still can apply GD and Nesterov’s accelerated GD, but—and we want to emphasize it—it was not a trivial thing to find the steps in practice. The steps we have chosen were in some sense optimal, i.e., GD and AGD did not converge if we doubled the steps. Interestingly, for both these methods their “optimal” steps were quite different. In contrast, our methods do not require any tuning, so even in this regard they are much more practical. For the experiments we used Movilens 100K dataset [14] with more than million entries and several values of . All algorithms were initialized at the same point, chosen randomly. The results are presented in Figure 3.

##### Neural network.

We use standard ResNet-18 architecture implemented in Pytorch6 [29] and train it to classify images from the Cifar10 dataset [19] with cross-entropy loss. We use batch size 128 for all methods. For Adam and Adadelta, we took the default parameters, and learning rate for SGD.

For our method, we observed that works better than . In addition, we ran it with and in the other factor and the latter performed the best. For reference, we provide the result for the theoretical estimate as well. We did not test any other values than the specified combinations, so we imagine that some other variation can perform better. We used the variant of SGD with computed using as well. See Figure 4 for the results.

##### Cubic regularization.

In cubic regularization of Newton method [27], at each iteration we need to minimize , where and are given. This objective is smooth only locally due to the cubic term, which is our motivation to consider it. and were the gradient and the Hessian of the logistic loss with the Covtype dataset, evaluated at . Although the values of led to similar results, they also required different numbers of iterations, so we present the corresponding results in Figure 5.

### 5 Perspectives

We briefly provide a few directions which we personally consider to be important and challenging.

1. Nonconvex case. A great challenge for us is to understand theoretical guarantees of the proposed method in the nonconvex settings. We are not aware of any generic first-order method for nonconvex optimization that does not rely on the descent lemma (or its generalization), see, e.g., [2].

2. Performance estimation. In our experiments we often observed much better performance of Algorithm 1, than GD or AGD. However, the theoretical rate we can show coincides with that of GD. The challenge here is to bridge this gap and we hope that the approach pioneered by[10] and further developed in [39, 17, 38] has a potential to do that.

3. Composite minimization. In classical first-order methods the transition from smooth to composite minimization [25] is rather straightforward. Unfortunately, the proposed proof of Algorithm 1 does not seem to provide any route for generalization and we hope there is some way of resolving this issue.

4. Stochastic optimization. The derived bounds for the stochastic case are not satisfactory and have suboptimal dependency on . However, it is not clear to us whether one can extend the techniques from the deterministic analysis to improve the rate.

5. Heuristics. Finally, we want to have some solid ground in understanding the performance of the proposed heuristics.

##### Acknowledgments.

Yura Malitsky wishes to thank Roman Cheplyaka for his interest in optimization that partly inspired the current work.

### References

• [1] L. Armijo, Minimization of functions having lipschitz continuous first partial derivatives., Pacific Journal of Mathematics, 16 (1966), pp. 1–3.
• [2] H. Attouch, J. Bolte, and B. F. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods, Mathematical Programming, 137 (2013), pp. 91–129.
• [3] J. Barzilai and J. M. Borwein, Two-point step size gradient methods, IMA Journal of Numerical Analysis, 8 (1988), pp. 141–148.
• [4] H. H. Bauschke, J. Bolte, and M. Teboulle, A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications, Mathematics of Operations Research, 42 (2016), pp. 330–348.
• [5] J. Y. Bello Cruz and T. T. Nghia, On the convergence of the forward–backward splitting method with linesearches, Optimization Methods and Software, 31 (2016), pp. 1209–1238.
• [6] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, signSGD: Compressed optimisation for non-convex problems, in International Conference on Machine Learning, 2018, pp. 559–568.
• [7] A. Cauchy, Méthode générale pour la résolution des systemes d’équations simultanées, Comp. Rend. Sci. Paris, 25 (1847), pp. 536–538.
• [8] Y.-H. Dai and R. Fletcher, Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming, Numerische Mathematik, 100 (2005), pp. 21–47.
• [9] Y.-H. Dai and L.-Z. Liao, R-linear convergence of the barzilai and borwein gradient method, IMA Journal of Numerical Analysis, 22 (2002), pp. 1–10.
• [10] Y. Drori and M. Teboulle, Performance of first-order methods for smooth convex minimization: a novel approach, Mathematical Programming, 145 (2014), pp. 451–482.
• [11] J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research, 12 (2011), pp. 2121–2159.
• [12] O. Fercoq and Z. Qu, Adaptive restart of accelerated gradient methods under local quadratic growth condition, arXiv preprint arXiv:1709.02300, (2017).
• [13] A. Goldstein, Cauchy’s method of minimization, Numerische Mathematik, 4 (1962), pp. 146–150.
• [14] F. M. Harper and J. A. Konstan, The movielens datasets: History and context, ACM transactions on interactive intelligent systems (tiis), 5 (2016), p. 19.
• [15] E. Hazan and S. Kakade, Revisiting the Polyak step size, arXiv preprint arXiv:1905.00313, (2019).
• [16] S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi, Error feedback fixes signsgd and other gradient compression schemes, in International Conference on Machine Learning, 2019, pp. 3252–3261.
• [17] D. Kim and J. A. Fessler, Optimized first-order methods for smooth convex minimization, Mathematical programming, 159 (2016), pp. 81–107.
• [18] D. Kingma and J. Ba, Adam: A method for stochastic optimization, International Conference on Learning Representations, (2014).
• [19] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images, tech. rep., Citeseer, 2009.
• [20] C. Lemaréchal, Cauchy and the gradient method, Doc Math Extra, 251 (2012), p. 254.
• [21] C. J. Maddison, D. Paulin, Y. W. Teh, and A. Doucet, Dual space preconditioning for gradient descent, arXiv preprint arXiv:1902.02257, (2019).
• [22] Y. Malitsky, Golden ratio algorithms for variational inequalities, Mathematical Programming, (2019).
• [23] H. B. McMahan and M. Streeter, Adaptive bound optimization for online convex optimization, arXiv preprint arXiv:1002.4908, (2010).
• [24] Y. Nesterov, A method for unconstrained convex minimization problem with the rate of convergence , Doklady AN SSSR, 269 (1983), pp. 543–547.
• [25] Y. Nesterov, Gradient methods for minimizing composite functions, Mathematical Programming, 140 (2013), pp. 125–161.
• [26] Y. Nesterov, Introductory lectures on convex optimization: A basic course, vol. 87, Springer Science & Business Media, 2013.
• [27] Y. Nesterov and B. T. Polyak, Cubic regularization of Newton method and its global performance, Mathematical Programming, 108 (2006), pp. 177–205.
• [28] L. Nguyen, P. H. Nguyen, M. Dijk, P. Richtárik, K. Scheinberg, and M. Takac, SGD and Hogwild! convergence without the bounded gradients assumption, in International Conference on Machine Learning, 2018, pp. 3747–3755.
• [29] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, Automatic differentiation in pytorch, (2017).
• [30] B. T. Polyak, Gradient methods for minimizing functionals, Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3 (1963), pp. 643–653.
• [31]  , Minimization of nonsmooth functionals, Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 9 (1969), pp. 509–521.
• [32] M. Raydan, On the Barzilai and Borwein choice of steplength for the gradient method, IMA Journal of Numerical Analysis, 13 (1993), pp. 321–326.
• [33]  , The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem, SIAM Journal on Optimization, 7 (1997), pp. 26–33.
• [34] S. J. Reddi, S. Kale, and S. Kumar, On the convergence of adam and beyond, in International Conference on Learning Representations, 2018.
• [35] S. Salzo, The variable metric forward-backward splitting algorithm under mild differentiability assumptions, SIAM Journal on Optimization, 27 (2017), pp. 2153–2181.
• [36] N. Shor, An application of the method of gradient descent to the solution of the network transportation problem, Materialy Naucnovo Seminara po Teoret i Priklad. Voprosam Kibernet. i Issted. Operacii, Nucnyi Sov. po Kibernet, Akad. Nauk Ukrain. SSSR, vyp, 1 (1962), pp. 9–17.
• [37] S. U. Stich, Unified optimal analysis of the (stochastic) gradient method, arXiv preprint arXiv:1907.04232, (2019).
• [38] A. Taylor and F. Bach, Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions, in Proceedings of the Thirty-Second Conference on Learning Theory, A. Beygelzimer and D. Hsu, eds., vol. 99 of Proceedings of Machine Learning Research, Phoenix, USA, 25–28 Jun 2019, PMLR, pp. 2934–2992.
• [39] A. B. Taylor, J. M. Hendrickx, and F. Glineur, Smooth strongly convex interpolation and exact worst-case performance of first-order methods, Mathematical Programming, 161 (2017), pp. 307–345.
• [40] T. Tieleman and G. Hinton, Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
• [41] R. Ward, X. Wu, and L. Bottou, AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, in Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov, eds., vol. 97 of Proceedings of Machine Learning Research, Long Beach, California, USA, 09–15 Jun 2019, PMLR, pp. 6677–6686.
• [42] M. Zaheer, S. Reddi, D. Sachan, S. Kale, and S. Kumar, Adaptive methods for nonconvex optimization, in Advances in Neural Information Processing Systems, 2018, pp. 9793–9803.

## Appendix:

### Appendix A Missing proofs

###### Lemma 2 (Theorem 2.1.5 in [26]).

Let be a closed convex set in . If is convex and -smooth, then it holds

 f(x)−f(y)−⟨∇f(y),x−y⟩≥12L∥∇f(x)−∇f(y)∥2. (12)

We also need some variation of the Opial lemma.

###### Lemma 3.

Let and be two sequences in and respectively. Suppose that is a bounded sequence whose cluster points belong to and it also holds that

 ∥xk+1−x∥2+ak+1≤∥xk−x∥2+ak,∀x∈S. (13)

Then converges to some element in .

###### Proof.

Let , be any cluster points of . Thus, there exist two subsequences and such that and . Obviously, exists for any . Let . This yields

 limk→∞∥xk−¯x1∥2+ak =limi→∞∥xki−¯x1∥2+aki=limi→∞aki =limj→∞∥xkj−¯x1∥2+akj=∥¯x2−¯x1∥2+limj→∞akj.

Hence, . Doing the same with instead of , yields . Hence, we obtain that , which finishes the proof. ∎

###### Proof of Theorem 1.

(Convergence of ).

Note that in the first part we have already proved that is bounded and that is -Lipschitz on . Invoking Lemma 1, we deduce that

 λk(f(x∗)−f(xk))≥λk⟨∇f(xk),x−xk⟩+λk2L∥∇f(xk)∥2. (14)

This indicates that instead of using inequality (6) in the proof of Lemma 1, we could use a better estimate (14). However, we want to emphasize that we did not assume that is globally Lipschitz, but rather obtained Lipschitzness on as an artifact of our analysis. Clearly, in the end this improvement gives us an additional term in the left-hand side of (5). Thus, now telescoping (5), one obtains that . As , one has that . Now we might conclude that all cluster points of are solutions of (1).

Let be the solution set of (1) and . We want to finish the proof applying Lemma 3. To this end, notice that inequality (5) yields (13), since