On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes
Stochastic gradient descent is the method of choice for large scale optimization of machine learning objective functions. Yet, its performance is greatly variable and heavily depends on the choice of the stepsizes. This has motivated a large body of research on adaptive stepsizes. However, there is currently a gap in our theoretical understanding of these methods, especially in the non-convex setting. In this paper, we start closing this gap: we theoretically analyze the use of adaptive stepsizes, like the ones in AdaGrad, in the non-convex setting. We show sufficient conditions for almost sure convergence to a stationary point when the adaptive stepsizes are used, proving the first guarantee for AdaGrad in the non-convex setting. Moreover, we show explicit rates of convergence that automatically interpolates between and depending on the noise of the stochastic gradients, in both the convex and non-convex setting.
On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes
Xiaoyu Li Department of Applied Mathematics & Statistics Stony Brook University Stony Brook, NY 11790 firstname.lastname@example.org Francesco Orabona Department of Computer Science Stony Brook University Stony Brook, NY 11790 email@example.com
noticebox[b]Preprint. Work in progress.\end@float
In the recent years, Stochastic Gradient Descent (SGD) has become the tool of choice to train machine learning models. In particular, in the Deep Learning community it is widely used to minimize the training error of deep networks. In this setting, the stochasticity arises from the use of so-called mini-batches, that allows to keep the complexity per iteration constant with respect to the size of the training set.
Classic convergence analysis of the SGD algorithm relies on conditions on the positive stepsizes (Robbins and Monro, 1951). In particular, sufficient conditions are that
The first condition is necessary and very intuitive too: we need the algorithm to be able to travel arbitrary distances, in order to reach the stationary point from the initial point. On the other hand, the second one is not necessary and many popular choices of the stepsize, e.g. the one in AdaGrad (Duchi et al., 2011), do not satisfy it while still guaranteeing convergence in the convex setting.
However, for a large number of SGD variations employed by practitioners the conditions above are not satisfied and not much is known about their convergence in the non-convex setting. In fact, these algorithms are often designed and analyzed for the convex domain, (e.g. Duchi et al., 2011), or they do not provide convergence guarantees at all, (e.g. Zeiler, 2012), or even worse they are known to fail to converge on simple one-dimensional convex stochastic optimization problems (Reddi et al., 2018). This lack of understanding is particularly unsettling, especially when we consider the fact that we do not even know under which conditions these popular variations of SGD converge to a stationary point with an infinite number of iterations.
In this paper we start closing this gap, finding necessary conditions under which popular variations of SGD converge to a stationary point. In particular, we focus on the adaptive stepsizes popularized by AdaGrad (Duchi et al., 2011). This kind of updates has become the basis of all other adaptive optimization algorithms used in machine learning, e.g. (Zeiler, 2012; Tieleman and Hinton, 2012; Kingma and Ba, 2015; Reddi et al., 2018). We analyze the coordinate-wise AdaGrad stepsize and a global version too. Also, we show novel theoretical properties of the adaptive global stepsizes in both the convex and non-convex setting.
More in details, the contributions of this paper are the following:
In Section 5, in the convex setting we show that global adaptive stepsizes give rise to convergence rates that are adaptive to the noise level, interpolating between the convergence rates of Gradient Descent (GD) and SGD. In doing so, we also remove the strong assumption of having a bounded domain present in many previous analyses.
In Section 6, in the non-convex setting we prove almost sure convergence of SGD with adaptive stepsizes, for both coordinate-wise and global adaptive stepsizes. As far as we know, this is the first theoretical justification for the use of AdaGrad in the non-convex setting.
In Section 7, in the non-convex setting we show a finite-time convergence rate to stationary points, adaptive to the level of noise for the global stepsizes.
2 Related Work
In the convex setting, adaptive stepsizes have a long history. They were first proposed in the online learning literature (Auer et al., 2002) and adopted into the stochastic one later (Duchi et al., 2011). Yet, most of these works assumed the optimization to be constrained in a convex bounded set. While this is a reasonable assumption in some settings, it is completely unreasonable in many applications of optimization for machine learning. Yousefian et al. (2012) analyze different adaptive stepsizes, but only for strongly convex optimization. Recently, Wu et al. (2018) have analyzed a choice of adaptive stepsizes similar to the global stepsizes we consider, but their result in the convex setting requires the very strong assumption of having the norm of the gradients strictly greater than zero.
The almost sure convergence of SGD for non-convex smooth functions under the weakest assumptions have been established in Bertsekas and Tsitsiklis (2000): the variance of the noise on the gradient in can grow as , is lower bounded, and the stepsizes satisfy (1). The convergence of a random iterate of SGD for non-convex smooth functions has been proved by Ghadimi and Lan (2013). However, both approaches do not cover adaptive stepsizes.
The only work we know on adaptive stepsizes for non-convex stochastic optimization is Kresoja et al. (2017). They study the convergence of a choice of adaptive stepsizes that require access to the function values, under strict conditions on the direction of the gradients. Wu et al. (2018) also consider adaptive stepsizes, but they only consider deterministic gradients in the non-convex setting.
A different route is to assume some properties of the non-convex function that allow to prove convergence rates. A number of such conditions has been proposed, such as the Polyak-Łojasiewicz condition (Polyak, 1963) (see Karimi et al. (2016) for a recent review on these conditions). However, all these conditions are a substitute of strong convexity and they are used to prove linear convergence rate in the non-convex deterministic setting through a contractive mapping. In this view, these conditions are actually very strong and it is still unclear how useful they are to model the problems we encounter in optimization problems in machine learning.
A very weak condition for almost sure convergence to the global optimum of non-convex functions was proposed in Bottou (1998) and recently independently reproposed in Zhou et al. (2017). However, this condition implies the very strong assumption that the gradients never point in the opposite direction of the global optimum. In this paper, in our most restrictive case, we will only assume the function to be smooth and Lipschitz.
3 Problem Set-Up
We denote vectors and matrices by bold letters, e.g. . The coordinate of a vector is denoted by and as for the gradient . We denote by the expectation with respect to the underlying probability space and by the conditional expectation with respect to the past, that is, with respect to . All the norms are L2 norms.
Setting and Assumptions.
We consider the following optimization problem
where is a function bounded from below. We will make different assumptions on the objective function , depending on the setting. In particular, we will always assume that
is -smooth, that is, is differentiable and its gradient is -Lipschitz, i.e. .
Note that (H1), for all , implies (Nesterov, 2003, Lemma 1.2.3)
Sometimes, we will also assume that
is -Lipschitz, i.e. .
We assume that we have access to a stochastic first-order black-box oracle, that returns a noisy estimate of the gradient of at any point . That is, we will use the following assumption
We receive a vector such that for any .
We will also make alternatively one of the following assumptions on the variance of the noise.
The noise in the stochastic gradient has bounded support, that is .
The stochastic gradient satisfies .
Assumption (H4’) has been already used by Nemirovski et al. (2009) to prove high probability convergence guarantees. Note that this condition implies a bounded variance, in fact
This condition will be needed to control the expectation of the maximum of the terms . Note that (H4) implies (H4’).
Stochastic Gradient Descent.
The optimization algorithm we consider is SGD, that iteratively updates the solution as , starting from an arbitrary point . Differently from previous work, we allow the stepsizes to depend on the past, effectively making them stochastic variables. Also, we will consider the more general setting in which the stepsizes are diagonal matrices whose elements on the diagonal are , and the update becomes for .
4 Adaptive Stepsizes
The adaptive stepsizes we analyze are a generalization of ones widely used in the online and stochastic optimization literature. As such, their good performance have been already validated in numerous empirical results. In particular, we consider the following stepsizes (4) and (5) where and . Depending on the particular setting, we might have more constraints on . Note that, with , (5) are the coordinate-wise stepsizes used in AdaGrad (Duchi et al., 2011), while (4) have been used in online convex optimization to achieve adaptive regret guarantees, (e.g. Rakhlin and Sridharan, 2013; Orabona and Pál, 2018).
A key difference of (4) and (5) with the standard adaptive stepsizes is the fact that is not used in . This is a key property for the theoretical analysis, because it allows to calculate the conditional expectation of quantities involving and . Also, we claim this should be the right way to implement adaptive stepsizes. Indeed, as we show in the Example below, if the stepsize does depend on the current gradient, things can go wrong. The details can be found in the Appendix.
There exist a convex differentiable function satisfying (H1), an additive noise on the gradients satisfying (H4), and a sequence of gradients such that for a given we have .
In words, the example says that including the current noisy gradient in (that is, using ) can make the algorithm deviate in expectation more than degrees from the correct direction. While in the convex bounded case the algorithm can recover, it is intuitive that this could have catastrophic consequences in the unconstrained non-convex setting.
On the other hand, this difference makes the analysis more involved, because the quantity cannot be bounded anymore in a straightforward way, see Lemma 2 in the next Section. Previous analyses, (e.g. Duchi et al., 2011), solved this issue by assuming the knowledge of the Lipschitz constant of the function , while we will assume the function to be Lipschitz only to prove the asymptotic guarantee and no knowledge of it.
In the following, we will show that this stepsize allows to prove adaptive guarantees in the convex and non-convex setting.
5 Adaptive Convergence Rates for Convex Functions
In this section, we show that the global stepsizes (4) give adaptive rates of convergence that interpolate between the rate of GD and SGD, without knowledge of the variance of the noise. Differently from the other proofs on SGD with adaptive rates (e.g. Duchi et al., 2011), we do not assume to use projections. This makes the proof more technically challenging, but at the same time it mirrors the setting of many applications of SGD in machine learning optimization problems.
We first state some technical lemmas, whose proofs are in the Appendix.
Assume (H1). Then .
Assume (H1, H3, H4’). The stepsizes are chosen as (4), where . Then,
We can now state the adaptive convergence guarantee.
Assume (H1, H3, H4’) and convex. Let and the stepsizes set as in (4), where , and . Then, with probability at least , the iterates of SGD satisfies the following bound
where , , and .
From the update of SGD we have that
Taking the conditional expectation with respect to , we have that
where in the inequality we used the fact that is convex. Hence, summing over to , we have
We can also lower bound the l.h.s. of (6) with
Putting all together and using Markov’s inequality, we have, with probability at least ,
Using the expression of , we have
where in the first inequality we used the elementary inequality and Lemma 1 in the second one. We use again Markov’s inequality, to have, with probability at least ,
Hence, putting all together, using the notation of the theorem, and overapproximating, we have
Through a case analysis, we have that
Using Jensen’s inequality on the l.h.s. of last inequality together with the union bound concludes the proof. ∎
Up to polylog terms, if we recover the GD rate, , and otherwise we get the rate of SGD, . The same behavior was proved in Dekel et al. (2012). However, here we do not need to know the noise level nor assuming a bounded domain. In the case the constants of the slow term are small compared with the ones of the first term, we can expect a first quick convergent phase, followed by a slow one, as it is often observed in empirical experiments.
Observe that this bound is not in expectation but in probability. While all the bounds on the expected sub-optimality can be expressed in the same way, here the use of the Markov’s inequality is actually necessary to be able to solve the implicit inequality in the proof.
6 Almost Sure Convergence for Non-Convex Functions
In this section, we show that SGD with the adaptive stepsizes in (4) and (5) converges to a stationary point almost surely, that is, with probability 1. Note that the stepsizes in (4) and (5) do not satisfy (1), not even in expectation, because the could decrease fast enough to have . Hence, the results here cannot be obtain from the classic results in stochastic approximation (e.g. Bertsekas and Tsitsiklis, 2000).
Here, we will have to assume our strongest assumptions. In particular, we will need the function to be Lipschitz and the noise to have bounded support. This is mainly needed in order to be sure the sum of the stepsizes diverges.
We first state some technical lemmas we will use in the following, all the proofs are in the Appendix.
(Mairal, 2013, Lemma A.5) Let be two non-negative real sequences such that is bounded, converges and diverges, and there exists such that . Then converges to 0.
Let , and . Then .
Assume (H1, H3). Then, the iterates of SGD satisfy the following inequality
We now state the almost sure convergence of SGD with adaptive stepsizes.
Assume (H1, H2, H3, H4). The stepsizes are chosen as in (4), where and . Then, SGD converges to a stationary point almost surely, i.e. with probability 1. Moreover, with probability 1.
From the result in Lemma 5, taking the limit for and exchanging the expectation and the limits because the terms are non-negative, we have
where in the first inequality we have used Lemma 4, and in the third one the elementary inequality .
Hence, we have . Now, note that , where is a non-negative random variable, implies that with probability 1. In fact, otherwise implies , contradicting our assumption. Hence, with probability 1, we have .
Now, observe that the Lipschitzness of and the bounded support of the noise on the gradients gives
Using the fact the is -Lipschitz and -smooth, we have
Also, . Hence, we can use Lemma 3 to obtain .
For the second statement, observe that, with probability 1,
where in the first inequality we used the Lipschitzness of and the bounded support of the noise on the gradients. Hence, noting that , we have that . ∎
We now state a similar result for a version of SGD similar to AdaGrad (Duchi et al., 2011). We use coordinate-wise adaptive stepsizes (5) as in AdaGrad, but with the power of the denominator with , rather than . Also, differently from what is stated in the original AdaGrad paper, here we do not project onto a bounded closed convex set. This mirrors the actual implementation of AdaGrad in machine learning libraries, e.g. Tensorflow (Abadi et al., 2015). Given that the proof is virtually identical to the one of Theorem 2, we defer its proof to the Appendix.
Assume (H1, H2, H3, H4). The stepsizes are given by a diagonal matrix whose diagonal values are defined in (5), where and . Then, SGD converges to a stationary point almost surely, i.e. with probability 1. Moreover, with probability 1.
As far as we know, the above theorem is the first result on the convergence of AdaGrad to a stationary point, assuming . Also, the almost sure asymptotic convergence is the first theoretical support to the common heuristic of selecting the last iterate, rather than the minimum over the iterations.
Yet, in the above convergence guarantees the rate with which the gradient converges to zero is only asymptotic. In the next Section, we show a finite-time convergence rate for the minimum gradient over the iterates that precisely quantifies the effect of the noise on the rate.
7 Non-Asymptotic Adaptive Convergence Rates for Non-Convex Functions
We now prove non-asymptotic adaptive convergence rates to stationary points using the global stepsizes (4). This result is complementary to the one in the previous section. Given that SGD is not a descent method, we are not aware of any result of convergence with an explicit rate for the last iterate for non-convex functions. Hence, here we will prove a convergence guarantee for the best iterate over iterations rather than for the last one. Note that choosing a random stopping time as in Ghadimi and Lan (2013) would be equivalent in expectation to choose the best iterate. For simplicity, we choose to state the theorem for the best iterate.
Assume (H1, H3, H4’). Let and the stepsizes set as (4), where , , and . Then, with probability at least , the iterates of SGD satisfies the following bound
where and .
From Lemma 5, we have
Using Lemma 2, we can upper bound the expected sum in the r.h.s. of last inequality, to have
Denoting by , we have
where in the second inequality we used the elementary inequality . Using the definition of in the Theorem and defining , we have that
We now consider two cases: and . In the first case, we have that
Using Markov’s inequality, we have that with probability at least , , that is . In the second case, we have
Using Markov’s inequality, we have that with probability at least , that implies . Using again Markov inequality, we have with probability at least , that gives us .
Putting all together and using the union bound, we have the stated bound. ∎
This theorem mirrors Theorem 1, proving again a convergence rate that is adaptive to the noise level. Hence, the same observations on adaptation to the noise level and convergence hold here as well. The main difference w.r.t. Theorem 1 is that here we only prove convergence to a stationary point because we do not assume convexity.
Note that such bounds were already known with an oracle tuning of the stepsizes, in particular with the knowledge of the variance of the noise, see, e.g., Ghadimi and Lan (2013). In fact, the required stepsize in the deterministic case must be constant, while it has to be of the order of in the stochastic case. However, here we obtain the same behaviour automatically, without having to estimate the variance of the noise, thanks to the adaptive stepsizes.
8 Discussion and Future Work
We have presented an analysis of adaptive stepsizes for stochastic gradient descent, with convex and non-convex functions. In the convex setting, our result overcomes the limitations of previous results, removing the assumption of a bounded domain, yet showing an adaptive convergence rate. In the non-convex setting, we show almost sure convergence and adaptive convergence rates to stationary points. Moreover, we show for the first time a convergence guarantee for non-convex functions for a minor variation of AdaGrad.
In the future, we would like to understand if the conditions we impose can be weakened. For example, the almost sure convergence require a bounded support noise, that, while it might be verified in many practical scenarios, still seems unsatisfying from a theoretical point of view. Also, we would like to address the issue of proving high probability finite-time convergence guarantees.
The authors thank Dávid Pál for the comments and discussions. This material is based upon work supported by the National Science Foundation under grant no. 1740762 “Collaborative Research: TRIPODS Institute for Optimization and Learning” and by a Google Research Award.
- Abadi et al.  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
- Auer et al.  P. Auer, N. Cesa-Bianchi, and C. Gentile. Adaptive and self-confident on-line learning algorithms. J. Comput. Syst. Sci., 64(1):48–75, 2002.
- Bertsekas and Tsitsiklis  D. P. Bertsekas and J. N. Tsitsiklis. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627–642, 2000.
- Bottou  L. Bottou. Online learning and stochastic approximations. On-line learning in neural networks, 17(9):142, 1998.
- Dekel et al.  O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165–202, 2012.
- Duchi et al.  J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
- Ghadimi and Lan  S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- Karimi et al.  H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.
- Kingma and Ba  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
- Kresoja et al.  M. Kresoja, Z. Lužanin, and I. Stojkovska. Adaptive stochastic approximation algorithm. Numerical Algorithms, 76(4):917–937, Dec 2017.
- Mairal  J. Mairal. Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems, pages 2283–2291, 2013.
- Nemirovski et al.  A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
- Nesterov  Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer, 2003.
- Orabona and Pál  F. Orabona and D. Pál. Scale-free online learning. Theoretical Computer Science, 716:50–69, 2018. Special Issue on ALT 2015.
- Polyak  B. T. Polyak. Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3(4):643–653, 1963.
- Rakhlin and Sridharan  A. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, pages 3066–3074, 2013.
- Reddi et al.  S. J. Reddi, S. Kale, and S. Kumar. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018.
- Robbins and Monro  H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
- Tieleman and Hinton  T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
- Wu et al.  X. Wu, R. Ward, and L. Bottou. WNGrad: Learn the learning rate in gradient descent. arXiv preprint arXiv:1803.02865, 2018.
- Yousefian et al.  F. Yousefian, A. Nedić, and U. V. Shanbhag. On stochastic gradient and subgradient methods with adaptive steplength sequences. Automatica, 48(1):56–67, 2012.
- Zeiler  M. D. Zeiler. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
- Zhou et al.  Z. Zhou, P. Mertikopoulos, N. Bambos, S. Boyd, and P. W. Glynn. Stochastic mirror descent in variationally coherent optimization problems. In Advances in Neural Information Processing Systems, pages 7043–7052, 2017.
Appendix A Appendix
Here, we report the proofs missing from the main text.
a.1 Details of Example 1
Consider the function . The gradient in -th iteration is . Let the stochastic gradient be defined as , where .
Let . Then
This expression can be negative, for example, setting , , and .
a.2 Proof of Lemma 4
Let and nonincreasing function. Then
Denote by .
Summing over , we have the stated bound. ∎
a.3 Proofs of Section 5
Proof of Lemma 2.
Using the assumption on the noise, we have
Hence, we have