Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition
Abstract
We consider optimizing a function smooth convex function that is the average of a set of differentiable functions , under the assumption considered by Solodov (1998) and Tseng (1998) that the norm of each gradient is bounded by a linear function of the norm of the average gradient . We show that under these assumptions the basic stochastic gradient method with a sufficiently-small constant step-size has an convergence rate, and has a linear convergence rate if is strongly-convex.
1 Deterministic vs. Stochastic Gradient Descent
We consider optimizing a function that is the average of a set of differentiable functions ,
(1) |
where we assume that is convex and its gradient is Lipschitz-continuous with constant , meaning that for all and we have
If is twice-differentiable, these assumptions are equivalent to assuming that the eigenvalues of the Hessian are bounded between and for all .
Deterministic gradient methods for problems of this form use the iteration
(2) |
for a sequence of step sizes . In contrast, stochastic gradient methods use the iteration
(3) |
for an individual data sample selected uniformly at random from the set .
The stochastic gradient method is appealing because the cost of its iterations is independent of . However, in order to guarantee convergence stochastic gradient methods require a decreasing sequence of step sizes and this leads to a slower convergence rate. In particular, for convex objective functions the stochastic gradient method with a decreasing sequence of step sizes has an expected error on iteration of (Nemirovski, 1994, §14.1), meaning that
In contrast, the deterministic gradient method with a constant step size has a smaller error of (Nesterov, 2004, §2.1.5). The situation is more dramatic when is strongly convex, meaning that
(4) |
for all and and some . For twice-differentiable functions, this is equivalent to assuming that the eigenvalues of the Hessian are bounded below by . For strongly convex objective functions, the stochastic gradient method with a decreasing sequence of step sizes has an error of (Nemirovski et al., 2009, §2.1) while the deterministic method with a constant step size has an linear convergence rate. In particular, the deterministic method satisfies
for some (Luenberger and Ye, 2008, §8.6).
The purpose of this note is to show that, if the individual gradients satisfy a certain strong growth condition relative to the full gradient , the stochastic gradient method with a sufficiently small constant step size achieves (in expectation) the convergence rates stated above for the deterministic gradient method.
2 A Strong Growth Condition
The particular condition we consider in this work is that for all we have
(5) |
for some constant . This condition states that the norms of the gradients of the individual functions are bounded by a linear function of the norm of the average gradient. Note that this condition is very strong and is not satisfied in most applications. In particular, this condition requires that any optimal solution for problem (1) must also be a stationary point for each , so that
In the context of non-linear least squares problems this condition requires that all residuals be zero at the solution, a property that can be used to show local superlinear convergence of Gauss-Newton algorithms (Bertsekas, 1999, §1.5.1).
Under condition (5), Solodov (1998) and Tseng (1998) have analyzed convergence properties of deterministic incremental gradient methods. In these methods, the iteration (3) is used but the data sample is chosen in a deterministic fashion by proceeding through the samples in a cyclic order. Normally, the deterministic incremental gradient method requires a decreasing sequence of step sizes to achieve convergence, but Solodov shows that under condition (5) the deterministic incremental gradient method converges with a sufficiently small constant step size. Further, Tseng shows that a deterministic incremental gradient method with a sufficiently small step size may have a form of linear convergence under condition (5). However, this form of linear convergence treats full passes through the data as iterations, similar to the deterministic gradient method. Below, we show that the stochastic gradient descent method achieves a linear convergence rate in expectation, using iterations that only look at one training example.
3 Error Properties
It will be convenient to re-write the stochastic gradient iteration (3) in the form
(6) |
where we have assumed a constant step size and where the error is given by
(7) |
That is, we treat the stochastic gradient descent iteration as a full gradient iteration of the form (2) but with an error in the gradient calculation. Because is sampled uniformly from the set , note that we have
(8) |
and subsequently that the error has a mean of zero,
(9) |
In addition to this simple property, our analysis will also use a bound on the variance term in terms of . To obtain this we first use (7), then expand and use (8), and finally use our assumption (5) to get
(10) | ||||
4 Upper Bound on Progress
5 Descent Property
We now show that, if the step size is sufficiently small and the error is as described in Section 3, the expected value of is less than . In particular, we take the expectation of both sides of (11) with respect to , and use (9) and (10) to obtain
(12) | ||||
This inequality shows that if is not a minimizer, then the stochastic gradient descent iteration is expected to decrease the objective function for any step size satisfying
(13) |
6 Linear Convergence for Strongly Convex Objectives
We now use the bound (12) to show that, for strongly convex functions, constant step sizes satisfying (13) lead to an expected linear convergence rate. First, use in (4) and minimize both sides of (4) with respect to to obtain
where is the minimizer of . Subsequently, we have
Now use this in (12) and assume the step sizes satisfy (13) to get
We now subtract from both sides and take the expectation with respect to the sequence to obtain
Applying this recursively we have
for some . Thus, the difference between the expected function value and the optimal function value decreases geometrically in the iteration number .
In the particular case of , this expression simplifies to
and thus the method approaches the rate of the deterministic method with a step size of (see Luenberger and Ye, 2008, §8.6) as approaches one.
7 Sublinear Convergence for Convex Objectives
We now turn to the case where is convex but not necessarily strongly convex. In this case, we show that if at least one minimizer exists, then a step size of leads to an error. By convexity, we have for any minimizer that
and thus for any that
We use this to bound in (11) to get
(14) | ||||
Note that
and using this to replace in (14) we obtain the ugly expression
Taking the expectation with respect to and using properties (9) and (10), this becomes
(15) | ||||
Using , we can make all terms in cancel out by choosing because
We now take the expectation of (15) with respect to and note that to obtain
If we sum up the error from to , we have
Hence, we have
Since is a non-increasing function of , the sum on the left-hand side is larger than times its last element. Hence, we get
References
- Bertsekas [1999] D. Bertsekas. Nonlinear programming. Athena Scientific, 1999.
- Luenberger and Ye [2008] D. Luenberger and Y. Ye. Linear and nonlinear programming. Springer Verlag, 2008.
- Nemirovski [1994] A. Nemirovski. Efficient methods in convex programming. Lecture notes, 1994.
- Nemirovski et al. [2009] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
- Nesterov [2004] Y. Nesterov. Introductory lectures on convex optimization: A basic course. Springer Netherlands, 2004.
- Solodov [1998] M. Solodov. Incremental gradient algorithms with stepsizes bounded away from zero. Computational Optimization and Applications, 11(1):23–35, 1998.
- Tseng [1998] P. Tseng. An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506–531, 1998.