Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition

Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition

Mark Schmidt and Nicolas Le Roux
Abstract

We consider optimizing a function smooth convex function that is the average of a set of differentiable functions , under the assumption considered by Solodov (1998) and Tseng (1998) that the norm of each gradient is bounded by a linear function of the norm of the average gradient . We show that under these assumptions the basic stochastic gradient method with a sufficiently-small constant step-size has an convergence rate, and has a linear convergence rate if is strongly-convex.

1 Deterministic vs. Stochastic Gradient Descent

We consider optimizing a function that is the average of a set of differentiable functions ,

(1)

where we assume that is convex and its gradient is Lipschitz-continuous with constant , meaning that for all and we have

If is twice-differentiable, these assumptions are equivalent to assuming that the eigenvalues of the Hessian are bounded between and for all .

Deterministic gradient methods for problems of this form use the iteration

(2)

for a sequence of step sizes . In contrast, stochastic gradient methods use the iteration

(3)

for an individual data sample selected uniformly at random from the set .

The stochastic gradient method is appealing because the cost of its iterations is independent of . However, in order to guarantee convergence stochastic gradient methods require a decreasing sequence of step sizes and this leads to a slower convergence rate. In particular, for convex objective functions the stochastic gradient method with a decreasing sequence of step sizes has an expected error on iteration of  (Nemirovski, 1994, §14.1), meaning that

In contrast, the deterministic gradient method with a constant step size has a smaller error of  (Nesterov, 2004, §2.1.5). The situation is more dramatic when is strongly convex, meaning that

(4)

for all and and some . For twice-differentiable functions, this is equivalent to assuming that the eigenvalues of the Hessian are bounded below by . For strongly convex objective functions, the stochastic gradient method with a decreasing sequence of step sizes has an error of  (Nemirovski et al., 2009, §2.1) while the deterministic method with a constant step size has an linear convergence rate. In particular, the deterministic method satisfies

for some  (Luenberger and Ye, 2008, §8.6).

The purpose of this note is to show that, if the individual gradients satisfy a certain strong growth condition relative to the full gradient , the stochastic gradient method with a sufficiently small constant step size achieves (in expectation) the convergence rates stated above for the deterministic gradient method.

2 A Strong Growth Condition

The particular condition we consider in this work is that for all we have

(5)

for some constant . This condition states that the norms of the gradients of the individual functions are bounded by a linear function of the norm of the average gradient. Note that this condition is very strong and is not satisfied in most applications. In particular, this condition requires that any optimal solution for problem (1) must also be a stationary point for each , so that

In the context of non-linear least squares problems this condition requires that all residuals be zero at the solution, a property that can be used to show local superlinear convergence of Gauss-Newton algorithms (Bertsekas, 1999, §1.5.1).

Under condition (5), Solodov (1998) and Tseng (1998) have analyzed convergence properties of deterministic incremental gradient methods. In these methods, the iteration (3) is used but the data sample is chosen in a deterministic fashion by proceeding through the samples in a cyclic order. Normally, the deterministic incremental gradient method requires a decreasing sequence of step sizes to achieve convergence, but Solodov shows that under condition (5) the deterministic incremental gradient method converges with a sufficiently small constant step size. Further, Tseng shows that a deterministic incremental gradient method with a sufficiently small step size may have a form of linear convergence under condition (5). However, this form of linear convergence treats full passes through the data as iterations, similar to the deterministic gradient method. Below, we show that the stochastic gradient descent method achieves a linear convergence rate in expectation, using iterations that only look at one training example.

3 Error Properties

It will be convenient to re-write the stochastic gradient iteration (3) in the form

(6)

where we have assumed a constant step size and where the error is given by

(7)

That is, we treat the stochastic gradient descent iteration as a full gradient iteration of the form (2) but with an error in the gradient calculation. Because is sampled uniformly from the set , note that we have

(8)

and subsequently that the error has a mean of zero,

(9)

In addition to this simple property, our analysis will also use a bound on the variance term in terms of . To obtain this we first use (7), then expand and use (8), and finally use our assumption (5) to get

(10)

4 Upper Bound on Progress

We first review a basic inequality for inexact gradient methods of the form (6), when applied to functions that have a Lipschitz continuous gradient. In particular, because is Lipschitz-continuous, we have for all and that

Plugging in and we get

From (6) we have that , so we obtain

(11)

5 Descent Property

We now show that, if the step size is sufficiently small and the error is as described in Section 3, the expected value of is less than . In particular, we take the expectation of both sides of (11) with respect to , and use (9) and (10) to obtain

(12)

This inequality shows that if is not a minimizer, then the stochastic gradient descent iteration is expected to decrease the objective function for any step size satisfying

(13)

6 Linear Convergence for Strongly Convex Objectives

We now use the bound (12) to show that, for strongly convex functions, constant step sizes satisfying (13) lead to an expected linear convergence rate. First, use in (4) and minimize both sides of (4) with respect to to obtain

where is the minimizer of . Subsequently, we have

Now use this in (12) and assume the step sizes satisfy (13) to get

We now subtract from both sides and take the expectation with respect to the sequence to obtain

Applying this recursively we have

for some . Thus, the difference between the expected function value and the optimal function value decreases geometrically in the iteration number .

In the particular case of , this expression simplifies to

and thus the method approaches the rate of the deterministic method with a step size of  (see Luenberger and Ye, 2008, §8.6) as approaches one.

7 Sublinear Convergence for Convex Objectives

We now turn to the case where is convex but not necessarily strongly convex. In this case, we show that if at least one minimizer exists, then a step size of leads to an error. By convexity, we have for any minimizer that

and thus for any that

We use this to bound in (11) to get

(14)

Note that

and using this to replace in (14) we obtain the ugly expression

Taking the expectation with respect to and using properties (9) and (10), this becomes

(15)

Using , we can make all terms in cancel out by choosing because

We now take the expectation of (15) with respect to and note that to obtain

If we sum up the error from to , we have

Hence, we have

Since is a non-increasing function of , the sum on the left-hand side is larger than times its last element. Hence, we get

References

  • Bertsekas [1999] D. Bertsekas. Nonlinear programming. Athena Scientific, 1999.
  • Luenberger and Ye [2008] D. Luenberger and Y. Ye. Linear and nonlinear programming. Springer Verlag, 2008.
  • Nemirovski [1994] A. Nemirovski. Efficient methods in convex programming. Lecture notes, 1994.
  • Nemirovski et al. [2009] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
  • Nesterov [2004] Y. Nesterov. Introductory lectures on convex optimization: A basic course. Springer Netherlands, 2004.
  • Solodov [1998] M. Solodov. Incremental gradient algorithms with stepsizes bounded away from zero. Computational Optimization and Applications, 11(1):23–35, 1998.
  • Tseng [1998] P. Tseng. An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506–531, 1998.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
27177
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description