Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition

# Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition

Mark Schmidt and Nicolas Le Roux
###### Abstract

We consider optimizing a function smooth convex function that is the average of a set of differentiable functions , under the assumption considered by Solodov (1998) and Tseng (1998) that the norm of each gradient is bounded by a linear function of the norm of the average gradient . We show that under these assumptions the basic stochastic gradient method with a sufficiently-small constant step-size has an convergence rate, and has a linear convergence rate if is strongly-convex.

## 1 Deterministic vs. Stochastic Gradient Descent

We consider optimizing a function that is the average of a set of differentiable functions ,

 minx∈RPf(x):=1NN∑i=1fi(x), (1)

where we assume that is convex and its gradient is Lipschitz-continuous with constant , meaning that for all and we have

 ||f′(x)−f′(y)||≤L||x−y||.

If is twice-differentiable, these assumptions are equivalent to assuming that the eigenvalues of the Hessian are bounded between and for all .

Deterministic gradient methods for problems of this form use the iteration

 xk+1=xk−αkf′(xk), (2)

for a sequence of step sizes . In contrast, stochastic gradient methods use the iteration

 xk+1=xk−αkf′i(xk), (3)

for an individual data sample selected uniformly at random from the set .

The stochastic gradient method is appealing because the cost of its iterations is independent of . However, in order to guarantee convergence stochastic gradient methods require a decreasing sequence of step sizes and this leads to a slower convergence rate. In particular, for convex objective functions the stochastic gradient method with a decreasing sequence of step sizes has an expected error on iteration of  (Nemirovski, 1994, §14.1), meaning that

 E[f(xk)]−f(x∗)=O(1/√k).

In contrast, the deterministic gradient method with a constant step size has a smaller error of  (Nesterov, 2004, §2.1.5). The situation is more dramatic when is strongly convex, meaning that

 f(y)≥f(x)+⟨f′(x),y−x⟩+μ2||y−x||2, (4)

for all and and some . For twice-differentiable functions, this is equivalent to assuming that the eigenvalues of the Hessian are bounded below by . For strongly convex objective functions, the stochastic gradient method with a decreasing sequence of step sizes has an error of  (Nemirovski et al., 2009, §2.1) while the deterministic method with a constant step size has an linear convergence rate. In particular, the deterministic method satisfies

 f(xk)−f(x∗)≤ρk[f(x0)−f(x∗)],

for some  (Luenberger and Ye, 2008, §8.6).

The purpose of this note is to show that, if the individual gradients satisfy a certain strong growth condition relative to the full gradient , the stochastic gradient method with a sufficiently small constant step size achieves (in expectation) the convergence rates stated above for the deterministic gradient method.

## 2 A Strong Growth Condition

The particular condition we consider in this work is that for all we have

 maxi{||f′i(x)||}≤B||f′(x)||, (5)

for some constant . This condition states that the norms of the gradients of the individual functions are bounded by a linear function of the norm of the average gradient. Note that this condition is very strong and is not satisfied in most applications. In particular, this condition requires that any optimal solution for problem (1) must also be a stationary point for each , so that

 (f′(x)=0)⇒(f′i(x)=0),∀i.

In the context of non-linear least squares problems this condition requires that all residuals be zero at the solution, a property that can be used to show local superlinear convergence of Gauss-Newton algorithms (Bertsekas, 1999, §1.5.1).

Under condition (5), Solodov (1998) and Tseng (1998) have analyzed convergence properties of deterministic incremental gradient methods. In these methods, the iteration (3) is used but the data sample is chosen in a deterministic fashion by proceeding through the samples in a cyclic order. Normally, the deterministic incremental gradient method requires a decreasing sequence of step sizes to achieve convergence, but Solodov shows that under condition (5) the deterministic incremental gradient method converges with a sufficiently small constant step size. Further, Tseng shows that a deterministic incremental gradient method with a sufficiently small step size may have a form of linear convergence under condition (5). However, this form of linear convergence treats full passes through the data as iterations, similar to the deterministic gradient method. Below, we show that the stochastic gradient descent method achieves a linear convergence rate in expectation, using iterations that only look at one training example.

## 3 Error Properties

It will be convenient to re-write the stochastic gradient iteration (3) in the form

 xk+1=xk−α(f′(xk)+ek), (6)

where we have assumed a constant step size and where the error is given by

 ek=f′i(xk)−f′(xk). (7)

That is, we treat the stochastic gradient descent iteration as a full gradient iteration of the form (2) but with an error in the gradient calculation. Because is sampled uniformly from the set , note that we have

 E[f′i(xk)]=1NN∑i=1f′i(xk)=f′(xk), (8)

and subsequently that the error has a mean of zero,

 E[ek]=E[f′i(xk)−f′(xk)]=E[f′i(xk)]−f′(xk)=0. (9)

In addition to this simple property, our analysis will also use a bound on the variance term in terms of . To obtain this we first use (7), then expand and use (8), and finally use our assumption (5) to get

 E[||ek||2] =E[||f′i(xk)−f′(xk)||2] (10) =E[||f′i(xk)||2−2⟨f′i(xk),f′(xk)⟩+||f′(xk)||2] =E[||f′i(xk)||2]−2⟨E[f′i(xk)],f′(xk)⟩+||f′(xk)||2 =1NN∑i=1[||f′i(xk)||2]−||f′(xk)||2 ≤(B2−1)||f′(xk)||2.

## 4 Upper Bound on Progress

We first review a basic inequality for inexact gradient methods of the form (6), when applied to functions that have a Lipschitz continuous gradient. In particular, because is Lipschitz-continuous, we have for all and that

 f(y)≤f(x)+⟨f′(x),y−x⟩+L2||y−x||2.

Plugging in and we get

 f(xk+1)≤f(xk)+⟨f′(xk),xk+1−xk⟩+L2||xk+1−xk||2.

From (6) we have that , so we obtain

 f(xk+1) ≤f(xk)−α⟨f′(xk),f′(xk)+ek⟩+α2L2||f′(xk)+ek||2 (11) =f(xk)−α(1−αL2)||f′(xk)||2−α(1−αL)⟨f′(xk),ek⟩+α2L2||ek||2.

## 5 Descent Property

We now show that, if the step size is sufficiently small and the error is as described in Section 3, the expected value of is less than . In particular, we take the expectation of both sides of (11) with respect to , and use (9) and (10) to obtain

 E[f(xk+1)] ≤f(xk)−α(1−αL2)||f′(xk)||2−α(1−αL)⟨f′(xk),E[ek]⟩+α2L2E[||ek||2] (12) ≤f(xk)−α(1−αL2)||f′(xk)||2+α2L(B2−1)2||f′(xk)||2 =f(xk)−α(1−αLB22)||f′(xk)||2.

This inequality shows that if is not a minimizer, then the stochastic gradient descent iteration is expected to decrease the objective function for any step size satisfying

 0<α<2LB2. (13)

## 6 Linear Convergence for Strongly Convex Objectives

We now use the bound (12) to show that, for strongly convex functions, constant step sizes satisfying (13) lead to an expected linear convergence rate. First, use in (4) and minimize both sides of (4) with respect to to obtain

 f(x∗)≥f(xk)−12μ||f′(xk)||2,

where is the minimizer of . Subsequently, we have

 −||f′(xk)||2≤−2μ(f(xk)−f(x∗)).

Now use this in (12) and assume the step sizes satisfy (13) to get

 E[f(xk+1)]≤f(xk)−2μα(1−αLB22)[f(xk)−f(x∗)].

We now subtract from both sides and take the expectation with respect to the sequence to obtain

 E[f(xk+1)]−f(x∗) ≤E[f(xk)]−f(x∗)−2μα(1−αLB22)[E[f(xk)]−f(x∗)] =(1−2μα(1−αLB22))[E[f(xk)]−f(x∗)].

Applying this recursively we have

 E[f(xk)]−f(x∗)≤ρk[f(x0)−f(x∗)],

for some . Thus, the difference between the expected function value and the optimal function value decreases geometrically in the iteration number .

In the particular case of , this expression simplifies to

 E[f(xk)]−f(x∗)≤(1−μLB2)k[f(x0)−f(x∗)],

and thus the method approaches the rate of the deterministic method with a step size of  (see Luenberger and Ye, 2008, §8.6) as approaches one.

## 7 Sublinear O(1/k) Convergence for Convex Objectives

We now turn to the case where is convex but not necessarily strongly convex. In this case, we show that if at least one minimizer exists, then a step size of leads to an error. By convexity, we have for any minimizer that

 f(xk)≤f(x∗)+⟨f′(xk),xk−x∗⟩,

and thus for any that

 f(xk)≤βf(xk)+(1−β)f(x∗)+(1−β)⟨f′(xk),xk−x∗⟩.

We use this to bound in (11) to get

 f(xk+1) ≤βf(xk)+(1−β)f(x∗)+(1−β)⟨f′(xk),xk−x∗⟩ (14) −α(1−αL2)||f′(xk)||2−α(1−αL)⟨f′(xk),ek⟩+α2L2||ek||2.

Note that

 12α(∥xk−x∗∥2−∥xk+1−x∗∥2) =12α(∥xk−x∗∥2−∥xk−αf′(xk)−αek−x∗∥2) =−α2∥f′(xk)∥2−α2∥ek∥2−α⟨f′(xk),ek⟩ +⟨f′(xk),xk−x∗⟩+⟨ek,xk−x∗⟩,

and using this to replace in (14) we obtain the ugly expression

 f(xk+1) +α(1−β)2(∥f′(xk)∥2+∥ek∥2)+(1−β)α⟨f′(xk),ek⟩−(1−β)⟨ek,xk−x∗⟩ −α(1−αL2)||f′(xk)||2−α(1−Lα)⟨f′(xk),ek⟩+Lα22∥ek∥2.

Taking the expectation with respect to and using properties (9) and (10), this becomes

 E[f(xk+1)] (15) +α(1−β)2(∥f′(xk)∥2+(B2−1)∥f′(xk)∥2) −α(2−αL2)||f′(xk)||2+Lα2(B2−1)2∥f′(xk)∥2.

Using , we can make all terms in cancel out by choosing because

 α(1−β)B2−2α+Lα2B2=α−2α+α=0.

We now take the expectation of (15) with respect to and note that to obtain

 E[f(xk+1)]−f(x∗)≤βE[f(xk)]−βf(x∗)+L2(E[∥xk−x∗∥2]−E[∥xk+1−x∗∥2]).

If we sum up the error from to , we have

 n−1∑k=0(E[f(xk+1)]−f(x∗)) ≤βn−1∑k=0(E[f(xk)]−f(x∗))+L2(∥x0−x∗∥2−E[∥xn−x∗∥2]) ≤βn∑k=1(E[f(xk)]−f(x∗))+β(f(0)−f(x∗))+L2∥x0−x∗∥2.

Hence, we have

 (1−β)n−1∑k=0(E[f(xk+1)]−f(x∗)) ≤β(f(0)−f(x∗))+L2∥x0−x∗∥2.

Since is a non-increasing function of , the sum on the left-hand side is larger than times its last element. Hence, we get

 E[f(xk+1)]−f(x∗) ≤1kk−1∑i=0(E[f(xi+1)]−f(x∗)) ≤β(f(0)−f(x∗))+L2∥x0−x∗∥2k(1−β) =2(B2−1)(f(0)−f(x∗))+LB2∥x0−x∗∥22k =O(1/k).

## References

• Bertsekas [1999] D. Bertsekas. Nonlinear programming. Athena Scientific, 1999.
• Luenberger and Ye [2008] D. Luenberger and Y. Ye. Linear and nonlinear programming. Springer Verlag, 2008.
• Nemirovski [1994] A. Nemirovski. Efficient methods in convex programming. Lecture notes, 1994.
• Nemirovski et al. [2009] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
• Nesterov [2004] Y. Nesterov. Introductory lectures on convex optimization: A basic course. Springer Netherlands, 2004.
• Solodov [1998] M. Solodov. Incremental gradient algorithms with stepsizes bounded away from zero. Computational Optimization and Applications, 11(1):23–35, 1998.
• Tseng [1998] P. Tseng. An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506–531, 1998.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters