Accelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Programming This research was partially supported by NSF grants CMMI-1000347, CMMI-1254446, DMS-1319050, and ONR grant N00014-13-1-0036.

# Accelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Programming 1

## Abstract

In this paper, we generalize the well-known Nesterov’s accelerated gradient (AG) method, originally designed for convex smooth optimization, to solve nonconvex and possibly stochastic optimization problems. We demonstrate that by properly specifying the stepsize policy, the AG method exhibits the best known rate of convergence for solving general nonconvex smooth optimization problems by using first-order information, similarly to the gradient descent method. We then consider an important class of composite optimization problems and show that the AG method can solve them uniformly, i.e., by using the same aggressive stepsize policy as in the convex case, even if the problem turns out to be nonconvex. We demonstrate that the AG method exhibits an optimal rate of convergence if the composite problem is convex, and improves the best known rate of convergence if the problem is nonconvex. Based on the AG method, we also present new nonconvex stochastic approximation methods and show that they can improve a few existing rates of convergence for nonconvex stochastic optimization. To the best of our knowledge, this is the first time that the convergence of the AG method has been established for solving nonconvex nonlinear programming in the literature.

Keywords: nonconvex optimization, stochastic programming, accelerated gradient, complexity

AMS 2000 subject classification: 62L20, 90C25, 90C15, 68Q25,

## 1 Introduction

In 1983, Nesterov in a celebrated work Nest83-1 () presented the accelerated gradient (AG) method for solving a class of convex programming (CP) problems given by

 Ψ∗=minx∈RnΨ(x). (1.1)

Here is a convex function with Lipschitz continuous gradients, i.e., such that (s.t.)

 (1.2)

Nesterov shows that the number of iterations performed by this algorithm to find a solution s.t. can be bounded by , which significantly improves the complexity bound possessed by the gradient descent method. Moreover, in view of the classic complexity theory for convex optimization by Nemirovski and Yudin nemyud:83 (), the above iteration complexity bound is not improvable for smooth convex optimization when is sufficiently large.

Nesterov’s AG method has attracted much interest recently due to the increasing need to solve large-scale CP problems by using fast first-order methods. In particular, Nesterov in an important work Nest05-1 () shows that by using the AG method and a novel smoothing scheme, one can improve the complexity for solving a broad class of saddle-point problems from to . The AG method has also been generalized by Nesterov Nest07-1 (), Beck and Teboulle BecTeb09-2 (), and Tseng tseng08-1 () to solve an emerging class of composite CP problems whose objective function is given by the summation of a smooth component and another relatively simple nonsmooth component (e.g., the norm). Lan Lan10-3 () further shows that the AG method, when employed with proper stepsize policies, is optimal for solving not only smooth CP problems, but also general (not necessarily simple) nonsmooth and stochastic CP problems. More recently, some key elements of the AG method, e.g., the multi-step acceleration scheme, have been adapted to significantly improve the convergence properties of a few other first-order methods (e.g., level methods Lan13-2 ()). However, to the best of our knowledge, all the aforementioned developments require explicitly the convexity assumption about . Otherwise, if in (1.1) is not necessarily convex, it is unclear whether the AG method still converges or not.

This paper aims to generalize the AG method, originally designed for smooth convex optimization, to solve more general nonlinear programming (NLP) (possibly nonconvex and stochastic) problems, and thus to present a unified treatment and analysis for convex, nonconvex and stochastic optimization. While this paper focuses on the theoretical development of the AG method, our study has also been motivated by the following more practical considerations in solving nonlinear programming problems. First, many general nonlinear objective functions are locally convex. A unified treatment for both convex and nonconvex problems will help us to make use of such local convex properties. In particular, we intend to understand whether one can apply the well-known aggressive stepsize policy in the AG method under a more general setting to benefit from such local convexity. Second, many nonlinear objective functions arising from sparse optimization (e.g., ChGeWangYe12-1 (); FengMitPangShenW13-1 ()) and machine learning (e.g., FanLi01-1 (); Mairal09 ()) consist of both convex and nonconvex components, corresponding to the data fidelity and sparsity regularization terms respectively. One interesting question is whether one can design more efficient algorithms for solving these nonconvex composite problems by utilizing their convexity structure. Third, the convexity of some objective functions represented by a black-box procedure is usually unknown, e.g., in simulation-based optimization Andr98-1 (); Fu02-1 (); AsmGlynn00 (); Law07 (). A unified treatment and analysis can thus help us to deal with such structural ambiguity. Fourth, in some cases, the objective functions are nonconvex with respect to (w.r.t.) a few decision variables jointly, but convex w.r.t. each one of them separately. Many machine learning/imaging processing problems are given in this form (e.g., Mairal09 ()). Current practice is to first run an NLP solver to find a stationary point, and then a CP solver after one variable (e.g., dictionary in Mairal09 ()) is fixed. A more powerful, unified treatment for both convex and nonconvex problems is desirable to better handle these types of problems.

Our contribution mainly lies in the following three aspects. First, we consider the classic NLP problem given in the form of (1.1), where is a smooth (possibly nonconvex) function satisfying (1.2) (denoted by ). In addition, we assume that is bounded from below. We demonstrate that the AG method, when employed with a certain stepsize policy, can find an -solution of (1.1), i.e., a point such that , in at most iterations, which is the best-known complexity bound possessed by first-order methods to solve general NLP problems (e.g., the gradient descent method Nest04 (); CarGouToi10-1 () and the trust region method GraSarToi08 ()). Note that if is convex and a more aggressive stepsize policy is applied in the AG method, then the aforementioned complexity bound can be improved to .

Second, we consider a class of composite problems (see, e.g., Lewis and Wright LewWri09-1 (), Chen et al. ChGeWangYe12-1 ()) given by

 minx∈RnΨ(x)+X(x),  Ψ(x):=f(x)+h(x), (1.3)

where is possibly nonconvex, is convex, and is a simple convex (possibly non-smooth) function with bounded domain (e.g., with being the indicator function of a convex compact set ). Clearly, we have with . Since is possibly non-differentiable, we need to employ a different termination criterion based on the gradient mapping (see (2.38)) to analyze the complexity of the AG method. Observe, however, that if , then we have for any . We show that the same aggressive stepsize policy as the AG method for the convex problems can be applied for solving problem (1.3) no matter if is convex or not. More specifically, the AG method exhibits an optimal rate of convergence in terms of functional optimality gap if turns out to be convex. In addition, we show that one can find a solution s.t. in at most

 O⎧⎨⎩(L2Ψϵ)1/3+LΨLfϵ⎫⎬⎭

iterations. The above complexity bound improves the one established in GhaLanZhang13-1 () for the projected gradient method applied to problem (1.3) in terms of their dependence on the Lipschtiz constant . In addition, it is significantly better than the latter bound when is small enough (see Section 2.2 for more details).

Third, we consider stochastic NLP problems in the form of (1.1) or (1.3), where only noisy first-order information about is available via subsequent calls to a stochastic oracle (). More specifically, at the -th call, being the input, the outputs a stochastic gradient , where are random vectors whose distributions are supported on . The following assumptions are also made for the stochastic gradient .

###### Assumption 1

For any and , we have

 a) E[G(x,ξk)]=∇Ψ(x), (1.4) b) E[∥G(x,ξk)−∇Ψ(x)∥2]≤σ2. (1.5)

Currently, the randomized stochastic gradient (RSG) method initially studied by Ghadimi and Lan GhaLan12 () and later improved in GhaLanZhang13-1 (); DangLan13-1 () seems to be the only available stochastic approximation (SA) algorithm for solving the aforementioned general stochastic NLP problems, while other SA methods (see, e.g., RobMon51-1 (); NJLS09-1 (); Spall03 (); pol92 (); Lan10-3 (); GhaLan12 (); GhaLan10-1b ()) require the convexity assumption about . However, the RSG method and its variants are only nearly optimal for solving convex SP problems. Based on the AG method, we present a randomized stochastic AG (RSAG) method for solving general stochastic NLP problems and show that if is convex, then the RSAG exhibits an optimal rate of convergence in terms of functional optimality gap, similarly to the accelerated SA method in Lan10-3 (). In this case, the complexity bound in (1.6) in terms of the residual of gradients can be improved to

 O⎛⎜ ⎜⎝L23Ψϵ13+L23Ψσ2ϵ43⎞⎟ ⎟⎠.

Moreover, if is nonconvex, then the RSAG method can find an -solution of (1.1), i.e., a point s.t. in at most

 O(LΨϵ+LΨσ2ϵ2) (1.6)

calls to the . We also generalize these complexity analyses to a class of nonconvex stochastic composite optimization problems by introducing a mini-batch approach into the RSAG method and improve a few complexity results presented in GhaLanZhang13-1 () for solving these stochastic composite optimization problems.

This paper is organized as follows. In Section 2, we present the AG algorithm and establish its convergence properties for solving problems (1.1) and (1.3). We then generalize the AG method for solving stochastic nonlinear and composite optimization problems in Section 3. Some brief concluding remarks are given in Section 4.

## 2 The accelerated gradient algorithm

Our goal in this section is to show that the AG method, which is originally designed for smooth convex optimization, also converges for solving nonconvex optimization problems after incorporating some proper modification. More specifically, we first present an AG method for solving a general class of nonlinear optimization problems in Subsection 2.1 and then describe the AG method for solving a special class of nonconvex composite optimization problems in Subsection 2.2.

### 2.1 Minimization of smooth functions

In this subsection, we assume that is a differentiable nonconvex function, bounded from below and its gradient satisfies in (1.2). It then follows that (see, e.g., Nest04 ())

 |Ψ(y)−Ψ(x)−⟨∇Ψ(x),y−x⟩|≤LΨ2∥y−x∥2   ∀x,y∈Rn. (2.1)

While the gradient descent method converges for solving the above class of nonconvex optimization problems, it does not achieve the optimal rate of convergence, in terms of the functional optimality gap, when is convex. On the other hand, the original AG method in Nest83-1 () is optimal for solving convex optimization problems, but does not necessarily converge for solving nonconvex optimization problems. Below, we present a modified AG method and show that by properly specifying the stepsize policy, it not only achieves the optimal rate of convergence for convex optimization, but also exhibits the best-known rate of convergence as shown in  Nest04 (); CarGouToi10-1 () for solving general smooth NLP problems by using first-order methods.

Note that, if , then we have . In this case, the above AG method is equivalent to one of the simplest variants of the well-known Nesterov’s method (see, e.g., Nest04 ()). On the other hand, if , then it can be shown by induction that and . In this case, Algorithm 1 reduces to the gradient descent method. We will show in this subsection that the above AG method actually converges for different selections of , , and in both convex and nonconvex case.

To establish the convergence of the above AG method, we need the following simple technical result (see Lemma 3 of Lan13-2 () for a slightly more general result).

###### Lemma 1

Let be the stepsizes in the AG method and the sequence satisfies

 θk≤(1−αk)θk−1+ηk,  k=1,2,…, (2.5)

where

 Γk:={1,k=1,(1−αk)Γk−1,k≥2. (2.6)

Then we have for any .

###### Proof

Noting that and for any . These observations together with (2.6) then imply that for any . Dividing both sides of (2.5) by , we obtain

 θ1Γ1≤(1−α1)θ0Γ1+η1Γ1=η1Γ1

and

 θiΓi≤(1−αi)θi−1Γi+ηiΓi=θi−1Γi−1+ηiΓi,   ∀i≥2.

The result then immediately follows by summing up the above inequalities and rearranging the terms.

We are now ready to describe the main convergence properties of the AG method.

###### Theorem 2.1

Let be computed by Algorithm 1 and be defined in (2.6).

• If , , and are chosen such that

 Ck:=1−LΨλk−LΨ(λk−βk)22αkΓkλk(N∑τ=kΓτ)>0, (2.7)

then for any , we have

 mink=1,...,N∥∇Ψ(xmdk)∥2≤Ψ(x0)−Ψ∗∑Nk=1λkCk. (2.8)
• Suppose that is convex and that an optimal solution exists for problem (1.1). If , , and are chosen such that

 αkλk≤βk<1LΨ, (2.9) α1λ1Γ1≥α2λ2Γ2≥…, (2.10)

then for any , we have

 mink=1,...,N∥∇Ψ(xmdk)∥2 ≤ ∥x0−x∗∥2λ1∑Nk=1Γ−1kβk(1−LΨβk), (2.11) Ψ(xagN)−Ψ(x∗) ≤ ΓN∥x0−x∗∥22λ1. (2.12)
###### Proof

We first show part a). Denote . By (1.2) and (2.2), we have

 ∥Δk∥=∥∇Ψ(xk−1)−∇Ψ(xmdk)∥≤LΨ∥xk−1−xmdk∥=LΨ(1−αk)∥xagk−1−xk−1∥. (2.13)

Also by (2.1) and (2.3), we have

 Ψ(xk) ≤ Ψ(xk−1)+⟨∇Ψ(xk−1),xk−xk−1⟩+LΨ2∥xk−xk−1∥2 (2.14) = Ψ(xk−1)+⟨Δk+∇Ψ(xmdk),−λk∇Ψ(xmdk)⟩+LΨλ2k2∥∇Ψ(xmdk)∥2 = Ψ(xk−1)−λk(1−LΨλk2)∥∇Ψ(xmdk)∥2−λk⟨Δk,∇Ψ(xmdk)⟩ ≤ Ψ(xk−1)−λk(1−LΨλk2)∥∇Ψ(xmdk)∥2+λk∥Δk∥⋅∥∇Ψ(xmdk)∥,

where the last inequality follows from the Cauchy-Schwarz inequality. Combining the previous two inequalities, we obtain

 Ψ(xk) ≤ Ψ(xk−1)−λk(1−LΨλk2)∥∇Ψ(xmdk)∥2+LΨ(1−αk)λk∥∇Ψ(xmdk)∥⋅∥xagk−1−xk−1∥ (2.15) ≤ Ψ(xk−1)−λk(1−LΨλk2)∥∇Ψ(xmdk)∥2+LΨλ2k2∥∇Ψ(xmdk)∥2+LΨ(1−αk)22∥xagk−1−xk−1∥2 = Ψ(xk−1)−λk(1−LΨλk)∥∇Ψ(xmdk)∥2+LΨ(1−αk)22∥xagk−1−xk−1∥2,

where the second inequality follows from the fact that . Now, by (2.2), (2.3), and (2.4), we have

 xagk−xk = (1−αk)xagk−1+αkxk−1−βk∇Ψ(xmdk)−[xk−1−λk∇Ψ(xmdk)] = (1−αk)(xagk−1−xk−1)+(λk−βk)∇Ψ(xmdk),

which, in the view of Lemma 1, implies that

 xagk−xk=Γkk∑τ=1(λτ−βτΓτ)∇Ψ(xmdτ).

Using the above identity, the Jensen’s inequality for , and the fact that

 k∑τ=1ατΓτ=α1Γ1+k∑τ=21Γτ(1−ΓτΓτ−1)=1Γ1+k∑τ=2(1Γτ−1Γτ−1)=1Γk, (2.16)

we have

 ∥xagk−xk∥2 = (2.17) ≤ Γkk∑τ=1ατΓτ∥∥∥(λτ−βτατ)∇Ψ(xmdτ)∥∥∥2=Γkk∑τ=1(λτ−βτ)2Γτατ∥∇Ψ(xmdτ)∥2.

Replacing the above bound in (2.15), we obtain

 Ψ(xk) ≤ Ψ(xk−1)−λk(1−LΨλk)∥∇Ψ(xmdk)∥2+LΨΓk−1(1−αk)22k−1∑τ=1(λτ−βτ)2Γτατ∥∇Ψ(xmdτ)∥2 (2.18) ≤ Ψ(xk−1)−λk(1−LΨλk)∥∇Ψ(xmdk)∥2+LΨΓk2k∑τ=1(λτ−βτ)2Γτατ∥∇Ψ(xmdτ)∥2

for any , where the last inequality follows from the definition of in (2.6) and the fact that for all . Summing up the above inequalities and using the definition of in (2.7), we have

 Ψ(xN) ≤ Ψ(x0)−N∑k=1λk(1−LΨλk)∥∇Ψ(xmdk)∥2+LΨ2N∑k=1Γkk∑τ=1(λτ−βτ)2Γτατ∥∇Ψ(xmdτ)∥2 (2.19) = Ψ(x0)−N∑k=1λk(1−LΨλk)∥∇Ψ(xmdk)∥2+LΨ2N∑k=1(λk−βk)2Γkαk(N∑τ=kΓτ)∥∇Ψ(xmdk)∥2 = Ψ(x0)−N∑k=1λkCk∥∇Ψ(xmdk)∥2.

Re-arranging the terms in the above inequality and noting that , we obtain

 mink=1,...,N∥∇Ψ(xmdk)∥2(N∑k=1λkCk)≤N∑k=1λkCk∥∇Ψ(xmdk)∥2≤Ψ(x0)−Ψ∗,

which, in view of the assumption that , clearly implies (2.8).

We now show part b). First, note that by (2.4), we have

 Ψ(xagk) ≤ Ψ(xmdk)+⟨∇Ψ(xmdk),xagk−xmdk⟩+LΨ2∥xagk−xmdk∥2 (2.20) = Ψ(xmdk)−βk∥∇Ψ(xmdk)∥2+LΨβ2k2∥∇Ψ(xmdk)∥2.

Also by the convexity of and (2.2),

 Ψ(xmdk)−[(1−αk)Ψ(xagk−1)+αkΨ(x)] =αk[Ψ(xmdk)−Ψ(x)]+(1−αk)[Ψ(xmdk)−Ψ(xagk−1)] ≤αk⟨∇Ψ(xmdk),xmdk−x⟩+(1−αk)⟨∇Ψ(xmdk),xmdk−xagk−1⟩ =⟨∇Ψ(xmdk),αk(xmdk−x)+(1−αk)(xmdk−xagk−1)⟩ =αk⟨∇Ψ(xmdk),xk−1−x⟩. (2.21)

It also follows from (2.3) that

 ∥xk−1−x∥2−2λk⟨∇Ψ(xmdk),xk−1−x⟩+λ2k∥∇Ψ(xmdk)∥2 =∥xk−1−λk∇Ψ(xmdk)−x∥2=∥xk−x∥2,

and hence that

 αk⟨∇Ψ(xmdk),xk−1−x⟩=αk2λk[∥xk−1−x∥2−∥xk−x∥2]+αkλk2∥∇Ψ(xmdk)∥2. (2.22)

Combining (2.20), (2.21), and (2.22), we obtain

 Ψ(xagk) ≤(1−αk)Ψ(xagk−1)+αkΨ(x)+αk2λk[∥xk−1−x∥2−∥xk−x∥2] −βk(1−LΨβk2−αkλk2βk)∥∇Ψ(xmdk)∥2 ≤(1−αk)Ψ(xagk−1)+αkΨ(x)+αk2λk[∥xk−1−x∥2−∥xk−x∥2] −βk2(1−LΨβk)∥∇Ψ(xmdk)∥2, (2.23)

where the last inequality follows from the assumption in (2.9). Subtracting from both sides of the above inequality and using Lemma 1, we conclude that

 Ψ(xagN)−Ψ(x)ΓN ≤ N∑k=1αk2λkΓk[∥xk−1−x∥2−∥xk−x∥2]−N∑k=1βk2Γk(1−LΨβk)∥∇Ψ(xmdk)∥2 (2.24) ≤ ∥x0−x∥22λ1−N∑k=1βk2Γk(1−LΨβk)∥∇Ψ(xmdk)∥2  ∀x∈Rn,

where the second inequality follows from the simple relation that

 N∑k=1αkλkΓk[∥xk−1−x∥2−∥xk−x∥2]≤α1∥x0−x∥2λ1Γ1=∥x0−x∥2λ1 (2.25)

due to (2.10) and the fact that . Hence, (2.12) immediately follows from the above inequality and the assumption in (2.9). Moreover, fixing , re-arranging the terms in (2.24), and noting the fact that , we obtain

 mink=1,...,N∥∇Ψ(xmdk)∥2N∑k=1βk2Γk(1−LΨβk) ≤ N∑k=1βk2Γk(1−LΨβk)∥∇Ψ(xmdk)∥2 ≤ ∥x∗−x0∥22λ1,

which together with (2.9), clearly imply (2.11).

We add a few observations about Theorem 2.1. First, in view of (2.23), it is possible to use a different assumption than the one in (2.9) on the stepsize policies for the convex case. In particular, we only need

 2−LΨβk−αkλkβk>0 (2.26)

to show the convergence of the AG method for minimizing smooth convex problems. However, since the condition given by (2.9) is required for minimizing composite problems in Subsections 2.2 and 3.2, we state this assumption for the sake of simplicity. Second, there are various options for selecting , , and to guarantee the convergence of the AG algorithm. Below we provide some of these selections for solving both convex and nonconvex problems.

###### Corollary 1

Suppose that and in the AG method are set to

 αk=2k+1   and   βk=12LΨ. (2.27)
• If satisifies

 λk∈[βk,(1+αk4)βk]  ∀k≥1, (2.28)

then for any , we have

 mink=1,...,N∥∇Ψ(xmdk)∥2≤6LΨ[Ψ(x0)−Ψ∗]N. (2.29)
• Assume that is convex and that an optimal solution exists for problem (1.1). If satisfies

 λk=kβk2  ∀k≥1, (2.30)

then for any , we have

 mink=1,...,N∥∇Ψ(xmdk)∥2 ≤ 96L2Ψ∥x0−x∗∥2N2(N+1), (2.31) Ψ(xagN)−Ψ(x∗) ≤ 4LΨ∥x0−x∗∥2N(N+1). (2.32)
###### Proof

We first show part a). Note that by (2.6) and (2.27), we have

 Γk=2k(k+1), (2.33)

which implies that

 N∑τ=kΓτ=N∑τ=k2τ(τ+1)=2N∑τ=k(1τ−1τ+1)≤2k. (2.34)

It can also be easily seen from (2.28) that . Using these observations, (2.27), and (2.28), we have

 Ck = 1−LΨ[λk+(λk−βk)22αkΓkλk(N∑τ=kΓτ)] (2.35) ≥ 1−LΨ[(1+αk4)βk+α2kβ2k161kαkΓkβk] = 1−βkLΨ(1+αk4+116) ≥ 1−βkLΨ2116=1132, λkCk ≥ 11βk32=1164LΨ≥16LΨ.

Combining the above relation with (2.8), we obtain (2.29).

We now show part b). Observe that by (2.27) and (2.30), we have

 αkλk =kk+1βk<βk, α1λ1Γ1 =α2λ2Γ2=…=4LΨ,

which implies that conditions (2.9) and (2.10) hold. Moreover, we have

 N∑k=1Γ−1kβk(1−LΨβk)=14LΨN∑k=1Γ−1k≥18LΨN∑k=1k2=148LΨN(N+1)(2N+1)≥N2(N+1)24LΨ. (2.36)

Using (2.33) and the above bounds in (2.11) and (2.12), we obtain (2.31) and (2.32).

We now add a few remarks about the results obtained in Corollary 1. First, the rate of convergence in (2.29) for the AG method is in the same order of magnitude as that for the gradient descent method (Nest04 ()). It is also worth noting that by choosing in (2.28), the rate of convergence for the AG method is just changed up to a constant factor. However, in this case, the AG method is reduced to the gradient descent method as mentioned earlier in this subsection. Second, if the problem is convex, by choosing more aggressive stepsize in (2.30), the AG method exhibits the optimal rate of convergence in (2.32). Moreover, with such a selection of , the AG method can find a solution such that in at most iterations according to (2.31). The latter result has also been established in (MonSva11-1, , Proposition 5.2) for an accelerated hybrid proximal extra-gradient method when applied to convex problems.

Observe that in (2.28) for general nonconvex problems is in the order of , while the one in (2.30) for convex problems are more aggressive (in ). An interesting question is whether we can apply the same stepsize policy in (2.30) for solving general NLP problems no matter they are convex or not. We will discuss such a uniform treatment for both convex and nonconvex optimization for solve a certain class of composite problems in next subsection.

### 2.2 Minimization of nonconvex composite functions

In this subsection, we consider a special class of NLP problems given in the form of (1.3). Our goal in this subsection is to show that we can employ a more aggressive stepsize policy in the AG method, similar to the one used in the convex case (see Theorem 2.1.b) and Corollary 1.b)), to solve these composite problems, even if is possibly nonconvex.

Throughout this subsection, we make the following assumption about the convex (possibly non-differentiable) component in (1.3).

###### Assumption 2

There exists a constant such that for any and , where is given by

 P(x,y,c):=argminu∈Rn{⟨y,u⟩+12c∥u−x