Accelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Programming This research was partially supported by NSF grants CMMI-1000347, CMMI-1254446, DMS-1319050, and ONR grant N00014-13-1-0036.

Accelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Programming 1


In this paper, we generalize the well-known Nesterov’s accelerated gradient (AG) method, originally designed for convex smooth optimization, to solve nonconvex and possibly stochastic optimization problems. We demonstrate that by properly specifying the stepsize policy, the AG method exhibits the best known rate of convergence for solving general nonconvex smooth optimization problems by using first-order information, similarly to the gradient descent method. We then consider an important class of composite optimization problems and show that the AG method can solve them uniformly, i.e., by using the same aggressive stepsize policy as in the convex case, even if the problem turns out to be nonconvex. We demonstrate that the AG method exhibits an optimal rate of convergence if the composite problem is convex, and improves the best known rate of convergence if the problem is nonconvex. Based on the AG method, we also present new nonconvex stochastic approximation methods and show that they can improve a few existing rates of convergence for nonconvex stochastic optimization. To the best of our knowledge, this is the first time that the convergence of the AG method has been established for solving nonconvex nonlinear programming in the literature.

Keywords: nonconvex optimization, stochastic programming, accelerated gradient, complexity

AMS 2000 subject classification: 62L20, 90C25, 90C15, 68Q25,

1 Introduction

In 1983, Nesterov in a celebrated work Nest83-1 () presented the accelerated gradient (AG) method for solving a class of convex programming (CP) problems given by


Here is a convex function with Lipschitz continuous gradients, i.e., such that (s.t.)


Nesterov shows that the number of iterations performed by this algorithm to find a solution s.t. can be bounded by , which significantly improves the complexity bound possessed by the gradient descent method. Moreover, in view of the classic complexity theory for convex optimization by Nemirovski and Yudin nemyud:83 (), the above iteration complexity bound is not improvable for smooth convex optimization when is sufficiently large.

Nesterov’s AG method has attracted much interest recently due to the increasing need to solve large-scale CP problems by using fast first-order methods. In particular, Nesterov in an important work Nest05-1 () shows that by using the AG method and a novel smoothing scheme, one can improve the complexity for solving a broad class of saddle-point problems from to . The AG method has also been generalized by Nesterov Nest07-1 (), Beck and Teboulle BecTeb09-2 (), and Tseng tseng08-1 () to solve an emerging class of composite CP problems whose objective function is given by the summation of a smooth component and another relatively simple nonsmooth component (e.g., the norm). Lan Lan10-3 () further shows that the AG method, when employed with proper stepsize policies, is optimal for solving not only smooth CP problems, but also general (not necessarily simple) nonsmooth and stochastic CP problems. More recently, some key elements of the AG method, e.g., the multi-step acceleration scheme, have been adapted to significantly improve the convergence properties of a few other first-order methods (e.g., level methods Lan13-2 ()). However, to the best of our knowledge, all the aforementioned developments require explicitly the convexity assumption about . Otherwise, if in (1.1) is not necessarily convex, it is unclear whether the AG method still converges or not.

This paper aims to generalize the AG method, originally designed for smooth convex optimization, to solve more general nonlinear programming (NLP) (possibly nonconvex and stochastic) problems, and thus to present a unified treatment and analysis for convex, nonconvex and stochastic optimization. While this paper focuses on the theoretical development of the AG method, our study has also been motivated by the following more practical considerations in solving nonlinear programming problems. First, many general nonlinear objective functions are locally convex. A unified treatment for both convex and nonconvex problems will help us to make use of such local convex properties. In particular, we intend to understand whether one can apply the well-known aggressive stepsize policy in the AG method under a more general setting to benefit from such local convexity. Second, many nonlinear objective functions arising from sparse optimization (e.g., ChGeWangYe12-1 (); FengMitPangShenW13-1 ()) and machine learning (e.g., FanLi01-1 (); Mairal09 ()) consist of both convex and nonconvex components, corresponding to the data fidelity and sparsity regularization terms respectively. One interesting question is whether one can design more efficient algorithms for solving these nonconvex composite problems by utilizing their convexity structure. Third, the convexity of some objective functions represented by a black-box procedure is usually unknown, e.g., in simulation-based optimization Andr98-1 (); Fu02-1 (); AsmGlynn00 (); Law07 (). A unified treatment and analysis can thus help us to deal with such structural ambiguity. Fourth, in some cases, the objective functions are nonconvex with respect to (w.r.t.) a few decision variables jointly, but convex w.r.t. each one of them separately. Many machine learning/imaging processing problems are given in this form (e.g., Mairal09 ()). Current practice is to first run an NLP solver to find a stationary point, and then a CP solver after one variable (e.g., dictionary in Mairal09 ()) is fixed. A more powerful, unified treatment for both convex and nonconvex problems is desirable to better handle these types of problems.

Our contribution mainly lies in the following three aspects. First, we consider the classic NLP problem given in the form of (1.1), where is a smooth (possibly nonconvex) function satisfying (1.2) (denoted by ). In addition, we assume that is bounded from below. We demonstrate that the AG method, when employed with a certain stepsize policy, can find an -solution of (1.1), i.e., a point such that , in at most iterations, which is the best-known complexity bound possessed by first-order methods to solve general NLP problems (e.g., the gradient descent method Nest04 (); CarGouToi10-1 () and the trust region method GraSarToi08 ()). Note that if is convex and a more aggressive stepsize policy is applied in the AG method, then the aforementioned complexity bound can be improved to .

Second, we consider a class of composite problems (see, e.g., Lewis and Wright LewWri09-1 (), Chen et al. ChGeWangYe12-1 ()) given by


where is possibly nonconvex, is convex, and is a simple convex (possibly non-smooth) function with bounded domain (e.g., with being the indicator function of a convex compact set ). Clearly, we have with . Since is possibly non-differentiable, we need to employ a different termination criterion based on the gradient mapping (see (2.38)) to analyze the complexity of the AG method. Observe, however, that if , then we have for any . We show that the same aggressive stepsize policy as the AG method for the convex problems can be applied for solving problem (1.3) no matter if is convex or not. More specifically, the AG method exhibits an optimal rate of convergence in terms of functional optimality gap if turns out to be convex. In addition, we show that one can find a solution s.t. in at most

iterations. The above complexity bound improves the one established in GhaLanZhang13-1 () for the projected gradient method applied to problem (1.3) in terms of their dependence on the Lipschtiz constant . In addition, it is significantly better than the latter bound when is small enough (see Section 2.2 for more details).

Third, we consider stochastic NLP problems in the form of (1.1) or (1.3), where only noisy first-order information about is available via subsequent calls to a stochastic oracle (). More specifically, at the -th call, being the input, the outputs a stochastic gradient , where are random vectors whose distributions are supported on . The following assumptions are also made for the stochastic gradient .

Assumption 1

For any and , we have

a) (1.4)
b) (1.5)

Currently, the randomized stochastic gradient (RSG) method initially studied by Ghadimi and Lan GhaLan12 () and later improved in GhaLanZhang13-1 (); DangLan13-1 () seems to be the only available stochastic approximation (SA) algorithm for solving the aforementioned general stochastic NLP problems, while other SA methods (see, e.g., RobMon51-1 (); NJLS09-1 (); Spall03 (); pol92 (); Lan10-3 (); GhaLan12 (); GhaLan10-1b ()) require the convexity assumption about . However, the RSG method and its variants are only nearly optimal for solving convex SP problems. Based on the AG method, we present a randomized stochastic AG (RSAG) method for solving general stochastic NLP problems and show that if is convex, then the RSAG exhibits an optimal rate of convergence in terms of functional optimality gap, similarly to the accelerated SA method in Lan10-3 (). In this case, the complexity bound in (1.6) in terms of the residual of gradients can be improved to

Moreover, if is nonconvex, then the RSAG method can find an -solution of (1.1), i.e., a point s.t. in at most


calls to the . We also generalize these complexity analyses to a class of nonconvex stochastic composite optimization problems by introducing a mini-batch approach into the RSAG method and improve a few complexity results presented in GhaLanZhang13-1 () for solving these stochastic composite optimization problems.

This paper is organized as follows. In Section 2, we present the AG algorithm and establish its convergence properties for solving problems (1.1) and (1.3). We then generalize the AG method for solving stochastic nonlinear and composite optimization problems in Section 3. Some brief concluding remarks are given in Section 4.

2 The accelerated gradient algorithm

Our goal in this section is to show that the AG method, which is originally designed for smooth convex optimization, also converges for solving nonconvex optimization problems after incorporating some proper modification. More specifically, we first present an AG method for solving a general class of nonlinear optimization problems in Subsection 2.1 and then describe the AG method for solving a special class of nonconvex composite optimization problems in Subsection 2.2.

2.1 Minimization of smooth functions

In this subsection, we assume that is a differentiable nonconvex function, bounded from below and its gradient satisfies in (1.2). It then follows that (see, e.g., Nest04 ())


While the gradient descent method converges for solving the above class of nonconvex optimization problems, it does not achieve the optimal rate of convergence, in terms of the functional optimality gap, when is convex. On the other hand, the original AG method in Nest83-1 () is optimal for solving convex optimization problems, but does not necessarily converge for solving nonconvex optimization problems. Below, we present a modified AG method and show that by properly specifying the stepsize policy, it not only achieves the optimal rate of convergence for convex optimization, but also exhibits the best-known rate of convergence as shown in  Nest04 (); CarGouToi10-1 () for solving general smooth NLP problems by using first-order methods.

  Input: , s.t. and for any , , and .
  0. Set the initial points and .
  1. Set
  2. Compute and set
  3. Set and go to step 1.
Algorithm 1 The accelerated gradient (AG) algorithm

Note that, if , then we have . In this case, the above AG method is equivalent to one of the simplest variants of the well-known Nesterov’s method (see, e.g., Nest04 ()). On the other hand, if , then it can be shown by induction that and . In this case, Algorithm 1 reduces to the gradient descent method. We will show in this subsection that the above AG method actually converges for different selections of , , and in both convex and nonconvex case.

To establish the convergence of the above AG method, we need the following simple technical result (see Lemma 3 of Lan13-2 () for a slightly more general result).

Lemma 1

Let be the stepsizes in the AG method and the sequence satisfies




Then we have for any .


Noting that and for any . These observations together with (2.6) then imply that for any . Dividing both sides of (2.5) by , we obtain


The result then immediately follows by summing up the above inequalities and rearranging the terms.

We are now ready to describe the main convergence properties of the AG method.

Theorem 2.1

Let be computed by Algorithm 1 and be defined in (2.6).

  • If , , and are chosen such that


    then for any , we have

  • Suppose that is convex and that an optimal solution exists for problem (1.1). If , , and are chosen such that


    then for any , we have


We first show part a). Denote . By (1.2) and (2.2), we have


Also by (2.1) and (2.3), we have


where the last inequality follows from the Cauchy-Schwarz inequality. Combining the previous two inequalities, we obtain


where the second inequality follows from the fact that . Now, by (2.2), (2.3), and (2.4), we have

which, in the view of Lemma 1, implies that

Using the above identity, the Jensen’s inequality for , and the fact that


we have


Replacing the above bound in (2.15), we obtain


for any , where the last inequality follows from the definition of in (2.6) and the fact that for all . Summing up the above inequalities and using the definition of in (2.7), we have


Re-arranging the terms in the above inequality and noting that , we obtain

which, in view of the assumption that , clearly implies (2.8).

We now show part b). First, note that by (2.4), we have


Also by the convexity of and (2.2),


It also follows from (2.3) that

and hence that


Combining (2.20), (2.21), and (2.22), we obtain


where the last inequality follows from the assumption in (2.9). Subtracting from both sides of the above inequality and using Lemma 1, we conclude that


where the second inequality follows from the simple relation that


due to (2.10) and the fact that . Hence, (2.12) immediately follows from the above inequality and the assumption in (2.9). Moreover, fixing , re-arranging the terms in (2.24), and noting the fact that , we obtain

which together with (2.9), clearly imply (2.11).

We add a few observations about Theorem 2.1. First, in view of (2.23), it is possible to use a different assumption than the one in (2.9) on the stepsize policies for the convex case. In particular, we only need


to show the convergence of the AG method for minimizing smooth convex problems. However, since the condition given by (2.9) is required for minimizing composite problems in Subsections 2.2 and 3.2, we state this assumption for the sake of simplicity. Second, there are various options for selecting , , and to guarantee the convergence of the AG algorithm. Below we provide some of these selections for solving both convex and nonconvex problems.

Corollary 1

Suppose that and in the AG method are set to

  • If satisifies


    then for any , we have

  • Assume that is convex and that an optimal solution exists for problem (1.1). If satisfies


    then for any , we have


We first show part a). Note that by (2.6) and (2.27), we have


which implies that


It can also be easily seen from (2.28) that . Using these observations, (2.27), and (2.28), we have


Combining the above relation with (2.8), we obtain (2.29).

We now show part b). Observe that by (2.27) and (2.30), we have

which implies that conditions (2.9) and (2.10) hold. Moreover, we have


Using (2.33) and the above bounds in (2.11) and (2.12), we obtain (2.31) and (2.32).

We now add a few remarks about the results obtained in Corollary 1. First, the rate of convergence in (2.29) for the AG method is in the same order of magnitude as that for the gradient descent method (Nest04 ()). It is also worth noting that by choosing in (2.28), the rate of convergence for the AG method is just changed up to a constant factor. However, in this case, the AG method is reduced to the gradient descent method as mentioned earlier in this subsection. Second, if the problem is convex, by choosing more aggressive stepsize in (2.30), the AG method exhibits the optimal rate of convergence in (2.32). Moreover, with such a selection of , the AG method can find a solution such that in at most iterations according to (2.31). The latter result has also been established in (MonSva11-1, , Proposition 5.2) for an accelerated hybrid proximal extra-gradient method when applied to convex problems.

Observe that in (2.28) for general nonconvex problems is in the order of , while the one in (2.30) for convex problems are more aggressive (in ). An interesting question is whether we can apply the same stepsize policy in (2.30) for solving general NLP problems no matter they are convex or not. We will discuss such a uniform treatment for both convex and nonconvex optimization for solve a certain class of composite problems in next subsection.

2.2 Minimization of nonconvex composite functions

In this subsection, we consider a special class of NLP problems given in the form of (1.3). Our goal in this subsection is to show that we can employ a more aggressive stepsize policy in the AG method, similar to the one used in the convex case (see Theorem 2.1.b) and Corollary 1.b)), to solve these composite problems, even if is possibly nonconvex.

Throughout this subsection, we make the following assumption about the convex (possibly non-differentiable) component in (1.3).

Assumption 2

There exists a constant such that for any and , where is given by