Accelerated Gradient Methods for
Nonconvex Nonlinear and Stochastic Programming
^{1}
Abstract
In this paper, we generalize the wellknown Nesterov’s accelerated gradient (AG) method, originally designed for convex smooth optimization, to solve nonconvex and possibly stochastic optimization problems. We demonstrate that by properly specifying the stepsize policy, the AG method exhibits the best known rate of convergence for solving general nonconvex smooth optimization problems by using firstorder information, similarly to the gradient descent method. We then consider an important class of composite optimization problems and show that the AG method can solve them uniformly, i.e., by using the same aggressive stepsize policy as in the convex case, even if the problem turns out to be nonconvex. We demonstrate that the AG method exhibits an optimal rate of convergence if the composite problem is convex, and improves the best known rate of convergence if the problem is nonconvex. Based on the AG method, we also present new nonconvex stochastic approximation methods and show that they can improve a few existing rates of convergence for nonconvex stochastic optimization. To the best of our knowledge, this is the first time that the convergence of the AG method has been established for solving nonconvex nonlinear programming in the literature.
Keywords: nonconvex optimization, stochastic programming, accelerated gradient, complexity
AMS 2000 subject classification: 62L20, 90C25, 90C15, 68Q25,
1 Introduction
In 1983, Nesterov in a celebrated work Nest831 () presented the accelerated gradient (AG) method for solving a class of convex programming (CP) problems given by
(1.1) 
Here is a convex function with Lipschitz continuous gradients, i.e., such that (s.t.)
(1.2) 
Nesterov shows that the number of iterations performed by this algorithm to find a solution s.t. can be bounded by , which significantly improves the complexity bound possessed by the gradient descent method. Moreover, in view of the classic complexity theory for convex optimization by Nemirovski and Yudin nemyud:83 (), the above iteration complexity bound is not improvable for smooth convex optimization when is sufficiently large.
Nesterov’s AG method has attracted much interest recently due to the increasing need to solve largescale CP problems by using fast firstorder methods. In particular, Nesterov in an important work Nest051 () shows that by using the AG method and a novel smoothing scheme, one can improve the complexity for solving a broad class of saddlepoint problems from to . The AG method has also been generalized by Nesterov Nest071 (), Beck and Teboulle BecTeb092 (), and Tseng tseng081 () to solve an emerging class of composite CP problems whose objective function is given by the summation of a smooth component and another relatively simple nonsmooth component (e.g., the norm). Lan Lan103 () further shows that the AG method, when employed with proper stepsize policies, is optimal for solving not only smooth CP problems, but also general (not necessarily simple) nonsmooth and stochastic CP problems. More recently, some key elements of the AG method, e.g., the multistep acceleration scheme, have been adapted to significantly improve the convergence properties of a few other firstorder methods (e.g., level methods Lan132 ()). However, to the best of our knowledge, all the aforementioned developments require explicitly the convexity assumption about . Otherwise, if in (1.1) is not necessarily convex, it is unclear whether the AG method still converges or not.
This paper aims to generalize the AG method, originally designed for smooth convex optimization, to solve more general nonlinear programming (NLP) (possibly nonconvex and stochastic) problems, and thus to present a unified treatment and analysis for convex, nonconvex and stochastic optimization. While this paper focuses on the theoretical development of the AG method, our study has also been motivated by the following more practical considerations in solving nonlinear programming problems. First, many general nonlinear objective functions are locally convex. A unified treatment for both convex and nonconvex problems will help us to make use of such local convex properties. In particular, we intend to understand whether one can apply the wellknown aggressive stepsize policy in the AG method under a more general setting to benefit from such local convexity. Second, many nonlinear objective functions arising from sparse optimization (e.g., ChGeWangYe121 (); FengMitPangShenW131 ()) and machine learning (e.g., FanLi011 (); Mairal09 ()) consist of both convex and nonconvex components, corresponding to the data fidelity and sparsity regularization terms respectively. One interesting question is whether one can design more efficient algorithms for solving these nonconvex composite problems by utilizing their convexity structure. Third, the convexity of some objective functions represented by a blackbox procedure is usually unknown, e.g., in simulationbased optimization Andr981 (); Fu021 (); AsmGlynn00 (); Law07 (). A unified treatment and analysis can thus help us to deal with such structural ambiguity. Fourth, in some cases, the objective functions are nonconvex with respect to (w.r.t.) a few decision variables jointly, but convex w.r.t. each one of them separately. Many machine learning/imaging processing problems are given in this form (e.g., Mairal09 ()). Current practice is to first run an NLP solver to find a stationary point, and then a CP solver after one variable (e.g., dictionary in Mairal09 ()) is fixed. A more powerful, unified treatment for both convex and nonconvex problems is desirable to better handle these types of problems.
Our contribution mainly lies in the following three aspects. First, we consider the classic NLP problem given in the form of (1.1), where is a smooth (possibly nonconvex) function satisfying (1.2) (denoted by ). In addition, we assume that is bounded from below. We demonstrate that the AG method, when employed with a certain stepsize policy, can find an solution of (1.1), i.e., a point such that , in at most iterations, which is the bestknown complexity bound possessed by firstorder methods to solve general NLP problems (e.g., the gradient descent method Nest04 (); CarGouToi101 () and the trust region method GraSarToi08 ()). Note that if is convex and a more aggressive stepsize policy is applied in the AG method, then the aforementioned complexity bound can be improved to .
Second, we consider a class of composite problems (see, e.g., Lewis and Wright LewWri091 (), Chen et al. ChGeWangYe121 ()) given by
(1.3) 
where is possibly nonconvex, is convex, and is a simple convex (possibly nonsmooth) function with bounded domain (e.g., with being the indicator function of a convex compact set ). Clearly, we have with . Since is possibly nondifferentiable, we need to employ a different termination criterion based on the gradient mapping (see (2.38)) to analyze the complexity of the AG method. Observe, however, that if , then we have for any . We show that the same aggressive stepsize policy as the AG method for the convex problems can be applied for solving problem (1.3) no matter if is convex or not. More specifically, the AG method exhibits an optimal rate of convergence in terms of functional optimality gap if turns out to be convex. In addition, we show that one can find a solution s.t. in at most
iterations. The above complexity bound improves the one established in GhaLanZhang131 () for the projected gradient method applied to problem (1.3) in terms of their dependence on the Lipschtiz constant . In addition, it is significantly better than the latter bound when is small enough (see Section 2.2 for more details).
Third, we consider stochastic NLP problems in the form of (1.1) or (1.3), where only noisy firstorder information about is available via subsequent calls to a stochastic oracle (). More specifically, at the th call, being the input, the outputs a stochastic gradient , where are random vectors whose distributions are supported on . The following assumptions are also made for the stochastic gradient .
Assumption 1
For any and , we have
a)  (1.4)  
b)  (1.5) 
Currently, the randomized stochastic gradient (RSG) method initially studied by Ghadimi and Lan GhaLan12 () and later improved in GhaLanZhang131 (); DangLan131 () seems to be the only available stochastic approximation (SA) algorithm for solving the aforementioned general stochastic NLP problems, while other SA methods (see, e.g., RobMon511 (); NJLS091 (); Spall03 (); pol92 (); Lan103 (); GhaLan12 (); GhaLan101b ()) require the convexity assumption about . However, the RSG method and its variants are only nearly optimal for solving convex SP problems. Based on the AG method, we present a randomized stochastic AG (RSAG) method for solving general stochastic NLP problems and show that if is convex, then the RSAG exhibits an optimal rate of convergence in terms of functional optimality gap, similarly to the accelerated SA method in Lan103 (). In this case, the complexity bound in (1.6) in terms of the residual of gradients can be improved to
Moreover, if is nonconvex, then the RSAG method can find an solution of (1.1), i.e., a point s.t. in at most
(1.6) 
calls to the . We also generalize these complexity analyses to a class of nonconvex stochastic composite optimization problems by introducing a minibatch approach into the RSAG method and improve a few complexity results presented in GhaLanZhang131 () for solving these stochastic composite optimization problems.
This paper is organized as follows. In Section 2, we present the AG algorithm and establish its convergence properties for solving problems (1.1) and (1.3). We then generalize the AG method for solving stochastic nonlinear and composite optimization problems in Section 3. Some brief concluding remarks are given in Section 4.
2 The accelerated gradient algorithm
Our goal in this section is to show that the AG method, which is originally designed for smooth convex optimization, also converges for solving nonconvex optimization problems after incorporating some proper modification. More specifically, we first present an AG method for solving a general class of nonlinear optimization problems in Subsection 2.1 and then describe the AG method for solving a special class of nonconvex composite optimization problems in Subsection 2.2.
2.1 Minimization of smooth functions
In this subsection, we assume that is a differentiable nonconvex function, bounded from below and its gradient satisfies in (1.2). It then follows that (see, e.g., Nest04 ())
(2.1) 
While the gradient descent method converges for solving the above class of nonconvex optimization problems, it does not achieve the optimal rate of convergence, in terms of the functional optimality gap, when is convex. On the other hand, the original AG method in Nest831 () is optimal for solving convex optimization problems, but does not necessarily converge for solving nonconvex optimization problems. Below, we present a modified AG method and show that by properly specifying the stepsize policy, it not only achieves the optimal rate of convergence for convex optimization, but also exhibits the bestknown rate of convergence as shown in Nest04 (); CarGouToi101 () for solving general smooth NLP problems by using firstorder methods.
(2.2) 
(2.3)  
(2.4) 
Note that, if , then we have . In this case, the above AG method is equivalent to one of the simplest variants of the wellknown Nesterov’s method (see, e.g., Nest04 ()). On the other hand, if , then it can be shown by induction that and . In this case, Algorithm 1 reduces to the gradient descent method. We will show in this subsection that the above AG method actually converges for different selections of , , and in both convex and nonconvex case.
To establish the convergence of the above AG method, we need the following simple technical result (see Lemma 3 of Lan132 () for a slightly more general result).
Lemma 1
Let be the stepsizes in the AG method and the sequence satisfies
(2.5) 
where
(2.6) 
Then we have for any .
Proof
We are now ready to describe the main convergence properties of the AG method.
Theorem 2.1

If , , and are chosen such that
(2.7) then for any , we have
(2.8) 
Suppose that is convex and that an optimal solution exists for problem (1.1). If , , and are chosen such that
(2.9) (2.10) then for any , we have
(2.11) (2.12)
Proof
We first show part a). Denote . By (1.2) and (2.2), we have
(2.13) 
Also by (2.1) and (2.3), we have
(2.14)  
where the last inequality follows from the CauchySchwarz inequality. Combining the previous two inequalities, we obtain
(2.15)  
where the second inequality follows from the fact that . Now, by (2.2), (2.3), and (2.4), we have
which, in the view of Lemma 1, implies that
Using the above identity, the Jensen’s inequality for , and the fact that
(2.16) 
we have
(2.17)  
Replacing the above bound in (2.15), we obtain
(2.18)  
for any , where the last inequality follows from the definition of in (2.6) and the fact that for all . Summing up the above inequalities and using the definition of in (2.7), we have
(2.19)  
Rearranging the terms in the above inequality and noting that , we obtain
which, in view of the assumption that , clearly implies (2.8).
We now show part b). First, note that by (2.4), we have
(2.20)  
Also by the convexity of and (2.2),
(2.21) 
It also follows from (2.3) that
and hence that
(2.22) 
Combining (2.20), (2.21), and (2.22), we obtain
(2.23) 
where the last inequality follows from the assumption in (2.9). Subtracting from both sides of the above inequality and using Lemma 1, we conclude that
(2.24)  
where the second inequality follows from the simple relation that
(2.25) 
due to (2.10) and the fact that . Hence, (2.12) immediately follows from the above inequality and the assumption in (2.9). Moreover, fixing , rearranging the terms in (2.24), and noting the fact that , we obtain
We add a few observations about Theorem 2.1. First, in view of (2.23), it is possible to use a different assumption than the one in (2.9) on the stepsize policies for the convex case. In particular, we only need
(2.26) 
to show the convergence of the AG method for minimizing smooth convex problems. However, since the condition given by (2.9) is required for minimizing composite problems in Subsections 2.2 and 3.2, we state this assumption for the sake of simplicity. Second, there are various options for selecting , , and to guarantee the convergence of the AG algorithm. Below we provide some of these selections for solving both convex and nonconvex problems.
Corollary 1
Suppose that and in the AG method are set to
(2.27) 

If satisifies
(2.28) then for any , we have
(2.29) 
Assume that is convex and that an optimal solution exists for problem (1.1). If satisfies
(2.30) then for any , we have
(2.31) (2.32)
Proof
We now add a few remarks about the results obtained in Corollary 1. First, the rate of convergence in (2.29) for the AG method is in the same order of magnitude as that for the gradient descent method (Nest04 ()). It is also worth noting that by choosing in (2.28), the rate of convergence for the AG method is just changed up to a constant factor. However, in this case, the AG method is reduced to the gradient descent method as mentioned earlier in this subsection. Second, if the problem is convex, by choosing more aggressive stepsize in (2.30), the AG method exhibits the optimal rate of convergence in (2.32). Moreover, with such a selection of , the AG method can find a solution such that in at most iterations according to (2.31). The latter result has also been established in (MonSva111, , Proposition 5.2) for an accelerated hybrid proximal extragradient method when applied to convex problems.
Observe that in (2.28) for general nonconvex problems is in the order of , while the one in (2.30) for convex problems are more aggressive (in ). An interesting question is whether we can apply the same stepsize policy in (2.30) for solving general NLP problems no matter they are convex or not. We will discuss such a uniform treatment for both convex and nonconvex optimization for solve a certain class of composite problems in next subsection.
2.2 Minimization of nonconvex composite functions
In this subsection, we consider a special class of NLP problems given in the form of (1.3). Our goal in this subsection is to show that we can employ a more aggressive stepsize policy in the AG method, similar to the one used in the convex case (see Theorem 2.1.b) and Corollary 1.b)), to solve these composite problems, even if is possibly nonconvex.
Throughout this subsection, we make the following assumption about the convex (possibly nondifferentiable) component in (1.3).
Assumption 2
There exists a constant such that for any and , where is given by