Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent
Abstract
Nesterov’s accelerated gradient descent (AGD), an instance of the general family of “momentum methods,” provably achieves faster convergence rate than gradient descent (GD) in the convex setting. However, whether these methods are superior to GD in the nonconvex setting remains open. This paper studies a simple variant of AGD, and shows that it escapes saddle points and finds a secondorder stationary point in iterations, faster than the iterations required by GD. To the best of our knowledge, this is the first Hessianfree algorithm to find a secondorder stationary point faster than GD, and also the first singleloop algorithm with a faster rate than GD even in the setting of finding a firstorder stationary point. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuoustime perspective, which AGD monotonically decreases per step even for nonconvex functions, and (2) a novel framework called improve or localize, which is useful for tracking the longterm behavior of gradientbased optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization.
1 Introduction
Nonconvex optimization problems are ubiquitous in modern machine learning. While it is NPhard to find global minima of a nonconvex function in the worst case, in the setting of machine learning it has proved useful to consider a less stringent notion of success, namely that of convergence to a firstorder stationary point (where ). Gradient descent (GD), a simple and fundamental optimization algorithm that has proved its value in largescale machine learning, is known to find an firstorder stationary point (where ) in iterations (Nesterov, 1998), and this rate is sharp (Cartis et al., 2010). Such results, however, do not seem to address the practical success of gradient descent; firstorder stationarity includes local minima, saddle points or even local maxima, and a mere guarantee of convergence to such points seems unsatisfying. Indeed, architectures such as deep neural networks induce optimization surfaces that can be teeming with such highly suboptimal saddle points (Dauphin et al., 2014). It is important to study to what extent gradient descent avoids such points, particular in the highdimensional setting in which the directions of escape from saddle points may be few.
This paper focuses on convergence to a secondorder stationary point (where and ). Secondorder stationarity rules out many common types of saddle points (strict saddle points where ), allowing only local minima and higherorder saddle points. A significant body of recent work, some theoretical and some empirical, shows that for a large class of wellstudied machine learning problems, neither higherorder saddle points nor spurious local minima exist. That is, all secondorder stationary points are (approximate) global minima for these problems. Choromanska et al. (2014); Kawaguchi (2016) present such a result for learning multilayer neural networks, Bandeira et al. (2016); Mei et al. (2017) for synchronization and MaxCut, Boumal et al. (2016) for smooth semidefinite programs, Bhojanapalli et al. (2016) for matrix sensing, Ge et al. (2016) for matrix completion, and Ge et al. (2017) for robust PCA. These results strongly motivate the quest for efficient algorithms to find secondorder stationary points.
Hessianbased algorithms can explicitly compute curvatures and thereby avoid saddle points (e.g., (Nesterov and Polyak, 2006; Curtis et al., 2014)), but these algorithms are computationally infeasible in the highdimensional regime. GD, by contrast, is known to get stuck at strict saddle points (Nesterov, 1998, Section 1.2.3). Recent work has reconciled this conundrum in favor of GD; Jin et al. (2017), building on earlier work of Ge et al. (2015), show that a perturbed version of GD converges to an relaxed version of a secondorder stationary point (see Definition 5) in iterations. That is, perturbed GD in fact finds secondorder stationary points as fast as standard GD finds firstorder stationary point, up to logarithmic factors in dimension.
On the other hand, GD is known to be suboptimal in the convex case. In a celebrated paper, Nesterov (1983) showed that an accelerated version of gradient descent (AGD) finds an suboptimal point (see Section 2.2) in steps, while gradient descent takes steps. The basic idea of acceleration has been used to design faster algorithms for a range of other convex optimization problems (Beck and Teboulle, 2009; Nesterov, 2012; Lee and Sidford, 2013; ShalevShwartz and Zhang, 2014). We will refer to this general family as “momentumbased methods.”
Such results have focused on the convex setting. It is open as to whether momentumbased methods yield faster rates in the nonconvex setting, specifically when we consider the convergence criterion of secondorder stationarity. We are thus led to ask the following question: Do momentumbased methods yield faster convergence than GD in the presence of saddle points?
This paper answers this question in the affirmative. We present a simple momentumbased algorithm (PAGD for “perturbed AGD”) that finds an second order stationary point in iterations, faster than the iterations required by GD.
The pseudocode of our algorithm is presented in Algorithm 2.

Perturbation (Lines 34): when the gradient is small, we add a small perturbation sampled uniformly from a dimensional ball with radius . The homogeneous nature of this perturbation mitigates our lack of knowledge of the curvature tensor at or near saddle points.

Negative Curvature Exploitation (NCE, Lines 89; pseudocode in Algorithm 3): when the function becomes “too nonconvex” along to , we reset the momentum and decide whether to exploit negative curvature depending on the magnitude of the current momentum .
We note that both components are straightforward to implement and increase computation by a constant factor. The perturbation idea follows from Ge et al. (2015) and Jin et al. (2017), while NCE is inspired by (Carmon et al., 2017). To the best of our knowledge, PAGD is the first Hessianfree algorithm to find a secondorder stationary point in steps. Note also that PAGD is a “singleloop algorithm,” meaning that it does not require an inner loop of optimization of a surrogate function. It is the first singleloop algorithm to achieve a rate even in the setting of finding a firstorder stationary point.
1.1 Related Work
Guarantees  Oracle  Algorithm  Iterations  Simplicity 
\makecell[l]Firstorder
Stationary Point 
Gradient  GD (Nesterov, 1998)  Singleloop  
AGD (Ghadimi and Lan, 2016)  Singleloop  
Carmon et al. (2017)  Nestedloop  
\makecell[l]Secondorder
Stationary Point 
\makecellHessian
vector 
Carmon et al. (2016)  Nestedloop  
Agarwal et al. (2017)  Nestedloop  
Gradient  Noisy GD (Ge et al., 2015)  Singleloop  
Perturbed GD (Jin et al., 2017)  Singleloop  
Perturbed AGD [This Work]  Singleloop  

In this section, we review related work from the perspective of both nonconvex optimization and momentum/acceleration. For clarity of presentation, when discussing rates, we focus on the dependence on the accuracy and the dimension while assuming all other problem parameters are constant. Table 1 presents a comparison of the current work with previous work.
Convergence to firstorder stationary points: Traditional analyses in this case assume only Lipschitz gradients (see Definition 1). Nesterov (1998) shows that GD finds an firstorder stationary point in steps. Ghadimi and Lan (2016) guarantee that AGD also converges in steps. Under the additional assumption of Lipschitz Hessians (see Definition 4), Carmon et al. (2017) develop a new algorithm that converges in steps. Their algorithm is a nestedloop algorithm, where the outer loop adds a proximal term to reduce the nonconvex problem to a convex subproblem. A key novelty in their algorithm is the idea of “negative curvature exploitation,” which inspired a similar step in our algorithm. In addition to the qualitative and quantitative differences between Carmon et al. (2017) and the current work, as summarized in Table 1, we note that while Carmon et al. (2017) analyze AGD applied to convex subproblems, we analyze AGD applied directly to nonconvex functions through a novel Hamiltonian framework.
Convergence to secondorder stationary points: All results in this setting assume Lipschitz conditions for both the gradient and Hessian. Classical approaches, such as cubic regularization (Nesterov and Polyak, 2006) and trust region algorithms (Curtis et al., 2014), require access to Hessians, and are known to find secondorder stationary points in steps. However, the requirement of these algorithms to form the Hessian makes them infeasible for highdimensional problems. A second set of algorithms utilize only Hessianvector products instead of the explicit Hessian; in many applications such products can be computed efficiently. Rates of have been established for such algorithms (Carmon et al., 2016; Agarwal et al., 2017; Royer and Wright, 2017). Finally, in the realm of purely gradientbased algorithms, Ge et al. (2015) present the first polynomial guarantees for a perturbed version of GD, and Jin et al. (2017) sharpen it to . For the special case of quadratic functions, O’Neill and Wright (2017) analyze the behavior of AGD around critical points and show that it escapes saddle points faster than GD. We note that the current work is the first achieving a rate of for general nonconvex functions.
Acceleration: There is also a rich literature that aims to understand momentum methods; e.g., AllenZhu and Orecchia (2014) view AGD as a linear coupling of GD and mirror descent, Su et al. (2016) and Wibisono et al. (2016) view AGD as a secondorder differential equation, and Bubeck et al. (2015) view AGD from a geometric perspective. Most of this work is tailored to the convex setting, and it is unclear and nontrivial to generalize the results to a nonconvex setting. There are also several papers that study AGD with relaxed versions of convexity—see Necoara et al. (2015); Li and Lin (2017) and references therein for overviews of these results.
1.2 Main Techniques
Our results rely on the following three key ideas. To the best of our knowledge, the first two are novel, while the third one was delineated in Jin et al. (2017).
Hamiltonian: A major challenge in analyzing momentumbased algorithms is that the objective function does not decrease monotonically as is the case for GD. To overcome this in the convex setting, several Lyapunov functions have been proposed (Wilson et al., 2016). However these Lyapunov functions involve the global minimum , which cannot be computed by the algorithm, and is thus of limited value in the nonconvex setting. A key technical contribution of this paper is the design of a function which is both computable and tracks the progress of AGD. The function takes the form of a Hamiltonian:
(1) 
i.e., a sum of potential energy and kinetic energy terms. It is monotonically decreasing in the continuoustime setting. This is not the case in general in the discretetime setting, a fact which requires us to incorporate the NCE step.
Improve or localize: Another key technical contribution of this paper is in formalizing a simple but powerful framework for analyzing nonconvex optimization algorithms. This framework requires us to show that for a given algorithm, either the algorithm makes significant progress or the iterates do not move much. We call this the improveorlocalize phenomenon. For instance, when progress is measured by function value, it is easy to show that for GD, with proper choice of learning rate, we have:
For AGD, a similar lemma can be shown by replacing the objective function with the Hamiltonian (see Lemma 4). Once this phenomenon is established, we can conclude that if an algorithm does not make much progress, it is localized to a small ball, and we can then approximate the objective function by either a linear or a quadratic function (depending on smoothness assumptions) in this small local region. Moreover, an upper bound on lets us conclude that iterates do not oscillate much in this local region (oscillation is a unique phenomenon of momentum algorithms as can be seen even in the convex setting). This gives us better control of approximation error.
Coupling sequences for escaping saddle points: When an algorithm arrives in the neighborhood of a strict saddle point, where , all we know is that there exists a direction of escape (the direction of the minimum eigenvector of ); denote it by . To avoid such points, the algorithm randomly perturbs the current iterate uniformly in a small ball, and runs AGD starting from this point . As in Jin et al. (2017), we can divide this ball into a “stuck region,” , starting from which AGD does not escape the saddle quickly, and its complement from which AGD escapes quickly. In order to show quick escape from a saddle point, we must show that the volume of is very small compared to that of the ball. Though may be without an analytical form, one can control the rate of escape by studying two AGD sequences that start from two realizations of perturbation, and , which are separated along by a small distance . In this case, at least one of the sequences escapes the saddle point quickly, which proves that the width of along can not be greater than , and hence has small volume.
2 Preliminaries
In this section, we will review some wellknown results on GD and AGD in the strongly convex setting, and existing results on convergence of GD to secondorder stationary points.
2.1 Notation
Bold uppercase letters () denote matrices and bold lowercase letters () denote vectors. For vectors denotes the norm. For matrices, denotes the spectral norm and denotes the minimum eigenvalue. For , and denote its gradient and Hessian respectively, and denotes its global minimum. We use to hide absolute constants, and to hide absolute constants and polylog factors for all problem parameters.
2.2 Convex Setting
To minimize a function , GD performs the following sequence of steps:
The suboptimality of GD and the improvement achieved by AGD can be clearly illustrated for the case of smooth and strongly convex functions.
Definition 1.
A differentiable function is smooth (or gradient Lipschitz) if:
The gradient Lipschitz property asserts that the gradient can not change too rapidly in a small local region.
Definition 2.
A twicedifferentiable function is strongly convex if .
Let . A point is said to be suboptimal if . The following theorem gives the convergence rate of GD and AGD for smooth and strongly convex functions.
Theorem 1 (Nesterov (2004)).
Assume that the function is smooth and strongly convex. Then, for any , the iteration complexities to find an suboptimal point are as follows:

GD with :

AGD (Algorithm 1) with and : .
The number of iterations of GD depends linearly on the ratio , which is called the condition number of since . Clearly and hence condition number is always at least one. Denoting the condition number by , we highlight two important aspects of AGD: (1) the momentum parameter satisfies and (2) AGD improves upon GD by a factor of .
2.3 Nonconvex Setting
For nonconvex functions finding global minima is NPhard in the worst case. The best one can hope for in this setting is convergence to stationary points. There are various levels of stationarity.
Definition 3.
is an firstorder stationary point of function if .
As mentioned in Section 1, for most nonconvex problems encountered in practice, a majority of firstorder stationary points turn out to be saddle points. Secondorder stationary points require not only zero gradient, but also positive semidefinite Hessian, ruling out most saddle points. Secondorder stationary points are meaningful, however, only when the Hessian is continuous.
Definition 4.
A twicedifferentiable function is Hessian Lipschitz if:
Definition 5 (Nesterov and Polyak (2006)).
For a Hessian Lipschitz function , is an secondorder stationary point if:
The following theorem gives the convergence rate of a perturbed version of GD to secondorder stationary points. See Jin et al. (2017) for a detailed description of the algorithm.
Theorem 2 ((Jin et al., 2017)).
Assume that the function is smooth and Hessian Lipschitz. Then, for any , perturbed GD outputs an secondorder stationary point w.h.p in iterations:
Note that this rate is essentially the same as that of GD for convergence to firstorder stationary points. In particular, it only has polylogarithmic dependence on the dimension.
3 Main Result
In this section, we present our algorithm and main result. As mentioned in Section 1, the algorithm we propose is essentially AGD with two key differences (see Algorithm 2): perturbation and negative curvature exploitation (NCE). A perturbation is added when the gradient is small (to escape saddle points), and no more frequently than once in steps. The perturbation is sampled uniformly from a dimensional ball with radius . The specific choices of gap and uniform distribution are for technical convenience (they are sufficient for our theoretical result but not necessary).
NCE (Algorithm 3) is explicitly designed to guarantee decrease of the Hamiltonian (1). When it is triggered, i.e., when
(2) 
the function has a large negative curvature between the current iterates and . In this case, if the momentum is small, then and are close, so the large negative curvature also carries over to the Hessian at due to the Lipschitz property. Assaying two points along around gives one point that is negatively aligned with and yields a decreasing function value and Hamiltonian. If the momentum is large, negative curvature can no longer be exploited, but fortunately resetting the momentum to zero kills the second term in (1), significantly decreasing the Hamiltonian.
Setting of hyperparameters: Let be the target accuracy for a secondorder stationary point, let and be gradient/HessianLipschitz parameters, and let be absolute constant and log factor to be specified later. Let , and set
(3) 
The following theorem is the main result of this paper.
Theorem 3.
Assume that the function is smooth and Hessian Lipschitz. There exists an absolute constant such that for any , , , if , and such that if we run PAGD (Algorithm 2) with choice of parameters according to (3), then with probability at least , one of the iterates will be an second order stationary point in the following number of iterations:
Theorem 3 says that when PAGD is run for the designated number of steps (which is polylogarithmic in dimension), at least one of the iterates is an secondorder stationary point. We focus on the case of small (i.e., ) so that the Hessian requirement for the secondorder stationary point () is nontrivial. Note that implies , which can be viewed as a condition number, akin to that in convex setting. Comparing Theorem 3 with Theorem 2, PAGD, with a momentum parameter , achieves better iteration complexity compared to PGD.
Output second order stationary point: Although Theorem 3 only guarantees that one of the iterates is an second order stationary point, it is straightforward to identify one of them by adding a proper termination condition: once the gradient is small and satisfies the precondition to add a perturbation, we can keep track of the point prior to adding perturbation, and compare the Hamiltonian at with the one steps after. If the Hamiltonian decreases by , then the algorithm has made progress, otherwise is an secondorder stationary point according to Lemma 8. Doing so will add a hyperparameter (threshold ) but does not increase complexity.
4 Overview of Analysis
In this section, we will present an overview of the proof of Theorem 3. Section 4.1 presents the Hamiltonian for AGD and its key property of monotonic decrease. This leads to Section 4.2 where the improveorlocalize lemma is stated, as well as the main intuition behind acceleration. Section 4.3 demonstrates how to apply these tools to prove Theorem 3. Complete details can be found in the appendix.
4.1 Hamiltonian
While GD guarantees decrease of function value in every step (even for nonconvex problems), the biggest stumbling block to analyzing AGD is that it is less clear how to keep track of “progress.” Known Lyapunov functions for AGD (Wilson et al., 2016) are restricted to the convex setting and furthermore are not computable by the algorithm (as they depend on ).
To deepen the understanding of AGD in a nonconvex setting, we inspect it from a dynamical systems perspective, where we fix the ratio to be a constant, while letting . This leads to an ODE which is the continuous limit of AGD (Su et al., 2016):
(4) 
where and are derivatives with respect to time . This equation is a secondorder dynamical equation with dissipative forces . Integrating both sides, we obtain:
(5) 
Using physical language, is a potential energy while is a kinetic energy, and the sum is a Hamiltonian. The integral shows that the Hamiltonian decreases monotonically with time , and the decrease is given by the dissipation term . Note that (5) holds regardless of the convexity of . This monotonic decrease of the Hamiltonian can in fact be extended to the discretized version of AGD when the function is convex, or mildly nonconvex:
Lemma 4 (Hamiltonian decreases monotonically).
Denote the discrete Hamiltonian as , and note that in AGD, . Lemma 4 tolerates nonconvexity with curvature at most . Unfortunately, when the function becomes too nonconvex in certain regions (so that (2) holds), the analogy between the continuous and discretized versions breaks and (6) no longer holds. In fact, standard AGD can even increase the Hamiltonian in this regime (see Appendix A.1 for more details). This motivates us to modify the algorithm by adding the NCE step, which addresses this issue. We have the following result:
Lemma 5.
4.2 Improve or Localize
One significant challenge in the analysis of gradientbased algorithms for nonconvex optimation is that many phenomena—for instance the accumulation of momentum and the escape from saddle points via perturbation—are multiplestep behaviors; they do not happen in each step. We address this issue by developing a general technique for analyzing the longterm behavior of such algorithms.
In our case, to track the longterm behavior of AGD, one key observation from Lemma 4 is that the amount of progress actually relates to movement of the iterates, which leads to the following improveorlocalize lemma:
Corollary 6 (Improve or localize).
Corollary 6 says that the algorithm either makes progress in terms of the Hamiltonian, or the iterates do not move much. In the second case, Corollary 6 allows us to approximate the dynamics of with a quadratic approximation of .
The acceleration phenomenon is rooted in and can be seen clearly for a quadratic, where the function can be decomposed into eigendirections. Consider an eigendirection with eigenvalue , and linear term (i.e., in this direction ). The GD update becomes , with determining the rate of GD. The update of AGD is with matrix defined as follows:
The rate of AGD is determined by largest eigenvalue of matrix , which is denoted by . Recall the choice of parameter (3), and divide the eigendirections into the following three categories.

Strongly convex directions : the slowest case is , where while , which results in AGD converging faster than GD.

Flat directions : the representative case is where AGD update becomes . For , we have for GD while for AGD, which results in AGD moving along negative gradient directions faster than GD.

Strongly nonconvex directions : similar to the strongly convex case, the slowest rate is for where while , which results in AGD escaping saddle point faster than GD.
Finally, the approximation error (from a quadratic) is also under control in this framework. With appropriate choice of and threshold for in Corollary 6, by the CauchySwartz inequality we can restrict iterates to all lie within a local ball around with radius , where both the gradient and Hessian of and its quadratic approximation are close:
Fact.
Assume is Hessian Lipschitz, then for all so that , we have and .
4.3 Main Framework
For simplicity of presentation, recall and denote , where is sufficiently large constant as in Theorem 3. Our overall proof strategy will be to show the following “average descent claim”: Algorithm 2 decreases the Hamiltonian by in every set of iterations as long as it does not reach an secondorder stationary point. Since the Hamiltonian cannot decrease more than , this immediately shows that it has to reach an secondorder stationary point in steps, proving Theorem 3.
It can be verified by the choice of parameters (3) and Lemma 4 that whenever (2) holds so that NCE is triggered, the Hamiltonian decreases by at least in one step. So, if NCE step is performed even once in each round of steps, we achieve enough average decrease. The troublesome case is when in some time interval of steps starting with , only AGD steps are performed without NCE. If is not an second order stationary point, either the gradient is large or the Hessian has a large negative direction. We prove the average decrease claim by considering these two cases.
Lemma 7 (Large gradient).
Lemma 8 (Negative curvature).
We note that an important aspect of these two lemmas is that the Hamiltonian decreases by in steps, which is faster compared to PGD which decreases the function value by in steps (Jin et al., 2017). That is, the acceleration phenomenon in PAGD happens in both cases. We also stress that under both of these settings, PAGD cannot achieve decrease in each step—it has to accumulate momentum over time to achieve amortized decrease.
Large Gradient Scenario
For AGD, gradient and momentum interact, and both play important roles in the dynamics. Fortunately, according to Lemma 4, the Hamiltonian decreases sufficiently whenever the momentum is large; so it is sufficient to discuss the case where the momentum is small.
One difficulty in proving Lemma 7 lies in the difficulty of enforcing the precondition that gradients of all iterates are large even with quadratic approximation. Intuitively we hope that the large initial gradient suffices to give a sufficient decrease of the Hamiltonian. Unfortunately, this is not true. Let be the subspace of eigenvectors of with eigenvalues in , consisting of all the strongly convex directions, and let be the orthogonal subspace. It turns out that the initial gradient component in is not very helpful in decreasing the Hamiltonian since AGD rapidly decreases the gradient in these directions. We instead prove Lemma 7 in two steps.
Lemma 9.
(informal) If is small, not too large and , then for all we have .
Lemma 10.
(informal) If is small and , then we have
See the formal versions, Lemma 16 and Lemma 17, for more details. We see that if the Hamiltonian does not decrease much (and so is localized in a small ball), the gradient in the strongly convex subspace vanishes in steps by Lemma 9. Since the hypothesis of Lemma 7 guarantees a large gradient for all of the steps, this means that is large after steps, thereby decreasing the Hamiltonian in the next steps (by Lemma 10).
Negative Curvature Scenario
In this section, we will show that the volume of the set around a strict saddle point from which AGD does not escape quickly is very small (Lemma 8). We do this using the coupling mechanism introduced in Jin et al. (2017), which gives a finegrained understanding of the geometry around saddle points. More concretely, letting the perturbation radius as specified in (3), we show the following lemma.
Lemma 11.
(informal) Suppose and . Let be at distance at most from , and where is the minimum eigendirection of and . Then for AGD starting at and , we have:
where and are the Hamiltonians at and respectively.
See the formal version in Lemma 18. We note in above Lemma is a small number characterize the failure probability of the algorithm (as defined in Theorem 3), and has logarithmic dependence on according to (3). Lemma 11 says that around any strict saddle, for any two points that are separated along the smallest eigendirection by at least , PAGD, starting from at least one of those points, decreases the Hamiltonian, and hence escapes the strict saddle. This implies that the width of the region starting from where AGD is stuck has width at most , and thus has small volume.
5 Conclusions
In this paper, we show that a variant of AGD can escape saddle points faster than GD, demonstrating that momentum techniques can indeed accelerate convergence even for nonconvex optimization. Our algorithm finds an second order stationary point in iterations, faster than the iterations taken by GD. This is the first algorithm that is both Hessianfree and singleloop that achieves this rate. Our analysis relies on novel techniques that lead to a better understanding of momentum techniques as well as nonconvex optimization.
The results here also give rise to several questions. The first concerns lower bounds; is the rate of that we have established here optimal for gradientbased methods under the setting of gradient and HessianLipschitz? We believe this upper bound is very likely sharp up to log factors, and developing a tight algorithmindependent lower bound will be necessary to settle this question. The second is whether the negativecurvatureexploitation component of our algorithm is actually necessary for the fast rate. To attempt to answer this question, we may either explore other ways to track the progress of standard AGD (other than the particular Hamiltonian that we have presented here), or consider other discretizations of the ODE (4) so that the property (5) is preserved even for the most nonconvex region. A final direction for future research is the extension of our results to the finitesum setting and the stochastic setting.
Appendix A Proof of Hamiltonian Lemmas
In this section, we prove Lemma 4, Lemma 5 and Corollary 6, which are presented in Section 4.1 and Section 4.2. In section A.1 we also give an example where standard AGD with negative curvature exploitation can increase the Hamiltonian.
Recall that we define the Hamiltonian as , where, for AGD, we define . The first lemma shows that this Hamiltonian decreases in every step of AGD for mildly nonconvex functions.
Lemma 4 (Hamiltonian decreases monotonically).
Proof.
Recall that the update equation of accelerated gradient descent has following form:
By smoothness, with :
(7) 
assuming that the precondition (2) does not hold:
(8) 
and given the following update equation:
(9) 
we have:
The last inequality uses the fact that so that and . We substitute in the definition of and to finish the proof. ∎
We see from this proof that (8) relies on approximate convexity of , which explains why in all existing proofs, the convexity between and is so important. A perhaps surprising fact to note is that the above proof can in fact go through even with mild nonconvexity (captured in line of Algorithm 2). Thus, high nonconvexity is the problematic situation. To overcome this, we need to slightly modify AGD so that the Hamiltonian is decreasing. This is formalized in the following lemma.
Lemma 5.
Proof.
When we perform an NCE step, we know that (2) holds. In the first case (), we set and set the momentum to zero, which gives:
In the second case (), expanding in a Taylor series with Lagrange remainder, we have:
where and . Due to the certificate (2) we have
On the other hand, clearly . WLOG, suppose , then, by definition of , we have:
where and . Since , also lines up with :
Therefore, this gives
which finishes the proof. ∎
The Hamiltonian decrease has an important consequence: if the Hamiltonian does not decrease much, then all the iterates are localized in a small ball around the starting point. Moreover, the iterates do not oscillate much in this ball. We called this the improveorlocalize phenomenon.
Corollary 6 (Improve or localize).
Proof.
The proof follows immediately from telescoping the argument of Lemma 4. ∎
a.1 AGD can increase the Hamiltonian under nonconvexity
In the previous section, we proved Lemma 4 which requires , that is, . In this section, we show Lemma 4 is almost tight in the sense that when in (2), we have:
Monotonic decrease of the Hamiltonian may no longer hold, indeed, AGD can increase the Hamiltonian for those steps.
Consider a simple onedimensional example, , where (2) always holds. Define the initial condition . By update equation in Algorithm 1, the next iterate will be , and . By the definition of Hamiltonian, we haveï¼
since . It is not hard to verify that whenever , we will have ; that is, the Hamiltonian increases in this step.
This fact implies that when we pick a large learning rate and small momentum parameter (both are essential for acceleration), standard AGD does not decrease the Hamiltonian in a very nonconvex region. We need another mechanism such as NCE to fix the monotonically decreasing property.
Appendix B Proof of Main Result
In this section, we set up the machinery needed to prove our main result, Theorem 3. We first present the generic setup, then, as in Section 4.3, we split the proof into two cases, one where gradient is large and the other where the Hessian has negative curvature. In the end, we put everything together and prove Theorem 3.
To simplify the proof, we introduce some notation for this section, and state a convention regarding absolute constants. Recall the choice of parameters in Eq.(3):