Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent
Nesterov’s accelerated gradient descent (AGD), an instance of the general family of “momentum methods,” provably achieves faster convergence rate than gradient descent (GD) in the convex setting. However, whether these methods are superior to GD in the nonconvex setting remains open. This paper studies a simple variant of AGD, and shows that it escapes saddle points and finds a second-order stationary point in iterations, faster than the iterations required by GD. To the best of our knowledge, this is the first Hessian-free algorithm to find a second-order stationary point faster than GD, and also the first single-loop algorithm with a faster rate than GD even in the setting of finding a first-order stationary point. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuous-time perspective, which AGD monotonically decreases per step even for nonconvex functions, and (2) a novel framework called improve or localize, which is useful for tracking the long-term behavior of gradient-based optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization.
Nonconvex optimization problems are ubiquitous in modern machine learning. While it is NP-hard to find global minima of a nonconvex function in the worst case, in the setting of machine learning it has proved useful to consider a less stringent notion of success, namely that of convergence to a first-order stationary point (where ). Gradient descent (GD), a simple and fundamental optimization algorithm that has proved its value in large-scale machine learning, is known to find an -first-order stationary point (where ) in iterations (Nesterov, 1998), and this rate is sharp (Cartis et al., 2010). Such results, however, do not seem to address the practical success of gradient descent; first-order stationarity includes local minima, saddle points or even local maxima, and a mere guarantee of convergence to such points seems unsatisfying. Indeed, architectures such as deep neural networks induce optimization surfaces that can be teeming with such highly suboptimal saddle points (Dauphin et al., 2014). It is important to study to what extent gradient descent avoids such points, particular in the high-dimensional setting in which the directions of escape from saddle points may be few.
This paper focuses on convergence to a second-order stationary point (where and ). Second-order stationarity rules out many common types of saddle points (strict saddle points where ), allowing only local minima and higher-order saddle points. A significant body of recent work, some theoretical and some empirical, shows that for a large class of well-studied machine learning problems, neither higher-order saddle points nor spurious local minima exist. That is, all second-order stationary points are (approximate) global minima for these problems. Choromanska et al. (2014); Kawaguchi (2016) present such a result for learning multi-layer neural networks, Bandeira et al. (2016); Mei et al. (2017) for synchronization and MaxCut, Boumal et al. (2016) for smooth semidefinite programs, Bhojanapalli et al. (2016) for matrix sensing, Ge et al. (2016) for matrix completion, and Ge et al. (2017) for robust PCA. These results strongly motivate the quest for efficient algorithms to find second-order stationary points.
Hessian-based algorithms can explicitly compute curvatures and thereby avoid saddle points (e.g., (Nesterov and Polyak, 2006; Curtis et al., 2014)), but these algorithms are computationally infeasible in the high-dimensional regime. GD, by contrast, is known to get stuck at strict saddle points (Nesterov, 1998, Section 1.2.3). Recent work has reconciled this conundrum in favor of GD; Jin et al. (2017), building on earlier work of Ge et al. (2015), show that a perturbed version of GD converges to an -relaxed version of a second-order stationary point (see Definition 5) in iterations. That is, perturbed GD in fact finds second-order stationary points as fast as standard GD finds first-order stationary point, up to logarithmic factors in dimension.
On the other hand, GD is known to be suboptimal in the convex case. In a celebrated paper, Nesterov (1983) showed that an accelerated version of gradient descent (AGD) finds an -suboptimal point (see Section 2.2) in steps, while gradient descent takes steps. The basic idea of acceleration has been used to design faster algorithms for a range of other convex optimization problems (Beck and Teboulle, 2009; Nesterov, 2012; Lee and Sidford, 2013; Shalev-Shwartz and Zhang, 2014). We will refer to this general family as “momentum-based methods.”
Such results have focused on the convex setting. It is open as to whether momentum-based methods yield faster rates in the nonconvex setting, specifically when we consider the convergence criterion of second-order stationarity. We are thus led to ask the following question: Do momentum-based methods yield faster convergence than GD in the presence of saddle points?
This paper answers this question in the affirmative. We present a simple momentum-based algorithm (PAGD for “perturbed AGD”) that finds an -second order stationary point in iterations, faster than the iterations required by GD.
The pseudocode of our algorithm is presented in Algorithm 2.
Perturbation (Lines 3-4): when the gradient is small, we add a small perturbation sampled uniformly from a -dimensional ball with radius . The homogeneous nature of this perturbation mitigates our lack of knowledge of the curvature tensor at or near saddle points.
Negative Curvature Exploitation (NCE, Lines 8-9; pseudocode in Algorithm 3): when the function becomes “too nonconvex” along to , we reset the momentum and decide whether to exploit negative curvature depending on the magnitude of the current momentum .
We note that both components are straightforward to implement and increase computation by a constant factor. The perturbation idea follows from Ge et al. (2015) and Jin et al. (2017), while NCE is inspired by (Carmon et al., 2017). To the best of our knowledge, PAGD is the first Hessian-free algorithm to find a second-order stationary point in steps. Note also that PAGD is a “single-loop algorithm,” meaning that it does not require an inner loop of optimization of a surrogate function. It is the first single-loop algorithm to achieve a rate even in the setting of finding a first-order stationary point.
1.1 Related Work
|Gradient||GD (Nesterov, 1998)||Single-loop|
|AGD (Ghadimi and Lan, 2016)||Single-loop|
|Carmon et al. (2017)||Nested-loop|
|Carmon et al. (2016)||Nested-loop|
|Agarwal et al. (2017)||Nested-loop|
|Gradient||Noisy GD (Ge et al., 2015)||Single-loop|
|Perturbed GD (Jin et al., 2017)||Single-loop|
|Perturbed AGD [This Work]||Single-loop|
In this section, we review related work from the perspective of both nonconvex optimization and momentum/acceleration. For clarity of presentation, when discussing rates, we focus on the dependence on the accuracy and the dimension while assuming all other problem parameters are constant. Table 1 presents a comparison of the current work with previous work.
Convergence to first-order stationary points: Traditional analyses in this case assume only Lipschitz gradients (see Definition 1). Nesterov (1998) shows that GD finds an -first-order stationary point in steps. Ghadimi and Lan (2016) guarantee that AGD also converges in steps. Under the additional assumption of Lipschitz Hessians (see Definition 4), Carmon et al. (2017) develop a new algorithm that converges in steps. Their algorithm is a nested-loop algorithm, where the outer loop adds a proximal term to reduce the nonconvex problem to a convex subproblem. A key novelty in their algorithm is the idea of “negative curvature exploitation,” which inspired a similar step in our algorithm. In addition to the qualitative and quantitative differences between Carmon et al. (2017) and the current work, as summarized in Table 1, we note that while Carmon et al. (2017) analyze AGD applied to convex subproblems, we analyze AGD applied directly to nonconvex functions through a novel Hamiltonian framework.
Convergence to second-order stationary points: All results in this setting assume Lipschitz conditions for both the gradient and Hessian. Classical approaches, such as cubic regularization (Nesterov and Polyak, 2006) and trust region algorithms (Curtis et al., 2014), require access to Hessians, and are known to find -second-order stationary points in steps. However, the requirement of these algorithms to form the Hessian makes them infeasible for high-dimensional problems. A second set of algorithms utilize only Hessian-vector products instead of the explicit Hessian; in many applications such products can be computed efficiently. Rates of have been established for such algorithms (Carmon et al., 2016; Agarwal et al., 2017; Royer and Wright, 2017). Finally, in the realm of purely gradient-based algorithms, Ge et al. (2015) present the first polynomial guarantees for a perturbed version of GD, and Jin et al. (2017) sharpen it to . For the special case of quadratic functions, O’Neill and Wright (2017) analyze the behavior of AGD around critical points and show that it escapes saddle points faster than GD. We note that the current work is the first achieving a rate of for general nonconvex functions.
Acceleration: There is also a rich literature that aims to understand momentum methods; e.g., Allen-Zhu and Orecchia (2014) view AGD as a linear coupling of GD and mirror descent, Su et al. (2016) and Wibisono et al. (2016) view AGD as a second-order differential equation, and Bubeck et al. (2015) view AGD from a geometric perspective. Most of this work is tailored to the convex setting, and it is unclear and nontrivial to generalize the results to a nonconvex setting. There are also several papers that study AGD with relaxed versions of convexity—see Necoara et al. (2015); Li and Lin (2017) and references therein for overviews of these results.
1.2 Main Techniques
Our results rely on the following three key ideas. To the best of our knowledge, the first two are novel, while the third one was delineated in Jin et al. (2017).
Hamiltonian: A major challenge in analyzing momentum-based algorithms is that the objective function does not decrease monotonically as is the case for GD. To overcome this in the convex setting, several Lyapunov functions have been proposed (Wilson et al., 2016). However these Lyapunov functions involve the global minimum , which cannot be computed by the algorithm, and is thus of limited value in the nonconvex setting. A key technical contribution of this paper is the design of a function which is both computable and tracks the progress of AGD. The function takes the form of a Hamiltonian:
i.e., a sum of potential energy and kinetic energy terms. It is monotonically decreasing in the continuous-time setting. This is not the case in general in the discrete-time setting, a fact which requires us to incorporate the NCE step.
Improve or localize: Another key technical contribution of this paper is in formalizing a simple but powerful framework for analyzing nonconvex optimization algorithms. This framework requires us to show that for a given algorithm, either the algorithm makes significant progress or the iterates do not move much. We call this the improve-or-localize phenomenon. For instance, when progress is measured by function value, it is easy to show that for GD, with proper choice of learning rate, we have:
For AGD, a similar lemma can be shown by replacing the objective function with the Hamiltonian (see Lemma 4). Once this phenomenon is established, we can conclude that if an algorithm does not make much progress, it is localized to a small ball, and we can then approximate the objective function by either a linear or a quadratic function (depending on smoothness assumptions) in this small local region. Moreover, an upper bound on lets us conclude that iterates do not oscillate much in this local region (oscillation is a unique phenomenon of momentum algorithms as can be seen even in the convex setting). This gives us better control of approximation error.
Coupling sequences for escaping saddle points: When an algorithm arrives in the neighborhood of a strict saddle point, where , all we know is that there exists a direction of escape (the direction of the minimum eigenvector of ); denote it by . To avoid such points, the algorithm randomly perturbs the current iterate uniformly in a small ball, and runs AGD starting from this point . As in Jin et al. (2017), we can divide this ball into a “stuck region,” , starting from which AGD does not escape the saddle quickly, and its complement from which AGD escapes quickly. In order to show quick escape from a saddle point, we must show that the volume of is very small compared to that of the ball. Though may be without an analytical form, one can control the rate of escape by studying two AGD sequences that start from two realizations of perturbation, and , which are separated along by a small distance . In this case, at least one of the sequences escapes the saddle point quickly, which proves that the width of along can not be greater than , and hence has small volume.
In this section, we will review some well-known results on GD and AGD in the strongly convex setting, and existing results on convergence of GD to second-order stationary points.
Bold upper-case letters () denote matrices and bold lower-case letters () denote vectors. For vectors denotes the -norm. For matrices, denotes the spectral norm and denotes the minimum eigenvalue. For , and denote its gradient and Hessian respectively, and denotes its global minimum. We use to hide absolute constants, and to hide absolute constants and polylog factors for all problem parameters.
2.2 Convex Setting
To minimize a function , GD performs the following sequence of steps:
The suboptimality of GD and the improvement achieved by AGD can be clearly illustrated for the case of smooth and strongly convex functions.
A differentiable function is -smooth (or -gradient Lipschitz) if:
The gradient Lipschitz property asserts that the gradient can not change too rapidly in a small local region.
A twice-differentiable function is -strongly convex if .
Let . A point is said to be -suboptimal if . The following theorem gives the convergence rate of GD and AGD for smooth and strongly convex functions.
Theorem 1 (Nesterov (2004)).
Assume that the function is -smooth and -strongly convex. Then, for any , the iteration complexities to find an -suboptimal point are as follows:
GD with :
AGD (Algorithm 1) with and : .
The number of iterations of GD depends linearly on the ratio , which is called the condition number of since . Clearly and hence condition number is always at least one. Denoting the condition number by , we highlight two important aspects of AGD: (1) the momentum parameter satisfies and (2) AGD improves upon GD by a factor of .
2.3 Nonconvex Setting
For nonconvex functions finding global minima is NP-hard in the worst case. The best one can hope for in this setting is convergence to stationary points. There are various levels of stationarity.
is an -first-order stationary point of function if .
As mentioned in Section 1, for most nonconvex problems encountered in practice, a majority of first-order stationary points turn out to be saddle points. Second-order stationary points require not only zero gradient, but also positive semidefinite Hessian, ruling out most saddle points. Second-order stationary points are meaningful, however, only when the Hessian is continuous.
A twice-differentiable function is -Hessian Lipschitz if:
Definition 5 (Nesterov and Polyak (2006)).
For a -Hessian Lipschitz function , is an -second-order stationary point if:
The following theorem gives the convergence rate of a perturbed version of GD to second-order stationary points. See Jin et al. (2017) for a detailed description of the algorithm.
Theorem 2 ((Jin et al., 2017)).
Assume that the function is -smooth and -Hessian Lipschitz. Then, for any , perturbed GD outputs an -second-order stationary point w.h.p in iterations:
Note that this rate is essentially the same as that of GD for convergence to first-order stationary points. In particular, it only has polylogarithmic dependence on the dimension.
3 Main Result
In this section, we present our algorithm and main result. As mentioned in Section 1, the algorithm we propose is essentially AGD with two key differences (see Algorithm 2): perturbation and negative curvature exploitation (NCE). A perturbation is added when the gradient is small (to escape saddle points), and no more frequently than once in steps. The perturbation is sampled uniformly from a -dimensional ball with radius . The specific choices of gap and uniform distribution are for technical convenience (they are sufficient for our theoretical result but not necessary).
the function has a large negative curvature between the current iterates and . In this case, if the momentum is small, then and are close, so the large negative curvature also carries over to the Hessian at due to the Lipschitz property. Assaying two points along around gives one point that is negatively aligned with and yields a decreasing function value and Hamiltonian. If the momentum is large, negative curvature can no longer be exploited, but fortunately resetting the momentum to zero kills the second term in (1), significantly decreasing the Hamiltonian.
Setting of hyperparameters: Let be the target accuracy for a second-order stationary point, let and be gradient/Hessian-Lipschitz parameters, and let be absolute constant and log factor to be specified later. Let , and set
The following theorem is the main result of this paper.
Assume that the function is -smooth and -Hessian Lipschitz. There exists an absolute constant such that for any , , , if , and such that if we run PAGD (Algorithm 2) with choice of parameters according to (3), then with probability at least , one of the iterates will be an -second order stationary point in the following number of iterations:
Theorem 3 says that when PAGD is run for the designated number of steps (which is poly-logarithmic in dimension), at least one of the iterates is an -second-order stationary point. We focus on the case of small (i.e., ) so that the Hessian requirement for the -second-order stationary point () is nontrivial. Note that implies , which can be viewed as a condition number, akin to that in convex setting. Comparing Theorem 3 with Theorem 2, PAGD, with a momentum parameter , achieves better iteration complexity compared to PGD.
Output -second order stationary point: Although Theorem 3 only guarantees that one of the iterates is an -second order stationary point, it is straightforward to identify one of them by adding a proper termination condition: once the gradient is small and satisfies the pre-condition to add a perturbation, we can keep track of the point prior to adding perturbation, and compare the Hamiltonian at with the one steps after. If the Hamiltonian decreases by , then the algorithm has made progress, otherwise is an -second-order stationary point according to Lemma 8. Doing so will add a hyperparameter (threshold ) but does not increase complexity.
4 Overview of Analysis
In this section, we will present an overview of the proof of Theorem 3. Section 4.1 presents the Hamiltonian for AGD and its key property of monotonic decrease. This leads to Section 4.2 where the improve-or-localize lemma is stated, as well as the main intuition behind acceleration. Section 4.3 demonstrates how to apply these tools to prove Theorem 3. Complete details can be found in the appendix.
While GD guarantees decrease of function value in every step (even for nonconvex problems), the biggest stumbling block to analyzing AGD is that it is less clear how to keep track of “progress.” Known Lyapunov functions for AGD (Wilson et al., 2016) are restricted to the convex setting and furthermore are not computable by the algorithm (as they depend on ).
To deepen the understanding of AGD in a nonconvex setting, we inspect it from a dynamical systems perspective, where we fix the ratio to be a constant, while letting . This leads to an ODE which is the continuous limit of AGD (Su et al., 2016):
where and are derivatives with respect to time . This equation is a second-order dynamical equation with dissipative forces . Integrating both sides, we obtain:
Using physical language, is a potential energy while is a kinetic energy, and the sum is a Hamiltonian. The integral shows that the Hamiltonian decreases monotonically with time , and the decrease is given by the dissipation term . Note that (5) holds regardless of the convexity of . This monotonic decrease of the Hamiltonian can in fact be extended to the discretized version of AGD when the function is convex, or mildly nonconvex:
Lemma 4 (Hamiltonian decreases monotonically).
Denote the discrete Hamiltonian as , and note that in AGD, . Lemma 4 tolerates nonconvexity with curvature at most . Unfortunately, when the function becomes too nonconvex in certain regions (so that (2) holds), the analogy between the continuous and discretized versions breaks and (6) no longer holds. In fact, standard AGD can even increase the Hamiltonian in this regime (see Appendix A.1 for more details). This motivates us to modify the algorithm by adding the NCE step, which addresses this issue. We have the following result:
4.2 Improve or Localize
One significant challenge in the analysis of gradient-based algorithms for nonconvex optimation is that many phenomena—for instance the accumulation of momentum and the escape from saddle points via perturbation—are multiple-step behaviors; they do not happen in each step. We address this issue by developing a general technique for analyzing the long-term behavior of such algorithms.
In our case, to track the long-term behavior of AGD, one key observation from Lemma 4 is that the amount of progress actually relates to movement of the iterates, which leads to the following improve-or-localize lemma:
Corollary 6 (Improve or localize).
Corollary 6 says that the algorithm either makes progress in terms of the Hamiltonian, or the iterates do not move much. In the second case, Corollary 6 allows us to approximate the dynamics of with a quadratic approximation of .
The acceleration phenomenon is rooted in and can be seen clearly for a quadratic, where the function can be decomposed into eigen-directions. Consider an eigen-direction with eigenvalue , and linear term (i.e., in this direction ). The GD update becomes , with determining the rate of GD. The update of AGD is with matrix defined as follows:
The rate of AGD is determined by largest eigenvalue of matrix , which is denoted by . Recall the choice of parameter (3), and divide the eigen-directions into the following three categories.
Strongly convex directions : the slowest case is , where while , which results in AGD converging faster than GD.
Flat directions : the representative case is where AGD update becomes . For , we have for GD while for AGD, which results in AGD moving along negative gradient directions faster than GD.
Strongly nonconvex directions : similar to the strongly convex case, the slowest rate is for where while , which results in AGD escaping saddle point faster than GD.
Finally, the approximation error (from a quadratic) is also under control in this framework. With appropriate choice of and threshold for in Corollary 6, by the Cauchy-Swartz inequality we can restrict iterates to all lie within a local ball around with radius , where both the gradient and Hessian of and its quadratic approximation are close:
Assume is -Hessian Lipschitz, then for all so that , we have and .
4.3 Main Framework
For simplicity of presentation, recall and denote , where is sufficiently large constant as in Theorem 3. Our overall proof strategy will be to show the following “average descent claim”: Algorithm 2 decreases the Hamiltonian by in every set of iterations as long as it does not reach an -second-order stationary point. Since the Hamiltonian cannot decrease more than , this immediately shows that it has to reach an -second-order stationary point in steps, proving Theorem 3.
It can be verified by the choice of parameters (3) and Lemma 4 that whenever (2) holds so that NCE is triggered, the Hamiltonian decreases by at least in one step. So, if NCE step is performed even once in each round of steps, we achieve enough average decrease. The troublesome case is when in some time interval of steps starting with , only AGD steps are performed without NCE. If is not an -second order stationary point, either the gradient is large or the Hessian has a large negative direction. We prove the average decrease claim by considering these two cases.
Lemma 7 (Large gradient).
Lemma 8 (Negative curvature).
We note that an important aspect of these two lemmas is that the Hamiltonian decreases by in steps, which is faster compared to PGD which decreases the function value by in steps (Jin et al., 2017). That is, the acceleration phenomenon in PAGD happens in both cases. We also stress that under both of these settings, PAGD cannot achieve decrease in each step—it has to accumulate momentum over time to achieve amortized decrease.
Large Gradient Scenario
For AGD, gradient and momentum interact, and both play important roles in the dynamics. Fortunately, according to Lemma 4, the Hamiltonian decreases sufficiently whenever the momentum is large; so it is sufficient to discuss the case where the momentum is small.
One difficulty in proving Lemma 7 lies in the difficulty of enforcing the precondition that gradients of all iterates are large even with quadratic approximation. Intuitively we hope that the large initial gradient suffices to give a sufficient decrease of the Hamiltonian. Unfortunately, this is not true. Let be the subspace of eigenvectors of with eigenvalues in , consisting of all the strongly convex directions, and let be the orthogonal subspace. It turns out that the initial gradient component in is not very helpful in decreasing the Hamiltonian since AGD rapidly decreases the gradient in these directions. We instead prove Lemma 7 in two steps.
(informal) If is small, not too large and , then for all we have .
(informal) If is small and , then we have
See the formal versions, Lemma 16 and Lemma 17, for more details. We see that if the Hamiltonian does not decrease much (and so is localized in a small ball), the gradient in the strongly convex subspace vanishes in steps by Lemma 9. Since the hypothesis of Lemma 7 guarantees a large gradient for all of the steps, this means that is large after steps, thereby decreasing the Hamiltonian in the next steps (by Lemma 10).
Negative Curvature Scenario
In this section, we will show that the volume of the set around a strict saddle point from which AGD does not escape quickly is very small (Lemma 8). We do this using the coupling mechanism introduced in Jin et al. (2017), which gives a fine-grained understanding of the geometry around saddle points. More concretely, letting the perturbation radius as specified in (3), we show the following lemma.
(informal) Suppose and . Let be at distance at most from , and where is the minimum eigen-direction of and . Then for AGD starting at and , we have:
where and are the Hamiltonians at and respectively.
See the formal version in Lemma 18. We note in above Lemma is a small number characterize the failure probability of the algorithm (as defined in Theorem 3), and has logarithmic dependence on according to (3). Lemma 11 says that around any strict saddle, for any two points that are separated along the smallest eigen-direction by at least , PAGD, starting from at least one of those points, decreases the Hamiltonian, and hence escapes the strict saddle. This implies that the width of the region starting from where AGD is stuck has width at most , and thus has small volume.
In this paper, we show that a variant of AGD can escape saddle points faster than GD, demonstrating that momentum techniques can indeed accelerate convergence even for nonconvex optimization. Our algorithm finds an -second order stationary point in iterations, faster than the iterations taken by GD. This is the first algorithm that is both Hessian-free and single-loop that achieves this rate. Our analysis relies on novel techniques that lead to a better understanding of momentum techniques as well as nonconvex optimization.
The results here also give rise to several questions. The first concerns lower bounds; is the rate of that we have established here optimal for gradient-based methods under the setting of gradient and Hessian-Lipschitz? We believe this upper bound is very likely sharp up to log factors, and developing a tight algorithm-independent lower bound will be necessary to settle this question. The second is whether the negative-curvature-exploitation component of our algorithm is actually necessary for the fast rate. To attempt to answer this question, we may either explore other ways to track the progress of standard AGD (other than the particular Hamiltonian that we have presented here), or consider other discretizations of the ODE (4) so that the property (5) is preserved even for the most nonconvex region. A final direction for future research is the extension of our results to the finite-sum setting and the stochastic setting.
Appendix A Proof of Hamiltonian Lemmas
In this section, we prove Lemma 4, Lemma 5 and Corollary 6, which are presented in Section 4.1 and Section 4.2. In section A.1 we also give an example where standard AGD with negative curvature exploitation can increase the Hamiltonian.
Recall that we define the Hamiltonian as , where, for AGD, we define . The first lemma shows that this Hamiltonian decreases in every step of AGD for mildly nonconvex functions.
Lemma 4 (Hamiltonian decreases monotonically).
Recall that the update equation of accelerated gradient descent has following form:
By smoothness, with :
assuming that the precondition (2) does not hold:
and given the following update equation:
The last inequality uses the fact that so that and . We substitute in the definition of and to finish the proof. ∎
We see from this proof that (8) relies on approximate convexity of , which explains why in all existing proofs, the convexity between and is so important. A perhaps surprising fact to note is that the above proof can in fact go through even with mild nonconvexity (captured in line of Algorithm 2). Thus, high nonconvexity is the problematic situation. To overcome this, we need to slightly modify AGD so that the Hamiltonian is decreasing. This is formalized in the following lemma.
When we perform an NCE step, we know that (2) holds. In the first case (), we set and set the momentum to zero, which gives:
In the second case (), expanding in a Taylor series with Lagrange remainder, we have:
where and . Due to the certificate (2) we have
On the other hand, clearly . WLOG, suppose , then, by definition of , we have:
where and . Since , also lines up with :
Therefore, this gives
which finishes the proof. ∎
The Hamiltonian decrease has an important consequence: if the Hamiltonian does not decrease much, then all the iterates are localized in a small ball around the starting point. Moreover, the iterates do not oscillate much in this ball. We called this the improve-or-localize phenomenon.
Corollary 6 (Improve or localize).
The proof follows immediately from telescoping the argument of Lemma 4. ∎
a.1 AGD can increase the Hamiltonian under nonconvexity
Monotonic decrease of the Hamiltonian may no longer hold, indeed, AGD can increase the Hamiltonian for those steps.
Consider a simple one-dimensional example, , where (2) always holds. Define the initial condition . By update equation in Algorithm 1, the next iterate will be , and . By the definition of Hamiltonian, we haveï¼
since . It is not hard to verify that whenever , we will have ; that is, the Hamiltonian increases in this step.
This fact implies that when we pick a large learning rate and small momentum parameter (both are essential for acceleration), standard AGD does not decrease the Hamiltonian in a very nonconvex region. We need another mechanism such as NCE to fix the monotonically decreasing property.
Appendix B Proof of Main Result
In this section, we set up the machinery needed to prove our main result, Theorem 3. We first present the generic setup, then, as in Section 4.3, we split the proof into two cases, one where gradient is large and the other where the Hessian has negative curvature. In the end, we put everything together and prove Theorem 3.
To simplify the proof, we introduce some notation for this section, and state a convention regarding absolute constants. Recall the choice of parameters in Eq.(3):