Escaping Saddles with Stochastic Gradients
Abstract
We analyze the variance of stochastic gradients along negative curvature directions in certain nonconvex machine learning models and show that stochastic gradients exhibit a strong component along these directions. Furthermore, we show that  contrary to the case of isotropic noise  this variance is proportional to the magnitude of the corresponding eigenvalues and not decreasing in the dimensionality. Based upon this observation we propose a new assumption under which we show that the injection of explicit, isotropic noise usually applied to make gradient descent escape saddle points can successfully be replaced by a simple SGD step. Additionally  and under the same condition  we derive the first convergence rate for plain SGD to a secondorder stationary point in a number of iterations that is independent of the problem dimension.
1 Introduction
In this paper we analyze the use of gradient descent (GD) and its stochastic variant (SGD) to minimize objectives of the form
(1) 
where is a not necessarily convex loss function and is a probability distribution.
In the era of big data and deep neural networks, (stochastic) gradient descent is a core component of many training algorithms [Bottou(2010)]. What makes SGD so attractive is its simplicity, its seemingly universal applicability and a convergence rate that is independent of the size of the training set. One specific trait of SGD is the inherent noise, originating from sampling training points, whose variance has to be controlled in order to guarantee convergence either through a conservative step size [Nesterov(2013)] or via explicit variancereduction techniques [Johnson & Zhang(2013)Johnson and Zhang].
While the convergence behavior of SGD is wellunderstood for convex functions [Bottou(2010)], we are here interested in the optimization of nonconvex functions which pose additional challenges for optimization in particular due to the presence of saddle points and suboptimal local minima [Dauphin et al.(2014)Dauphin, Pascanu, Gulcehre, Cho, Ganguli, and Bengio, Choromanska et al.(2015)Choromanska, Henaff, Mathieu, Arous, and LeCun]. For example, finding the global minimum of even a degree 4 polynomial can be NPhard [Hillar & Lim(2013)Hillar and Lim]. Instead of aiming for a global minimizer, a more practical goal is to search for a local optimum of the objective. In this paper we thus focus on reaching a secondorder stationary point of smooth nonconvex functions. Formally, we aim to find an second order stationary point such that the following conditions hold:
(2) 
where .
Existing work, such as [Ge et al.(2015)Ge, Huang, Jin, and Yuan, Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and Jordan], proved convergence to a point satisfying Eq. (2) for modified variants of gradient descent and its stochastic variant by requiring additional noise to be explicitly added to the iterates along the entire path (former) or whenever the gradient is sufficiently small (latter). Formally, this yields the following update step for the perturbed GD and SGD versions:
(3)  
(4) 
where is typically zeromean noise sampled uniformly from a unit sphere.
Isotropic noise The perturbed variants of GD and SGD in Eqs. (3)(4) have been analyzed for the case where the added noise is isotropic [Ge et al.(2015)Ge, Huang, Jin, and Yuan, Levy(2016), Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and Jordan] or at least exhibits a certain amount of variance along all directions in [Ge et al.(2015)Ge, Huang, Jin, and Yuan]. As shown in Table 1, an immediate consequence of such conditions is that they introduce a dependency on the input dimension in the convergence rate. Furthermore, it is unknown as of today, if this condition is satisfied by the intrinsic noise of vanilla SGD for any specific class of machine learning models. Recent empirical observations show that this is not the case for training neural networks [Chaudhari & Soatto(2017)Chaudhari and Soatto].
Algorithm  Firstorder Complexity  Second order Complexity  Dependency 
Perturbed SGD [Ge et al.(2015)Ge, Huang, Jin, and Yuan]  poly  
SGLD [Zhang et al.(2017)Zhang, Liang, and Charikar]  poly  
PGD [Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and Jordan]  polylog  
SGD+NEON [Xu & Yang(2017)Xu and Yang]  polylog  
CNCGD (Algorithm 2)  free  
CNCSGD (Algorithm 4)  free 
In this work, we therefore turn our attention to the following question. Do we need to perturb iterates along all dimensions in order for (S)GD to converge to a secondorder stationary point? Or is it enough to simply rely on the inherent variance of SGD induced by sampling? More than a purely theoretical exercise, this question has some very important practical implications since in practice the vast majority of existing SGD methods do not add additional noise and therefore do not meet the requirement of isotropic noise. Thus we instead focus our attention on a less restrictive condition for which perturbations only have a guaranteed variance along directions of negative curvature of the objective, i.e. along the eigenvector(s) associated with the minimum eigenvalue of the Hessian. Instead of explicitly adding noise as done in Eqs. (3) and (4), we will from now on consider the simple SGD step:
(5) 
and propose the following sufficient condition on the stochastic gradient to guarantee convergence to a secondorder stationary point.
Assumption 1 (Correlated Negative Curvature (CNC)).
Let be the eigenvector corresponding to the minimum eigenvalue of the Hessian matrix . The stochastic gradient satisfies the CNC assumption, if the second moment of its projection along the direction is uniformly bounded away from zero, i.e.
(6) 
Contributions Our contribution is fourfold: First, we analyze the convergence of GD perturbed by SGD steps (Algorithm 2). Under the CNC assumption, we demonstrate that this method converges to an secondorder stationary point in iterations and with high probability. Second, we prove that vanilla SGD as stated n Algorithm 4 again under Assumption 1 also convergences to an secondorder stationary point in iterations and with high probability. To the best of our knowledge, this is the first second order convergence result for SGD without adding additional noise. One important consequence of not relying on isotropic noise is that the rate of convergence becomes independent of the input dimension . This can be a very significant practical advantage when optimizing deep neural networks that contain millions of trainable parameters.
Third, we prove that stochastic gradients satisfy Assumption 1 in the setting of learning halfspaces, which is ubiquitous in machine learning. Finally, we provide experimental evidence suggesting the validity of this condition for training neural networks. In particular we show that, while the variance of uniform noise along eigenvectors corresponding to the most negative eigenvalue decreases as , stochastic gradients have a significant component along this direction independent of the width and depth of the neural net. When looking at the entire eigenspectrum, we find that this variance increases with the magnitude of the associated eigenvalues. Hereby, we contribute to a better understanding of the success of training deep nets with SGD and its extensions.
2 Background & Related work
Reaching a 1storder stationary point For smooth functions, a firstorder stationary point satisfying can be reached by GD and SGD in and iterations respectively [Nesterov(2013)].
Reaching a 2ndorder stationary point In order to reach secondorder stationary points, existing firstorder techniques rely on explicitly adding isotropic noise with a known variance (see Eq. (3)). The key motivation for this step is the insight that the area of attraction to a saddle point constitutes an unstable manifold and thus gradient descent methods are unlikely to get stuck, but if they do, adding noise allows them to escape [Lee et al.(2016)Lee, Simchowitz, Jordan, and Recht]. Based upon this observations, recent works prove second order convergence of normalized GD [Levy(2016)] and perturbed GD [Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and Jordan]. The later needs at most iterations and is thus the first to achieve a polylog dependency on the dimensionality. The convergence of SGD with additional noise was analyzed in [Ge et al.(2015)Ge, Huang, Jin, and Yuan] but to the best of our knowledge, no prior work demonstrated convergence of SGD without explicitly adding noise.
Using curvature information Since negative curvature signals potential descent directions, it seems logical to apply a secondorder method to exploit this curvature direction in order to escape saddle points. Yet, the prototypical Newton’s method has no global convergence guarantee and is locally attracted by saddle points and even points of local maximizers [Dauphin et al.(2014)Dauphin, Pascanu, Gulcehre, Cho, Ganguli, and Bengio]. Another issue is the computation (and perhaps storage) of the Hessian matrix, which requires operations as well as computing the inverse of the Hessian, which requires computations.
The first problem can be resolved by using trustregion methods that guarantee convergence to a secondorder stationary point [Conn et al.(2000)Conn, Gould, and Toint]. Among these methods, the Cubic Regularization technique initially proposed by [Nesterov & Polyak(2006)Nesterov and Polyak] has been shown to achieve the optimal worstcase iteration bound [Cartis et al.(2012)Cartis, Gould, and Toint]. The second problem can be addressed by replacing the computation of the Hessian by Hessianvector products that can be computed efficiently in [Pearlmutter(1994)]. This is applied e.g. using matrixfree Lanczos iterations [Curtis & Robinson(2017)Curtis and Robinson, Reddi et al.(2017)Reddi, Zaheer, Sra, Poczos, Bach, Salakhutdinov, and Smola] or online variants such as Oja’s algorithm [AllenZhu(2017)]. Subsampling the Hessian can furthermore reduce the dependence on by using various sampling schemes [Kohler & Lucchi(2017)Kohler and Lucchi, Xu et al.(2017)Xu, RoostaKhorasani, and Mahoney]. Finally, [Xu & Yang(2017)Xu and Yang] and [AllenZhu & Li(2017)AllenZhu and Li] showed that noisy gradient updates act as a noisy Power method allowing to find a negative curvature direction using only firstorder information. Despite the recent theoretical improvements obtained by such techniques, firstorder methods still dominate for training large deep neural networks. Their theoretical properties are however not perfectly well understood in the general case and we here aim to deepen the current understanding.
3 GD Perturbed by Stochastic Gradients
In this section we derive a converge guarantee for a combination of gradient descent and stochastic gradient steps, as presented in Algorithm 2, for the case where the stochastic gradient sequence meets the CNC assumption introduced in Eq. (6). We name this algorithm CNCPGD since it is a modified version of the PGD method [Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and Jordan], but use the intrinsic noise of SGD instead of requiring noise isotropy. Our theoretical analysis relies on the following smoothness conditions on the objective function .
Assumption 2 (Smoothness Assumption).
We assume that the function is gradient Lipschitz and Hessian Lipschitz and that each function has an bounded gradient.
Note that smoothness and Hessian Lipschitzness are standard assumptions for convergence analysis to a second order stationary point [Ge et al.(2015)Ge, Huang, Jin, and Yuan, Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and
Jordan, Nesterov & Polyak(2006)Nesterov and Polyak]. The boundedness of the stochastic gradient is often used in stochastic optimization [Moulines & Bach(2011)Moulines and Bach].
{algorithm}[tb]
{algorithmic}[1]
\STATEInput: , , , and
\STATE
\FOR
\IF
\STATE # used in the analysis
\STATE #
\ELSE\STATE
\ENDIF\ENDFOR\STATEreturn uniformly from
Parameters The analysis presented below relies on a particular choice of parameters. Their values are set based on the desired accuracy and presented in Table 2.
Parameter  Value  Dependency on 

Independent  
3.1 PGD Convergence Result
Theorem 1.
Remark In plain English: CNCPGD converges polynomially to a secondorder stationary point under Assumption 1. By relying on isotropic noise, [Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and Jordan] prove convergence to a stationary point in steps. The result of Theorem 1 matches this rate in terms of firstorder optimality but is worse by an factor in terms of the secondorder condition. Yet, we do not know whether our rate is the best achievable rate under the CNC condition and whether having isotropic noise is necessary to obtain a faster rate of convergence. As mentioned previously, a major benefit of employing the CNC condition is that it results in a convergence rate that does not depend on the dimension of the parameter space. We also believe that the dependency of the number of steps to (see Eq. (6)) can be significantly improved.
3.2 Proof sketch of Theorem 1
In order to prove Theorem 1, we consider three different scenarios depending on the magnitude of the gradient and the amount of negative curvature. Our proof scheme is mainly inspired by the analysis of perturbed gradient descent [Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and Jordan], where a deterministic sufficient condition is established for escaping from saddle points (see Lemma 11). This condition is shown to hold in the case of isotropic noise. However, the nonisotropic noise coming from stochastic gradients is more difficult to analyze. Our contribution is to show that a less restrictive assumption on the perturbation noise still allows to escape saddle points. Detailed proofs of each lemma are provided in the Appendix.
Large gradient regime When the gradient is large enough, we can invoke existing results on the analysis of gradient descent for nonconvex functions [Nesterov(2013)].
Lemma 1.
Consider a gradient descent step on a smooth function . For this yields the following function decrease:
(7) 
Using the above result, we can guarantee the desired decrease whenever the norm of the gradient is large enough. Suppose that , then Lemma 1 immediately yields
(8) 
Small gradient and sharp negative curvature regime Consider the setting where the norm of the gradient is small, i.e. , but the minimum eigenvalue of the Hessian matrix is significantly less than zero, i.e. . In such a case, exploiting Assumption 1 (CNC) provides a guaranteed decrease in the function value after iterations, in expectation.
Lemma 2.
Let Assumption 1 and 2 hold. Consider perturbed gradient steps (Algorithm 2 with parameters as in Table 2) starting from such that . Assume the Hessian matrix has a large negative eigenvalue, i.e.
(9) 
Then, after iterations the function value decreases as
(10) 
where the expectation is over the sequence .
Small gradient with moderate negative curvature regime Suppose that and that the absolute value of the minimum eigenvalue of the Hessian is close to zero, i.e. we already reached the desired first and secondorder optimality. In this case, we can guarantee that adding noise will only lead to a limited increase in terms of expectation of function value.
Joint analysis We now combine the results of the three scenarios discussed so far. Towards this end we introduce the set as
Each of the visited parameters constitutes a random variable. For each of these random variables, we define the event .
When occurs, the function value decreases in expectation. Since the number of steps required in the analysis of the large gradient regime and the sharp curvature regime are different, we use an amortized analysis similar to [Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and
Jordan] where we consider the perstep decrease
(12) 
The large gradient norm regime of Lemma 1 guarantees a decrease of the same order and hence
(13) 
follows from combining the two results. Let us now consider the case when (complement of ) occurs. Then the result of Lemma 3 allows us to bound the increase in terms of function value, i.e.
(14) 
Probabilistic bound The results established so far have shown that in expectation the function value decreases until the iterates reach a secondorder stationary point, for which Lemma 3 guarantees that the function value does not increase too much subsequently.
The idea is simple: If the number of steps is sufficiently large, then the results of Lemma (1)(3) guarantee that the number of times we visit a secondorder stationary point is high. Let be a random variable that determines the ratio of second order stationary points through the optimization path . Formally,
(15) 
where is the indicator function. Let denote the probability of event and be the probability of its complement . The probability of returning a secondorder stationary point is simply
(16) 
Estimating the probabilities is very difficult due to the interdependence of the random variables . However, we can upper bound the sum of the individual ’s. Using the law of total expectation and the results from Eq. (13) and (14), we bound the expectation of the function value decrease as:
(17) 
Summing over iterations yields
(18) 
which, after rearranging terms, leads to the following upperbound
(19) 
Therefore, the probability that occurs uniformly over is lower bounded as
(20) 
which concludes the proof of Theorem 1.
4 SGD without Perturbation
We now turn our attention to the stochastic variant of gradient descent under the assumption that the stochastic gradients fulfill the CNC condition (Assumption 1). We name this method CNCSGD and demonstrate that it converges to a secondorder stationary point without any additional perturbation. Note that in order to provide the convergence guarantee, we periodically enlarge the step size through the optimization process, as outlined in Algorithm 4. This periodic step size increase amplifies the variance along eigenvectors corresponding to the minimum eigenvalue of the Hessian, allowing SGD to exploit the negative curvature in the subsequent steps (using a smaller step size). Increasing the step size is therefore similar to the perturbation step used in CNCPGD (Algorithm 2).
[h!]
{algorithmic}[1]
\STATEInput: , , , and ()
\FOR
\IF
\STATE # used in analysis
\STATE #
\ELSE\STATE #
\ENDIF\ENDFOR\STATEreturn uniformly from .
Parameters The analysis of CNCSGD relies on the particular choice of parameters presented in Table 3.
Parameter  Value  Dependency to 

Remarks As reported in Table 1, perturbed SGD  with isotropic noise  converges to an second order stationary point in steps [Ge et al.(2015)Ge, Huang, Jin, and Yuan]. Here, we prove that under the CNC assumption, vanilla SGD  i.e. without perturbations  converges to an secondorder stationary point using stochastic gradient steps. Our result matches the result of [Ge et al.(2015)Ge, Huang, Jin, and Yuan] in terms of firstorder optimality and yields an improvement by an factor in terms of secondorder optimality. However, this secondorder optimality rate is still worse by an factor compared to the best known convergence rate for perturbed SGD established by [Zhang et al.(2017)Zhang, Liang, and Charikar], which requires iterations for an second order stationary point. One can even improve the convergence guarantee of SGD by using the NEON framework [AllenZhu & Li(2017)AllenZhu and Li, Xu & Yang(2017)Xu and Yang] but a perturbation with isotropic noise is still required. The theoretical guarantees we provide in Theorem 2, however, are based on a less restrictive assumption. As we prove in the following Section, this assumption actually holds for stochastic gradients when learning halfspaces. Subsequently, in Section 6, we present empirical observations that suggest its validity even for training wide and deep neural networks.
5 Learning Halfspaces with Correlated Negative Curvature
The analysis presented in the previous sections relies on the CNC assumption introduced in Eq. (6). As mentioned before, this assumption is weaker than the isotropic noise condition required in previous work. In this Section we confirm the validity of this condition for the problem of learning halfspaces which is a core problem in machine learning, commonly encountered when training Perceptrons, Support Vector Machines or Neural Networks [Zhang et al.(2015)Zhang, Lee, Wainwright, and Jordan]. Learning a halfspace reduces to a minimization problem of the following form
(21) 
where is an arbitrary loss function and the data distribution might have a finite or infinite support. There are different choices for the loss function , e.g. zeroone loss, sigmoid loss or piecewise linear loss [Zhang et al.(2015)Zhang, Lee, Wainwright, and Jordan]. Here, we assume that is differentiable. Generally, the objective is nonconvex and might exhibit many local minima and saddle points.
Note that the stochastic gradient is unbiased and defined as
(22) 
where the samples are drawn from the distribution .
1. Stochastic gradients  2. Isotropic noise  3. Extreme eigenvalues 

Noise isotropy vs. CNC assumption. First, one can easily find a scenario where the noise isotropy condition is violated for stochastic gradients. Take for example the case where the data distribution from which is sampled lives in a lowdimensional space . In this case, one can prove that there exists a vector orthogonal to all . Then clearly and thus does not have components along all directions.
However  under mild assumptions  we show that the stochastic gradients do have a significant component along directions of negative curvature. Lemma 4 makes this argument precise by establishing a lower bound on the second moment of the stochastic gradients projected onto eigenvectors corresponding to negative eigenvalues of the Hessian matrix . To establish this lower bound we require the following structural property of the loss function .
Assumption 3.
Suppose that the magnitude of the second order derivative of is bounded by a constant factor of its first order derivative, i.e.
(23) 
holds for all in the domain of and .
The reader might notice that this condition resembles the selfconcordant assumption often used in the optimization literature [Nesterov(2013)], for which the second derivative is bounded by the third derivative. One can easily check that this condition is fulfilled by commonly used activation functions in neural networks, such as the sigmoid and softplus. We now leverage this property to prove that the stochastic gradient satisfies Assumption 1 (CNC).
Discussion Since the result of Lemma 4 holds for any eigenvector associated with a negative eigenvalue , this naturally includes the eigenvector(s) corresponding to . As a result, Assumption 1 (CNC) holds for stochastic gradients on learning halfspaces. Combining this result with the derived convergence guarantees in Theorem 1 implies that a mix of SGD and GD steps (Algorithm 2) obtains a secondorder stationary point in polynomial time. Furthermore, according to Theorem 2, vanilla SGD obtains a second order stationary point in polynomial time without any explicit perturbations. Notably, both established convergence guarantees are dimension free.
Furthermore, Lemma 4 reveals an interesting relationship between stochastic gradients and eigenvectors at a certain iterate . Namely, the variance of stochastic gradients along these vectors scales proportional to the magnitude of the negative eigenvalues within the spectrum of the Hessian matrix. This is in clear contrast to the case of isotropic noise variance which is uniformly distributed along all eigenvectors of the Hessian matrix. The difference can be important form a generalization point of view. Consider the simplified setting where is square loss. Then the eigenvectors with large eigenvalues correspond to the principal directions of the data. In this regard, having a lower variance along the nonprincipal directions avoids overfitting.
In the following section we confirm the above results and furthermore show experiments on Neural Networks that suggest the validity of these results beyond the setting of learning halfspaces.
6 Experiments
In this Section we first show that vanilla SGD (Algorithm 4) as well as GD with a stochastic gradient step as perturbation (Algorithm 2) indeed escape saddle points. Towards this end, we initialize SGD, GD, perturbed GD with isotropic noise (ISOPGD) [Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and
Jordan] and CNCPGD close to a saddle point on a low dimensional learninghalfspaces problem with Gaussian input data and sigmoid loss. Figure 2 shows suboptimality over epochs for an average of 10 runs. The results are in line with our analysis since all stochastic methods quickly find a negative curvature direction to escape the saddle point. See Appendix E for more details.
Secondly  and more importantly  we study the properties of the variance of stochastic gradients depending on the width and depth of neural networks. All of these experiments are conducted using feedforward networks on the wellknown MNIST classification task (). Specifically, we draw random parameters in each of these networks and test Assumption 1 by estimating the second moment of the stochastic gradients projected onto the eigenvectors of as follows
(25) 
We do the same for isotropic noise vectors drawn from the unit ball around each .

As expected, the variance of isotropic noise along eigenvectors corresponding to decreases as .

The stochastic gradients, however, maintain a significant component along the directions of most negative curvature independent of width and depth of the neural network (see Figure 1).

Finally, the stochastic gradients yield an increasing variance along eigenvectors corresponding to larger eigenvalues (see Figure 3).
These findings suggest important implications. (a) and (b) justify the use and explain the success of training wide and deep Neural Networks with pure SGD despite the presence of saddle points. (c) suggests that the bound established in Lemma 4 may well be extended to more general settings such as training neural networks and illustrates the implicit regularization of optimization methods that rely on stochastic gradients since directions of large curvature correspond to principal (more robust) components of the data for many machine learning models.
7 Conclusion
In this work we have analyzed the convergence of PGD and SGD for optimizing nonconvex functions under a new assumption named CNC  that requires the stochastic noise to exhibit a certain amount of variance along the directions of most negative curvature. This is a less restrictive assumption than the noise isotropy condition required by previous work which causes a dependency to the problem dimensionality in the convergence rate. We have shown theoretically that stochastic gradients satisfy the CNC assumption and reveal a variance proportional to the eigenvalue’s magnitude for the problem of learning halfspaces. Furthermore, we provided empirical evidence which suggests the validity of this assumption in the context of neural networks and thus contributed to a better understanding of training these models with stochastic gradients. Proving this observation theoretically and investigating its implications on the optimization and generalization properties of stochastic gradients methods is an interesting direction of future research.
Acknowledgments We would like to thank Kfir Levy, Gary Becigneul, Yannic Kilcher and Kevin Roth for their helpful discussions.
Appendix
Appendix A Preliminaries
Assumptions Recall that we assumed the function is Lsmooth (or Lgradient Lipschitz) and Hessian Lipschitz. We define these two properties below.
Definition 1 (Smooth function).
A differentiable function is Lsmooth (or Lgradient Lipschitz) if
(26) 
Definition 2 (Hessian Lipschitz).
A twicedifferentiable function is Hessian Lipschitz if
(27) 
Definition 3 (Bounded Gradient).
A differentiable function is bounded gradient
(28) 
Convergence of SGD on a smooth function
Lemma 5.
Let be obtained from one stochastic gradient step at on the smooth objective , namely
where and is bounded gradient. Then the function value decreases in expectation as
(29) 
Proof.
The proof is based on a straightforward application of smoothness:
(30)  
(31)  
(32) 
∎
Bounded series
Lemma 6.
For all , the following series are bounded as
(33)  
(34)  
(35) 
Proof.
The proof is based on the following bounds on power series for :
(36)  
(37)  
(38) 
Yet, for the sake of brevity, we omit the subsequent (straightforward) derivations needed to prove the statement. ∎
Appendix B PGD analysis
b.1 Choosing the parameters
b.2 Sharp negative curvature regime
Lemma 7 (Restated Lemma 2).
Let Assumption 1 and 2 hold. Consider perturbed gradient steps (Algorithm 2 with parameters as in Table 2) starting from such that . Assume the Hessian matrix has a large negative eigenvalue, i.e.
(39) 
Then, after iterations the function value decreases as
(40) 
where the expectation is over the sequence .
Notation Without loss of generality, we assume that . Let be the eigenvector And we use the simplified notation , . We also use the compact notations:
(41) 
Note that denote parameter before perturbation and is obtained by GD steps after perturbation. Recall the compact notation as
(42) 
Finally, we set .
Proof sketch The proof presented below proceeds by contradiction and is inspired by the analysis of accelerated gradient descent in nonconvex settings as done in [Jin et al.(2017b)Jin, Netrapalli, and Jordan]. We first assume that the sufficient decrease condition is not met and show that this implies an upper bound on the distance moved over a given number of iterations. We then derive a lower bound on the iterate distance and show that  for the specific choice of parameters introduced earlier  this lower bound contradicts the upper bound for a large enough number of steps . We therefore conclude that we get sufficient decrease for .
Proof of Lemma 7
Part 1: Upper bounding the distance on the iterates in terms of function decrease. We assume that PGD does not obtain the desired function decrease in iterations, i.e.
(43) 
The above assumption implies the iterates stay close to , for all . We formalize this result in the following lemma.
Lemma 8 (Distance Bound).
Proof.
Here, we use the proposed proof strategy of normalized gradient descent [Levy(2016)]. First of all, we bound the effect of the noise in the first step. Recall the first update of Algorithm 2 under the above setting
(45) 
Then by a straightforward application of lemma 5, we have
(46) 
We proceed using the result of Lemma 1 that relates the function decrease to the norm of the visited gradients:
(47) 
According to Eq. (43), the function value does not decrease too much. Plugging this bound into the above inequality yields an upper bound on the sum of the squared norm of the visited gradients, i.e.
(48) 
Using the above result allows us to bound the expected distance in the parameter space as:
(49)  
(50)  
(51) 
Replacing the above inequality into the following bound completes the proof:
(52) 
∎
Part 2: Quadratic approximation Since the parameter vector stays close to under the condition in Eq. (43), we can use a ”stale” Taylor expansion approximation of the function at :
Using a stale Taylor approximation over all iterations is the essential part of the proof that is firstly proposed by [Ge et al.(2015)Ge, Huang, Jin, and Yuan] for analysis of PSGD method. The next lemma proves that the gradient of can be approximated by the gradient of as long as is close enough to .
Lemma 9 (Taylor expansion bound for the gradient [Nesterov(2013)]).
For every twice differentiable, Hessian Lipschitz function the following bound holds true.
(53) 
Furthermore, the guaranteed closeness to the initial parameter allows us to use the gradient of the quadratic objective in the GD steps as follows,
(54) 
where the vectors , and are defined in Table 5.
Vector  Formula  Indication 

Power Iteration  