A Preliminaries

## Abstract

We analyze the variance of stochastic gradients along negative curvature directions in certain non-convex machine learning models and show that stochastic gradients exhibit a strong component along these directions. Furthermore, we show that - contrary to the case of isotropic noise - this variance is proportional to the magnitude of the corresponding eigenvalues and not decreasing in the dimensionality. Based upon this observation we propose a new assumption under which we show that the injection of explicit, isotropic noise usually applied to make gradient descent escape saddle points can successfully be replaced by a simple SGD step. Additionally - and under the same condition - we derive the first convergence rate for plain SGD to a second-order stationary point in a number of iterations that is independent of the problem dimension.

\printAffiliationsAndNotice\icmlEqualContribution

### 1 Introduction

In this paper we analyze the use of gradient descent (GD) and its stochastic variant (SGD) to minimize objectives of the form

 w∗=argminw∈Rd[f(w):=Ez∼P[fz(w)]], (1)

where is a not necessarily convex loss function and is a probability distribution.

In the era of big data and deep neural networks, (stochastic) gradient descent is a core component of many training algorithms [Bottou(2010)]. What makes SGD so attractive is its simplicity, its seemingly universal applicability and a convergence rate that is independent of the size of the training set. One specific trait of SGD is the inherent noise, originating from sampling training points, whose variance has to be controlled in order to guarantee convergence either through a conservative step size [Nesterov(2013)] or via explicit variance-reduction techniques [Johnson & Zhang(2013)Johnson and Zhang].

While the convergence behavior of SGD is well-understood for convex functions [Bottou(2010)], we are here interested in the optimization of non-convex functions which pose additional challenges for optimization in particular due to the presence of saddle points and suboptimal local minima [Dauphin et al.(2014)Dauphin, Pascanu, Gulcehre, Cho, Ganguli, and Bengio, Choromanska et al.(2015)Choromanska, Henaff, Mathieu, Arous, and LeCun]. For example, finding the global minimum of even a degree 4 polynomial can be NP-hard [Hillar & Lim(2013)Hillar and Lim]. Instead of aiming for a global minimizer, a more practical goal is to search for a local optimum of the objective. In this paper we thus focus on reaching a second-order stationary point of smooth non-convex functions. Formally, we aim to find an -second order stationary point such that the following conditions hold:

 ∥∇f(w)∥≤ϵgand∇2f(w)≽−ϵhI, (2)

where .

Existing work, such as [Ge et al.(2015)Ge, Huang, Jin, and Yuan, Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and Jordan], proved convergence to a point satisfying Eq. (2) for modified variants of gradient descent and its stochastic variant by requiring additional noise to be explicitly added to the iterates along the entire path (former) or whenever the gradient is sufficiently small (latter). Formally, this yields the following update step for the perturbed GD and SGD versions:

 PGD: wt+1=wt−ηt∇f(wt)+rζt+1 (3) PSGD: wt+1=wt−ηt(∇fz(wt)+ζt), (4)

where is typically zero-mean noise sampled uniformly from a unit sphere.

Isotropic noise The perturbed variants of GD and SGD in Eqs. (3)-(4) have been analyzed for the case where the added noise is isotropic [Ge et al.(2015)Ge, Huang, Jin, and Yuan, Levy(2016), Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and Jordan] or at least exhibits a certain amount of variance along all directions in  [Ge et al.(2015)Ge, Huang, Jin, and Yuan]. As shown in Table 1, an immediate consequence of such conditions is that they introduce a dependency on the input dimension in the convergence rate. Furthermore, it is unknown as of today, if this condition is satisfied by the intrinsic noise of vanilla SGD for any specific class of machine learning models. Recent empirical observations show that this is not the case for training neural networks [Chaudhari & Soatto(2017)Chaudhari and Soatto].

In this work, we therefore turn our attention to the following question. Do we need to perturb iterates along all dimensions in order for (S)GD to converge to a second-order stationary point? Or is it enough to simply rely on the inherent variance of SGD induced by sampling? More than a purely theoretical exercise, this question has some very important practical implications since in practice the vast majority of existing SGD methods do not add additional noise and therefore do not meet the requirement of isotropic noise. Thus we instead focus our attention on a less restrictive condition for which perturbations only have a guaranteed variance along directions of negative curvature of the objective, i.e. along the eigenvector(s) associated with the minimum eigenvalue of the Hessian. Instead of explicitly adding noise as done in Eqs. (3) and (4), we will from now on consider the simple SGD step:

 wt+1=wt−η∇fz(wt) (5)

and propose the following sufficient condition on the stochastic gradient to guarantee convergence to a second-order stationary point.

###### Assumption 1 (Correlated Negative Curvature (CNC)).

Let be the eigenvector corresponding to the minimum eigenvalue of the Hessian matrix . The stochastic gradient satisfies the CNC assumption, if the second moment of its projection along the direction is uniformly bounded away from zero, i.e.

 ∃γ>0s.t.∀w:E[⟨vw,∇fz(w)⟩2]>γ. (6)

Contributions Our contribution is fourfold: First, we analyze the convergence of GD perturbed by SGD steps (Algorithm 2). Under the CNC assumption, we demonstrate that this method converges to an -second-order stationary point in iterations and with high probability. Second, we prove that vanilla SGD as stated n Algorithm 4 -again under Assumption 1- also convergences to an -second-order stationary point in iterations and with high probability. To the best of our knowledge, this is the first second order convergence result for SGD without adding additional noise. One important consequence of not relying on isotropic noise is that the rate of convergence becomes independent of the input dimension . This can be a very significant practical advantage when optimizing deep neural networks that contain millions of trainable parameters.

Third, we prove that stochastic gradients satisfy Assumption 1 in the setting of learning half-spaces, which is ubiquitous in machine learning. Finally, we provide experimental evidence suggesting the validity of this condition for training neural networks. In particular we show that, while the variance of uniform noise along eigenvectors corresponding to the most negative eigenvalue decreases as , stochastic gradients have a significant component along this direction independent of the width and depth of the neural net. When looking at the entire eigenspectrum, we find that this variance increases with the magnitude of the associated eigenvalues. Hereby, we contribute to a better understanding of the success of training deep nets with SGD and its extensions.

### 2 Background & Related work

Reaching a 1st-order stationary point For smooth functions, a first-order stationary point satisfying can be reached by GD and SGD in and iterations respectively [Nesterov(2013)].

Reaching a 2nd-order stationary point In order to reach second-order stationary points, existing first-order techniques rely on explicitly adding isotropic noise with a known variance (see Eq. (3)). The key motivation for this step is the insight that the area of attraction to a saddle point constitutes an unstable manifold and thus gradient descent methods are unlikely to get stuck, but if they do, adding noise allows them to escape [Lee et al.(2016)Lee, Simchowitz, Jordan, and Recht]. Based upon this observations, recent works prove second order convergence of normalized GD [Levy(2016)] and perturbed GD [Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and Jordan]. The later needs at most iterations and is thus the first to achieve a poly-log dependency on the dimensionality. The convergence of SGD with additional noise was analyzed in [Ge et al.(2015)Ge, Huang, Jin, and Yuan] but to the best of our knowledge, no prior work demonstrated convergence of SGD without explicitly adding noise.

Using curvature information Since negative curvature signals potential descent directions, it seems logical to apply a second-order method to exploit this curvature direction in order to escape saddle points. Yet, the prototypical Newton’s method has no global convergence guarantee and is locally attracted by saddle points and even points of local maximizers [Dauphin et al.(2014)Dauphin, Pascanu, Gulcehre, Cho, Ganguli, and Bengio]. Another issue is the computation (and perhaps storage) of the Hessian matrix, which requires operations as well as computing the inverse of the Hessian, which requires computations.

The first problem can be resolved by using trust-region methods that guarantee convergence to a second-order stationary point [Conn et al.(2000)Conn, Gould, and Toint]. Among these methods, the Cubic Regularization technique initially proposed by [Nesterov & Polyak(2006)Nesterov and Polyak] has been shown to achieve the optimal worst-case iteration bound [Cartis et al.(2012)Cartis, Gould, and Toint]. The second problem can be addressed by replacing the computation of the Hessian by Hessian-vector products that can be computed efficiently in  [Pearlmutter(1994)]. This is applied e.g. using matrix-free Lanczos iterations [Curtis & Robinson(2017)Curtis and Robinson, Reddi et al.(2017)Reddi, Zaheer, Sra, Poczos, Bach, Salakhutdinov, and Smola] or online variants such as Oja’s algorithm [Allen-Zhu(2017)]. Sub-sampling the Hessian can furthermore reduce the dependence on by using various sampling schemes [Kohler & Lucchi(2017)Kohler and Lucchi, Xu et al.(2017)Xu, Roosta-Khorasani, and Mahoney]. Finally,  [Xu & Yang(2017)Xu and Yang] and  [Allen-Zhu & Li(2017)Allen-Zhu and Li] showed that noisy gradient updates act as a noisy Power method allowing to find a negative curvature direction using only first-order information. Despite the recent theoretical improvements obtained by such techniques, first-order methods still dominate for training large deep neural networks. Their theoretical properties are however not perfectly well understood in the general case and we here aim to deepen the current understanding.

### 3 GD Perturbed by Stochastic Gradients

In this section we derive a converge guarantee for a combination of gradient descent and stochastic gradient steps, as presented in Algorithm 2, for the case where the stochastic gradient sequence meets the CNC assumption introduced in Eq. (6). We name this algorithm CNC-PGD since it is a modified version of the PGD method [Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and Jordan], but use the intrinsic noise of SGD instead of requiring noise isotropy. Our theoretical analysis relies on the following smoothness conditions on the objective function .

###### Assumption 2 (Smoothness Assumption).

We assume that the function is -gradient Lipschitz and -Hessian Lipschitz and that each function has an -bounded gradient.1 W.l.o.g. we further assume that , , and are greater than one.

Note that -smoothness and -Hessian Lipschitzness are standard assumptions for convergence analysis to a second order stationary point [Ge et al.(2015)Ge, Huang, Jin, and Yuan, Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and Jordan, Nesterov & Polyak(2006)Nesterov and Polyak]. The boundedness of the stochastic gradient is often used in stochastic optimization [Moulines & Bach(2011)Moulines and Bach]. {algorithm}[tb] {algorithmic}[1] \STATEInput: , , , and \STATE \FOR \IF \STATE      # used in the analysis
\STATE
# \ELSE\STATE \ENDIF\ENDFOR\STATEreturn uniformly from CNC-PGD

Parameters The analysis presented below relies on a particular choice of parameters. Their values are set based on the desired accuracy and presented in Table 2.

#### 3.1 PGD Convergence Result

###### Theorem 1.

Let the stochastic gradients in CNC-PGD satisfy Assumption 1 and let , satisfy Assumption 2. Then Algorithm 2 returns an -second order stationary point with probability at least after

 O((ℓL)4(δγϵ)−2log(ℓLηδγϵ2/5))

steps, where .

Remark In plain English: CNC-PGD converges polynomially to a second-order stationary point under Assumption 1. By relying on isotropic noise, [Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and Jordan] prove convergence to a -stationary point in steps. The result of Theorem 1 matches this rate in terms of first-order optimality but is worse by an -factor in terms of the second-order condition. Yet, we do not know whether our rate is the best achievable rate under the CNC condition and whether having isotropic noise is necessary to obtain a faster rate of convergence. As mentioned previously, a major benefit of employing the CNC condition is that it results in a convergence rate that does not depend on the dimension of the parameter space. We also believe that the dependency of the number of steps to (see Eq. (6)) can be significantly improved.

#### 3.2 Proof sketch of Theorem 1

In order to prove Theorem 1, we consider three different scenarios depending on the magnitude of the gradient and the amount of negative curvature. Our proof scheme is mainly inspired by the analysis of perturbed gradient descent [Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and Jordan], where a deterministic sufficient condition is established for escaping from saddle points (see Lemma 11). This condition is shown to hold in the case of isotropic noise. However, the non-isotropic noise coming from stochastic gradients is more difficult to analyze. Our contribution is to show that a less restrictive assumption on the perturbation noise still allows to escape saddle points. Detailed proofs of each lemma are provided in the Appendix.

Large gradient regime When the gradient is large enough, we can invoke existing results on the analysis of gradient descent for non-convex functions [Nesterov(2013)].

###### Lemma 1.

Consider a gradient descent step on a -smooth function . For this yields the following function decrease:

 f(wt+1)−f(wt)≤−η2∥∇f(wt)∥2. (7)

Using the above result, we can guarantee the desired decrease whenever the norm of the gradient is large enough. Suppose that , then Lemma 1 immediately yields

 f(wt+1)−f(wt) ≤−η2gthres. (8)

Small gradient and sharp negative curvature regime Consider the setting where the norm of the gradient is small, i.e. , but the minimum eigenvalue of the Hessian matrix is significantly less than zero, i.e. . In such a case, exploiting Assumption 1 (CNC) provides a guaranteed decrease in the function value after iterations, in expectation.

###### Lemma 2.

Let Assumption 1 and 2 hold. Consider perturbed gradient steps (Algorithm 2 with parameters as in Table 2) starting from such that . Assume the Hessian matrix has a large negative eigenvalue, i.e.

 λmin(∇2f(~wt))≤−√ρϵ2/5. (9)

Then, after iterations the function value decreases as

 E[f(wt+tthres)]−f(~wt)≤−fthres, (10)

where the expectation is over the sequence .

Small gradient with moderate negative curvature regime Suppose that and that the absolute value of the minimum eigenvalue of the Hessian is close to zero, i.e. we already reached the desired first- and second-order optimality. In this case, we can guarantee that adding noise will only lead to a limited increase in terms of expectation of function value.

###### Lemma 3.

Let Assumption 1 and 2 hold. Consider perturbed gradient steps (Algorithm 2 with parameters as in Table 2) starting from such that . Then after iterations, the function value cannot increase by more than

 E[f(wt+tthres)]−f(~wt)≤ηδfthres4, (11)

where the expectation is over the sequence .

Joint analysis We now combine the results of the three scenarios discussed so far. Towards this end we introduce the set as

 S:={w∈Rd| ∥∇f(w)∥2≥gthres orλmin(∇2f(w))≤−√ρϵ2/5}.

Each of the visited parameters constitutes a random variable. For each of these random variables, we define the event . When occurs, the function value decreases in expectation. Since the number of steps required in the analysis of the large gradient regime and the sharp curvature regime are different, we use an amortized analysis similar to [Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and Jordan] where we consider the per-step decrease 2. Indeed, when the negative curvature is sharp, then Lemma 2 provides a guaranteed decrease in which - when normalized per step - yields

 E[f(wt+tthres)]−f(~wt)tthres≤−fthrest% thres=−ηgthres. (12)

The large gradient norm regime of Lemma 1 guarantees a decrease of the same order and hence

 E[f(wt+1)−f(wt)|At]≤−η2gthres (13)

follows from combining the two results. Let us now consider the case when (complement of ) occurs. Then the result of Lemma 3 allows us to bound the increase in terms of function value, i.e.

 E[f(wt+1)−f(wt)|Act]≤ηδ4gthres. (14)

Probabilistic bound The results established so far have shown that in expectation the function value decreases until the iterates reach a second-order stationary point, for which Lemma 3 guarantees that the function value does not increase too much subsequently.3 This result guarantees visiting a second-order stationary point in steps (see Table 2). Yet, certifying the second-order condition is expensive and is thus not computationally viable to examine all visited parameters . The original PGD method does not have this issue since Lemma 10 in [Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and Jordan] provides a high probability statement on the function value decrease. Using this, PGD certifies visiting a second-order stationary point at the cost of evaluating only the gradient norm. Since our guarantees are in expectation, we cannot rely on the function value decrease in Algorithm 2. Instead, the following analysis guarantees that we recover a second-order stationary point with high probability by returning a visited parameter uniformly at random. This approach is often used in stochastic non-convex optimization [Ghadimi & Lan(2013)Ghadimi and Lan].

The idea is simple: If the number of steps is sufficiently large, then the results of Lemma (1)-(3) guarantee that the number of times we visit a second-order stationary point is high. Let be a random variable that determines the ratio of -second order stationary points through the optimization path . Formally,

 R:=1TT∑t=1\mathbbm1(Act), (15)

where is the indicator function. Let denote the probability of event and be the probability of its complement . The probability of returning a second-order stationary point is simply

 E[R]=1TT∑t=1(1−Pt). (16)

Estimating the probabilities is very difficult due to the interdependence of the random variables . However, we can upper bound the sum of the individual ’s. Using the law of total expectation and the results from Eq. (13) and (14), we bound the expectation of the function value decrease as:

 E[f(wt+1)−f(wt)]≤ηgthres(δ/2−(1+δ/2)Pt)/2. (17)

Summing over iterations yields

 T∑i=1E[f(wt+1)]−E[f(wt)]≤ηgthres(δT/2−(1+δ/2)T∑t=1Pt)/2, (18)

which, after rearranging terms, leads to the following upper-bound

 1TT∑t=1Pt≤δ2+2(f(w0)−f∗))Tηgthres≤δ. (19)

Therefore, the probability that occurs uniformly over is lower bounded as

 1TT∑t=1(1−Pt)≥1−δ, (20)

which concludes the proof of Theorem 1.

### 4 SGD without Perturbation

We now turn our attention to the stochastic variant of gradient descent under the assumption that the stochastic gradients fulfill the CNC condition (Assumption 1). We name this method CNC-SGD and demonstrate that it converges to a second-order stationary point without any additional perturbation. Note that in order to provide the convergence guarantee, we periodically enlarge the step size through the optimization process, as outlined in Algorithm 4. This periodic step size increase amplifies the variance along eigenvectors corresponding to the minimum eigenvalue of the Hessian, allowing SGD to exploit the negative curvature in the subsequent steps (using a smaller step size). Increasing the step size is therefore similar to the perturbation step used in CNC-PGD (Algorithm 2).

{algorithm}

[h!] {algorithmic}[1] \STATEInput: , , , and      () \FOR \IF \STATE        # used in analysis
\STATE
# \ELSE\STATE# \ENDIF\ENDFOR\STATEreturn uniformly from . CNC-SGD

Parameters The analysis of CNC-SGD relies on the particular choice of parameters presented in Table 3.

###### Theorem 2.

Let the stochastic gradients in CNC-SGD satisfy Assumption 1 and let , satisfy Assumption 2. Then Algorithm 4 returns an -second order stationary point with probability at least after

 O((δγϵLℓ5/2)−4log2(ℓLϵδγ))

steps, where .

Remarks As reported in Table 1, perturbed SGD - with isotropic noise - converges to an -second order stationary point in steps [Ge et al.(2015)Ge, Huang, Jin, and Yuan]. Here, we prove that under the CNC assumption, vanilla SGD - i.e. without perturbations - converges to an second-order stationary point using stochastic gradient steps. Our result matches the result of [Ge et al.(2015)Ge, Huang, Jin, and Yuan] in terms of first-order optimality and yields an improvement by an -factor in terms of second-order optimality. However, this second-order optimality rate is still worse by an -factor compared to the best known convergence rate for perturbed SGD established by [Zhang et al.(2017)Zhang, Liang, and Charikar], which requires iterations for an -second order stationary point. One can even improve the convergence guarantee of SGD by using the NEON framework [Allen-Zhu & Li(2017)Allen-Zhu and Li, Xu & Yang(2017)Xu and Yang] but a perturbation with isotropic noise is still required. The theoretical guarantees we provide in Theorem 2, however, are based on a less restrictive assumption. As we prove in the following Section, this assumption actually holds for stochastic gradients when learning half-spaces. Subsequently, in Section 6, we present empirical observations that suggest its validity even for training wide and deep neural networks.

### 5 Learning Half-spaces with Correlated Negative Curvature

The analysis presented in the previous sections relies on the CNC assumption introduced in Eq. (6). As mentioned before, this assumption is weaker than the isotropic noise condition required in previous work. In this Section we confirm the validity of this condition for the problem of learning half-spaces which is a core problem in machine learning, commonly encountered when training Perceptrons, Support Vector Machines or Neural Networks [Zhang et al.(2015)Zhang, Lee, Wainwright, and Jordan]. Learning a half-space reduces to a minimization problem of the following form

 minw∈Rd[f(w):=Ez∼P[φ(w⊤z)]], (21)

where is an arbitrary loss function and the data distribution might have a finite or infinite support. There are different choices for the loss function , e.g. zero-one loss, sigmoid loss or piece-wise linear loss [Zhang et al.(2015)Zhang, Lee, Wainwright, and Jordan]. Here, we assume that is differentiable. Generally, the objective is non-convex and might exhibit many local minima and saddle points.

Note that the stochastic gradient is unbiased and defined as

 ∇f(w)=Ez[∇fz(w)],∇fz(w)=φ′(w⊤z)z, (22)

where the samples are drawn from the distribution .

Noise isotropy vs. CNC assumption. First, one can easily find a scenario where the noise isotropy condition is violated for stochastic gradients. Take for example the case where the data distribution from which is sampled lives in a low-dimensional space . In this case, one can prove that there exists a vector orthogonal to all . Then clearly and thus does not have components along all directions.

However - under mild assumptions - we show that the stochastic gradients do have a significant component along directions of negative curvature. Lemma 4 makes this argument precise by establishing a lower bound on the second moment of the stochastic gradients projected onto eigenvectors corresponding to negative eigenvalues of the Hessian matrix . To establish this lower bound we require the following structural property of the loss function .

###### Assumption 3.

Suppose that the magnitude of the second order derivative of is bounded by a constant factor of its first order derivative, i.e.

 |φ′′(α)|≤c|φ′(α)| (23)

holds for all in the domain of and .

The reader might notice that this condition resembles the self-concordant assumption often used in the optimization literature [Nesterov(2013)], for which the second derivative is bounded by the third derivative. One can easily check that this condition is fulfilled by commonly used activation functions in neural networks, such as the sigmoid and softplus. We now leverage this property to prove that the stochastic gradient satisfies Assumption 1 (CNC).

###### Lemma 4.

Consider the problem of learning half-spaces as stated in Eq. (21), where satisfies Assumption 3. Furthermore, assume that the support of is a subset of the unit sphere.4 Let be a unit length eigenvector of with corresponding eigenvalue . Then

 Ez[(∇fz(w)⊤v)2]≥(λ/c)2. (24)

Discussion Since the result of Lemma 4 holds for any eigenvector associated with a negative eigenvalue , this naturally includes the eigenvector(s) corresponding to . As a result, Assumption 1 (CNC) holds for stochastic gradients on learning half-spaces. Combining this result with the derived convergence guarantees in Theorem 1 implies that a mix of SGD and GD steps (Algorithm 2) obtains a second-order stationary point in polynomial time. Furthermore, according to Theorem 2, vanilla SGD obtains a second order stationary point in polynomial time without any explicit perturbations. Notably, both established convergence guarantees are dimension free.

Furthermore, Lemma 4 reveals an interesting relationship between stochastic gradients and eigenvectors at a certain iterate . Namely, the variance of stochastic gradients along these vectors scales proportional to the magnitude of the negative eigenvalues within the spectrum of the Hessian matrix. This is in clear contrast to the case of isotropic noise variance which is uniformly distributed along all eigenvectors of the Hessian matrix. The difference can be important form a generalization point of view. Consider the simplified setting where is square loss. Then the eigenvectors with large eigenvalues correspond to the principal directions of the data. In this regard, having a lower variance along the non-principal directions avoids over-fitting.

In the following section we confirm the above results and furthermore show experiments on Neural Networks that suggest the validity of these results beyond the setting of learning half-spaces.

### 6 Experiments

In this Section we first show that vanilla SGD (Algorithm 4) as well as GD with a stochastic gradient step as perturbation (Algorithm 2) indeed escape saddle points. Towards this end, we initialize SGD, GD, perturbed GD with isotropic noise (ISO-PGD) [Jin et al.(2017a)Jin, Ge, Netrapalli, Kakade, and Jordan] and CNC-PGD close to a saddle point on a low dimensional learning-halfspaces problem with Gaussian input data and sigmoid loss. Figure 2 shows suboptimality over epochs for an average of 10 runs. The results are in line with our analysis since all stochastic methods quickly find a negative curvature direction to escape the saddle point. See Appendix E for more details.5

Secondly - and more importantly - we study the properties of the variance of stochastic gradients depending on the width and depth of neural networks. All of these experiments are conducted using feed-forward networks on the well-known MNIST classification task (). Specifically, we draw random parameters in each of these networks and test Assumption 1 by estimating the second moment of the stochastic gradients projected onto the eigenvectors of as follows

 μk=1mm∑i=1(1nn∑j=1(∇fj(wi)⊤vk)2). (25)

We do the same for isotropic noise vectors drawn from the unit ball around each .6 Figure 1 shows this estimate for eigenvectors corresponding to the minimum eigenvalues for a 1 hidden layer network with increasing number of units (top) and for a 10 hidden unit network with increasing number of layers (bottom). Similar results on the entire negative eigenspectrum can be found in Appendix E. Figure 3 shows how varies with the magnitude of the corresponding negative eigenvalues . Again we evaluate 30 random parameter settings in Neural Networks with increasing depth. Three interesting conclusions can be drawn from the results:

1. As expected, the variance of isotropic noise along eigenvectors corresponding to decreases as .

2. The stochastic gradients, however, maintain a significant component along the directions of most negative curvature independent of width and depth of the neural network (see Figure 1).

3. Finally, the stochastic gradients yield an increasing variance along eigenvectors corresponding to larger eigenvalues (see Figure 3).

These findings suggest important implications. (a) and (b) justify the use and explain the success of training wide and deep Neural Networks with pure SGD despite the presence of saddle points. (c) suggests that the bound established in Lemma 4 may well be extended to more general settings such as training neural networks and illustrates the implicit regularization of optimization methods that rely on stochastic gradients since directions of large curvature correspond to principal (more robust) components of the data for many machine learning models.

### 7 Conclusion

In this work we have analyzed the convergence of PGD and SGD for optimizing non-convex functions under a new assumption -named CNC - that requires the stochastic noise to exhibit a certain amount of variance along the directions of most negative curvature. This is a less restrictive assumption than the noise isotropy condition required by previous work which causes a dependency to the problem dimensionality in the convergence rate. We have shown theoretically that stochastic gradients satisfy the CNC assumption and reveal a variance proportional to the eigenvalue’s magnitude for the problem of learning half-spaces. Furthermore, we provided empirical evidence which suggests the validity of this assumption in the context of neural networks and thus contributed to a better understanding of training these models with stochastic gradients. Proving this observation theoretically and investigating its implications on the optimization and generalization properties of stochastic gradients methods is an interesting direction of future research.

Acknowledgments We would like to thank Kfir Levy, Gary Becigneul, Yannic Kilcher and Kevin Roth for their helpful discussions.

## Appendix

### Appendix A Preliminaries

Assumptions Recall that we assumed the function is L-smooth (or L-gradient Lipschitz) and -Hessian Lipschitz. We define these two properties below.

###### Definition 1 (Smooth function).

A differentiable function is L-smooth (or L-gradient Lipschitz) if

 ∥∇f(w1)−∇f(w2)∥≤L∥w1−w2∥,∀w1,w2∈Rd (26)
###### Definition 2 (Hessian Lipschitz).

A twice-differentiable function is -Hessian Lipschitz if

 ∥∇2f(w1)−∇2f(w2)∥≤ρ∥w1−w2∥,∀w1,w2∈Rd (27)

A differentiable function is -bounded gradient 7 if

 ∥∇fz(w)∥≤ℓ,∀w∈Rd (28)
##### Convergence of SGD on a smooth function
###### Lemma 5.

Let be obtained from one stochastic gradient step at on the -smooth objective , namely

 wt+1=wt−η∇fz(wt)

where and is -bounded gradient. Then the function value decreases in expectation as

 Ez[f(wt+1)]−f(wt)≤−ηE∥∇f(wt)∥2+Lη2ℓ2/2. (29)
###### Proof.

The proof is based on a straightforward application of smoothness:

 Ez[f(wt+1)]−f(wt) ≤−η(∇f(wt))⊤E[∇fz(wt)]+L/2η2E∥∇fz(wt)∥2 (30) ≤−η∥∇f(wt)∥2+Lη2∥∇fz(wt)∥2/2 (31) ≤η∥∇f(wt)∥2+Lη2ℓ2/2 (32)

##### Bounded series
###### Lemma 6.

For all , the following series are bounded as

 t∑i=1(1+β)t−i ≤2β−1(1+β)t (33) t∑i=1(1+β)t−ii ≤2β−2(1+β)t (34) t∑i=1(1+β)t−ii2 ≤6β−3(1+β)t (35)
###### Proof.

The proof is based on the following bounds on power series for :

 ∞∑k=1zk ≤1/(1−z) (36) ∞∑k=1zkk =z/(1−z)2 (37) ∞∑k=1zkk2 =z(1+z)/(1−z)3. (38)

Yet, for the sake of brevity, we omit the subsequent (straightforward) derivations needed to prove the statement. ∎

### Appendix B PGD analysis

#### b.1 Choosing the parameters

Table 4 represents the choice of parameters together with the collection of required constraints on the parameters. This table summarizes our approach for choosing the parameters of CNC-PGD presented in Algorithm 2.

#### b.2 Sharp negative curvature regime

###### Lemma 7 (Restated Lemma 2).

Let Assumption 1 and 2 hold. Consider perturbed gradient steps (Algorithm 2 with parameters as in Table 2) starting from such that . Assume the Hessian matrix has a large negative eigenvalue, i.e.

 λmin(∇2f(~wt))≤−√ρϵ2/5. (39)

Then, after iterations the function value decreases as

 E[f(wt+tthres)]−f(~wt)≤−fthres, (40)

where the expectation is over the sequence .

Notation Without loss of generality, we assume that . Let be the eigenvector And we use the simplified notation , . We also use the compact notations:

 ft:=f(wt),∇ft:=∇f(wt),~f:=f(~w),∇~f:=∇f(~wt),H:=∇2f(~w),∇gt:=g(wt), (41)

Note that denote parameter before perturbation and is obtained by GD steps after perturbation. Recall the compact notation as

 λ:=|min{λmin(∇2f(~w),0})| (42)

Finally, we set .

Proof sketch The proof presented below proceeds by contradiction and is inspired by the analysis of accelerated gradient descent in non-convex settings as done in [Jin et al.(2017b)Jin, Netrapalli, and Jordan]. We first assume that the sufficient decrease condition is not met and show that this implies an upper bound on the distance moved over a given number of iterations. We then derive a lower bound on the iterate distance and show that - for the specific choice of parameters introduced earlier - this lower bound contradicts the upper bound for a large enough number of steps . We therefore conclude that we get sufficient decrease for .

Proof of Lemma 7

Part 1: Upper bounding the distance on the iterates in terms of function decrease. We assume that PGD does not obtain the desired function decrease in iterations, i.e.

 E[f(wtthres)−f(~w)]>−fthres. (43)

The above assumption implies the iterates stay close to , for all . We formalize this result in the following lemma.

###### Lemma 8 (Distance Bound).

Under the setting of Lemma 7, assume Eq (43) holds. Then the expected distance to the initial parameter can be bounded as

 E[∥wt−~w∥2]≤2(2ηfthres+ηL(ℓr)2)t+2(ℓr)2∀t≤tthres, (44)

as long as .

###### Proof.

Here, we use the proposed proof strategy of normalized gradient descent [Levy(2016)]. First of all, we bound the effect of the noise in the first step. Recall the first update of Algorithm 2 under the above setting

 w1=~w−rξ,ξ:=∇fz(~w) (45)

Then by a straightforward application of lemma 5, we have

 E[f1−~f]≤−r∥∇~f∥2+L2(ℓr)2. (46)

We proceed using the result of Lemma 1 that relates the function decrease to the norm of the visited gradients:

 E[ftthres−~f] =tthres∑t=1E[ft−ft−1] ≤−η2tthres−1∑t=1E∥∇ft∥2+E[f1−~f] \lx@stackrel(???)≤−η2tthres−1∑t=1E∥∇ft∥2+L2(ℓr)2. (47)

According to Eq. (43), the function value does not decrease too much. Plugging this bound into the above inequality yields an upper bound on the sum of the squared norm of the visited gradients, i.e.

 tthres−1∑t=1E∥∇ft∥2≤(2fthres+L(ℓr)2)/η. (48)

Using the above result allows us to bound the expected distance in the parameter space as:

 E[∥wt−w1∥2] =E[∥t∑i=2wi−wi−1∥2] ≤E⎡⎣(t∑i=2∥wi−wi−1∥)2⎤⎦(Triangle inequality) (49) ≤E[tt∑i=2∥wi−wi−1∥2]{\scriptsize(Cauchy-Schwarz inequality)} (50) ≤t(E[η2t−1∑i=1∥∇fi∥2]) ≤(2ηfthres+ηL(ℓr)2)t,∀t≤tthres.{\scriptsize(Eq. (???))} (51)

Replacing the above inequality into the following bound completes the proof:

 E∥wt−~w∥2 ≤2E∥wt−w1∥2+2E∥w1−~w∥2 ≤2(2ηfthres+ηL(ℓr)2)t+2(ℓr)2 (52)

Part 2: Quadratic approximation Since the parameter vector stays close to under the condition in Eq. (43), we can use a ”stale” Taylor expansion approximation of the function at :

 g(w)=~f+(w−~w)⊤∇f(~w)+(w−~w)⊤H(w−~w).

Using a stale Taylor approximation over all iterations is the essential part of the proof that is firstly proposed by [Ge et al.(2015)Ge, Huang, Jin, and Yuan] for analysis of PSGD method. The next lemma proves that the gradient of can be approximated by the gradient of as long as is close enough to .

###### Lemma 9 (Taylor expansion bound for the gradient [Nesterov(2013)]).

For every twice differentiable, -Hessian Lipschitz function the following bound holds true.

 ∥∇f(w)−∇g(w)∥≤ρ2∥w−~w∥2 (53)

Furthermore, the guaranteed closeness to the initial parameter allows us to use the gradient of the quadratic objective in the GD steps as follows,

 wt+1−~w =wt−η∇ft−~w =wt−~w−η∇gt+η(∇gt−∇ft) =(I−ηH)(wt−~w)+η(∇gt−∇ft−∇f(~w)) =ut+η(δt+dt), (54)

where the vectors , and are defined in Table 5.