On Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions

# On Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions

## Abstract

We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonconvex functions. In particular, we study the class of Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions for which the chain rule of calculus holds. This class contains important examples such as ReLU neural networks and others with non-differentiable activation functions. First, we show that finding an -stationary point with first-order methods is impossible in finite time. Therefore, we introduce the notion of -stationarity, a generalization that allows for a point to be within distance of an -stationary point and reduces to -stationarity for smooth functions. We propose a series of randomized first-order methods and analyze their complexity of finding a -stationary point. Furthermore, we provide a lower bound and show that our stochastic algorithm has min-max optimal dependence on . Empirically, our methods perform well for training ReLU neural networks.

## 1 Introduction

Gradient based optimization underlies most of machine learning and it has attracted tremendous research attention over the years. While non-asymptotic complexity analysis of gradient based methods is well-established for convex and smooth nonconvex problems, little is known for nonsmooth nonconvex problems. We summarize the known rates (black) in Table 1 based on the references (Nesterov, 2018; Carmon et al., 2017; Arjevani et al., 2019).

Within the nonsmooth nonconvex setting, recent research results have focused on asymptotic convergence analysis (Benaïm et al., 2005; Kiwiel, 2007; Majewski et al., 2018; Davis et al., 2018; Bolte and Pauwels, 2019). Despite their advances, these results fail to address finite-time, non-asymptotic convergence rates. Given the widespread use of nonsmooth nonconvex problems in machine learning, a canonical example being deep ReLU neural networks, obtaining a non-asymptotic convergence analysis is an important open problem of fundamental interest.

We tackle this problem for nonsmooth functions that are Lipschitz and directionally differentiable. This class is rich enough to cover common machine learning problems, including ReLU neural networks. Surprisingly, even for this seemingly restricted class, finding an -stationary point, i.e., a point for which , is intractable. In other words, no algorithm can guarantee to find an -stationary point within a finite number of iterations.

This intractability suggests that, to obtain meaningful non-asymptotic results, we need to refine the notion of stationarity. We introduce such a notion and base our analysis on it, leading to the following main contributions of the paper:

• We show that a traditional -stationary point cannot be obtained in finite time (Theorem 5).

• We study the notion of -stationary points (see Definition 4). For smooth functions, this notion reduces to usual -stationarity by setting . We provide a lower bound on the number of calls if algorithms are only allowed access to a generalized gradient oracle.

• We propose a normalized “gradient descent” style algorithm that achieves complexity in finding a -stationary point in the deterministic setting.

• We propose a momentum based algorithm that achieves complexity in finding a -stationary point in the stochastic finite variance setting.

As a proof of concept to validate our theoretical findings, we implement our stochastic algorithm and show that it matches the performance of empirically used SGD with momentum method for training ResNets on the Cifar10 dataset.

Our results attempt to bridge the gap from recent advances in developing a non-asymptotic theory for nonconvex optimization algorithms to settings that apply to training deep neural networks, where, due to non-differentiability of the activations, most existing theory does not directly apply.

### 1.1 Related Work

Asymptotic convergence for nonsmooth nonconvex functions. Benaïm et al. (2005) study the convergence of subgradient methods from a differential inclusion perspective; Majewski et al. (2018) extend the result to include proximal and implicit updates. Bolte and Pauwels (2019) focus on formally justifying the back propagation rule under nonsmooth conditions. In parallel, Davis et al. (2018) proved asymptotic convergence of subgradient methods assuming the objective function to be Whitney stratifiable. The class of Whitney stratifiable functions is broader than regular functions studied in (Majewski et al., 2018), and it does not assume the regularity inequality (see Lemma 6.3 and (51) in (Majewski et al., 2018)). Another line of work (Mifflin, 1977; Kiwiel, 2007; Burke et al., 2018) studies convergence of gradient sampling algorithms. These algorithms assume a deterministic generalized gradient oracle. Our methods draw intuition from these algorithms and their analysis, but are non-asymptotic in contrast.

Structured nonsmooth nonconvex problems. Another line of research in nonconvex optimization is to exploit structure: Duchi and Ruan (2018); Drusvyatskiy and Paquette (2019); Davis and Drusvyatskiy (2019) consider the composition structure of convex and smooth functions; Bolte et al. (2018); Zhang and He (2018); Beck and Hallak (2020) study composite objectives of the form where one function is differentiable or convex/concave. With such structure, one can apply proximal gradient algorithms if the proximal mapping can be efficiently evaluated. However, this usually requires weak convexity, i.e., adding a quadratic function makes the function convex, which is not satisfied by several simple functions, e.g., .

Stationary points under smoothness. When the objective function is smooth, SGD finds an -stationary point in gradient calls (Ghadimi and Lan, 2013), which improves to for convex problems. Fast upper bounds under a variety of settings (deterministic, finite-sum, stochastic) are studied in (Carmon et al., 2018; Fang et al., 2018; Zhou et al., 2018; Nguyen et al., 2019; Allen-Zhu, 2018; Reddi et al., 2016). More recently, lower bounds have also been developed (Carmon et al., 2017; Drori and Shamir, 2019; Arjevani et al., 2019; Foster et al., 2019). When the function enjoys high-order smoothness, a stronger goal is to find an approximate second-order stationary point and could thus escape saddle points too. Many methods focus on this goal (Ge et al., 2015; Agarwal et al., 2017; Jin et al., 2017; Daneshmand et al., 2018; Fang et al., 2019).

## 2 Preliminaries

In this section, we set up the notion of generalized directional derivatives that will play a central role in our analysis. Throughout the paper, we assume that the nonsmooth function is -Lipschitz continuous (more precise assumptions on the function class are outlined in §2.3).

###### Definition 1.

Given a point , and direction , the generalized directional derivative of is defined as

 f∘(x;d):=limsupy→x,t↓0f(y+td)−f(y)t.
###### Definition 2.

The generalized gradient of is defined as

 ∂f(x):={g∣⟨g,d⟩≤f∘(x,d), ∀d∈Rd}.

We recall below some basic properties of the generalized gradient, see e.g., (Clarke, 1990) for details.

###### Proposition 1 (Properties of generalized gradients).

1. is a nonempty, convex compact set. For all vectors , we have .

2. .

3. is an upper-semicontinuous set valued map.

4. is differentiable almost everywhere (as it is -Lipschitz); let denote the convex hull, then we have that

 ∂f(x)=conv({g|g=limk→∞∇f(xk), xk→x}).
5. Let denote the unit Euclidean ball. Then,

 ∂f(x)=∩δ>0∪y∈x+δB∂f(y).
6. For any , there exists and such that

 f(y)−f(z)=⟨g,y−z⟩.

### 2.2 Directional derivatives

Since general nonsmooth functions can have arbitrarily large variations in their “gradients,” we must restrict the function class to be able to develop a meaningful complexity theory. We show below that directionally differentiable functions match this purpose well.

###### Definition 3.

A function is called directionally differentiable in the sense of Hadamard (cf. (Sova, 1964; Shapiro, 1990)) if for any mapping for which and , the following limit exists:

 f′(x;d)=limt→0+1t(f(φ(t))−f(x)). (1)

In the rest of the paper, we will say a function is directionally differentiable if it is directionally differentiable in the sense of Hadamard at all .

This directional differentiabilility is also referred to as Hadamard semidifferentiability in (Delfour, 2019). Notably, such directional differentiability is satisfied by most problems of interest in machine learning. It includes functions such as that do not satisfy the so-called regularity inequality (equation (51) in (Majewski et al., 2018)). Moreover, it covers the class of semialgebraic functions, as well as o-minimally definable functions (see Lemma 6.1 in (Coste, 2000)) discussed in (Davis et al., 2018). Currently, we are unaware whether the notion of Whitney stratifiability (studied in some recent works on nonsmooth optimization) implies directional differentiability.

A very important property of directional differentiability is that it is preserved under composition.

###### Lemma 2 (Chain rule).

Let be Hadamard directionally differentiable at , and be Hadamard directionally differentiable at . Then the composite mapping is Hadamard directionally differentiable at and

 (ψ∘ϕ)′x=ψ′ϕ(x)∘ϕ′x.

A proof of this lemma can be found in (Shapiro, 1990, Proposition 3.6). As a consequence, any neural network function composed of directionally differentiable functions, including ReLU/LeakyReLU, is directionally differentiable.

Directional differentiability also implies key properties useful in the analysis of nonsmooth problems. In particular, it enables the use of (Lebesgue) path integrals as follows.

###### Lemma 3.

Given any , let , . If is directionally differentiable and Lipschitz, then

 f(y)−f(x) =∫[0,1]f′(γ(t);y−x)dt.

The following important lemma further connects directional derivatives with generalized gradients.

###### Lemma 4.

Assume that the directional derivative exists. For any , there exists s.t. .

### 2.3 Nonsmooth function class of interest

Throughout the paper, we focus on the set of Lipschitz, directionally differentiable and bounded (below) functions:

 F(Δ,L):={f| f is L-Lipschitz;f is directionally % differentiable;f(x0)−infxf(x)≤Δ}, (2)

where a function is Lipschitz if

 |f(x)−f(y)|≤L∥x−y∥,∀ x,y∈Rn.

As indicated previously, ReLU neural networks with bounded weight norms are included in this function class.

## 3 Stationary points and oracles

We now formally define our notion of stationarity and discuss the intractability of the standard notion. Afterwards, we formalize the optimization oracles and define measures of complexity for algorithms that use these oracles.

### 3.1 Stationary points

With the generalized gradient in hand, commonly a point is called stationary if Clarke (1990). A natural question is, what is the necessary complexity to obtain an -stationary point, i.e., a point for which

 min{∥g∥∣ g∈∂f(x)}≤ϵ.

It turns out that attaining such a point is intractable. In particular, there is no finite time algorithm that can guarantee -stationarity in the nonconvex nonsmooth setting. We make this claim precise in our first main result.

###### Theorem 5.

Given any algorithm  that accesses function value and generalized gradient of in each iteration, for any and for any finite iteration , there exists such that the sequence generated by on the objective does not contain any -stationary point with probability more than .

A key ingredient of the proof is that an algorithm  is uniquely determined by , the function values and gradients at the query points. For any two functions and that have the same function values and gradients at the same set of queried points , the distribution of the iterate generated by is identical for and . However, due to the richness of the class of nonsmooth functions, we can find and such that the set of -stationary points of and are disjoint. Therefore, the algorithm cannot find a stationary point with probability more than for both and simultaneously. Intuitively, such functions exist because a nonsmooth function could vary arbitrarily—e.g., a nonsmooth nonconvex function could have constant gradient norms except at the (local) extrema, as happens for a piecewise linear zigzag function. Moreover, the set of extrema could be of measure zero. Therefore, unless the algorithm lands exactly in this measure-zero set, it cannot find any -stationary point.

Theorem 5 suggests the need for rethinking the definition of stationary points. Intuitively, even though we are unable to find an -stationary point, one could hope to find a point that is close to an -stationary point. This motivates us to adopt the following more refined notion:

###### Definition 4.

A point is called -stationary if

 d(0,∂f(x+δB))≤ϵ,

where is the Goldstein -subdifferential, introduced in Goldstein (1977).

In other words, a point is -stationary if we can find a point at most distance away from such that is -stationary. At first glance, this appears to be a weaker notion since if is -stationary, then it is also a -stationary point for any , but not vice versa. We show that the converse implication indeed holds, assuming smoothness.

###### Proposition 6.

The following statements hold:

1. -stationarity implies -stationarity for any .

2. If is smooth with an -Lipschitz gradient and if is (, )-stationary, then is also -stationary, i.e.

 d(0,∂f(x+ϵ3LB))≤ϵ3⟹∥∇f(x)∥≤ϵ.

Consequently, the two notions of stationarity are equivalent for differentiable functions. It is then natural to ask: does -stationarity permit a finite time analysis?

The answer is positive, as we will show later, revealing an intrinsic difference between the two notions of stationarity. Besides providing algorithms, in Theorem 11 we also prove an lower bound on the dependency of for algorithms that can only access a generalized gradient oracle.

We also note that -stationarity behaves well as .

###### Lemma 7.

The set converges as as

 limδ↓0∂f(x+δB)=∂f(x).

Lemma 7 enables a straightforward routine for transforming non-asymptotic analyses for finding -stationary points to asymptotic results for finding -stationary points. Indeed, assume that a finite time algorithm for finding -stationary points is provided. Then, by repeating the algorithm with decreasing , (e.g., ), any accumulation points of the repeated algorithm is an -stationary point with high probability.

###### Assumption 1.

Given , the oracle returns a function value , and a generalized gradient ,

 (fx,gx)=O(x,d),

such that

1. In the deterministic setting, the oracle returns

 fx=f(x),gx∈∂f(x) satisfying ⟨gx,d⟩=f′(x,d).
2. In the stochastic finite-variance setting, the oracle only returns a stochastic gradient with , where satisfies . Moreover, the variance is bounded. In particular, no function value is accessible.

We remark that one cannot generally evaluate the generalized gradient in practice at any point where is not differentiable. When the function is not directionally differentiable, one needs to incorporate gradient sampling to estimate  (Burke et al., 2002). Our oracle queries only an element of the generalized gradient and is thus weaker than querying the entire set . Still, finding a vector such that equals the directional derivative is non-trivial in general. Yet, when the objective function is a composition of directionally differentiable functions, such as ReLU neural networks, and if a closed form directional derivative is available for each function in the composition, then we can find the desired by appealing to the chain rule in Lemma 2. This property justifies our choice of oracles.

### 3.3 Algorithm class and complexity measures

An algorithm maps a function to a sequence of points in . We denote to be the mapping from previous iterations to . Each can potentially be a random variable, due to the stochastic oracles or algorithm design. Let be the filtration generated by such that is adapted to . Based on the definition of the oracle, we assume that the iterates follow the structure

 xk+1=A(k)(x1,g1,f1,x2,g2,f2,...,xk,gk,fk), (3)

where , and the point and direction are (stochastic) functions of the iterates .

For a random process , we define the complexity of for a function as the value

 Tδ,ϵ({xt}t∈N,f):=inf{t∈N ∣ Prob{d(0,∂f(x+δB))≥ϵ   for all k≤t}≤13}. (4)

Let denote the sequence of points generated by algorithm for function . Then, we define the iteration complexity of an algorithm class on a function class as

 N(A,F,ϵ,δ):=infA∈Asupf∈FTδ,ϵ(A[f,x0],f). (5)

At a high level, (5) is the minimum number of oracle calls required for a fixed algorithm to find a -stationary point with probability at least for all functions is class .

## 4 Deterministic Setting

For optimizing -smooth functions, a crucial inequality is

 f(x−1L∇f(x))−f(x)≤−12L∥∇f(x)∥2. (6)

In other words, either the gradient is small or the function value decreases sufficiently along the negative gradient. However, when the objective function is nonsmooth, this descent property is no longer satisfied. Thus, defining an appropriate descent direction is non-trivial. Our key innovation is to solve this problem via randomization.

More specifically, in our algorithm, Interpolated Normalized Gradient Descent (Ingd), we derive a local search strategy to find the descent direction at an iterate . The vector plays the role of descent direction and we sequentially update it until the condition

 f(xt,k)−f(xt)<−δ∥mt,k∥4, (descent condition)

is satisfied. To connect with the descent property (6), observe that when is smooth, with and , (descent condition) is the same as (6) up to a factor . This connection motivates our choice of descent condition.

When the descent condition is satisfied, the next iterate  is obtained by taking a normalized step from along the direction . Otherwise, we stay at and continue the search for a descent direction. We raise special attention to the fact that inside the -loop, the iterates are always obtained by taking a normalized step from . Thus, all the inner iterates have distance exactly from .

To update the descent direction, we incorporate a randomized strategy. We randomly sample an interpolation point on the segment and evaluate the generalized gradient at this random point . Then, we update the descent direction as a convex combination of and the previous direction . Due to lack of smoothness, the violation of the descent condition does not directly imply that is small. Instead, the projection of the generalized gradient is small along the direction on average. Hence, with a proper linear combination, the random interpolation allows us to guarantee the decrease of in expectation. This reasoning allows us to derive the non-asymptotic convergence rate in high probability.

###### Theorem 8.

In the deterministic setting and with Assumption 1(a), the Ingd algorithm with parameters and finds a -stationary point for function class with probability using at most

 192ΔL2ϵ3δlog(4Δγδϵ)oracle calls.

Since we introduce random sampling for choosing the interpolation point, even in the deterministic setting we can only guarantee a high probability result. The detailed proof is deferred to Appendix C.

A sketch of the proof is as follows. Since for any , the interpolation point is inside the ball . Hence for any . In other words, as soon as (line 7), the reference point is -stationary. If this is not true, i.e., , then we check whether (descent condition) holds, in which case

 f(xt,k)−f(xt)<−δ∥mt,k∥4<−ϵδ4.

Knowing that the function value is lower bounded, this can happen at most times. Thus, for at least one , the local search inside the while loop is not broken by the descent condition. Finally, given that and the descent condition is not satisfied, we show that

 E[∥mt,k+1∥2]≤(1−E[∥mt,k∥2]3L2)E[∥mt,k∥2]

This implies that follows a decrease of order . Hence with , we are guaranteed to find with high probability.

###### Remark 9.

If the problem is smooth, the descent condition is always satisfied in one iteration. Hence the global complexity of our algorithm reduces to . Due to the equivalence of the notions of stationarity (Prop. 6), with , our algorithm recovers the standard convergence rate for finding an -stationary point. In other words, our algorithm can adapt to the smoothness condition.

## 5 Stochastic Setting

In the deterministic setting one of the key ingredients used Ingd is to check whether the function value decreases sufficiently. However, evaluating the function value can be computationally expensive, or even infeasible in the stochastic setting. For example, when training neural networks, evaluating the entire loss function requires going through all the data, which is impractical. As a result, we do not assume access to function value in the stochastic setting and instead propose a variant of Ingd that only relies on gradient information.

One of the challenges of using stochastic gradients is the noisiness of the gradient evaluation. To control the variance of the associated updates, we introduce into the normalized step size:

 ηt=1p∥mt∥+q.

A similar strategy is used in adaptive methods like Duchi et al. (2011); Kingma and Ba (2015) to prevent instability. Here, we show that the constant allows us to control the variance of . In particular, it implies the bound

 E[∥xt+1−xt∥2]≤G2q,

where is a trivial upper-bound on the expected norm of any sampled gradient .

Another substantial change (relative to Ingd) is the removal of the explicit local search, since the stopping criterion can now no longer be tested without access to the function value. Instead, one may view as an implicit local search with respect to the reference point . In particular, we show that when the direction has a small norm, then is a -stationary point, but not . This discrepancy explains why we output instead of .

In the deterministic setting, the direction inside each local search is guaranteed to belong to . Hence, controlling the norm of implies the -stationarity of . In the stochastic case, however, we have two complications. First, only the expectation of the gradient evaluation satisfies the membership . Second, the direction is a convex combination of all the previous gradients , with all coefficients being nonzero. In contrast, we use a re-initialization in the deterministic setting. We overcome these difficulties and their ensuing subtleties to finally obtain the following complexity result:

###### Theorem 10.

In the stochastic setting, with Assumption 1(b), the Stochastic-Ingd algorithm (Algorithm 2) with parameters , , , , , ensures

 1TT∑t=1E[∥mt∥]≤ϵ4.

In other words, the number of gradient calls to achieve a stationary point is upper bounded by

For readability, the constants in Theorem 10 have not been optimized. The high level idea of the proof is to relate to the function value decrease , and then to perform a telescopic sum.

We would like to emphasize the use of the adaptive step size and the momentum term . These techniques arise naturally from our goal to find a -stationary point. The step size helps us ensure that the distance moved is at most , and hence we are certain that adjacent iterates are close to each other. The momentum term serves as a convex combination of generalized gradients, as postulated by Definition 4.

Further, even though the parameter does not directly influence the updates of our algorithm, it plays an important role in understanding our algorithm. Indeed, we show that

 d(E[mt|xt−K],∂f(xt−K+δB))≤ϵ16.

In other words, the conditional expectation is approximately in the -subdifferential at . This relationship is non-trivial.

On one hand, by imposing , we ensure that are inside the -ball of center . On the other hand, we guarantee that the contribution of to is small, providing an appropriate upper bound on the coefficient . These two requirements help balance the different parameters in our final choice. Details of the proof may be found in Appendix D.

Recall that we do not access the function value in this stochastic setting, which is a strength of the algorithm. In fact, we can show that our dependence is tight, when the oracle has only access to generalized gradients.

###### Theorem 11 (Lower bound on δ dependence).

Let denote the class of algorithms defined in Section 3.2 and denote the class of functions defined in Equation (2). Assume and . Then the iteration complexity is lower bounded by if the algorithm only has access to generalized gradients.

The proof is inspired by Theorem 1.1.2 in Nesterov (2018). We show that unless more than different points are queried, we can construct two different functions in the function class that have gradient norm at all the queried points, and the stationary points of both functions are away. For more details, see Appendix E.

This theorem also implies the negative result for finite time analyses that we showed in Theorem 5. Indeed, when an algorithm finds an -stationary point, the point is also a -stationary for any . Thus, the iteration complexity must be at least , i.e., no finite time algorithm can guarantee to find an -stationary point.

Before moving on to the experimental section, we would like to make several comments related to different settings. First, since the stochastic setting is strictly stronger than the deterministic setting, the stochastic variant Stochastic-INGD is applicable to the deterministic setting too. Moreover, the analysis can be extended to , which leads to a complexity of . This is the same as the deterministic algorithm. However, the stochastic variant does not adapt to the smoothness condition. In other words, even if the function is differentiable, we will not obtain a faster convergence rate. In particular, if the function is smooth, by using the equivalence of the types of stationary points, Stochastic-INGD finds an -stationary point in while standard SGD enjoys a convergence rate. We do not know whether a better convergence result is achievable, as our lower bound does not provide an explicit dependency on ; we leave this as a future research direction.

## 6 Experiments

In this section, we evaluate the performance of our proposed algorithm Stochastic Ingd on image classification tasks.

We train the ResNet20 (He et al., 2016) model on the CIFAR10 (Krizhevsky and Hinton, 2009) classification dataset. The dataset contains 50k training images and 10k test images in 10 classes.

We implement Stochastic Ingd in PyTorch with the inbuilt auto differentiation algorithm Paszke et al. (2017). We remark that except on the kink points, the auto differentiation matches the generalized gradient oracle, which justifies our choice. We benchmark the experiments with two popular machine learning optimizers, SGD with momentum and ADAM Kingma and Ba (2015). We train the model for 100 epochs with the standard hyper-parameters from the Github repository1:

• For SGD with momentum, we initialize the learning rate as , momentum as and reduce the learning rate by 10 at epoch 50 and 75. The weight decay parameter is set to .

• For ADAM, we use constant the learning rate , betas in , and weight decay parameter and for the best performance.

• For Stochastic-Ingd, we use , , , and weight decay parameter .

The training and test accuracy for all three algorithms are plotted in Figure 1. We observe that Stochastic-Ingd matches the SGD baseline and outperforms the ADAM algorithm in terms of test accuracy. The above results suggests that the experimental implications of our algorithm could be interesting, but we leave a more systematic study as future direction.

## 7 Conclusions and Future Directions

In this paper, we investigate the complexity of finding first order stationary points of nonconvex nondifferentiable functions. We focus in particular on Hadamard semi-differentiable functions, which we suspect is perhaps the most general class of functions for which the chain rule of calculus holds—see the monograph (Delfour, 2019). We further extend the standard definition of -stationary points for smooth functions into a new notion of -stationary points. We justify our definition by showing that no algorithm can find a stationary point for any in a finite number of iterations and conclude that a positive is necessary for a finite time analysis. Using the above definition and a more refined gradient oracle, we prove that the proposed algorithms find stationary points within iterations in the deterministic setting and with iterations in the stochastic setting.

Our results provide the first non-asymptotic analysis of nonconvex optimization algorithms in the general Lipschitz continuous setting. Yet, they also open further questions. The first question is whether the current dependence on in our complexity bound is optimal. A future research direction is to try to find provably faster algorithms or construct adversarial examples that close the gap between upper and lower bounds on . Second, the rate we obtain in the deterministic case requires function evaluations and is randomized, leading to high probability bounds. Can similar rates be obtained by an algorithm oblivious to the function value? Another possible direction would be to obtain a deterministic convergence result. More specialized questions include whether one can remove the logarithmic factors from our bounds. Aside from the above questions on the rate, we can take a step back and ask high-level questions. Are there better alternatives to the current definition of -stationary points? One should also investigate whether everywhere directional differentiability is necessary.

In addition to the open problems listed above, our work uncovers another very interesting observation. In the standard stochastic, nonconvex, and smooth setting, stochastic gradient descent is known to be theoretically optimal (Arjevani et al., 2019), while widely used practical techniques such as momentum-based and adaptive step size methods usually lead to worse theoretical convergence rates. In our proposed setting, momentum and adaptivity naturally show up in algorithm design, and become necessary for the convergence analysis. Hence we believe that studying optimization under more relaxed assumptions may lead to theorems that can better bridge the widening theory-practice divide in optimization for training deep neural networks, and ultimately lead to better insights for practitioners.

## 8 Acknowledgement

SS acknowledges support from an NSF-CAREER Award (Number 1846088) and an Amazon Research Award. AJ acknowledges support from an MIT-IBM-Exploratory project on adaptive, robust, and collaborative optimization.

## Appendix A Proof of Lemmas in Preliminaries

### a.1 Proof of Lemma 3

###### Proof.

Let for , then is -Lipschitz implying that is absolutely continuous. Thus from the fundamental theorem of calculus (Lebesgue), has a derivative almost everywhere, and the derivative is Lebesgue integrable such that

 g(t)=g(0)+∫t0g′(s)ds.

Moreover, if is differentiable at , then

 g′(t)=limδt→0g(t+δt)−g(t)δt=limδt→0f(x+(t+δt)(y−x))−f(x+t(y−x))δt=f′(x+t(y−x),y−x).

Since this equality holds almost everywhere, we have

 f(y)−f(x)=g(1)−g(0)=∫10g′(t)dt=∫10f′(x+t(y−x),y−x)dt.

### a.2 Proof of Lemma 4

###### Proof.

For any as given in Definition 3, let . Denote . By Proposition 1.6, we know that there exists such that

 f(xk)−f(x)=⟨gk,j,xk−x⟩.

By the existence of directional derivative, we know that

 limk→∞⟨gk,j,d⟩=limk→∞⟨gk,j,tkd⟩tk=f′(x,d)

is in a bounded set with norm less than L. The Lemma follows by the fact that any accumulation point of is in due to upper-semicontinuity of . ∎

## Appendix B Proof of Lemmas in Algorithm Complexity

### b.1 Proof of Theorem 5

Our proof strategy is similar to Theorem 1.1.2 in Nesterov (2018), where we use the resisting strategy to prove lower bound. Given a one dimensional function , let be the sequence of points queried in ascending order instead of query order. We assume without loss of generality that the initial point is queried and is an element of (otherwise, query the initial point first before proceeding with the algorithm).

Then we define the resisting strategy: always return

 f(x)=0,and∇f(x)=L.

If we can prove that for any set of points , there exists two functions such that they satisfy the resisting strategy , and that the two functions do not share any common stationary points, then we know no randomized/deterministic can return an stationary points with probability more than for both functions simultaneously. In other word, no algorithm that query points can distinguish these two functions. Hence we proved the theorem following the definition of complexity in (5) with .

All we need to do is to show that such two functions exist in the Lemma below.

###### Lemma 12.

Given a finite sequence of real numbers , there is a family of functions such that for any ,

 fθ(xk)=0and∇fθ(xk)=L

and for sufficiently small, the set of -stationary points of are all disjoint, i.e -stationary points of -stationary points of for any .

###### Proof.

Up to a permutation of the indices, we could reorder the sequence in the increasing order. WLOG, we assume is increasing. Let . For any , we define by

 fθ(x) =−L(x−x1+2θδ)forx∈(−∞,x1−θδ] fθ(x) =L(x−xk)forx∈[xk−θδ,xk+xk+12−θδ] fθ(x) =−L(x−xk+1+2θδ)forx∈[xk+xk+12−θδ,xk+1−θδ] fθ(x) =L(x−xK)x∈[xK+θδ,+∞).

It is clear that is directional differentiable at all point and . Moreover, the minimum . This implies that . Note that or except at the local extremum. Therefore, for any the set of -stationary points of are exactly

 {ϵ-stationary points of fθ}={xk−θδ|k∈[1,K]}∪{xk+xk+12−θδ|k∈[1,K−1]},

which is clearly distinct for different choice of . ∎

### b.2 Proof of Proposition 6

###### Proof.

When is stationary, we have . By definition, we could find such that . This means, there exists , and such that and

 g=k∑i=1αi∇f(xi)

Therefore

 ∥∇f(x)∥ ≤∥g∥+∥∇f(x)−g∥ ≤2ϵ3+k∑i=1αi∥∇f(x)−∇f(xk)∥ ≤2ϵ3+k∑i=1αiL∥x−xk∥ ≤2ϵ3+k∑i=1αiLϵ3L=ϵ.

Therefore, is an -stationary point in the standard sense. ∎

### b.3 Proof of Lemma 7

###### Proof.

First, we show that the limit exists. By Lipschitzness and Jenson inequality, we know that lies in a bounded ball with radius . For any sequence of with , we know that Therefore, the limit exists by the monotone convergence theorem.

Next, we show that For one direction, we show that . This follows by proposition 1.5 and the fact that

 ∪y∈x+δB∂f(y)⊆conv(∪y∈x+δB∂f(y))=∂f(x+δB).

Next, we show the other direction . By upper semicontinuity, we know that for any , there exists such that

 ∪y∈x+δB∂f(y)⊆∂f(x)+ϵB.

Then by convexity of and , we know that their Minkowski sum is convex. Therefore, we conclude that for any , there exists such that

 ∂f(x+δB)=conv(∪y∈x+δB∂f(y))⊆∂f(x)+ϵB.

## Appendix C Proof of Theorem 8

Before we prove the theorem, we first analyze how many times the algorithm iterates in the while loop.

###### Lemma 13.

Let . Given ,

 E[∥mt,K∥2]≤ϵ216.

where for convenience of analysis, we define for all if the -loop breaks at . Consequently, for any , with probability , there are at most restarts of the while loop at the -th iteration.

###### Proof.

Let , then We denote as the event that -loop does not break at , i.e. and . It is clear that .

Let . Note that . Since is uniformly sampled from line segment , we know

 E[⟨gt,k+1,xt,k−xt⟩|Ft,k]=∫10f′(γ(t),xt,k−xt)dt=f(xt,k)−f(xt)

where the second equality comes from directional differentiability. Since , we know that

 E[⟨gt,k+1,mt,k⟩|Ft,k]=−∥mt,k∥δ(f(xt,k)−f(xt)). (7)

By construction under , and otherwise. Therefore,

 E[∥mt,k+1∥2|Ft,k] = E[∥βmt,k+(1−β)gt,k+1∥