Convergence Analysis of Proximal Gradient with Momentum for Nonconvex Optimization

# Convergence Analysis of Proximal Gradient with Momentum for Nonconvex Optimization

## Abstract

In many modern machine learning applications, structures of underlying mathematical models often yield nonconvex optimization problems. Due to the intractability of nonconvexity, there is a rising need to develop efficient methods for solving general nonconvex problems with certain performance guarantee. In this work, we investigate the accelerated proximal gradient method for nonconvex programming (APGnc) Yao & Kwok (2016). The method compares between a usual proximal gradient step and a linear extrapolation step, and accepts the one that has a lower function value to achieve a monotonic decrease. In specific, under a general nonsmooth and nonconvex setting, we provide a rigorous argument to show that the limit points of the sequence generated by APGnc are critical points of the objective function. Then, by exploiting the Kurdyka-Łojasiewicz (KŁ) property for a broad class of functions, we establish the linear and sub-linear convergence rates of the function value sequence generated by APGnc. We further propose a stochastic variance reduced APGnc (SVRG-APGnc), and establish its linear convergence under a special case of the KŁ property. We also extend the analysis to the inexact version of these methods and develop an adaptive momentum strategy that improves the numerical performance.

\printAffiliationsAndNotice

## 1 Introduction

Many problems in machine learning, data mining, and signal processing can be formulated as the following composite minimization problem

 minx∈\mathdsRd F(x)=f(x)+g(x). (P)

Typically, captures the loss of data fitting and can be written as with each corresponding to the loss of one sample. The second term is the regularizer that promotes desired structures on the solution based on prior knowledge of the problem.

In practice, many problems of (P) are formulated, either naturally or intensionally, into a convex model to guarantee the tractability of algorithms. In particular, such convex problems can be efficiently minimized by many first-order algorithms, among which the accelerated proximal gradient (APG) method (also referred to as FISTA Beck & Teboulle (2009b)) is proven to be the best for minimizing such class of convex functions. We present one of its basic forms in Algorithm 1.

Compared to the usual proximal gradient step, the APG algorithm takes an extra linear extrapolation step for acceleration. It has been shown Beck & Teboulle (2009b) that the APG method reduces the function value gap at a rate of where denotes the number of iterations. This convergence rate meets the theoretical lower bound of first-order gradient methods for minimizing smooth convex functions. The reader can refer to Tseng (2010) for other variants of APG.

Although convex problems are tractable and can be globally minimized, many applications naturally require to solve nonconvex optimization problems of (P). Recently, several variants of the APG method have been proposed for nonconvex problems, and two major ones are presented in Algorithm 2 and Algorithm 3, respectively. The major difference to the original APG is that the modified methods only accept the new iterate when the corresponding function value is sufficiently decreased, which leads to a more stable convergence behavior. In particular, Li & Lin (2015) analyzed mAPG (Algorithm 2) by exploiting the Kurdyka-Łojasiewicz (KŁ) property, which is a local geometrical structure very generally held by a large class of nonconvex objective functions, and has been successfully exploited to characterize the asymptotic convergence behavior of many first order methods. It was shown in Li & Lin (2015) that mAPG achieves the convergence rate for convex problems of (P), and converges to a critical point at sublinear and linear rates under different cases of the KL property for nonconvex problems. Despite the desirable convergence rate, mAPG requires two proximal steps, which doubles the computational complexity of the original APG. In comparison, the APGnc (Algorithm 3) requires only one proximal step, and hence computes faster than mAPG in each iteration. However, the analysis of APGnc in Yao & Kwok (2016) does not exploit the KL property and no convergence rate of the function value is established. Hence, there is still no formal theoretical comparison of the overall performance (which depends on both computation per iteration and convergence rate) between mAPG and APGnc. It is unclear whether the computational saving per iteration in APGnc is at the cost of lower convergence rate.

The goal of this paper is to provide a comprehensive analysis of the APGnc algorithm under the KL framework, thus establishing a rigorous comparison between mAPG and APGnc and formally justifying the overall advantage of APGnc.

### 1.1 Main Contributions

This paper provides the convergence analysis of APGnc type algorithms for nonconvex problems of (P) under the KL framework as well as the inexact situation. We also study the stochastic variance reduced APGnc algorithm and its inexact situation. Our analysis requires novel technical treatments to exploit the KŁ property due to the joint appearance of the following ingredients in the algorithms including momentum terms, inexact errors, and stochastic variance reduced gradients. Our contributions are summarized as follows.

For APGnc applied to nonconvex problems of (P), we show that the limit points of the sequences generated by APGnc are critical points of the objective function. Then, by exploiting different cases of the Kurdyka-Łojasiewicz property of the objective function, we establish the linear and sub-linear convergence rates of the function value sequence generated by APGnc. Our results formally show that APGnc (with one proximal map per iteration) achieves the same convergence properties as well as the convergence rates as mAPG (with two proximal maps per iteration) for nonconvex problems, thus establishing its overall computational advantage.

We further propose an APGnc algorithm, which is an improved version of APGnc by adapting the momentum stepsize (see Algorithm 4), and shares the same theoretical convergence rate as APGnc but numerically performs better than APGnc.

Furthermore, we study the inexact APGnc in which the computation of the gradient and the proximal mapping may have errors. We show that the algorithm still achieves the convergence rate at the same order as the exact case as long as the inexactness is properly controlled. We also explicitly characterize the impact of errors on the constant factors that affect the convergence rate.

To facilitate the solution to large-scale optimization problems, we study the stochastic variance reduced APGnc (SVRG-APGnc), and show that such an algorithm achieves linear convergence rate under a certain case of the KŁ property. We further analyze the inexact SVRG-APGnc and show that it also achieves the linear convergence under the same KŁ property as long as the error in the proximal mapping is bounded properly. This is the first analysis of the SVRG proximal algorithm with momentum that exploits the KŁ structure to establish linear convergence rate for nonconvex programming.

Our numerical results further corroborate the theoretic analysis. We demonstrate that APGnc/APGnc outperforms APG and mAPG for nonconvex problems in both exact and inexact cases, and in both deterministic and stochastic variants of the algorithms. Furthermore, APGnc outperforms APGnc due to properly chosen momentum stepsize.

### 1.2 Comparison to Related Work

APG algorithms: The original accelerated gradient method for minimizing a single smooth convex function dates back to Nesterov (1983), and is further extended as APG in the composite minimization framework in Beck & Teboulle (2009b); Tseng (2010). While these APG variants generate a sequence of function values that may oscillate, Beck & Teboulle (2009a) proposed another variant of APG that generates a non-increasing sequence of function values. Then, Li & Lin (2015) further proposed an mAPG that generates a sufficiently decreasing sequence of function values, and established the asymptotic convergence rates under the KŁ property. Recently, Yao & Kwok (2016) proposed APGnc, which is a more efficient version of APG for nonconvex problems, but the analysis only characterizes fixed points and did not exploit the KŁ property to characterize the convegence rate. Our study establishes the convergence rate analysis of APGnc under the KŁ property.

Nonconvex optimization under KŁ: The KŁ property Bolte et al. (2007) is an extension of the Łojasiewicz gradient inequality Łojasiewicz (1965) to the nonsmooth case. Many first-order descent methods, under the KŁ property, can be shown to converge to a critical point Attouch & Bolte (2009); Bolte et al. (2014); Attouch et al. (2010) with different types of asymptotic convergence rates. Li & Lin (2015) and our paper focuses on the first-order algorithms with momentum, and respectively analyze mAPG and APGnc by exploiting the KŁ property.

Inexact algorithms under KŁ: Attouch et al. (2013); Frankel et al. (2015) studied the inexact proximal algorithm under the KŁ property. This paper studies the inexact proximal algorithm with momentum (i.e., APGnc) under the KŁ property. While Yao & Kwok (2016) also studied the inexact APGnc, the analysis did not exploit the KŁ property to characterize the convergence rate.

Nonconvex SVRG: SVRG was first proposed in Johnson & Zhang (2013), to accelerate the stochastic gradient method for strongly convex objective functions, and was studied for the convex case in Zhu & Yuan (2016). Recently, SVRG was further studied for smooth nonconvex optimization in Reddi et al. (2016a). Then in Reddi et al. (2016b), the proximal SVRG was proposed and studied for nonsmooth and nonconvex optimization. Our paper further incorporates SVRG for the proximal gradient with momentum in the nonconvex case. Furthermore, we exploit a certain KŁ property in our analysis that is very different from the PL property exploited in Reddi et al. (2016a), and requires special technical treatment in convergence analysis.

## 2 Preliminaries and Assumptions

In this section, we first introduce some technical definitions that are useful later on, and then describe the assumptions on the problem (P) that we take in this paper.

Throughout this section, is an extended real-valued function that is proper, i.e., its domain is nonempty, and is closed, i.e., its sublevel sets are closed for all . Note that a proper and closed function can be nonsmooth and nonconvex, hence we consider the following generalized notion of derivative.

###### Definition 1 (Subdifferential, Rockafellar & Wets (1997)).

The Frechét subdifferential of at is the set of such that

 liminfz≠x,z→xh(z)−h(x)−u⊤(z−x)∥z−x∥≥0,

while the (limiting) subdifferential at is the graphical closure of :

In particular, this generalized derivative reduceds to when is continuously differentiable, and reduces to the usual subdifferential when is convex.

###### Definition 2 (Critical point).

A point is a critical point of iff .

###### Definition 3 (Distance).

The distance of a point to a closed set is defined as:

 distΩ(x):=miny∈Ω∥y−x∥. (1)
###### Definition 4 (Proximal map, e.g. Rockafellar & Wets (1997)).

The proximal map of a point under a proper and closed function with parameter is defined as:

 proxηh(x):=argminzh(z)+12η∥z−x∥2, (2)

where is the Euclidean norm.

We note that when is convex, the corresponding proximal map is the minimizer of a strongly convex function, i.e., a singleton. But for nonconvex , the proximal map can be set-valued, in which case stands for an arbitrary element from that set. The proximal map is a popular tool to handle the nonsmooth part of the objective function, and is the key component of proximal-like algorithms Beck & Teboulle (2009b); Bolte et al. (2014).

###### Definition 5 (Uniformized KŁ property, Bolte et al. (2014)).

Function is said to satisfy the uniformized KŁ property if for every compact set on which is constant, there exist such that for all and all , one has

 φ′(h(x)−h(¯x))⋅dist∂h(x)(0)≥1, (3)

where the function takes the form for some constants .

The above definition is a modified version of the original KŁ property Kurdyka (1998); Bolte et al. (2010), and is more convenient for our analysis later. The KŁ property is a generalization of the Łojasiewicz gradient inequality to nonsmooth functions Bolte et al. (2007), and it is a powerful tool to analyze a class of first-order descent algorithms Attouch & Bolte (2009); Bolte et al. (2014); Attouch et al. (2010). In particular, the class of semi-algebraic functions satisfy the above KŁ property. This function class covers most objective functions in real applications, for instance, all where and is rational, real polynomials, rank, etc. For a more detailed discussion and a list of examples of KŁ functions, see Bolte et al. (2014) and Attouch et al. (2010).

We adopt the following assumptions on the problem (P) in this paper.

###### Assumption 1.

Regarding the functions (and ) in (P)

1. They are proper and lower semicontinous; ; the sublevel set is bounded for all ;

2. They satisfy the uniformized KŁ property;

3. Function is continuously differentiable and the gradient is -Lipschitz continuous.

Note that the sublevel set of is bounded when either or has bounded sublevel set, i.e., as . Of course, we do not assume convexity on either or , and the KŁ property serves as an alternative in this general setting.

## 3 Main Results

In this section, we provide our main results on the convergence analysis of APGnc and SVRG-APGnc as well as inexact variants of these algorithms. All proofs of the theorems are provided in supplemental materials.

### 3.1 Convergence Analysis

In this subsection, we characterize the convergence of APGnc. Our first result characterizes the behavior of the limit points of the sequence generated by APGnc.

###### Theorem 1.

Let creftypecap 1.{1,3} hold for the problem (P). Then with stepsize , the sequence generated by APGnc satisfies

1. is a bounded seuqence;

2. The set of limit points of forms a compact set, on which the objective function is constant;

3. All elements of are critical points of .

creftypecap 1 states that the sequence generated by APGnc eventually approaches a compact set (i.e., a closed and bounded set in ) of critical points, and the objective function remains constant on it. Here, approaching critical points establishes the first step for solving general nonconvex problems. Moreover, the compact set meets the requirements of the uniform KŁ property, and hence provides a seed to exploit the KŁ property around it. Next, we further utilize the KŁ property to establish the asymptotic convergence rate for APGnc. In the following theorem, is the parameter in the uniformized KŁ property via the function that takes the form for some .

###### Theorem 2.

Let creftypecap 1.{1,2,3} hold for the problem (P). Let for all (the set of limit points), and denote . Then with stepsize , the sequence satisfies for large enough

1. If , then reduces to zero in finite steps;

2. If , then ;

3. If , then ,

where and .

creftypecap 2 characterizes three types of convergence behaviors of APGnc, depending on that parameterizes the KŁ property that the objective function satisfies. An illustrative example for the first kind () can take a form similar to for around the critical points. The function is ‘sharp’ around its critical point and thus the iterates slide down quickly onto it within finite steps. For the second kind (), example functions can take a form similar to around the critical points. That is, the function is strongly convex-like and hence the convergence rate is typically linear. Lastly, functions of the third kind are ‘flat’ around its critical points and thus the convergence is slowed down to sub-linear rate. We note that characterizing the value of for a given function is a highly non-trivial problem that takes much independent effort Li & Kei (2016); Kurdyka & Spodzieja (2011). Nevertheless, KŁ property provides a general picture of the asymptotic convergence behaviors of APGnc.

### 3.2 APGnc with Adaptive Momentum

The original APGnc sets the momentum parameter , which can be theoretically justified only for convex problems. We here propose an alternative choice of the momentum stepsize that is more intuitive for nonconvex problems, and refer to the resulting algorithm as APGnc (See Algorithm 4). The idea is to enlarge the momentum to further exploit the opportunity of acceleration when the extrapolation step achieves a lower function value. Since the proofs of creftypecap 1 and creftypecap 2 do not depend on the exact value of the momentum stepsize, APGnc and APGnc have the same order-level convergence rate. However, we show in Section 4 that APGnc improves upon APGnc numerically.

### 3.3 Inexact APGnc

We further consider inexact APGnc, in which computation of the proximal gradient step may be inexact, i.e.,

 xk=proxϵkηg(yk−η(∇f(yk)+ek)),

where captures the inexactness of computation of , and captures the inexactness of evaluation of the proximal map as given by

 x =proxϵηg(y) ={u | g(u)+12η∥u−y∥2 ≤ϵ+g(v)+12η∥v−y∥2,∀v∈\mathdsRd}. (4)

The inexact proximal algorithm has been studied in Attouch et al. (2013) for nonconvex functions under the KŁ property. Our study here is the first treatment of inexact proximal algorithms with momentum (i.e., APG-like algorithms). Furthermore, previous studies addressed only the inexactness of gradient computation for nonconvex problems, but our study here also includes the inexactness of the proximal map for nonconvex problems requiring only to be convex as the second case we specify below.

We study the following two cases.

1. is convex;

2. is nonconvex, and .

In the first case, reduces to the usual subdifferential of convex functions, and the inexactness naturally induces the following -subdifferential

 ∂ϵg(x)={u | g(y)≥g(x)+⟨y−x,u⟩−ϵ,∀y∈\mathdsRd}.

Moreover, since the KŁ property utilizes the information of , we then need to characterize the perturbation of under the inexactness . This leads to the following definition.

###### Definition 6.

For any , let such that has the minimal norm. Then the perturbation between and is defined as .

The following theorem states that for nonconvex functions, as long as the inexactness parameters and are properly controlled, then the inexact APGnc converges at the same order-level rate as the corresponding exact algorithm.

###### Theorem 3.

Consider the above two cases for inexact APGnc under creftypecap 1.{1,2,3}. If for all

 ∥ek∥ ≤γ∥xk−yk∥, (5) ϵk ≤δ∥xk−yk∥2, (6) ξk ≤λ∥xk−yk∥, (7)

then all the statements in creftypecap 1 remain true and the convergence rates in creftypecap 2 remain at the same order with the constants , where depends on and , and . Correspondingly, a smaller stepsize should be used.

It can be seen that, due to the inexactness, the constant factor in creftypecap 2 is enlarged, which further leads to a smaller in creftypecap 2. Hence, the corresponding convergence rates are slower compared to the exact case, but remain at the same order.

### 3.4 Stochastic Variance Reduced APGnc

In this subsection, we study the stochastic variance reduced APGnc algorithm, referred to as SVRG-APGnc. The main steps are presented in Algorithm 5. The major difference from APGnc is that the single proximal gradient step is replaced by a loop of stochastic proximal gradient steps using variance reduced gradients.

Due to the stochastic nature of the algorithm, the iterate sequence may not stably stay in the local KŁ region, and hence the standard KŁ approach fails. We then focus on the analysis of the special but important case of the global KŁ property with . In fact, if , the KŁ property in such a case reduces to the well known Polyak-Łojasiewicz (PL) inequality studied in Karimi et al. (2016). Various nonconvex problems have been shown to satisfy this property such as quadratic phase retrieval loss function Zhou et al. (2016) and neural network loss function Hardt & Ma (2016). The following theorem characterizes the convergence rate of SVRG-APGnc under the KŁ property with .

###### Theorem 4.

Let , where and satisfies If the problem (P) satisfies the KŁ property globally with , then the sequence generated by Algorithm 5 satisfies

 E[F(yk)−F∗]≤(dd+1)k(F(y0)−F∗), (8)

where , and is the optimal function value.

Hence, SVRG-APGnc also achieves the linear convergence rate under the KŁ property with . We note that creftypecap 4 differs from the linear convergence result established in Reddi et al. (2016b) for the SVRG proximal gradient in two folds: (1) we analyze proximal gradient with momentum but Reddi et al. (2016b) studied proximal gradient algorithm; (2) the KŁ property with here is different from the generalized PL inequality for composite functions adopted by Karimi et al. (2016). In order to exploit the KŁ property, our analysis of the convergence rate requires novel treatments of bounds, which can be seen in the proof of creftypecap 4 in LABEL:sec:_thm:_svrg.

### 3.5 Inexact SVRG-APGnc

We further study the inexact SVRG-APGnc algorithm, and the setting of inexactness is the same as that in Section 3.3. Here, we focus on the case where is convex and . The following theorem characterizes the convergence rate under such an inexact case.

###### Theorem 5.

Let be convex and consider only the inexactness in the proximal map. Assume the KŁ property is globally satisfied with . Set where and satisfies Assume that for some , and define . Then the sequence satisfies

 E[F(yk)−F∗]≤(dd+1)k(F(y0)−F∗), (9)

where , and is the optimal function value.

The convergence analysis for stochastic methods in inexact case has never been addressed before. To incorporate the KŁ property in deriving the convergence rate, we use a reference sequence generated by exact proximal mapping. Even though this sequence is not actually generated by the algorithm, we can reach to the convergence rate by analyzing the relation between the reference sequence and the actual sequence generated by the algorithm.

Compared to the exact case, the convergence rate remains at the same order, i.e., the linear convergence, but the convergence is slower due to the larger parameter caused by the error parameter .

## 4 Experiments

In this section, we compare the efficiency of APGnc and SVRG-APGnc with other competitive methods via numerical experiments. In particular, we focus on the non-negative principle component analysis (NN-PCA) problem, which can be formulated as

 minx≥0−12xT(n∑i=1zizTi)x+γ∥x∥2. (10)

It can be equivalently written as

 minx−12xT(n∑i=1zizTi)x+γ∥x∥2+1{x≥0}. (11)

Here, corresponds to the first two terms, and is the indicator of the nonnegative orthant, i.e., . This problem is nonconvex due to the negative sign and satisfies creftypecap 1. In particular, it satisfies the KŁ property since it is quadratic.

For the experiment, we set and randomly generate the samples from normal distribution. All samples are then normalized to have unit norm. The initialization is randomly generated, and is applied to all the methods. We then compare the function values versus the number of effective passes through samples.

### 4.1 Comparison among APG variants

We first compare among the deterministic APG-like methods in Algorithms 2 - 4 and the standard proximal gradient method. The original APG in Algorithm 1 is not considered since it is not a descent method and does not have convergence guarantee in nonconvex cases. We use a fixed step size , where is the spectral norm of the sample matrix . We set for APGnc. The results are shown in Figures 1 and 2. In Figure 1 (a), we show the performance comparison of the methods when there is no error in gradient or proximal calculation. One can see that APGnc and APGnc outperform all other APG variants. In particular, APGnc performs the best with our adaptive momentum strategy, justifying its empirical advantage. We note that the mAPG requires two passes over all samples at each iteration, and is, therefore, less data efficient compared to other APG variants.

We further study the inexact case in Figure 1 (b), where we introduce the proximal error at the th iteration. One can see that inexact APGnc and inexact APGnc also outperform other two inexact algorithms. Furthermore, in Figure 2 (a) and (b), we compare exact and inexact algorithms respectively for APGnc and APGnc. It can been that even with a reasonable amount of inexactness, both methods converge comparably to their corresponding exact methods. Although initially the function value drops faster in exact algorithms, both exact and inexact algorithms converge to the optimal point almost at the same time. Such a fact demonstrates the robustness of the algorithms.

### 4.2 Comparison among SVRG-APG variants

We then compare the performance among SVRG-APGnc, SVRG-APGnc and the original proximal SVRG methods, and pick the stepsize with . The results are presented in Figures 3 and 4. In the error free case in Figure 3 (a), one can see that SVRG-AGPnc method outperforms the others due to the adaptive momentum, and the SVRG-APGnc method also performs better than the original proximal SVRG method.

For the inexact case, we set the proximal error as . One can see from Figure 3 (b) that the performance is degraded compared to the exact case, and converges to a different local minimum. In this result, all the methods are no longer monotone due to the inexactness and the stochastic nature of SVRG. Nevertheless, the SVRG-APGnc still yields the best performance.

We also compare the results corresponding to SVRG-APGnc and SVRG-APGnc, with and without the proximal error, in Figure 4 (a) and (b), respectively. It is clear that the SVRG-based algorithms are much more sensitive to the error comparing with APG-based ones. Even though the error is set to be smaller than in the inexact case with APG-based methods, one can observe more significant performance gaps than those in Figure 2.

## 5 Conclusion

In this paper, we provided comprehensive analysis of the convergence properties of APGnc as well as its inexact and stochastic variance reduced forms by exploiting the KŁ property. We also proposed an improved algorithm APGnc by adapting the momentum parameter. We showed that APGnc shares the same convergence guarantee and the same order of convergence rate as the mAPG, but is computationally more efficient and more amenable to adaptive momentum. In order to exploit the KŁ property for accelerated algorithms in the situations with inexact errors and/or with stochastic variance reduced gradients, we developed novel convergence analysis techniques, which can be useful for exploring other algorithms for nonconvex problems.

### References

1. Attouch, H. and Bolte, J. On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Mathematical Programming, 116(1-2):5–16, 2009. ISSN 0025-5610.
2. Attouch, H., Bolte, J., Redont, P., and Soubeyran, A. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.
3. Attouch, H., Bolte, J., and Svaiter, B. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Mathematical Programming, 137(1-2):91–129, 2013.
4. Beck, A. and Teboulle, M. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. Transactions on Image Processing, 18(11):2419–2434, November 2009a.
5. Beck, A. and Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal of Image Science., 2(1):183–202, 2009b.
6. Bolte, J., Daniilidis, A., and Lewis, A. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17:1205–1223, 2007.
7. Bolte, J., Danilidis, A., Ley, O., and Mazet, L. Characterizations of Łojasiewicz inequalities and applications: Subgradient flows, talweg, convexity. Transactions of the American Mathematical Society, 362(6):3319–3363, 2010.
8. Bolte, J., Sabach, S., and Teboulle, M. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1-2):459–494, 2014.
9. Frankel, P., Garrigos, G., and Peypouquet, J. Splitting methods with variable metric for kurdyka–łojasiewicz functions and general convergence rates. Journal of Optimization Theory and Applications, 165(3):874–900, 2015.
10. Hardt, M. and Ma, T. Identity matters in deep learning. Arxiv preprint, 2016.
11. Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315–323, 2013.
12. Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. Machine Learning and Knowledge Discovery in Databases: European Conference, pp. 795–811, 2016.
13. Kurdyka, K. On gradients of functions definable in o-minimal structures. Annales de l’institut Fourier, 48(3):769–783, 1998.
14. Kurdyka, K. and Spodzieja, S. Separation of real algebraic sets and the Łojasiewicz exponent. Wydział Matematyki Informatyki. Uniwersytet łódzki, 2011.
15. Li, G. and Kei, T. Calculus of the exponent of Kurdyka-Łojasiewicz inequality and its applications to linear convergence of first-order methods. ArXiv preprint, February 2016.
16. Li, H. and Lin, Z. Accelerated proximal gradient methods for nonconvex programming. In Advances in Neural Information Processing Systems, pp. 379–387. 2015.
17. Łojasiewicz, S. Ensembles semi-analytiques. Institut des Hautes Etudes Scientifiques, 1965.
18. Nesterov, Y. A method of solving a convex programming problem with convergence rate . Soviet Mathematics Doklady, 27:372–376, 1983.
19. Reddi, S., Hefny, A., Sra, S., Poczos, B., and Smola, A. Stochastic variance reduction for nonconvex optimization. ArXiv preprint, 2016a.
20. Reddi, S., Sra, S., Poczos, B., and Smola, A. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In Advances in Neural Information Processing Systems 29, pp. 1145–1153. 2016b.
21. Rockafellar, R.T. and Wets, R.J.B. Variational Analysis. Springer, 1997.
22. Schmidt, M., Roux, N.L., and Bach, F.R. Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in Neural Information Processing Systems 24, pp. 1458–1466. 2011.
23. Tseng, P. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Mathematical Programming, 125(2):263–295, 2010.
24. Yao, Q. and Kwok, J.T. More efficient accelerated proximal algorithm for nonconvex problems. ArXiv preprint, December 2016.
25. Zhou, Y., Zhang, H., and Liang, Y. Geometrical properties and accelerated gradient solvers of non-convex phase retrieval. The 54th Annual Allerton Conference, 2016.
26. Zhu, Z. and Yuan, Y. Improved SVRG for non-strongly-convex or sum-of-non-convex objectives. In International Conference on Machine Learning, 2016.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters