On exponential convergence of SGD in non-convex over-parametrized learning

# On exponential convergence of SGD in non-convex over-parametrized learning

Raef Bassily Department of Computer Science and Engineering Mikhail Belkin Department of Computer Science and Engineering Siyuan Ma Department of Computer Science and Engineering
###### Abstract

Large over-parametrized models learned via stochastic gradient descent (SGD) methods have become a key element in modern machine learning. Although SGD methods are very effective in practice, most theoretical analyses of SGD suggest slower convergence than what is empirically observed. In our recent work [MBB17] we analyzed how interpolation, common in modern over-parametrized learning, results in exponential convergence of SGD with constant step size for convex loss functions. In this note, we extend those results to a much broader non-convex function class satisfying the Polyak-Lojasiewicz (PL) condition. A number of important non-convex problems in machine learning, including some classes of neural networks, have been recently shown to satisfy the PL condition. We argue that the PL condition provides a relevant and attractive setting for many machine learning problems, particularly in the over-parametrized regime.

## 1 Introduction

Stochastic Gradient Descent and its variants have become a staple of the algorithmic foundations of machine learning. Yet many of its properties are not fully understood, particularly in non-convex settings common in modern practice.

In this note, we study convergence of Stochastic Gradient Descent (SGD) for the class of functions satisfying the Polyak-Lojasiewicz (PL) condition. This class contains all strongly-convex functions as well as a broad range of non-convex functions including those used in machine learning applications (see the discussion below).

The primary purpose of this note is to show that in the interpolation setting (common in modern over-parametrized machine learning and studied in our previous work [MBB17]) SGD with fixed step size has exponential convergence for the functions satisfying the PL condition. To the best of our knowledge, this is the first such exponential convergence result for a class of non-convex functions.

Below, we discuss and highlight a number of aspects of the PL condition which differentiate it from the convex setting and make it more relevant to the practice and requirements of many machine learning problems. We first recall that in the interpolation setting, a minimizer of the empirical loss satisfies that for all . We say that satisfies the PL condition (see [KNS18]) if for some .

Most analyses for optimization in machine learning have concentrated on convex or, commonly, strongly convex setting. These settings are amenable to theoretical analyses and describe many important special cases of ML, such as linear and kernel methods. Still, a large class of modern models, notably neural networks, are non-convex. Even for kernel machines, many of the arising optimization problems are poorly conditioned and not well-described by the traditional strongly convex analysis. Below we list some properties of the PL-type setting which make it particularly attractive and relevant to the requirements of machine learning, especially in the interpolated and over-parametrized setting.

• To verify the PL condition in the interpolated setting we need access to the norm of the gradient and the value of the objective function . These quantities are typically easily accessible empirically111In general we need to evaluate . Since , no further knowledge about is required., can be accurately estimated from a sub-sample of the data, and are often tractable analytically. On the other hand, verifying convexity requires the cumbersome positive definiteness of the Hessian matrix requiring accurate estimation of its smallest eigenvalue . Verifying this empirically is often difficult and cannot always be based on a sub-sample due to the required precision of the estimator when is close to zero (as is frequently the case in practice).

• The norm of the gradient is much more resilient to perturbation of the objective function than the smallest eigenvalue of the Hessian (for convexity).

• Many modern machine learning methods are over-parametrized and result in manifolds of global minima [cooper2018loss]. This is not compatible with strict convexity and, in most circumstances222Unless those manifolds are convex domains in lower-dimensional affine sub-spaces., not compatible with convexity. However, manifolds of solutions are compatible with the PL condition.

• Nearly every application of machine learning employs techniques for feature extraction or feature transformation. Global minima and the property of interpolation (shared global minima for the individual loss functions) are preserved under coordinate transformations. Yet convexity is generally not, thus not allowing for a unified analysis of optimization under feature transforms. In contrast, as discussed in Section 3, the PL condition is invariant under a broad class of non-linear coordinate transformations.

• Many problems of interest in machine learning involve optimization on manifolds. While geodesic convexity allows for efficient optimization, it is a parametrization dependent notion and is generally difficult to establish, as it requires explicit knowledge of the geodesic coordinates on the manifold. In contrast, the PL condition also allows for efficient optimization, while invariant under the choice of coordinates and far easier to verify. See [weber2017frank] for some recent applications.

• Most convergence analyses in convex optimization rely on the distance to the minimizer. Yet, this distance is often difficult or impossible to bound empirically. Furthermore, the distance to minimizer can be infinite in many important settings, including optimization via logistic loss [soudry2017implicit] or inverse problems over Hilbert spaces, as in kernel methods [ma2017diving]. In contrast, PL-type analyses directly involve the value of the loss function, an empirically observable quantity of practical significance.

• As originally observed by Polyak [polyak1963gradient], the PL condition is sufficient for exponential convergence of gradient descent. As we establish in this note, it also allows for exponential convergence of stochastic gradient descent with fixed step size in the interpolated setting.

Technical contributions: The main technical contribution of this note is to show the exponential convergence of mini-batch SGD in the interpolated setting. The proof is simple and is reminiscent of the original observation by Polyak [polyak1963gradient] of exponential convergence of gradient descent. It also extends our previous work on the exponential convergence of mini-batch SGD [MBB17] to a non-convex setting. Interestingly, the step size arising from the PL condition in our analysis depends on the parameter and is potentially much smaller than that in the strongly convex case, where no such dependence is needed. At this point it is an open question whether this dependence is necessary in the PL setting. As an additional contribution, in Section 4, we show that for a special class of PL functions obtained by a composition of a strictly convex function and a linear transformation333These functions are convex but not necessarily strictly convex., we obtain exponential convergence without such dependence on in the step size. However, this result requires a different type of analysis than that for the general PL setting. In Section 3, we provide a formal statement capturing the transformation invariance property of the PL condition.

### Examples and Related Work:

The PL condition has recently become popular in optimization and machine learning starting with the work [KNS18]. In fact, as discussed in [KNS18], several other conditions proposed for convergence analysis are special cases of the PL condition. One such condition is Restricted secant inequality (RSI) proposed in [zhang2013gradient]. Another set of conditions that are special cases of the PL condition was referred to as “one-point convexity” in [allen2017natasha]. The two variations of one-point convexity discussed there are special cases of RSI and PL, respectively, and hence are in the PL class. The same reference points out several examples of “one-point convexity” in previous works. Some notable examples satisfying RSI include two-layer neural networks [li2017convergence], matrix completion [sun2016guaranteed], dictionary learning [arora2015simple], and phase retrieval [chen2015solving]. It has also been observed empirically that neural networks satisfy the PL condition [kleinberg2018alternative]. In particular, we note the recent work [soltanolkotabi2018theoretical] which considers a class of neural networks that attain zero quadratic loss implying interpolation. In their proof it is shown that this class of neural nets satisfies the PL condition. Hence our results imply exponential convergence of SGD for this class. To the best of our knowledge this is the first time that exponential convergence of SGD has been established for a class of multi-layer neural networks.

## 2 Exponential Convergence of SGD for PL Losses

We start by formally stating the Polyak- Lojasiewicz (PL) Condition.

###### Definition 2.1 (α-PL function).

Let . Let be a differentiable function. Assume, w.o.l.g., that . We say that is -PL if for every , we have

 ∥∇f(w)∥2≥αf(w).

### ERM with smooth losses:

We consider the ERM problem where for all , is -smooth. Moreover, is -smooth, -PL function (as in Definition 2.1 above).

We do not assume compact parameter space; that is, a parameter vector can have unbounded norm, however is assumed to be bounded. In particular, a global minimizer may not exist, however, we assume the existence of global infimum for (which is equal to zero w.o.l.g.).

To elaborate, we assume the existence of a sequence such that

 limk→∞L(wk)=infw∈HL(w)=0 (1)
###### Assumption 1 (Interpolation).

For every sequence such that , we have for all , .

Consider the SGD algorithm that starts at an arbitrary , and at each iteration makes an update with a constant step size :

 wt+1 =wt−η⋅∇{1mm∑j=1ℓi(j)t(wt)} (2)

where is the size of a mini-batch of data points whose indices are drawn uniformly with replacement at each iteration from .

The theorem below establishes the exponential convergence of mini-batch SGD for any smooth, PL loss in the interpolated regime.

###### Theorem 1.

Consider the mini-batch SGD with smooth losses as described above. Suppose that Assumption 1 holds and suppose that the empirical risk function is -PL for some fixed . For any mini-batch size , the mini-batch SGD (2) with constant step size gives the following guarantee

 Ewt[L(wt)]≤(1−αη∗(m)2)tL(w0) (3)

where the expectation is taken w.r.t. the randomness in the choice of the mini-batch.

###### Proof.

From -smoothness of , it follows that

 L(wt+1)≤L(wt)+⟨∇L(wt),\leavevmode\nobreak wt+1−wt⟩+λ2∥wt+1−wt∥2.

Using 2, we then have

 L(wt)−L(wt+1)≥η⟨\leavevmode\nobreak ∇L(wt)\leavevmode\nobreak ,\leavevmode\nobreak 1mm∑j=1∇ℓi(j)t(wt)\leavevmode\nobreak ⟩−η2λ2∥∥ ∥∥1mm∑j=1∇ℓi(j)t(wt)∥∥ ∥∥2.

Fixing and taking expectation with respect to the randomness in the choice of the batch (and using the fact that those indices are i.i.d.), we get

 Ei(1)t,…,i(m)t[L(wt)−L(wt+1)]≥η∥∇L(wt)∥2−η2λ2⎛⎝1m\leavevmode\nobreak Ei(1)t[∥∇ℓi(1)t(wt)∥2]+m−1m∥∇L(wt)∥2⎞⎠

Since is -smooth and non-negative, we have with probability over the choice of . Thus, the last inequality reduces to

 E[L(wt)−L(wt+1)]≥η(1−ηλ2m−1m)∥∇L(wt)∥2−η2λβmL(wt).

By invoking -PL condition of and assuming that , we get

 E[L(wt)−L(wt+1)] ≥αη(1−ηλ2m−1m)L(wt)−η2λβmL(wt) =η(α−ηλm(αm−12+β))L(wt)

Hence,

 E[L(wt+1)] ≤(1−ηα+η2λm(αm−12+β))E[L(wt)] (4)

By optimizing the quadratic term in the upper bound 4 with respect to , we get , which is in the theorem statement. Hence, (4) becomes

 E[L(wt+1)]≤(1−αη∗(m)2)E[L(wt)],

which gives the desired convergence rate. ∎

## 3 A Transformation-Invariance Property of PL Functions and Its Implications

In this section, we formally discuss a simple observation concerning the class of PL functions that has useful implications on wide array of problems in modern machine learning. In particular, we observe that if is -smooth and -PL function for some , then for any map that satisfies certain weak conditions, the composition is -smooth and -PL for some that depend on , respectively, as well as a fairly general property of . This shows that the class of smooth PL objectives is closed under a fairly large family of transformations. Given our results above, this observation has direct implications on the convergence of SGD for large class of problems that involve parameter transformation, e.g., via feature maps.

First, we formalize this closure property in the following claim. Let be any map. We can write such a map as , where for each is a scalar function over . The Jacobian of is an operator that, for each , is described by a real-valued matrix whose entries are the partial derivatives

###### Claim 1.

Let be -smooth and -PL function for some . Let be any map, where . Suppose there exist such that for all and , where and denote the minimum and maximum eigen values of , respectively. Then, the function is -smooth and -PL, where and .

Note that the condition that is necessary for to be positive. The condition on holds when is differentiable and Lipschitz-continuous. The above claim follows easily from the chain rule and the PL condition.

Given this property of PL functions and our result in Theorem 1, we can argue that for smooth, PL losses, the exponential convergence rate of SGD is preserved under any transformation that satisfies the conditions in the above claim. We formalize this conclusion below.

As before, we consider a set of -smooth losses , where the empirical risk is -smooth and -PL.

###### Corollary 1.

Let be any map that satisfies the conditions in Claim 1. Suppose Assumption 1 holds and that there is sequence such that . Suppose we run mini-batch SGD w.r.t. the loss functions with batch size and step size , where is as defined in Theorem 1. Let denote the sequence of parameter vectors generated by mini-batch SGD over iterations. Then, we have

 Evt[L(Φ(vt))]≤(1−(a2b2)αη∗(m)2)tL(Φ(v0)).

## 4 Faster Convergence for a Class of Convex Losses

We consider a special class of PL functions originally discussed in [KNS18]. This class contains all convex functions that can be expressed as a composition of a strongly convex function with a linear function . Note that this class contains convex losses that are convex but not necessarily strongly, or even strictly convex.

In [KNS18, Appendix B], it was shown that if is -strongly convex and is matrix whose least non-zero singular value is , then defined as is -PL function. For this special class of PL losses, we show a better bound on the convergence rate than what is directly implied by Theorem 1. The proof technique for this result is different from that of Theorem 1. Exponential convergence of SGD for strongly convex losses in the interpolation setting has been established previously in [MBB17]. In this section, we show a similar convergence rate for this larger class of convex losses.

Let . Let and denote the smallest non-zero singular value and the largest singular value of , respectively. Consider a collection of loss functions where each can be expressed as for some -smooth convex function . It is easy to see that this implies that each is -smooth and convex. The empirical risk can be written as . Moreover, suppose that is -smooth and -strongly convex. Now, suppose we run SGD described in (2) to solve the ERM problem defined by the losses The following theorem provides an exponential convergence guarantee for SGD in the interpolation setting.

###### Theorem 2.

Consider the scenario described above and suppose Assumption 1 is true. Let and be the smallest non-zero singular value and the largest singular value of , respectively. Let be any vector such that is the unique minimizer of . The mini-batch SGD (2) with batch size and step size gives the following guarantee

 Ewt[L](wt) ≤λσ2max2(1−ασ2minη∗(m))t∥ˆw0−ˆw∗∥

where  and  where is the pseudo-inverse of .

###### Proof.

Recall that we can express via SVD as where is the matrix whose columns form an eigen basis for , is the matrix whose columns form an eigen basis for , and is matrix that contains the singular values of ; in particular and for , where is the singular value of , . Let be the non-zero singular values of , where . The following is a known fact: is orthonormal basis for and is orthonormal basis for , where is the subspace orthogonal to . Also, recall that the Moore-Penrose inverse (pseudo-inverse) of , denoted as is given by , where where , and the remaining entries are all zeros. The following is also a known fact that follows easily from the definition of and the facts above: is orthonormal basis for . Hence, from the above facts, it is easy to see that . Thus, by the direct sum theorem, any can be uniquely expressed as sum of two orthogonal components , where and . In particular, .

Using these observations, we can make the following claim.

###### Claim 2.

is -strongly convex over .

The proof of the above claim is as follows. Fix any . Observe that

 L(z1)=~L(Az1) ≥~L(Az2)+⟨∇~L(Az2),\leavevmode\nobreak A(z1−z2)⟩+α2∥A(z1−z2)∥2 (5) =L(z2)+⟨∇L(z2),\leavevmode\nobreak z1−z2⟩+α2∥A(z1−z2)∥2 (6)

where (5) follows from the strong convexity of , and (6) follows from the definition of and the fact that . Now, we note that since , we have . Plugging this into (6) proves the claim.

We now proceed with the proof of the Theorem 2. By -smoothness of , we have

 L(wt+1) =~L(Awt+1)≤λ2∥A(wt+1−w∗)∥2 =λ2∥A(ˆwt+1−ˆw∗)∥2≤σ2maxλ2∥ˆwt+1−ˆw∗∥2. (7)

where, as above, is the projection of onto . Similarly, is the projection of onto . Now, consider . From the update step (2) of the mini-batch SGD and the linearity of the projection operator , we have

 ∥ˆwt+1−ˆw∗∥2 =∥ˆwt−ˆw∗)∥2−2η⟨A†A⋅1mm∑j=1∇ℓi(j)t(ˆwt)\leavevmode\nobreak ,\leavevmode\nobreak \leavevmode\nobreak ˆwt−ˆw∗⟩+η2∥∥ ∥∥A†A⋅1mm∑j=1∇ℓi(j)t(ˆwt)∥∥ ∥∥2 ≤∥ˆwt−ˆw∗∥2−2η⟨1mm∑j=1∇ℓi(j)t(ˆwt),\leavevmode\nobreak ˆwt−ˆw∗⟩+η2∥∥ ∥∥1mm∑j=1∇ℓi(j)t(ˆwt)∥∥ ∥∥2

where the first equality follows from the update step and the fact that . The last inequality follows from the fact that is orthogonal to (and hence orthogonal to ), and the fact that projection cannot increase the norm. Fixing and taking expectation with respect to the choice of the batch , we have

 Ei(1)t,…,i(m)t[∥ˆwt+1−ˆw∗∥2] ≤∥ˆwt−ˆw∗∥2−2η⟨∇L(ˆwt),\leavevmode\nobreak ˆwt−ˆw∗⟩+η2\leavevmode\nobreak \leavevmode\nobreak Ei(1)t,…,i(m)t⎡⎢⎣∥∥ ∥∥1mm∑j=1∇ℓi(j)t(ˆwt)∥∥ ∥∥2⎤⎥⎦ (8)

By Claim 2, we have

 ⟨∇L(ˆwt),\leavevmode\nobreak ˆwt−ˆw∗⟩ ≥L(ˆwt)+ασ2min2∥ˆwt−ˆw∗∥2 (9)

Hence, from (8)-(9), we have

 Ei(1)t,…,i(m)t[∥ˆwt+1−ˆw∗∥2] ≤(1−ηασ2min)\leavevmode\nobreak Ei(1)t,…,i(m)t[∥ˆwt−ˆw∗∥2] −2η⎛⎜⎝L(ˆwt)−η2\leavevmode\nobreak \leavevmode\nobreak Ei(1)t,…,i(m)t⎡⎢⎣∥∥ ∥∥1mm∑j=1∇ℓi(j)t(ˆwt)∥∥ ∥∥2⎤⎥⎦⎞⎟⎠

As noted earlier is -smooth. Also, it is easy to see that is -smooth. From this point onward, the proof follows the same lines of the proof of [MBB17, Theorem 1]. We thus can show that by choosing , we get

 Ei(1)t,…,i(m)t[∥ˆwt+1−ˆw∗∥2] ≤(1−η∗(m)ασ2min)Ei(1)t,…,i(m)t[∥ˆwt−ˆw∗∥2]

Using the above inequality together with (7), we have

 Ewt+1[L](wt+1) ≤σ2maxλ2(1−η∗(m)ασ2min)Ewt[∥ˆwt−ˆw∗∥2] ≤σ2maxλ2(1−η∗(m)ασ2min)t+1∥ˆw0−ˆw∗∥2

## References

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters