How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?

How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?

Abstract

A recent line of research on deep learning focuses on the extremely over-parameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size and the inverse of the target accuracy , deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees. Very recently, it is shown that under certain margin assumption on the training data, a polylogarithmic width condition suffices for two-layer ReLU networks to converge and generalize (Ji and Telgarsky, 2020). However, how much over-parameterization is sufficient to guarantee optimization and generalization for deep neural networks still remains an open question. In this work, we establish sharp optimization and generalization guarantees for deep ReLU networks. Under various assumptions made in previous work, our optimization and generalization guarantees hold with network width polylogarithmic in and . Our results push the study of over-parameterized deep neural networks towards more practical settings.

1 Introduction

Deep neural networks have become one of the most important and prevalent machine learning models due to their remarkable power in various real-world applications. However, the success of deep learning has not been well-explained in theory. It remains mysterious why standard training algorithms tend to find a globally optimal solution, despite the highly non-convex landscape of the training loss function. Moreover, despite the extremely large amount of parameters, deep neural networks rarely over-fit, and can often generalize well to unseen data and achieve good test accuracy. Understanding these mysterious phenomena on the optimization and generalization of deep neural networks is one of the most fundamental goals in deep learning theory.

Recent breakthroughs have shed light on the optimization and generalization of deep neural networks under the over-parameterized setting, where the hidden layer width is extremely large. In terms of optimization, a line of work (Du et al., 2019b; Allen-Zhu et al., 2019b; Zou et al., 2018; Oymak and Soltanolkotabi, 2019b; Arora et al., 2019b; Zou and Gu, 2019) proved that (stochastic) gradient descent with random initialization can successfully find a global optimum of the training loss function regardless of the labeling of the data, as long as the width of the network is larger than , where is the training sample size and is the target error. For generalization, Allen-Zhu et al. (2019a); Arora et al. (2019a); Cao and Gu (2019b, a); Nitanda and Suzuki (2019) established generalization bounds of neural networks trained with (stochastic) gradient descent under certain data distribution assumptions, when the network width is at least . Although these results have provided important insights into the learning of extremely over-parameterized neural networks, the requirement on the network width is still far from the practical settings. Very recently, Ji and Telgarsky (2020) showed that for two-layer ReLU networks, when the training data are well separated, polylogarithmic width is sufficient to guarantee good optimization and generalization performances of neural networks trained by GD/SGD. However, extending their results to multi-layer neural networks is highly nontrivial, and it remains unclear whether similar results hold for deep neural networks.

In fact, most of the aforementioned results can be categorized in the so called neural tangent kernel (NTK) (Jacot et al., 2018; Du et al., 2019b) regime or lazy training regime (Chizat et al., 2019), where along the whole training process, the neural network function behaves similarly as its first-order Taylor expansion at initialization (Jacot et al., 2018; Lee et al., 2019; Arora et al., 2019b; Cao and Gu, 2019a). It is recognized that in order to make the learning of neural networks stay in the NTK regime, a proper scaling with respect to the network width is essential. For example, Cao and Gu (2019a) introduced a scaling factor in their definition of the neural network function, where is the network width. Same scaling factor has also been applied to the initialization of the output weights in Allen-Zhu et al. (2019b); Zou et al. (2019); Cao and Gu (2019b); Zou and Gu (2019). Many other results in the NTK regime used a different type of parameterization, but essentially have the same scaling factor (Jacot et al., 2018; Du et al., 2019b, a; Arora et al., 2019a, b). In fact, without such a scaling factor, it has been shown that the training of two-layer networks falls in a different regime, namely the “mean-field” regime (Mei et al., 2018; Chizat and Bach, 2018; Chizat et al., 2019; Sirignano and Spiliopoulos, 2019; Rotskoff and Vanden-Eijnden, 2018; Wei et al., 2019; Mei et al., 2019; Fang et al., 2019a, b).

In this paper, we study the optimization and generalization of deep ReLU networks for a wider range of scaling. Specifically, for a ReLU network with hidden nodes per layer, we generalize the scaling factor introduced in Cao and Gu (2019a) to , where is a constant. Note that similar scaling has been studied in Nitanda and Suzuki (2019). We show that for all such , as long as there exists a good neural network weight configuration within certain distance to the initialization, the global convergence property as well as good generalization performance can be provably established under mild condition on the neural network width, which is polylogarithmic in sample size and inverse target accuracy , as opposed to proved in prior work (Cao and Gu, 2019b, a). While our results can be seen as a generalization of Ji and Telgarsky (2020) from two-layer networks to deep networks, our proof technique is different from theirs. In specific, their proof heavily relies on the homogeneity of the ReLU network function, while our proof technique relies on the local linearity of the neural network function around initialization. We show that a moderate linear approximation error can still guarantee a small optimization error for GD/SGD, which was not discovered in prior work (Cao and Gu, 2019b, a). Our contributions are highlighted as follows:

• We establish the global convergence guarantee of GD for training deep ReLU networks for binary classification. Specifically, we prove that for any positive constant , if there exists a good neural network weight configuration within distance to the initialization, and the neural network width satisfies , GD can achieve -training loss within iterations, where is the neural network width, is the scaling factor of the neural network and is the neural network depth.

• We also establish the generalization guarantees for both GD and SGD in the same setting. Specifically, for GD, we establish a sample complexity for a wide range of network width. For SGD, we prove a sample complexity. For both algorithms, our results provide tighter sample complexities based on milder network width conditions compared with existing results.

• Our theoretical results can be generalized to the scenarios with different data separability assumptions studied in the literature, and therefore can cover and improve many existing results in the NTK regime. Specifically, under the data separability assumptions studied in Cao and Gu (2019a); Ji and Telgarsky (2020), our results hold with , where is the failure probability parameter. This suggests that a neural network with width can be learned by GD/SGD with good optimization and generalization guarantees. Moreover, we also show that under a very mild data nondegeneration assumption in Zou et al. (2019), our theoretical result can lead to a sharper over-parameterization condition, which improves the existing results in Zou et al. (2019) if the neural network depth satisfies .

For the ease of exposition, we compare our results with the mostly related previous results in Table 1, in terms of the over-parameterization condition, iteration complexity, and sample complexity. We remark that some of these results are not directly comparable since they are proved based on slightly different assumptions on the training data and/or activation functions. Yet it is fair to conclude that our results are sharper than the prior results for learning deep neural networks, and match the state-of-the-art results (Ji and Telgarsky, 2020) when degenerated to two-layer networks.

In terms of optimization, a line of work focuses on the optimization landscape of neural networks (Haeffele and Vidal, 2015; Kawaguchi, 2016; Freeman and Bruna, 2017; Hardt and Ma, 2017; Safran and Shamir, 2018; Xie et al., 2017; Nguyen and Hein, 2017; Soltanolkotabi et al., 2018; Zhou and Liang, 2017; Yun et al., 2018; Du and Lee, 2018; Venturi et al., 2018; Nguyen, 2019). They study the properties of landscape of the optimization problem in deep learning, and demonstrate that under certain situations the local minima are also globally optimal. However, most of the positive results along this line of work only hold for simplified cases like linear networks or two-layer networks under certain assumptions on the input/output dimensions and sample size.

For the generalization of neural networks, a vast amount of work has established uniform convergence based generalization error bounds (Neyshabur et al., 2015; Bartlett et al., 2017; Neyshabur et al., 2018; Golowich et al., 2018; Arora et al., 2018; Li et al., 2018a). While such results can be applied to the mean-field regime to establish certain generalization bounds (Wei et al., 2019), the bounds are loose when applied to the NTK regime due to the larger scaling of network parameters. For example, some case studies in Cao and Gu (2019b) showed that the resulting uniform convergence based generalization bounds are increasing in the network width .

Another important topic on neural networks is the implicit bias of training algorithms such as GD and SGD. Overall, the study of implicit bias aims to figure out the specific properties of the solutions given by a certain training algorithm, as the solutions to the optimization problem may not be unique. Along this line of research, many prior work (Gunasekar et al., 2017; Soudry et al., 2018; Ji and Telgarsky, 2019; Gunasekar et al., 2018a, b; Nacson et al., 2019b; Li et al., 2018b) has studied implicit regularization/bias of gradient flow, GD, SGD or mirror descent for matrix factorization, logistic regression, and deep linear networks. However, generalizing these results to deep non-linear neural networks turns out to be much more challenging. Nacson et al. (2019a); Lyu and Li (2019) studied the implicit bias of deep homogeneous model trained by gradient flow, and proved that the convergent direction of parameters is a KKT point of the max-margin problem. Nevertheless, they cannot handle practical optimization algorithms such as GD and SGD, and did not characterize how large the resulting margin is.

Several recent results have proved that neural networks can outperform kernel methods or behave differently than NTK-based kernel regression under certain conditions. Wei et al. (2019) studied the convergence of noisy Wasserstein flow in the mean-field regime, while Allen-Zhu and Li (2019) studied three-layer ResNets with a scaling similar to the mean-field regime. Moreover, Allen-Zhu et al. (2019a); Bai and Lee (2019) studied fully-connected three-layer or two-layer networks with a scaling similar to the NTK regime, but utilized certain randomization tricks to make the network “almost quadratic” instead of “almost linear” in its parameters, making the network behave differently from the NTK setting.

Notation. For two scalars and , we use and to denote and respectively. For a vector we use to denote its Euclidean norm. For a matrix , we use and to denote its spectral norm and Frobenius norm respectively, and denote by the entry of at the -th row and -th column. Given two matrices and with the same dimension, we denote . Given a collection of matrices and a function mapping , we define by the partial gradient of with respect to and . Given two collections of matrices and , we denote and . Given two sequences and , we denote if for some absolute positive constant , if for some absolute positive constant , and if for some absolute constants and . We also use notations and to hide logarithmic factors in and respectively. Additionally, we denote if for some positive constant , and if for some positive constant . Moreover, given a collection of matrices and a positive scalar , we denote .

2 Preliminaries on Learning Neural Networks

In this section we introduce the problem setting studied in this paper, including definitions of the network function and loss function, and the detailed training algorithms, i.e., GD and SGD with random initialization.

Neural network function. Given an input , the output of deep fully-connected ReLU network is defined as follows,

 f\Wb(\xb)=mα\WbLσ(\WbL−1⋯σ(\Wb1\xb)⋯),

where is a scaling parameter, , and . We denote the collection of all weight matrices as .

Loss function. Given training dataset with input and output , we define the training loss function as

 LS(\Wb)=1nn∑i=1Li(\Wb),

where is defined as the cross-entropy loss.

Algorithms. We consider both GD and SGD with Gaussian random initialization. These two algorithms are displayed in Algorithms 1 and 2 respectively. Specifically, the entries in are generated independently from univariate Gaussian distribution and the entries in are generated independently from . For GD, we consider using the full gradient to update the model parameters. For SGD, we consider only using one new training data in each iteration.

3 Main Theory

In this section, we present the main theoretical results about the optimization and generalization guarantees of GD and SGD for learning deep ReLU networks. We first make the following assumption on the training data points. {assumption} All training data points satisfy , . This assumption has been widely made in many previous work (Allen-Zhu et al., 2019b, c; Du et al., 2019b, a; Zou et al., 2019) in order to simplify the theoretical analysis. We also make the following assumption regarding the loss function . {assumption} There exists a positive constant and such that for all . Considering a sufficiently small , Assumption 3 spells out that there exists a neural network model with parameters such that all training data points can be correctly classified, i.e., achieving zero training error. We claim that this is a common empirical observation, thus Assumption 3 can be easily satisfied in practice. Moreover, note that we consider cross-entropy loss, therefore, Assumption 3 is equivalent to . In Section 4, we will show that Assumption 3 can be implied by a variety of assumptions made in prior work.

In what follows, we are going to deliver our main theoretical results regarding the optimization and generalization guarantees of learning deep ReLU networks. Specifically, we consider two training algorithms: GD and SGD with random initialization (presented in Algorithms 1) and 2. We will thoroughly analyze these two algorithms separately.

The following theorem establishes the global convergence guarantee of GD for training deep ReLU networks for binary classification. {theorem} For any , there exists that satisfies

 m∗(δ,R,L,α)=~\cO([poly(R,L)]1/α⋅log2/(2−α)(n/δ)),

such that if , with probability at least over the initialization, GD with step size can train a neural network to achieve at most training loss within iterations.

{remark}

Theorem 3.1 suggests that the minimum required neural network width, i.e., , is polynomially large in and and has a logarithmic dependency on the training sample size and the failure probability parameter . As will be discussed in Section 4, if the training data can be separated by neural tangent random feature model (Cao and Gu, 2019a) or shallow neural tangent kernel (Nitanda and Suzuki, 2019; Ji and Telgarsky, 2020), is in the order of . This further implies that is sufficient to guarantee the global convergence of GD. We would also like to remark that Theorem 3.1 will not hold for larger given that , which implies that one needs to apply early stopping when running Algorithm 1.

Let be the expected 0-1 loss (i.e. expected error) of . Then we characterize the generalization performance of the neural network trained by GD in the following theorem. {theorem} Under the same assumptions as Theorem 3.1, with probability at least , the iterate of Algorithm 1 satisfies that

for all .

{remark}

Theorem 3.1 provides an algorithm independent generalization bound. Note that the second term in the bound distinguishes our result from most of the previous work on the algorithm-dependent generalization bounds of over-parameterized neural networks (Allen-Zhu et al., 2019a; Arora et al., 2019a; Cao and Gu, 2019b; Yehudai and Shamir, 2019; Cao and Gu, 2019a; Nitanda and Suzuki, 2019). Specifically, while these previous results mainly focus on establishing a bound that does not explode when the network width goes to infinity, our result covers a wider range of , and therefore implements different bounds for small or large ’s. As will be shown in Section 4, under various assumptions made in previous work, Assumption 3 holds for , and therefore Theorem 3.1 guarantees a sample complexity of order for , which has not been covered by previous results.

{remark}

A trend can be observed in Theorem 3.1: the generalization error bound first increases with the network width and then starts to decrease when becomes even larger. This to certain extent bears a similarity to the “double descent” phenomenon studied in a recent line of work (Belkin et al., 2019a, b; Hastie et al., 2019; Mei and Montanari, 2019). However, since Theorem 3.1 only demonstrates a double descent curve for an upper bound of the generalization error, it is not sufficient to give any conclusive result on the double descent phenomenon. In fact, under certain data separability assumptions and over-parameterization condition, Ji and Telgarsky (2020) has proved a generalization error bound that does not depend on for two-layer networks. Therefore, it is possible that the double descent curve in our bound is an artifact of our analysis. We believe a further analysis on the generalization error and its relation to the double descent curve is an important future direction.

In this part, we aim to characterize the performance of SGD for training deep ReLU networks. Specifically, the following Theorem establishes a generalization error bound in terms of the output of SGD, under certain condition on the neural network width.

{theorem}

For any , there exists that satisfies

 m∗(δ,R,L,α)=~\cO([poly(R,L)]1/α⋅log2/(2−α)(n/δ)),

such that if , with probability at least , SGD with step size achieves

 \EE[L0−1\cD(^\Wb)]≤8L2R2n+8log(1/δ)n+24ϵ,

where the expectation is taken over the uniform draw of from .

{remark}

Theorem 3.2 gives a sample complexity for deep ReLU networks trained with SGD. Treating as a constant, then as long as (which we will verify in Section 4 under various conditions), this is a sample complexity of order . Our result extends the result for two-layer networks proved in Ji and Telgarsky (2020), and improves the results given by Allen-Zhu et al. (2019a); Cao and Gu (2019a) in two aspects. First, the sample complexity is improved from (Allen-Zhu et al., 2019a; Cao and Gu, 2019a) to . Moreover, while Allen-Zhu et al. (2019a); Cao and Gu (2019a) requires , our result works for .

4 Discussions on Data Separability

In this section, we will discuss different data separability assumptions made in existing work. Specifically, we will show that the assumptions on training data made in Cao and Gu (2019a), Ji and Telgarsky (2020) and Zou et al. (2019) can imply Assumption 3 in certain ways, and thus our theoretical results can be directly applied to these settings.

4.1 Data Separability by Neural Tangent Random Feature Model

We formally restate the definition of Neural Tangent Random Feature (NTRF) introduced in Cao and Gu (2019a) as follows. {definition} Let be the initialization weights, and function be a function with respect to the input . Then the NTRF function class is defined as follows

 \cF(\Wb(0),R,α)={f(⋅):\Wb∈\cB(0,R⋅m−α)}.

The NTRF function class is closely related to the neural tangent kernel. For wide enough neural networks, it has been shown that the functions NTRF model can learn are in the NTK-induced reproducing kernel Hilbert space (RKHS) (Cao and Gu, 2019a).

The following proposition states that if there is a good function in NTRF function class that achieves small training loss, Assumption 3 can also be satisfied. {proposition} Suppose there is a function such that for all , then Assumption 3 can be satisfied with .

Proposition 4.1 states that if the training data can be well classified by a function in the NTRF function class, they can also be well learned by deep ReLU networks. However, one may ask in which case there exists such a good function in the NTRF function class, and what is the corresponding value of ? We further provide such an example by introducing the following assumption on the neural tangent random features, i.e., .

{assumption}

There exists a collection of matrices satisfying , such that for all we have

 yi⟨∇f\Wb(0)(\xbi),\Ub∗⟩≥mαγ,

where is an absolute positive constant3. Then based on Proposition 4.1, the following corollary shows that under Assumption 4.1, Assumption 3 can be satisfied with a certain choice of .

{corollary}

Under Assumption 4.1, Assumption 3 can be satisfied with for some absolute constant . {remark} Corollary 4.1 shows that if the NTRF of all training data are linearly separable with constant margin, Assumption 3 can be satisfied with the radius parameter logarithmic in , and . Substituting this result into Theorems 3.1, 3.1 and 3.2, it can be shown that a neural network with width suffices to guarantee good optimization and generalization performances for both GD and SGD.

4.2 Data Separability by Shallow Neural Tangent Model

In this subsection we study the data separation assumption made in Ji and Telgarsky (2020) and show that our results covers this particular setting. We first restate the assumption as follows. {assumption} There exist and such that for any , and for all ,

 yi∫\RRdσ′(⟨\zb,\xbi⟩)⋅⟨¯\ub(\zb),\xbi⟩dμN(\zb)≥γ,

where denotes the standard normal distribution. Assumption 4.2 is related to the linear separability of the gradients of the first layer parameters at random initialization, where the randomness is replaced with an integral by taking the infinite width limit. Note that similar assumptions have also been studied in Cao and Gu (2019b); Nitanda and Suzuki (2019); Frei et al. (2019). In particular, Nitanda and Suzuki (2019) considers the gradients with respect to the first layer weights but uses smooth activation functions, which is not our setting. The assumption made in Cao and Gu (2019b); Frei et al. (2019) concerns gradients with respect to the second layer weights instead of the first layer weights. In the following, we mainly focus on Assumption 4.2. However we remark that our result also covers the setting studied in Cao and Gu (2019b); Frei et al. (2019).

In order to make a fair comparison, we reduce our results for multilayer networks to the one-hidden-layer setting:

 f\Wb(\xb)=mα\Wb2σ(\Wb1).

Then we provide the following proposition, which states that Assumption 4.2 can also imply Assumption 3 with a certain choice of .

{proposition}

Suppose the training data satisfy Assumption 4.2, then if the neural network width satisfies , Assumption 3 can be satisfied with for some absolute constant . {remark} Proposition 4.2 suggests that for two-layer ReLU networks, under Assumption 4.2, Assumption 3 can be satisfied with . Plugging this into Theorem 3.1, and setting , the condition on the neural network width becomes 4, which matches the condition proved in Ji and Telgarsky (2020) if .

4.3 Class-dependent Data Nondegeneration

In Zou et al. (2019), an assumption on the minimum distance between inputs from different classes is made to guarantee the convergence of gradient descent to a global minimum. We restate this assumption as follows. {assumption} For all if , then for some absolute constant . In contrast to the data nondegeneration assumption (i.e., no duplicate data points) made in Allen-Zhu et al. (2019b); Du et al. (2019b, a); Oymak and Soltanolkotabi (2019a); Zou and Gu (2019)5, Assumption 4.3 only requires that the data points from different classes are nondegenerate, thus we call it class-dependent data nondegeneration assumption. It is clear that Assumption 4.3 is milder since it can allow the data points to be arbitrary close as long as they are from the same class, while the data nondegeneration assumption requires that any two data points should be separated by a constant distance.

Then we provide the following proposition which shows that Assumption 4.3 also implies Assumption 3 for certain choices of and . {proposition} Suppose the training data points satisfy Assumption 4.3, then if

 m=Ω([L2n9/2ϕ−2log(n/(δϵ))]1/α),

Assumption 3 can be satisfied with for some constant .

{remark}

Proposition 4.3 suggests that when the neural network is sufficiently wide, as long as there exists no duplicate training data from different classes, Assumption 3 can still be satisfied with . We can also plug this result into Theorem 3.1, which yields the over-parameterization condition6 if choosing . Compared with the counterpart proved in Zou et al. (2019), i.e., , our result is strictly sharper if the network depth satisfies .

5 Experiments

In this section, we conduct some simple experiments to validate our theory. Since our paper mainly focuses on binary classification, we use a subset of the original CIFAR10 dataset (Krizhevsky et al., 2009), which only has two classes of images. We train a -layer fully-connected ReLU network on this binary classification dataset with different sample sizes (), and plot the minimal neural network width that is required to achieve zero training error in Figure 1 (solid line). We also plot and in dashed line for reference. It is evident that the required network width to achieve zero training error is polylogarithmic on the sample size , which is consistent with our theory.

6 Proof of the Main Theory

In this section we present the proofs of Theorems in Section 3. The proofs of theorems in Section 4 are deferred to Appendix A.

6.1 Proof of Theorem 3.1

We first present the following lemma which characterizes the linearity of the neural network function. {lemma}[Lemma 4.1 in Cao and Gu (2019a)] With probability at least over the randomness of initialization, for all and with , it holds that

 ∣∣f\Wb′(\xbi)−f\Wb(\xbi)−⟨∇f\Wb(\xbi),\Wb′−\Wb⟩∣∣=O(τ1/3L2mα√logm)L−1∑l=1∥\Wb′l−\Wbl∥2. (1)

Here we make a slight modification to the original version in Cao and Gu (2019a) as our neural network function encloses an additional scaling parameter . Unlike Cao and Gu (2019a) which requires the approximation error (i.e., the R.H.S. of (1)) to be smaller than the target test error , we only require it to be upper bounded by some constant in our analysis, which consequently leads to a much milder condition on the neural network width. Based on this, we assume all iterates are close to the initialization, and establish a convergence guarantee for GD in the following lemma. {lemma} Set the step size and . Then given and suppose for all , with probability at least over the randomness of initialization it holds that

 L∑l=1∥\Wb(0)l−\Wb∗l∥2F−L∑l=1∥\Wb(t′)l−\Wb∗l∥2F≥η⋅[t′−1∑t=0LS(\Wb(t))−2t′ϵ],

Then the remaining part is to characterize that under which condition on , we can guarantee all iterates stay inside the required region until convergence. Based on Lemmas 6.1 and 6.1, we complete the remaining proof as follows.

Proof of Theorem 3.1.

In the following proof we choose and . Note that Lemmas 6.1 and 6.1 hold with probability at least over the randomness of initialization and . Therefore, if the neural network width satisfies

 m=Ω[L1/(2−α)log(m)2/(2−α)log(nL2/δ)2/(2−α)], (2)

we have with probability at least all results in Lemmas 6.1 and 6.1 hold.

Then we prove the theorem by two parts: 1) we show that all iterates will stay inside the region ; and 2) we show that GD can find a neural network with at most training loss within iterations.

All iterates stay inside . We prove this part by induction. Specifically, given , we assume the hypothesis holds for all and prove that . First, it is clear that . Then by Lemma 6.1 and apply the fact that , we have

 L∑l=1∥\Wb(t′)l−\Wb∗l∥2F≤L∑l=1∥\Wb(0)l−\Wb∗l∥2F+2ηt′ϵ

Note that and , we have

 L∑l=1∥\Wb(t′)l−\Wb∗l∥2F ≤CLR2m−2α,

where is an absolute constant. Therefore, by triangle inequality, we further have the following for all ,

 ∥\Wb(t′)l−\Wb(0)l∥F ≤∥\Wb(t′)l−\Wb∗l∥F+∥\Wb(0)l−\Wb∗l∥F ≤√CLRm−α+Rm−α ≤2√CLRm−α. (3)

Therefore, in order to guarantee that , then it suffices to ensure that by our choice of . Combining with the condition on in (2), we have if

 m≥m∗(δ,R,L,α)=~\cO([R4L11]1/α⋅log2/(2−α)(n/δ)), (4)

the iterate will be staying inside the region , which completes the proof of the first part.

Convergence of gradient descent. Lemma 6.1 implies

 L∑l=1∥\Wb(0)l−\Wb∗l∥2F−L∑l=1∥\Wb(T)l−\Wb∗l∥2F≥η(T−1∑t=0LS(\Wb(t))−2Tϵ).

Dividing by on the both sides, we get

 1TT−1∑t=0LS(\Wb(t)) ≤∑Ll=1∥\Wb(0)l−\Wb∗l∥2FηT+2ϵ≤LR2m−2αηT+2ϵ≤3ϵ,

where the second inequality is by the fact that and the last inequality is by our choices of and which ensure that . Notice that . This completes the proof of the second part, and we are able to complete the proof. ∎

6.2 Proof of Theorem 3.1

Following Cao and Gu (2019b), we first introduce the definition of surrogate loss of the network, which is defined by the derivative of the loss function. {definition} We define the empirical surrogate error and population surrogate error as follows:

The following lemma gives uniform-convergence type of results for utilizing the fact that is bounded and Lipschitz continuous. {lemma} For any , suppose that . Then with probability at least , it holds that

 |\cE\cD(\Wb)−\cES(\Wb)|≤~\cO(min{4LL3/2~R√mn,L~R√n+L3~R4/3mα/3})+\cO(√log(1/δ)n)

for all

We are now ready to prove Theorem 3.1, which combines the trajectory distance analysis in the proof of Theorem 3.1 with Lemma 6.2.

Proof of Theorem 3.1.

With exactly the same proof as Theorem 3.1, by (6.1) and induction we have with . Therefore by Lemma 6.2, we have

 |\cE\cD(\Wb(t))−\cES(\Wb(t))|≤~\cO(min{4LL2R√mn,L3/2R√n+L11/3R4/3mα/3})+\cO(√log(1/δ)n)

for all . Note that we have . Therefore,

 \EE[L0−1\cD(^\Wb)] ≤2\cE\cD(\Wb(t)) ≤2\EE[LS(^\Wb)]+~\cO(min{4LL2R√mn,L3/2R√n+L11/3R4/3mα/3})+\cO(√log(1/δ)n).

This finishes the proof. ∎

6.3 Proof of Theorem 3.2

In this section we provide the proof of Theorem 3.2. The following result is the counterpart of Lemma 6.1 for SGD. {lemma} Set the step size and . Then given a positive integer and suppose for all , with probability at least over the randomness of initialization it holds that

 L∑l=1∥\Wb(0)l−\Wb∗l∥2F−L∑l=1∥\Wb(n′)l−\Wb∗l∥2F≥η(n′∑i=1Li(\Wb(i−1))−2n′ϵ),

Our proof is based on the application of Lemma 6.3 and an online-to-batch conversion argument (Cesa-Bianchi et al., 2004), which is inspired by Cao and Gu (2019a); Ji and Telgarsky (2020). We introduce a surrogate loss and its population version , which have been used in Ji and Telgarsky (2018); Cao and Gu (2019a); Nitanda and Suzuki (2019); Ji and Telgarsky (2020). The following lemma is provided in Ji and Telgarsky (2020), whose proof only relies on the boundedness of and therefore is applicable in our setting.

{lemma}

For any , with probability at least , the iterates of Algorithm 2 satisfies that

 1nn∑i=1\cE\cD(\Wb(i−1))≤4nn∑i=1\cEi(\Wb(i−1))+4log(1/δ)n.
Proof of Theorem 3.2.

Similar to the proof of Theorem 3.1, we prove this theorem in two parts: 1) all iterates stay inside ; and 2) convergence of SGD.

All iterates stay inside