How Much Overparameterization Is Sufficient to Learn Deep ReLU Networks?
Abstract
A recent line of research on deep learning focuses on the extremely overparameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size and the inverse of the target accuracy , deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees. Very recently, it is shown that under certain margin assumption on the training data, a polylogarithmic width condition suffices for twolayer ReLU networks to converge and generalize (Ji and Telgarsky, 2020). However, how much overparameterization is sufficient to guarantee optimization and generalization for deep neural networks still remains an open question. In this work, we establish sharp optimization and generalization guarantees for deep ReLU networks. Under various assumptions made in previous work, our optimization and generalization guarantees hold with network width polylogarithmic in and . Our results push the study of overparameterized deep neural networks towards more practical settings.
1 Introduction
Deep neural networks have become one of the most important and prevalent machine learning models due to their remarkable power in various realworld applications. However, the success of deep learning has not been wellexplained in theory. It remains mysterious why standard training algorithms tend to find a globally optimal solution, despite the highly nonconvex landscape of the training loss function. Moreover, despite the extremely large amount of parameters, deep neural networks rarely overfit, and can often generalize well to unseen data and achieve good test accuracy. Understanding these mysterious phenomena on the optimization and generalization of deep neural networks is one of the most fundamental goals in deep learning theory.
Recent breakthroughs have shed light on the optimization and generalization of deep neural networks under the overparameterized setting, where the hidden layer width is extremely large. In terms of optimization, a line of work (Du et al., 2019b; AllenZhu et al., 2019b; Zou et al., 2018; Oymak and Soltanolkotabi, 2019b; Arora et al., 2019b; Zou and Gu, 2019) proved that (stochastic) gradient descent with random initialization can successfully find a global optimum of the training loss function regardless of the labeling of the data, as long as the width of the network is larger than , where is the training sample size and is the target error. For generalization, AllenZhu et al. (2019a); Arora et al. (2019a); Cao and Gu (2019b, a); Nitanda and Suzuki (2019) established generalization bounds of neural networks trained with (stochastic) gradient descent under certain data distribution assumptions, when the network width is at least . Although these results have provided important insights into the learning of extremely overparameterized neural networks, the requirement on the network width is still far from the practical settings. Very recently, Ji and Telgarsky (2020) showed that for twolayer ReLU networks, when the training data are well separated, polylogarithmic width is sufficient to guarantee good optimization and generalization performances of neural networks trained by GD/SGD. However, extending their results to multilayer neural networks is highly nontrivial, and it remains unclear whether similar results hold for deep neural networks.
In fact, most of the aforementioned results can be categorized in the so called neural tangent kernel (NTK) (Jacot et al., 2018; Du et al., 2019b) regime or lazy training regime (Chizat et al., 2019), where along the whole training process, the neural network function behaves similarly as its firstorder Taylor expansion at initialization (Jacot et al., 2018; Lee et al., 2019; Arora et al., 2019b; Cao and Gu, 2019a). It is recognized that in order to make the learning of neural networks stay in the NTK regime, a proper scaling with respect to the network width is essential. For example, Cao and Gu (2019a) introduced a scaling factor in their definition of the neural network function, where is the network width. Same scaling factor has also been applied to the initialization of the output weights in AllenZhu et al. (2019b); Zou et al. (2019); Cao and Gu (2019b); Zou and Gu (2019). Many other results in the NTK regime used a different type of parameterization, but essentially have the same scaling factor (Jacot et al., 2018; Du et al., 2019b, a; Arora et al., 2019a, b). In fact, without such a scaling factor, it has been shown that the training of twolayer networks falls in a different regime, namely the “meanfield” regime (Mei et al., 2018; Chizat and Bach, 2018; Chizat et al., 2019; Sirignano and Spiliopoulos, 2019; Rotskoff and VandenEijnden, 2018; Wei et al., 2019; Mei et al., 2019; Fang et al., 2019a, b).
In this paper, we study the optimization and generalization of deep ReLU networks for a wider range of scaling. Specifically, for a ReLU network with hidden nodes per layer, we generalize the scaling factor introduced in Cao and Gu (2019a) to , where is a constant. Note that similar scaling has been studied in Nitanda and Suzuki (2019). We show that for all such , as long as there exists a good neural network weight configuration within certain distance to the initialization, the global convergence property as well as good generalization performance can be provably established under mild condition on the neural network width, which is polylogarithmic in sample size and inverse target accuracy , as opposed to proved in prior work (Cao and Gu, 2019b, a). While our results can be seen as a generalization of Ji and Telgarsky (2020) from twolayer networks to deep networks, our proof technique is different from theirs. In specific, their proof heavily relies on the homogeneity of the ReLU network function, while our proof technique relies on the local linearity of the neural network function around initialization. We show that a moderate linear approximation error can still guarantee a small optimization error for GD/SGD, which was not discovered in prior work (Cao and Gu, 2019b, a). Our contributions are highlighted as follows:

We establish the global convergence guarantee of GD for training deep ReLU networks for binary classification. Specifically, we prove that for any positive constant , if there exists a good neural network weight configuration within distance to the initialization, and the neural network width satisfies , GD can achieve training loss within iterations, where is the neural network width, is the scaling factor of the neural network and is the neural network depth.

We also establish the generalization guarantees for both GD and SGD in the same setting. Specifically, for GD, we establish a sample complexity for a wide range of network width. For SGD, we prove a sample complexity. For both algorithms, our results provide tighter sample complexities based on milder network width conditions compared with existing results.

Our theoretical results can be generalized to the scenarios with different data separability assumptions studied in the literature, and therefore can cover and improve many existing results in the NTK regime. Specifically, under the data separability assumptions studied in Cao and Gu (2019a); Ji and Telgarsky (2020), our results hold with , where is the failure probability parameter. This suggests that a neural network with width can be learned by GD/SGD with good optimization and generalization guarantees. Moreover, we also show that under a very mild data nondegeneration assumption in Zou et al. (2019), our theoretical result can lead to a sharper overparameterization condition, which improves the existing results in Zou et al. (2019) if the neural network depth satisfies .
For the ease of exposition, we compare our results with the mostly related previous results in Table 1, in terms of the overparameterization condition, iteration complexity, and sample complexity. We remark that some of these results are not directly comparable since they are proved based on slightly different assumptions on the training data and/or activation functions. Yet it is fair to conclude that our results are sharper than the prior results for learning deep neural networks, and match the stateoftheart results (Ji and Telgarsky, 2020) when degenerated to twolayer networks.
Algorithm  Overpara. Condition  Iter. Complexity  Sample Complexity  Network  
AllenZhu et al. (2019a)  SGD  Shallow  
Arora et al. (2019a)  GD  Shallow  
Nitanda and Suzuki (2019)  GD  Shallow  
Ji and Telgarsky (2020)  GD  Shallow  
Ji and Telgarsky (2020)  SGD  Shallow  
Cao and Gu (2019b)  GD  Deep  
Cao and Gu (2019a)  SGD  Deep  
This paper  GD  Deep  
This paper  SGD  Deep 
1.1 Additional Related Work
In terms of optimization, a line of work focuses on the optimization landscape of neural networks (Haeffele and Vidal, 2015; Kawaguchi, 2016; Freeman and Bruna, 2017; Hardt and Ma, 2017; Safran and Shamir, 2018; Xie et al., 2017; Nguyen and Hein, 2017; Soltanolkotabi et al., 2018; Zhou and Liang, 2017; Yun et al., 2018; Du and Lee, 2018; Venturi et al., 2018; Nguyen, 2019). They study the properties of landscape of the optimization problem in deep learning, and demonstrate that under certain situations the local minima are also globally optimal. However, most of the positive results along this line of work only hold for simplified cases like linear networks or twolayer networks under certain assumptions on the input/output dimensions and sample size.
For the generalization of neural networks, a vast amount of work has established uniform convergence based generalization error bounds (Neyshabur et al., 2015; Bartlett et al., 2017; Neyshabur et al., 2018; Golowich et al., 2018; Arora et al., 2018; Li et al., 2018a). While such results can be applied to the meanfield regime to establish certain generalization bounds (Wei et al., 2019), the bounds are loose when applied to the NTK regime due to the larger scaling of network parameters. For example, some case studies in Cao and Gu (2019b) showed that the resulting uniform convergence based generalization bounds are increasing in the network width .
Another important topic on neural networks is the implicit bias of training algorithms such as GD and SGD. Overall, the study of implicit bias aims to figure out the specific properties of the solutions given by a certain training algorithm, as the solutions to the optimization problem may not be unique. Along this line of research, many prior work (Gunasekar et al., 2017; Soudry et al., 2018; Ji and Telgarsky, 2019; Gunasekar et al., 2018a, b; Nacson et al., 2019b; Li et al., 2018b) has studied implicit regularization/bias of gradient flow, GD, SGD or mirror descent for matrix factorization, logistic regression, and deep linear networks. However, generalizing these results to deep nonlinear neural networks turns out to be much more challenging. Nacson et al. (2019a); Lyu and Li (2019) studied the implicit bias of deep homogeneous model trained by gradient flow, and proved that the convergent direction of parameters is a KKT point of the maxmargin problem. Nevertheless, they cannot handle practical optimization algorithms such as GD and SGD, and did not characterize how large the resulting margin is.
Several recent results have proved that neural networks can outperform kernel methods or behave differently than NTKbased kernel regression under certain conditions. Wei et al. (2019) studied the convergence of noisy Wasserstein flow in the meanfield regime, while AllenZhu and Li (2019) studied threelayer ResNets with a scaling similar to the meanfield regime. Moreover, AllenZhu et al. (2019a); Bai and Lee (2019) studied fullyconnected threelayer or twolayer networks with a scaling similar to the NTK regime, but utilized certain randomization tricks to make the network “almost quadratic” instead of “almost linear” in its parameters, making the network behave differently from the NTK setting.
Notation. For two scalars and , we use and to denote and respectively. For a vector we use to denote its Euclidean norm. For a matrix , we use and to denote its spectral norm and Frobenius norm respectively, and denote by the entry of at the th row and th column. Given two matrices and with the same dimension, we denote . Given a collection of matrices and a function mapping , we define by the partial gradient of with respect to and . Given two collections of matrices and , we denote and . Given two sequences and , we denote if for some absolute positive constant , if for some absolute positive constant , and if for some absolute constants and . We also use notations and to hide logarithmic factors in and respectively. Additionally, we denote if for some positive constant , and if for some positive constant . Moreover, given a collection of matrices and a positive scalar , we denote .
2 Preliminaries on Learning Neural Networks
In this section we introduce the problem setting studied in this paper, including definitions of the network function and loss function, and the detailed training algorithms, i.e., GD and SGD with random initialization.
Neural network function. Given an input , the output of deep fullyconnected ReLU network is defined as follows,
where is a scaling parameter, , and . We denote the collection of all weight matrices as .
Loss function. Given training dataset with input and output , we define the training loss function as
where is defined as the crossentropy loss.
Algorithms. We consider both GD and SGD with Gaussian random initialization. These two algorithms are displayed in Algorithms 1 and 2 respectively. Specifically, the entries in are generated independently from univariate Gaussian distribution and the entries in are generated independently from . For GD, we consider using the full gradient to update the model parameters. For SGD, we consider only using one new training data in each iteration.
3 Main Theory
In this section, we present the main theoretical results about the optimization and generalization guarantees of GD and SGD for learning deep ReLU networks. We first make the following assumption on the training data points. {assumption} All training data points satisfy , . This assumption has been widely made in many previous work (AllenZhu et al., 2019b, c; Du et al., 2019b, a; Zou et al., 2019) in order to simplify the theoretical analysis. We also make the following assumption regarding the loss function . {assumption} There exists a positive constant and such that for all . Considering a sufficiently small , Assumption 3 spells out that there exists a neural network model with parameters such that all training data points can be correctly classified, i.e., achieving zero training error. We claim that this is a common empirical observation, thus Assumption 3 can be easily satisfied in practice. Moreover, note that we consider crossentropy loss, therefore, Assumption 3 is equivalent to . In Section 4, we will show that Assumption 3 can be implied by a variety of assumptions made in prior work.
In what follows, we are going to deliver our main theoretical results regarding the optimization and generalization guarantees of learning deep ReLU networks. Specifically, we consider two training algorithms: GD and SGD with random initialization (presented in Algorithms 1) and 2. We will thoroughly analyze these two algorithms separately.
3.1 Gradient Descent
The following theorem establishes the global convergence guarantee of GD for training deep ReLU networks for binary classification. {theorem} For any , there exists that satisfies
such that if , with probability at least over the initialization, GD with step size can train a neural network to achieve at most training loss within iterations.
Theorem 3.1 suggests that the minimum required neural network width, i.e., , is polynomially large in and and has a logarithmic dependency on the training sample size and the failure probability parameter . As will be discussed in Section 4, if the training data can be separated by neural tangent random feature model (Cao and Gu, 2019a) or shallow neural tangent kernel (Nitanda and Suzuki, 2019; Ji and Telgarsky, 2020), is in the order of . This further implies that is sufficient to guarantee the global convergence of GD. We would also like to remark that Theorem 3.1 will not hold for larger given that , which implies that one needs to apply early stopping when running Algorithm 1.
Let be the expected 01 loss (i.e. expected error) of . Then we characterize the generalization performance of the neural network trained by GD in the following theorem. {theorem} Under the same assumptions as Theorem 3.1, with probability at least , the iterate of Algorithm 1 satisfies that
for all .
Theorem 3.1 provides an algorithm independent generalization bound. Note that the second term in the bound distinguishes our result from most of the previous work on the algorithmdependent generalization bounds of overparameterized neural networks (AllenZhu et al., 2019a; Arora et al., 2019a; Cao and Gu, 2019b; Yehudai and Shamir, 2019; Cao and Gu, 2019a; Nitanda and Suzuki, 2019). Specifically, while these previous results mainly focus on establishing a bound that does not explode when the network width goes to infinity, our result covers a wider range of , and therefore implements different bounds for small or large ’s. As will be shown in Section 4, under various assumptions made in previous work, Assumption 3 holds for , and therefore Theorem 3.1 guarantees a sample complexity of order for , which has not been covered by previous results.
A trend can be observed in Theorem 3.1: the generalization error bound first increases with the network width and then starts to decrease when becomes even larger. This to certain extent bears a similarity to the “double descent” phenomenon studied in a recent line of work (Belkin et al., 2019a, b; Hastie et al., 2019; Mei and Montanari, 2019). However, since Theorem 3.1 only demonstrates a double descent curve for an upper bound of the generalization error, it is not sufficient to give any conclusive result on the double descent phenomenon. In fact, under certain data separability assumptions and overparameterization condition, Ji and Telgarsky (2020) has proved a generalization error bound that does not depend on for twolayer networks. Therefore, it is possible that the double descent curve in our bound is an artifact of our analysis. We believe a further analysis on the generalization error and its relation to the double descent curve is an important future direction.
3.2 Stochastic Gradient Descent
In this part, we aim to characterize the performance of SGD for training deep ReLU networks. Specifically, the following Theorem establishes a generalization error bound in terms of the output of SGD, under certain condition on the neural network width.
For any , there exists that satisfies
such that if , with probability at least , SGD with step size achieves
where the expectation is taken over the uniform draw of from .
Theorem 3.2 gives a sample complexity for deep ReLU networks trained with SGD. Treating as a constant, then as long as (which we will verify in Section 4 under various conditions), this is a sample complexity of order . Our result extends the result for twolayer networks proved in Ji and Telgarsky (2020), and improves the results given by AllenZhu et al. (2019a); Cao and Gu (2019a) in two aspects. First, the sample complexity is improved from (AllenZhu et al., 2019a; Cao and Gu, 2019a) to . Moreover, while AllenZhu et al. (2019a); Cao and Gu (2019a) requires , our result works for .
4 Discussions on Data Separability
In this section, we will discuss different data separability assumptions made in existing work. Specifically, we will show that the assumptions on training data made in Cao and Gu (2019a), Ji and Telgarsky (2020) and Zou et al. (2019) can imply Assumption 3 in certain ways, and thus our theoretical results can be directly applied to these settings.
4.1 Data Separability by Neural Tangent Random Feature Model
We formally restate the definition of Neural Tangent Random Feature (NTRF) introduced in Cao and Gu (2019a) as follows. {definition} Let be the initialization weights, and function be a function with respect to the input . Then the NTRF function class is defined as follows
The NTRF function class is closely related to the neural tangent kernel. For wide enough neural networks, it has been shown that the functions NTRF model can learn are in the NTKinduced reproducing kernel Hilbert space (RKHS) (Cao and Gu, 2019a).
The following proposition states that if there is a good function in NTRF function class that achieves small training loss, Assumption 3 can also be satisfied. {proposition} Suppose there is a function such that for all , then Assumption 3 can be satisfied with .
Proposition 4.1 states that if the training data can be well classified by a function in the NTRF function class, they can also be well learned by deep ReLU networks. However, one may ask in which case there exists such a good function in the NTRF function class, and what is the corresponding value of ? We further provide such an example by introducing the following assumption on the neural tangent random features, i.e., .
There exists a collection of matrices satisfying , such that for all we have
where is an absolute positive constant
Under Assumption 4.1, Assumption 3 can be satisfied with for some absolute constant . {remark} Corollary 4.1 shows that if the NTRF of all training data are linearly separable with constant margin, Assumption 3 can be satisfied with the radius parameter logarithmic in , and . Substituting this result into Theorems 3.1, 3.1 and 3.2, it can be shown that a neural network with width suffices to guarantee good optimization and generalization performances for both GD and SGD.
4.2 Data Separability by Shallow Neural Tangent Model
In this subsection we study the data separation assumption made in Ji and Telgarsky (2020) and show that our results covers this particular setting. We first restate the assumption as follows. {assumption} There exist and such that for any , and for all ,
where denotes the standard normal distribution. Assumption 4.2 is related to the linear separability of the gradients of the first layer parameters at random initialization, where the randomness is replaced with an integral by taking the infinite width limit. Note that similar assumptions have also been studied in Cao and Gu (2019b); Nitanda and Suzuki (2019); Frei et al. (2019). In particular, Nitanda and Suzuki (2019) considers the gradients with respect to the first layer weights but uses smooth activation functions, which is not our setting. The assumption made in Cao and Gu (2019b); Frei et al. (2019) concerns gradients with respect to the second layer weights instead of the first layer weights. In the following, we mainly focus on Assumption 4.2. However we remark that our result also covers the setting studied in Cao and Gu (2019b); Frei et al. (2019).
In order to make a fair comparison, we reduce our results for multilayer networks to the onehiddenlayer setting:
Then we provide the following proposition, which states that Assumption 4.2 can also imply Assumption 3 with a certain choice of .
Suppose the training data satisfy Assumption 4.2, then if the neural network width satisfies , Assumption 3 can be satisfied with for some absolute constant .
{remark}
Proposition 4.2 suggests that for twolayer ReLU networks, under Assumption 4.2, Assumption 3 can be satisfied with . Plugging this into Theorem 3.1, and setting , the condition on the neural network width becomes
4.3 Classdependent Data Nondegeneration
In Zou et al. (2019), an assumption on the minimum distance between inputs from different classes is made to guarantee the convergence of gradient descent to a global minimum.
We restate this assumption as follows.
{assumption}
For all if , then for some absolute constant .
In contrast to the data nondegeneration assumption (i.e., no duplicate data points) made in AllenZhu et al. (2019b); Du et al. (2019b, a); Oymak and
Soltanolkotabi (2019a); Zou and Gu (2019)
Then we provide the following proposition which shows that Assumption 4.3 also implies Assumption 3 for certain choices of and . {proposition} Suppose the training data points satisfy Assumption 4.3, then if
Assumption 3 can be satisfied with for some constant .
Proposition 4.3 suggests that when the neural network is sufficiently wide, as long as there exists no duplicate training data from different classes, Assumption 3 can still be satisfied with . We can also plug this result into Theorem 3.1, which yields the overparameterization condition
5 Experiments
In this section, we conduct some simple experiments to validate our theory. Since our paper mainly focuses on binary classification, we use a subset of the original CIFAR10 dataset (Krizhevsky et al., 2009), which only has two classes of images. We train a layer fullyconnected ReLU network on this binary classification dataset with different sample sizes (), and plot the minimal neural network width that is required to achieve zero training error in Figure 1 (solid line). We also plot and in dashed line for reference. It is evident that the required network width to achieve zero training error is polylogarithmic on the sample size , which is consistent with our theory.
6 Proof of the Main Theory
In this section we present the proofs of Theorems in Section 3. The proofs of theorems in Section 4 are deferred to Appendix A.
6.1 Proof of Theorem 3.1
We first present the following lemma which characterizes the linearity of the neural network function. {lemma}[Lemma 4.1 in Cao and Gu (2019a)] With probability at least over the randomness of initialization, for all and with , it holds that
(1) 
Here we make a slight modification to the original version in Cao and Gu (2019a) as our neural network function encloses an additional scaling parameter . Unlike Cao and Gu (2019a) which requires the approximation error (i.e., the R.H.S. of (1)) to be smaller than the target test error , we only require it to be upper bounded by some constant in our analysis, which consequently leads to a much milder condition on the neural network width. Based on this, we assume all iterates are close to the initialization, and establish a convergence guarantee for GD in the following lemma. {lemma} Set the step size and . Then given and suppose for all , with probability at least over the randomness of initialization it holds that
Then the remaining part is to characterize that under which condition on , we can guarantee all iterates stay inside the required region until convergence. Based on Lemmas 6.1 and 6.1, we complete the remaining proof as follows.
Proof of Theorem 3.1.
In the following proof we choose and . Note that Lemmas 6.1 and 6.1 hold with probability at least over the randomness of initialization and . Therefore, if the neural network width satisfies
(2) 
we have with probability at least all results in Lemmas 6.1 and 6.1 hold.
Then we prove the theorem by two parts: 1) we show that all iterates will stay inside the region ; and 2) we show that GD can find a neural network with at most training loss within iterations.
All iterates stay inside . We prove this part by induction. Specifically, given , we assume the hypothesis holds for all and prove that . First, it is clear that . Then by Lemma 6.1 and apply the fact that , we have
Note that and , we have
where is an absolute constant. Therefore, by triangle inequality, we further have the following for all ,
(3) 
Therefore, in order to guarantee that , then it suffices to ensure that by our choice of . Combining with the condition on in (2), we have if
(4) 
the iterate will be staying inside the region , which completes the proof of the first part.
Convergence of gradient descent. Lemma 6.1 implies
Dividing by on the both sides, we get
where the second inequality is by the fact that and the last inequality is by our choices of and which ensure that . Notice that . This completes the proof of the second part, and we are able to complete the proof. ∎
6.2 Proof of Theorem 3.1
Following Cao and Gu (2019b), we first introduce the definition of surrogate loss of the network, which is defined by the derivative of the loss function. {definition} We define the empirical surrogate error and population surrogate error as follows:
The following lemma gives uniformconvergence type of results for utilizing the fact that is bounded and Lipschitz continuous. {lemma} For any , suppose that . Then with probability at least , it holds that
for all
6.3 Proof of Theorem 3.2
In this section we provide the proof of Theorem 3.2. The following result is the counterpart of Lemma 6.1 for SGD. {lemma} Set the step size and . Then given a positive integer and suppose for all , with probability at least over the randomness of initialization it holds that
Our proof is based on the application of Lemma 6.3 and an onlinetobatch conversion argument (CesaBianchi et al., 2004), which is inspired by Cao and Gu (2019a); Ji and Telgarsky (2020). We introduce a surrogate loss and its population version , which have been used in Ji and Telgarsky (2018); Cao and Gu (2019a); Nitanda and Suzuki (2019); Ji and Telgarsky (2020). The following lemma is provided in Ji and Telgarsky (2020), whose proof only relies on the boundedness of and therefore is applicable in our setting.
For any , with probability at least , the iterates of Algorithm 2 satisfies that