A Priori Estimates of the Population Risk for Twolayer Neural Networks
Abstract
New estimates for the population risk are established for twolayer neural networks. These estimates are nearly optimal in the sense that the error rates scale in the same way as the Monte Carlo error rates. They are equally effective in the overparametrized regime when the network size is much larger than the size of the dataset. These new estimates are a priori in nature in the sense that the bounds depend only on some norms of the underlying functions to be fitted, not the parameters in the model, in contrast with most existing results which are a posteriori in nature. Using these a priori estimates, we provide a perspective for understanding why twolayer neural networks perform better than the related kernel methods.
In memory of Professor David Shenou Cai
wolayer neural network; Barron space; Population risk; A priori estimate; Rademacher complexity
41A46; 41A63; 62J02; 65D05
1 Introduction
One of the main challenges in theoretical machine learning is to understand the errors in neural network models [43]. To this end, it is useful to draw an analogy with classical approximation theory and finite element analysis [13]. There are two kinds of error bounds in finite element analysis depending on whether the target solution (the ground truth) or the numerical solution enters into the bounds. Let and be the true solution and the “numerical solution”, respectively. “A priori” error estimates usually take the form
where only norms of the true solution enter into the bounds. In “a posteriori” error estimates, the norms of the numerical solution enter into the bounds:
Here denote various norms. In this language, most recent theoretical results [35, 7, 24, 32, 33, 34] on estimating the generalization error of neural networks should be viewed as “a posteriori” analysis, since the bounds depend on various norms of the neural network model obtained after the training process. As was observed in [18, 4, 34], the numerical values of these norms are very large, yielding vacuous bounds. For example, [34] calculated the values of various a posteriori bounds for some real twolayer neural networks and it is found that the best bounds are still on the order of .
In this paper, we pursue a different line of attack by providing “a priori” analysis. Specifically, we focus on twolayer networks, and we consider models with explicit regularization. We establish estimates for the population risk which are asymptotically sharp with constants depending only on the properties of the target function. Our numerical results suggest that such regularization terms are necessary in order for the model to be “wellposed” (see Section 7 for the precise meaning).
Specifically, our main contributions are:

We establish a priori estimates of the population risk for learning twolayer neural networks with an explicit regularization. These a priori estimates depend on the Barron norm of the target function. The rates with respect to the number of parameters and number of samples are comparable to the Monte Carlo rate. In addition, our estimates hold for high dimensional and overparametrized regime.

We make a comparison between the neural network and kernel methods using these a priori estimates. We show that twolayer neural networks can be understood as kernel methods with the kernel adaptively selected from the data. This understanding partially explains why neural networks perform better than kernel methods in practice.
The present paper is the first in a series of papers in which we analyze neural network models using a classical numerical analysis perspective. Subsequent papers will consider deep neural network models [19, 20], the optimization and implicit regularization problem using gradient descent dynamics [22, 20] and the general function spaces and approximation theory in high dimensions [21].
2 Related work
There are two key problems in learning twolayer neural networks: optimization and generalization. Recent progresses on optimization suggest that overparametrization is the key factor leading to a nice empirical landscape [38, 23, 36], thus facilitating convergence towards global minima of for gradientbased optimizers [31, 17, 12]. This leaves the generalization property of learning twolayer neural networks more puzzling, since naive arguments would suggest that more parameters implies worse generalization ability. This contradicts what is observed in practice. In what follows, we survey previous attempts in analyzing the generalization properties of twolayer neural network models.
2.1 Explicit regularization
This line of works studies the generalization property of twolayer neural networks with explicit regularization and our work lies in this category. Let denote the number of samples and number of parameters, respectively. For twolayer sigmoidal networks, [6] established a risk bound . By considering smoother activation functions, [27] proved another bound for the case when . Both of these results are proved for a regularized estimator. In comparison, the error rate established in this paper, is sharper and in fact nearly optimal, and it is also applicable for the overparametrized regime. For a better comparison, please refer to Table 1.
rate  overparametrization  

rate of [6]  No  
rate of [27]  No  
our rate  Yes 
More recently, [41] considered explicit regularization for classification problems. They proved that for the specific crossentropy loss, the regularization path converges to the maximum margin solutions. They also proved an a priori bound on how the network size affects the margin. However, their analysis is restricted to the case where the data is wellseparated. Our result does not have this restriction.
2.2 Implicit regularization
Another line of works study how gradient descent (GD) and stochastic gradient descent (SGD) finds the generalizable solutions. [9] proved that SGD learns overparametrized networks that provably generalize for binary classification problem. However, it is not clear how the population risk depends on the number of samples for their compressionbased generalization bound. Moreover, their proof highly relies on the strong assumption that the data is linearly separable. The experiments in [34] suggest that increasing the network width can improve the test accuracy of solutions found by SGD. They tried to explain this phenomena by an initializationdependent (a posterior) generalization bound. However, in their experiments, the largest width , rather than . Furthermore their generalization bounds are arbitrarily loose in practice. So their result cannot tell us whether GD can find generalizable solutions for arbitrarily wide networks.
In [15] and [1], it is proved that GD with a particularly chosen initialization, learning rate and early stopping can find generalizable solutions such that , as long as . These results differ from ours in several aspects. First, both of them assume that the target function , where is the uniform distribution over . Recall that is the reproducing kernel Hilbert space (RKHS) induced by , which is much smaller than , the space we consider. Secondly, through carefully analyzing the polynomial order in two papers, we can see that the sample complexities they provided scales as , which is worse than proved here. See also [3, 10] for some even more recent results.
Recent work in [22, 20] has shown clearly that for the kind of initialization schemes considered in these previous works or in the overparametrized regime, the neural network models do not perform better than the corresponding kernel method with a kernel defined by the initialization. These results do not rule out the possibility that neural network models can still outperform kernel methods in some regimes, but they do show that finding these regimes is quite nontrivial.
3 Preliminaries
We begin by recalling the basics of twolayer neural networks and their approximation properties.
The problem of interest is to learn a function from a training set of examples , i.i.d. samples drawn from an underlying distribution , which is assumed fixed but known only through the samples. Our target function is . We assume that the values of are given through the decomposition , where denotes the noise. For simplicity, we assume that the data lie in and .
The twolayer neural network is defined by
(3.1) 
where , is a nonlinear scaleinvariant activation function such as ReLU [30] and Leaky ReLU [25], both satisfies the condition for any . Without loss of generality, we assume is 1Lipschitz continuous. In the formula (3.1), we omit the bias term for notational simplicity. The effect of bias term can be incorporated if we assume that the first component of is always 1. We say that a network is overparametrized if the network width . We define a truncated form of through . By an abuse of notation, in the following we still use to denote . We will use to denote all the parameters to be learned from the training data,
The ultimate goal is to minimize the population risk
In practice, we have to work with the empirical risk
Here the loss function , unless it is specified otherwise.
Define the path norm [35],
(3.2) 
We will consider the regularized model defined as follows: {definition} For a twolayer neural network of width , we define the regularized risk as
The term at the right hand side is included only to simplify the proof. Our result also holds if we do not include this term in the regularized risk. The corresponding regularized estimator is defined as
Here is a tuning parameter that controls the balance between the fitting error and the model complexity. It is worth noting that the minimizer is not necessarily unique, and should be understood as any of the minimizers.
In the following, we will call Lipschitz continuous functions with Lipschitz constant Lipschitz continuous. We will use to indicate that for some universal constant .
3.1 Barron space
We begin by defining the natural function space associated with twolayer neural networks, which we will refer to as the Barron space to honor the pioneering work that Barron has done on this subject [5, 28, 27, 29]. A more complete discussion can be found in [21].
Let , and let be the Borel algebra on and be the collection of all probability measures on . Let be the collection of functions that admit the following integral representation:
(3.3) 
where , and is a measurable function with respect to . For any and , we define the following norm
(3.4) 
where
[Barron space] We define Barron space by
Since is a probability distribution, by Hölder’s inequality, for any we have Thus, we have .
Obviously is dense in since all the finite twolayer neural networks belong to Barron space with and the universal approximation theorem [14] tells us that continuous functions can be approximated by twolayer neural networks. Moreover, it is interesting to note that the norm of a twolayer neural network is bounded by the path norm of the parameters.
An important result proved in [8, 27] states that if a function satisfies , where is the Fourier transform of an extension of , then it can be expressed in the form (3.3) with
Thus it lies in .
Connection with reproducing kernel Hilbert space
The Barron space has a natural connection with reproducing kernel Hilbert space (RKHS) [2], and as we will show later, this connection will lead to a precise comparison between twolayer neural networks and kernel methods. For a fixed , we define
where
Recall that for a symmetric positive definite (PD)
(3.5) 
Thus Barron space can be viewed as the union of a family of RKHS with kernels defined by through Equation (3.5), i.e.
(3.6) 
Note that the family of kernels is only determined by the activation function .
3.2 Approximation property
{theorem}For any , there exists a twolayer neural network of width , such that
(3.7)  
(3.8) 
This kind of approximation results have been established in many papers, see for example [5, 8]. The difference is that we provide the explicit control of the norm of the constructed solution in (3.8), and the bound is independent of the network size. This observation will be useful for what follows.
The proof of Proposition 3.2 can be found in Appendix A. The basic intuition is that the integral representation of allows us to approximate by the MonteCarlo method: where are sampled from the distribution .
4 Main results
For simplicity we first discuss the case without noise, i.e. . In the next section, we deal with the noise. We also assume , and let . Here is the dimension of input and the definition of is given in Equation (3.4).
[Noiseless case] Assume that the target function and . Then for any , with probability at least over the choice of the training set , we have
(4.9)  
(4.10) 
The above theorem provides an a priori estimate for the population risk. The a priori nature is reflected by dependence of the norm of the target function. The first term at the right hand side controls the approximation error. The second term bounds the estimation error. Surprisingly, the bound for the estimation error is independent of the network width . Hence the bound also makes sense in the overparametrization regime.
In particular, if we take and , the bound becomes up to some logarithmic terms. This bound is nearly optimal in a minimax sense [42, 28].
4.1 Comparison with kernel methods
Consider , and without loss of generality, we assume that is one of the best representations of (it is easy to prove that such a representation exists), i.e. For a fixed , we have,
(4.11)  
as long as is absolutely continuous with respect to . In this sense, we can view from the perspective of . Note that is induced by PD function , and the norm of in is given by
Let be the solution of the kernel ridge regression (KRR) problem defined by:
(4.12) 
We are interested in the comparison between the two population risks and .
If , then we have and . In this case, it was proved in [11] that the optimal learning rate is
(4.13) 
Compared to Theorem 4, we can see that both rates have the same scaling with respect to , the number of samples. The only difference appears in the two norms: and . From the definition (3.4), we always have , since . If is nearly singular with respect to , then . In this case, the population risk for the kernel methods should be much larger than the population risk for the neural network model.
Example
Take to be the uniform distribution over and , for which and . In this case , but . Thus the rate (4.13) becomes trivial. Assume that the population risk scales as , and it is interesting to see how depends on the dimension . We numerically estimate ’s for two methods, and report the results in Table 2. It does show that the higher the dimensionality, the slower the rate of the kernel method. In contrast, the rates for the twolayer neural networks are independent of the dimensionality, which confirms the the prediction of Theorem 4. For this particular target function, the value of is bigger than the lower bound () proved in Theorem 4. This is not a contradiction since the latter holds for any .
The twolayer neural network model as of an adaptive kernel method
Recall that . The norm characterizes the complexity of the target function by selecting the best kernel among a family of kernels . The kernel method works with a specific RKHS with a particular choice of the kernel or the probability distribution . In contrast, the neural network models work with the union of all these RKHS and select the kernel or the probability distribution adapted to the data. From this perspective, we can view the twolayer neural network model as an adaptive kernel method.
4.2 Tackling the noise
We first make the following subGaussian assumption on the noise. {assumption} We assume that the noise satisfies
(4.14) 
Here and are constants.
In the presence of noise, the population risk can be decomposed into
(4.15) 
This suggests that, in spite of the noise, we still have and the latter is what we really want to minimize. However due to the noise, might be unbounded. We cannot directly use the generalization bound in Theorem 5.1. To address this issue, we consider the truncated risk defined as follows,
Let . For the noisy case, we consider the following regularized risk:
(4.16) 
The corresponding regularized estimator is given by Here for simplicity we slightly abused the notation.
[Main result, noisy case] Assume that the target function and . Then for any , with probability at least over the choice of the training set , we have
Compared to Theorem 4, the noise introduces at most several logarithmic terms. The case with no noise corresponds to the situation with .
4.3 Extension to classification problems
Let us consider the simplest setting: binary classification problem, where . In this case, denotes the probability of given . Given and , the corresponding plugin classifiers are defined by and , respectively. is the optimal Bayes classifier.
For a classifier , we measure its performance by the 01 loss defined by . {corollary} Under the same assumption as in Theorem 4.2 and taking , for any , with probability at least , we have
According to the Theorem 2.2. of [16], we have
(4.17)  
In this case, is bounded by , thus . Applying Theorem 4.2 yields the result.
The above theorem suggests that our a priori estimates also hold for classification problems, although the error rate only scales as . It is possible to improve the rate with more a delicate analyses. One potential way is to specifically develop a better estimate for loss, as can be seen from inequality (4.17). Another way is to make a stronger assumption on the data. For example, we can assume that there exists such that , for which the Bayes error . We leave these to future work.
5 Proofs
5.1 Bounding the generalization gap
{definition}[Rademacher complexity] Let be a hypothesis space, i.e. a set of functions. The Rademacher complexity of with respect to samples is defined as where are i.i.d. random variables with . The generalization gap can be estimated via the Rademacher complexity by the following theorem [39] . {theorem} Fix a hypothesis space . Assume that for any and , . Then for any , with probability at least over the choice of , we have,
Let denote all the twolayer networks with path norm bounded by . It was proved in [35] that
(5.18) 
By combining the above result withTheorem 5.1, we obtain the following a posterior bound of the generalization gap for twolayer neural networks. The proof is deferred to Appendix B.
[A posterior generalization bound] Assume that the loss function is Lipschitz continuous and bounded by . Then for any , with probability at least over the choice of the training set , we have, for any twolayer network ,
(5.19)  
(5.20) 
where .
We see that the generalization gap is bounded roughly by up to some logarithmic terms.
5.2 Proof for the noiseless case
The intuition is as follows. The path norm of the special solution which achieves the optimal approximation error is independent of the network width, and this norm can also be used to bound the generalization gap (Theorem 5.1). Therefore, if the path norm is suitably penalized during training, we should be able to control the generalization gap without harming the approximation accuracy.
We first have the estimate for the regularized risk of . {proposition} Let be the network constructed in Theorem 3.2, and . Then with probability at least , we have
(5.21) 
First is Lipschitz continuous and bounded by . According to Definition 3 and the property that , the regularized risk of satisfies
(5.22) 
The last term can be simplified by using and for . So we have
Plugging it into Equation (5.2) completes the proof.
[Properties of regularized solutions] The regularized estimator satisfies:
The first claim follows from the definition of . For the second claim, note that
Applying Proposition 5.2 completes the proof. {remark} The above proposition establishes the connection between the regularized solution and the special solution constructed in Proposition 3.2. In particular, by taking with the generalization gap of the regularized solution is bounded by as , up to some constant. This suggests that our regularization term is appropriate, and it forces the generalization gap to be roughly in the same order as the approximation error.
(Proof of Theorem 4) Now we are ready to prove the main result. Following the a posteriori generalization bound given in Theorem 5.1, we have with probability at least ,
where . The inequality (1) is due to the choice . The first term can be bounded by , which is given by Proposition 5.2. It remains to bound ,
By Proposition 5.2, we have
Thus after some simplification, we obtain
(5.23) 
5.3 Proof for the noisy case
We need the following lemma. The proof is deferred to Appendix D. {lemma} Under Assumption 4.2, we have
Therefore we have,
This suggests that as long as we can bound the truncated population risk, the original risk will be bounded accordingly.
(Proof of Theorem 4.2) The proof is almost the same as the noiseless case. The loss function is Lipschitz continuous and bounded by . By analogy with the proof of Proposition 5.2, we obtain that with probability at least the following inequality holds,
(5.24) 
Following the proof in Proposition 5.2, we similarly obtain and
(5.25) 
Following the proof of Theorem 4, we have
(5.26) 
Plugging (5.24) and (5.25) into (5.26), we get
Using Lemma 5.3 and the decomposition (4.15), we complete the proof.
6 Numerical Experiments
In this section, we evaluate the regularized model using numerical experiments.
We consider two datasets, MNIST
The twolayer ReLU network is initialized using . We use and train the regularized models using the Adam optimizer [26] for steps, unless it is specified otherwise. The initial learning rate is set to be , and it is then multiplied by a decay factor of at and again at . We set the tradeoff parameter
6.1 Shaper bounds for the generalization gap
Theorem 5.1 shows that the generalization gap is bounded by up to some logarithmic terms. Previous works [34, 18] showed that (stochastic) gradient descent tends to find solutions with huge norms, causing the a posterior bound to be vacuous. In contrast, our theory suggests there exist good solutions (i.e. solutions with small generalization error) with small norms, and these solutions can be found by the explicit regularization.
To see how this works in practice, we trained both the regularized models and unregularized models () for fixed network width 10,000. To cover the overparametrized regime, we also consider the case where . The results are summarized in Table 3.
dataset  n  training accuracy  testing accuracy  

CIFAR10  100%  58  
100  
0.14  
100  0.43  
MNIST  58  
100  162  
0.27  
100  0.41 
As we can see, the test accuracies of the regularized and unregularized solutions are generally comparable, but the values of , which serve as an upper bound for the generalization gap, are drastically different. The bounds for the unregularized models are always vacuous, as was observed in [18, 34, 4]. In contrast, the bounds for the regularized models are always several orders of magnitude smaller than that for the unregularized models. This is consistent with the theoretical prediction in Proposition 5.2.
To further explore the impact of overparametrization, we trai