A Priori Estimates of the Population Risk for Two-layer Neural Networks

# A Priori Estimates of the Population Risk for Two-layer Neural Networks

## Abstract

New estimates for the population risk are established for two-layer neural networks. These estimates are nearly optimal in the sense that the error rates scale in the same way as the Monte Carlo error rates. They are equally effective in the over-parametrized regime when the network size is much larger than the size of the dataset. These new estimates are a priori in nature in the sense that the bounds depend only on some norms of the underlying functions to be fitted, not the parameters in the model, in contrast with most existing results which are a posteriori in nature. Using these a priori estimates, we provide a perspective for understanding why two-layer neural networks perform better than the related kernel methods.

T

In memory of Professor David Shenou Cai

wo-layer neural network; Barron space; Population risk; A priori estimate; Rademacher complexity

{AMS}

41A46; 41A63; 62J02; 65D05

## 1 Introduction

One of the main challenges in theoretical machine learning is to understand the errors in neural network models [43]. To this end, it is useful to draw an analogy with classical approximation theory and finite element analysis [13]. There are two kinds of error bounds in finite element analysis depending on whether the target solution (the ground truth) or the numerical solution enters into the bounds. Let and be the true solution and the “numerical solution”, respectively. “A priori” error estimates usually take the form

 ∥^fn−f∗∥1≤Cn−α∥f∗∥2.

where only norms of the true solution enter into the bounds. In “a posteriori” error estimates, the norms of the numerical solution enter into the bounds:

 ∥^fn−f∗∥1≤Cn−β∥^fn∥3.

Here denote various norms. In this language, most recent theoretical results [35, 7, 24, 32, 33, 34] on estimating the generalization error of neural networks should be viewed as “a posteriori” analysis, since the bounds depend on various norms of the neural network model obtained after the training process. As was observed in [18, 4, 34], the numerical values of these norms are very large, yielding vacuous bounds. For example, [34] calculated the values of various a posteriori bounds for some real two-layer neural networks and it is found that the best bounds are still on the order of .

In this paper, we pursue a different line of attack by providing “a priori” analysis. Specifically, we focus on two-layer networks, and we consider models with explicit regularization. We establish estimates for the population risk which are asymptotically sharp with constants depending only on the properties of the target function. Our numerical results suggest that such regularization terms are necessary in order for the model to be “well-posed” (see Section 7 for the precise meaning).

Specifically, our main contributions are:

• We establish a priori estimates of the population risk for learning two-layer neural networks with an explicit regularization. These a priori estimates depend on the Barron norm of the target function. The rates with respect to the number of parameters and number of samples are comparable to the Monte Carlo rate. In addition, our estimates hold for high dimensional and over-parametrized regime.

• We make a comparison between the neural network and kernel methods using these a priori estimates. We show that two-layer neural networks can be understood as kernel methods with the kernel adaptively selected from the data. This understanding partially explains why neural networks perform better than kernel methods in practice.

The present paper is the first in a series of papers in which we analyze neural network models using a classical numerical analysis perspective. Subsequent papers will consider deep neural network models [19, 20], the optimization and implicit regularization problem using gradient descent dynamics [22, 20] and the general function spaces and approximation theory in high dimensions [21].

## 2 Related work

There are two key problems in learning two-layer neural networks: optimization and generalization. Recent progresses on optimization suggest that over-parametrization is the key factor leading to a nice empirical landscape  [38, 23, 36], thus facilitating convergence towards global minima of for gradient-based optimizers [31, 17, 12]. This leaves the generalization property of learning two-layer neural networks more puzzling, since naive arguments would suggest that more parameters implies worse generalization ability. This contradicts what is observed in practice. In what follows, we survey previous attempts in analyzing the generalization properties of two-layer neural network models.

### 2.1 Explicit regularization

This line of works studies the generalization property of two-layer neural networks with explicit regularization and our work lies in this category. Let denote the number of samples and number of parameters, respectively. For two-layer sigmoidal networks, [6] established a risk bound . By considering smoother activation functions, [27] proved another bound for the case when . Both of these results are proved for a regularized estimator. In comparison, the error rate established in this paper, is sharper and in fact nearly optimal, and it is also applicable for the over-parametrized regime. For a better comparison, please refer to Table 1.

More recently, [41] considered explicit regularization for classification problems. They proved that for the specific cross-entropy loss, the regularization path converges to the maximum margin solutions. They also proved an a priori bound on how the network size affects the margin. However, their analysis is restricted to the case where the data is well-separated. Our result does not have this restriction.

### 2.2 Implicit regularization

Another line of works study how gradient descent (GD) and stochastic gradient descent (SGD) finds the generalizable solutions. [9] proved that SGD learns over-parametrized networks that provably generalize for binary classification problem. However, it is not clear how the population risk depends on the number of samples for their compression-based generalization bound. Moreover, their proof highly relies on the strong assumption that the data is linearly separable. The experiments in [34] suggest that increasing the network width can improve the test accuracy of solutions found by SGD. They tried to explain this phenomena by an initialization-dependent (a posterior) generalization bound. However, in their experiments, the largest width , rather than . Furthermore their generalization bounds are arbitrarily loose in practice. So their result cannot tell us whether GD can find generalizable solutions for arbitrarily wide networks.

In [15] and [1], it is proved that GD with a particularly chosen initialization, learning rate and early stopping can find generalizable solutions such that , as long as . These results differ from ours in several aspects. First, both of them assume that the target function , where is the uniform distribution over . Recall that is the reproducing kernel Hilbert space (RKHS) induced by , which is much smaller than , the space we consider. Secondly, through carefully analyzing the polynomial order in two papers, we can see that the sample complexities they provided scales as , which is worse than proved here. See also [3, 10] for some even more recent results.

Recent work in [22, 20] has shown clearly that for the kind of initialization schemes considered in these previous works or in the over-parametrized regime, the neural network models do not perform better than the corresponding kernel method with a kernel defined by the initialization. These results do not rule out the possibility that neural network models can still outperform kernel methods in some regimes, but they do show that finding these regimes is quite non-trivial.

## 3 Preliminaries

We begin by recalling the basics of two-layer neural networks and their approximation properties.

The problem of interest is to learn a function from a training set of examples , i.i.d. samples drawn from an underlying distribution , which is assumed fixed but known only through the samples. Our target function is . We assume that the values of are given through the decomposition , where denotes the noise. For simplicity, we assume that the data lie in and .

The two-layer neural network is defined by

 f(x;θ)=m∑k=1akσ(wTkx), (3.1)

where , is a nonlinear scale-invariant activation function such as ReLU [30] and Leaky ReLU [25], both satisfies the condition for any . Without loss of generality, we assume is 1-Lipschitz continuous. In the formula (3.1), we omit the bias term for notational simplicity. The effect of bias term can be incorporated if we assume that the first component of is always 1. We say that a network is over-parametrized if the network width . We define a truncated form of through . By an abuse of notation, in the following we still use to denote . We will use to denote all the parameters to be learned from the training data,

The ultimate goal is to minimize the population risk

 L(θ)=Ex,y[ℓ(f(x;θ),y)].

In practice, we have to work with the empirical risk

 ^Ln(θ)=1nn∑i=1ℓ(f(xi;θ),yi).

Here the loss function , unless it is specified otherwise.

Define the path norm [35],

 ∥θ∥P:=m∑k=1|ak|∥wk∥1, (3.2)

We will consider the regularized model defined as follows: {definition} For a two-layer neural network of width , we define the regularized risk as

 Jλ(θ):=^Ln(θ)+λ(∥θ∥P+1).

The term at the right hand side is included only to simplify the proof. Our result also holds if we do not include this term in the regularized risk. The corresponding regularized estimator is defined as

 ^θn,λ=argminJλ(θ).

Here is a tuning parameter that controls the balance between the fitting error and the model complexity. It is worth noting that the minimizer is not necessarily unique, and should be understood as any of the minimizers.

In the following, we will call Lipschitz continuous functions with Lipschitz constant -Lipschitz continuous. We will use to indicate that for some universal constant .

### 3.1 Barron space

We begin by defining the natural function space associated with two-layer neural networks, which we will refer to as the Barron space to honor the pioneering work that Barron has done on this subject  [5, 28, 27, 29]. A more complete discussion can be found in [21].

Let , and let be the Borel -algebra on and be the collection of all probability measures on . Let be the collection of functions that admit the following integral representation:

 f(x)=∫Sda(w)σ(⟨w,x⟩)dπ(w)∀x∈X, (3.3)

where , and is a measurable function with respect to . For any and , we define the following norm

 γp(f):=inf(a,π)∈Θf(∫Sd|a(w)|pdπ(w))1/p, (3.4)

where

 Θf={(a,π)|f(x)=∫Sda(w)σ(⟨w,x⟩)dπ(w)}.
{definition}

[Barron space] We define Barron space by

 Bp(X):={f∈B(X) | γp(f)<∞}.

Since is a probability distribution, by Hölder’s inequality, for any we have Thus, we have .

Obviously is dense in since all the finite two-layer neural networks belong to Barron space with and the universal approximation theorem [14] tells us that continuous functions can be approximated by two-layer neural networks. Moreover, it is interesting to note that the norm of a two-layer neural network is bounded by the path norm of the parameters.

An important result proved in [8, 27] states that if a function satisfies , where is the Fourier transform of an extension of , then it can be expressed in the form (3.3) with

 γ∞(f):=supw∈Sd|a(w)|≲∫Rd∥ω∥21|^f(ω)|dω.

Thus it lies in .

#### Connection with reproducing kernel Hilbert space

The Barron space has a natural connection with reproducing kernel Hilbert space (RKHS) [2], and as we will show later, this connection will lead to a precise comparison between two-layer neural networks and kernel methods. For a fixed , we define

 Hπ(X):={∫Sdα(w)σ(⟨w,x⟩)dπ(w):∥f∥Hπ<∞},

where

 ∥f∥2Hπ:=Eπ[|a(w)|2].

Recall that for a symmetric positive definite (PD)1 function , the induced RKHS is the completion of with respect to the inner product . It was proved in [37] that with the kernel defined by

 kπ(x,x′)=Eπ[σ(⟨w,x⟩)σ(⟨w,x′⟩)]. (3.5)

Thus Barron space can be viewed as the union of a family of RKHS with kernels defined by through Equation (3.5), i.e.

 B2(X)=⋃π∈P(Sd)Hπ(X). (3.6)

Note that the family of kernels is only determined by the activation function .

### 3.2 Approximation property

{theorem}

For any , there exists a two-layer neural network of width , such that

 Ex[(f(x)−f(x;~θ))2] ≤3γ22(f)m (3.7) ∥~θ∥P ≤2γ2(f) (3.8)

This kind of approximation results have been established in many papers, see for example [5, 8]. The difference is that we provide the explicit control of the norm of the constructed solution in (3.8), and the bound is independent of the network size. This observation will be useful for what follows.

The proof of Proposition 3.2 can be found in Appendix A. The basic intuition is that the integral representation of allows us to approximate by the Monte-Carlo method: where are sampled from the distribution .

## 4 Main results

For simplicity we first discuss the case without noise, i.e. . In the next section, we deal with the noise. We also assume , and let . Here is the dimension of input and the definition of is given in Equation (3.4).

{theorem}

[Noiseless case] Assume that the target function and . Then for any , with probability at least over the choice of the training set , we have

 Ex|f(x;^θn,λ)− f∗(x)|2≲γ22(f∗)m+λ^γ2(f∗) (4.9) +1√n(^γ2(f∗)+√ln(n/δ)). (4.10)

The above theorem provides an a priori estimate for the population risk. The a priori nature is reflected by dependence of the norm of the target function. The first term at the right hand side controls the approximation error. The second term bounds the estimation error. Surprisingly, the bound for the estimation error is independent of the network width . Hence the bound also makes sense in the over-parametrization regime.

In particular, if we take and , the bound becomes up to some logarithmic terms. This bound is nearly optimal in a minimax sense [42, 28].

### 4.1 Comparison with kernel methods

Consider , and without loss of generality, we assume that is one of the best representations of (it is easy to prove that such a representation exists), i.e. For a fixed , we have,

 f∗(x) =∫Sda∗(w)σ(⟨w,x⟩)dπ∗(w) (4.11) =∫Sda∗(w)dπ∗dπ0(w)σ(⟨w,x⟩)dπ0(w)

as long as is absolutely continuous with respect to . In this sense, we can view from the perspective of . Note that is induced by PD function , and the norm of in is given by

 ∥f∗∥2Hπ0=Ew∼π0[|a∗(w)dπ∗dπ0(w)|2].

Let be the solution of the kernel ridge regression (KRR) problem defined by:

 minh∈Hπ012nn∑i=1(h(xi)−yi)2+λ∥h∥Hπ0. (4.12)

We are interested in the comparison between the two population risks and .

If , then we have and . In this case, it was proved in [11] that the optimal learning rate is

 L(^hn,λ)∼∥f∗∥Hπ0√n. (4.13)

Compared to Theorem 4, we can see that both rates have the same scaling with respect to , the number of samples. The only difference appears in the two norms: and . From the definition (3.4), we always have , since . If is nearly singular with respect to , then . In this case, the population risk for the kernel methods should be much larger than the population risk for the neural network model.

#### Example

Take to be the uniform distribution over and , for which and . In this case , but . Thus the rate (4.13) becomes trivial. Assume that the population risk scales as , and it is interesting to see how depends on the dimension . We numerically estimate ’s for two methods, and report the results in Table 2. It does show that the higher the dimensionality, the slower the rate of the kernel method. In contrast, the rates for the two-layer neural networks are independent of the dimensionality, which confirms the the prediction of Theorem 4. For this particular target function, the value of is bigger than the lower bound () proved in Theorem 4. This is not a contradiction since the latter holds for any .

#### The two-layer neural network model as of an adaptive kernel method

Recall that . The norm characterizes the complexity of the target function by selecting the best kernel among a family of kernels . The kernel method works with a specific RKHS with a particular choice of the kernel or the probability distribution . In contrast, the neural network models work with the union of all these RKHS and select the kernel or the probability distribution adapted to the data. From this perspective, we can view the two-layer neural network model as an adaptive kernel method.

### 4.2 Tackling the noise

We first make the following sub-Gaussian assumption on the noise. {assumption} We assume that the noise satisfies

 P[|ξ|>t]≤c0e−t2σ ∀t≥τ0. (4.14)

Here and are constants.

In the presence of noise, the population risk can be decomposed into

 L(θ)=Ex(f(x;θ)−f∗(x))2+E[ξ2]. (4.15)

This suggests that, in spite of the noise, we still have and the latter is what we really want to minimize. However due to the noise, might be unbounded. We cannot directly use the generalization bound in Theorem 5.1. To address this issue, we consider the truncated risk defined as follows,

 LB(θ) =Ex,y[ℓ(f(x;θ),y)∧B22] ^LB(θ) =1nn∑i=1ℓ(xi;θ),yi)∧B22.

Let . For the noisy case, we consider the following regularized risk:

 Jλ(θ):=^LBn(θ)+λBn(∥θ∥P+1). (4.16)

The corresponding regularized estimator is given by Here for simplicity we slightly abused the notation.

{theorem}

[Main result, noisy case] Assume that the target function and . Then for any , with probability at least over the choice of the training set , we have

 Ex|f(x;^θn,λ) −f∗(x)|2≲γ22(f∗)m+λBn^γ2(f∗) +B2n√n(^γ2(f∗)+√ln(n/δ)) +B2n√n(c0σ2+√E[ξ2]n1/2λ).

Compared to Theorem 4, the noise introduces at most several logarithmic terms. The case with no noise corresponds to the situation with .

### 4.3 Extension to classification problems

Let us consider the simplest setting: binary classification problem, where . In this case, denotes the probability of given . Given and , the corresponding plug-in classifiers are defined by and , respectively. is the optimal Bayes classifier.

For a classifier , we measure its performance by the 0-1 loss defined by . {corollary} Under the same assumption as in Theorem 4.2 and taking , for any , with probability at least , we have

 E(^η) ≲E(η∗)+γ2(f∗)√m+^γ1/22(f∗)ln1/4(d)+ln1/4(n/δ)n1/4.
{proof}

According to the Theorem 2.2. of [16], we have

 E(^η)−E(η∗) ≤2E[|f(x;^θn,λ)−f∗(x)|] (4.17) ≤2E[|f(x;^θn,λ)−f∗(x)|2]

In this case, is bounded by , thus . Applying Theorem 4.2 yields the result.

The above theorem suggests that our a priori estimates also hold for classification problems, although the error rate only scales as . It is possible to improve the rate with more a delicate analyses. One potential way is to specifically develop a better estimate for loss, as can be seen from inequality (4.17). Another way is to make a stronger assumption on the data. For example, we can assume that there exists such that , for which the Bayes error . We leave these to future work.

## 5 Proofs

### 5.1 Bounding the generalization gap

{definition}

[Rademacher complexity] Let be a hypothesis space, i.e. a set of functions. The Rademacher complexity of with respect to samples is defined as where are i.i.d. random variables with . The generalization gap can be estimated via the Rademacher complexity by the following theorem [39] . {theorem} Fix a hypothesis space . Assume that for any and , . Then for any , with probability at least over the choice of , we have,

 |1nn∑i=1f(zi)−Ez[f(z)]|≤2ES[^Rn(F)]+B√2ln(2/δ)n.

Let denote all the two-layer networks with path norm bounded by . It was proved in [35] that

 ^Rn(FQ)≤2Q√2ln(2d)n. (5.18)

By combining the above result withTheorem 5.1, we obtain the following a posterior bound of the generalization gap for two-layer neural networks. The proof is deferred to Appendix B.

{theorem}

[A posterior generalization bound] Assume that the loss function is Lipschitz continuous and bounded by . Then for any , with probability at least over the choice of the training set , we have, for any two-layer network ,

 |L(θ)−^Ln(θ)|≤ 4A√2ln(2d)n(∥θ∥P+1) (5.19) +B√2ln(2c(∥θ∥P+1)2/δ)n, (5.20)

where .

We see that the generalization gap is bounded roughly by up to some logarithmic terms.

### 5.2 Proof for the noiseless case

The intuition is as follows. The path norm of the special solution which achieves the optimal approximation error is independent of the network width, and this norm can also be used to bound the generalization gap (Theorem 5.1). Therefore, if the path norm is suitably penalized during training, we should be able to control the generalization gap without harming the approximation accuracy.

We first have the estimate for the regularized risk of . {proposition} Let be the network constructed in Theorem 3.2, and . Then with probability at least , we have

 Jλ(~θ)≤L(~θ)+8λ^γ2(f∗)+2√2ln(2c/δ)n (5.21)
{proof}

First is -Lipschitz continuous and bounded by . According to Definition 3 and the property that , the regularized risk of satisfies

 Jλ(~θ) \lx@stackrel=^Ln(~θ)+λ(∥~θ∥P+1) ≤L(~θ)+(λn+λ)(∥~θ∥P+1)+2√2ln(2c(∥~θ∥P+1)2/δ)n \lx@stackrel≤L(~θ)+6λ^γ2(f∗)+2√2ln(2c(1+2γ2(f∗))2/δ)n. (5.22)

The last term can be simplified by using and for . So we have

 √2ln(2c(1+2γ2(f∗))2/δ) ≤ √2ln(2c/δ)+3^γ2(f∗).

Plugging it into Equation (5.2) completes the proof.

{proposition}

[Properties of regularized solutions] The regularized estimator satisfies:

 Jλ(^θn,λ) ≤Jλ(~θ) ∥^θn,λ∥P ≤L(~θ)λ+8^γ2(f∗)+12√ln(2c/δ)
{proof}

The first claim follows from the definition of . For the second claim, note that

 λ(∥^θn,λ∥P+1)≤Jλ(^θn)≤Jλ(~θ),

Applying Proposition 5.2 completes the proof. {remark} The above proposition establishes the connection between the regularized solution and the special solution constructed in Proposition 3.2. In particular, by taking with the generalization gap of the regularized solution is bounded by as , up to some constant. This suggests that our regularization term is appropriate, and it forces the generalization gap to be roughly in the same order as the approximation error.

{proof}

(Proof of Theorem 4) Now we are ready to prove the main result. Following the a posteriori generalization bound given in Theorem 5.1, we have with probability at least ,

 L(^θn,λ) \lx@stackrel≤^Ln(^θn,λ)+λn(∥^θn,λ∥P+1)+3Qn \lx@stackrel(1)≤Jλ(^θn,λ)+3Qn,

where . The inequality (1) is due to the choice . The first term can be bounded by , which is given by Proposition 5.2. It remains to bound ,

 √nQn ≤√ln(2nc/δ)+√2ln(1+n−1/2∥^θn,λ∥P) ≤√ln(2nc/δ)+√2∥^θn,λ∥P/√n.

By Proposition 5.2, we have

  ⎷2∥^θn,λ∥P√n ≤ ⎷2(L(~θ)/λ+8^γ2(f∗)+0.5√ln(2c/δ))√n ≤√2L(~θ)λn1/2+3^γ2(f∗)n1/4+(ln(1/δ)n)1/4.

Thus after some simplification, we obtain

 Qn≤2√ln(n/δ)n+√2L(~θ)λn3/2+3^γ2(f∗)√n. (5.23)

By combining Equation (5.21) and  (5.23), we obtain

 L(^θn) ≲L(~θ)+8λ^γ2(f∗)+3√n(√L(~θ)n1/2λ+^γ2(f∗)+√ln(n/δ)).

By applying , we complete the proof.

### 5.3 Proof for the noisy case

We need the following lemma. The proof is deferred to Appendix D. {lemma} Under Assumption 4.2, we have

 supθ|L(θ)−LBn(θ)|≤2c0σ2√n,

Therefore we have,

 L(θ)=L(θ)−LBn(θ)+LBn(θ)≤2c0σ2√n+LBn(θ)

This suggests that as long as we can bound the truncated population risk, the original risk will be bounded accordingly.

{proof}

(Proof of Theorem 4.2) The proof is almost the same as the noiseless case. The loss function is -Lipschitz continuous and bounded by . By analogy with the proof of Proposition 5.2, we obtain that with probability at least the following inequality holds,

 Jλ(~θ)≤LBn(~θ)+8Bnλ^γ2(f∗)+B2n√ln(2c/δ)n. (5.24)

Following the proof in Proposition 5.2, we similarly obtain and

 ∥^θn,λ∥P ≤LBn(~θ)Bnλ+8^γ(f∗)+Bn2√ln(2c/δ). (5.25)

Following the proof of Theorem 4, we have

 LBn(^θn,λ) ≤Jλ(~θ)+B2n2√2ln(2c(1+∥^θn,λ∥P)2/δ)/n (5.26)

Plugging (5.24) and (5.25) into (5.26), we get

 LBn(^θn,λ) ≤LBn(~θ)+8Bn^γ2(f∗)λ +3B2n√n(√LBn(~θ)n1/2λ+^γ2(f∗)+√ln(n/δ))

Using Lemma 5.3 and the decomposition (4.15), we complete the proof.

## 6 Numerical Experiments

In this section, we evaluate the regularized model using numerical experiments. We consider two datasets, MNIST2 and CIFAR-103. Each example in MNIST is a grayscale image, while each example in CIFAR-10 is a color image. For MNIST, we map numbers to label and to . For CIFAR-10, we select the examples with labels and to construct our new training and validation sets. Thus, our new MNIST has training examples, and CIFAR-10 has training examples.

The two-layer ReLU network is initialized using . We use and train the regularized models using the Adam optimizer [26] for steps, unless it is specified otherwise. The initial learning rate is set to be , and it is then multiplied by a decay factor of at and again at . We set the trade-off parameter 4 .

### 6.1 Shaper bounds for the generalization gap

Theorem 5.1 shows that the generalization gap is bounded by up to some logarithmic terms. Previous works [34, 18] showed that (stochastic) gradient descent tends to find solutions with huge norms, causing the a posterior bound to be vacuous. In contrast, our theory suggests there exist good solutions (i.e. solutions with small generalization error) with small norms, and these solutions can be found by the explicit regularization.

To see how this works in practice, we trained both the regularized models and un-regularized models () for fixed network width 10,000. To cover the over-parametrized regime, we also consider the case where . The results are summarized in Table 3.

As we can see, the test accuracies of the regularized and un-regularized solutions are generally comparable, but the values of , which serve as an upper bound for the generalization gap, are drastically different. The bounds for the un-regularized models are always vacuous, as was observed in [18, 34, 4]. In contrast, the bounds for the regularized models are always several orders of magnitude smaller than that for the un-regularized models. This is consistent with the theoretical prediction in Proposition 5.2.

To further explore the impact of over-parametrization, we trai