Asymptotic Properties of Neural Network Sieve Estimators

# Asymptotic Properties of Neural Network Sieve Estimators

\fnmsXiaoxi \snmShen\thanksrefm1 label=e1]shenxia4@stt.msu.edu [    \fnmsChang \snmJiang\thanksrefm1 label=e2]cjiang@epi.msu.edu [    \fnmsLyudmila \snmSakhanenko\thanksrefm1 label=e3]luda@stt.msu.edu [    \fnmsQing \snmLu\thanksreft1, m1 label=e4]qlu@epi.msu.edu [ Michigan State University\thanksmarkm1 Department of Statistics and Probability
Michigan State University
East Lansing, MI 48824, USA
E-mail:
Department of Epidemiology and Biostatistics
Michigan State University
East Lansing, MI 48824, USA
Department of Statistics and Probability
Michigan State University
East Lansing, MI 48824, USA
E-mail:
Department of Epidemiology and Biostatistics
Michigan State University
East Lansing, MI 48824, USA
###### Abstract

Neural networks are one of the most popularly used methods in machine learning and artificial intelligence nowadays. Due to the universal approximation theorem (Hornik, Stinchcombe and White, 1989), a neural network with one hidden layer can approximate any continuous function on a compact support as long as the number of hidden units is sufficiently large. Statistically, a neural network can be classified into a nonlinear regression framework. However, if we consider it parametrically, due to the unidentifiability of the parameters, it is difficult to derive its asymptotic properties. Instead, we considered the estimation problem in a nonparametric regression framework and use the results from sieve estimation to establish the consistency, the rates of convergence and the asymptotic normality of the neural network estimators. We also illustrate the validity of the theories via simulations.

\kwd
\arxiv

arXiv:0000.0000 \startlocaldefs \endlocaldefs

\runtitle

Asymptotics for Neural Networks

{aug}

, and

\thankstext

t1To whom correspondence should be addressed.

Empirical Processes \kwdEntropy Integral

## 1 Introduction

With the success of machine learning and artificial intelligence in researches and industry, neural networks have become popularly used methods nowadays. Many new machine learning methods nowadays developed are based on deep neural networks and have achieved great classification and prediction accuracy. We refer interested readers to Goodfellow et al. (2016) for more background and details. In classical statistical learning theory, the consistency and the rate of convergence of the empirical risk minimization principle are of the great interest. Many upper bounds have been established for the empirical risk and the sample complexity based on the growth function and the Vapnik-Chervonenkis dimension (see for example, Vapnik (1998); Anthony and Bartlett (2009); Devroye, Györfi and Lugosi (2013)). However, few studies have focused on the asymptotic properties for neural networks. As Thomas J. Sargent said, “artificial intelligence is actually statistics, but in a very gorgeous phrase, it is statistics.” So it is natural and worthwhile to explore whether a neural network model possesses nice asymptotic properties. As if it does, it may be possible to conduct statistical inference based on neural networks. Throughout this paper, we will focus on the asymptotic properties of neural networks with one hidden layer.

In statistics, fitting a neural network with one hidden layer can be viewed as a parametric nonlinear regression problem:

 yi=α0+r∑j=1αjσ(γTjxi+γ0,j)+ϵi,

where are i.i.d. random errors with and and is an activation function, for example , which will be the main focus in this paper. White and Racine (2001) obtained the asymptotic distribution of the resulting estimators under the assumption that the true parameters are unique. In fact, the authors implicitly assumed that the number of hidden units is known. However, even if we assume that we know the number of hidden units, it is difficult to establish the asymptotic properties for the parameter estimators. In section 6.1, we conducted a simulation based on a single-layer neural network with 2 hidden units. Even for such a simple model, the simulation result suggests that it is unlikely to obtain consistent estimators. Moreover, since the number of hidden units is usually unknown in practice, such assumption can be easily violated. For example, as pointed out in Fukumizu (1996) and Fukumizu et al. (2003), if the true function is , that is the true number of hidden units is 1, and we fit the model using a neural network with two hidden units, then any parameter in the high-dimensional set

 {θ:γ1=γ,α1=α,γ0,1=γ0,2=α2=α0=0}∪ {θ:γ1=γ2=γ,γ0,1=γ0,2=α0=0,α1+α2=α}

realizes the true function . Therefore, when the number of hidden units is unknown, the parameters in this parametric nonlinear regression problem are unidentifiable. Theorem 1 in Wu (1981) showed that a necessary condition for the weak consistency of nonlinear least square estimators is that

 n∑i=1[f(xi,θ)−f(xi,θ′)]2→∞, as n→∞,

for all in the parameter space as long as the error distribution has finite Fisher information. Such condition implies that when the parameters are not identifiable, the resulting nonlinear least squares estimators will be inconsistent, which hinders further explorations on the asymptotic properties for the neural network estimators. Liu and Shao (2003) and Zhu and Zhang (2006) proposed some techniques to conduct hypothesis testing under loss of identifiability. However, their theoretical results are not easy to implement in the neural network setting.

Even though a function can have different neural network parametrizations, the function itself can be considered as unique. Moreover, due to the Universal Approximation Theorem (Hornik, Stinchcombe and White, 1989), any continuous function on a compact support can be approximated arbitrarily well by a neural network with one hidden layer. So it seems natural to consider it as a nonparametric regression problem and approximate the underlying function class through a class of neural networks with one hidden layer. Specifically, suppose that the true nonparametric regression model is

 yi=f0(xi)+ϵi,

where are i.i.d. random variables defined on a complete probability space with , ; are vectors of covariates with being a compact set in and is an unknown function needed to be estimated. We assume that , where is the class of continuous functions with compact supports. Clearly, minimizes the population criterion function

 Qn(f) =E[1nn∑i=1(yi−f(xi))2] =1nn∑i=1(f(xi)−f0(xi))2+σ2.

A least squares estimator of the regression function can be obtained by minimizing the empirical squared error loss :

 ^fn=argminf∈FQn(f)=argminf∈F1nn∑i=1(yi−f(xi))2.

However, if the class of functions is too rich, the resulting least squares estimator may have undesired properties such as inconsistency (van de Geer, 2000; Shen and Wong, 1994; Shen, 1997). Instead, we can optimize the squared error loss over some less complex function space , which is an approximation of while the approximation error tends to 0 as the sample size increases. In the language of Grenander (1981), such a sequence of function classes is known as a sieve. More precisely, we consider a sequence of function classes,

 F1⊆F2⊆⋯⊆Fn⊆Fn+1⊆⋯⊆F,

approximating in the sense that is dense in . In other words, for each , there exists such that as , where is some pseudo-metric defined on . With some abuse of notation, an approximate sieve estimator is defined to be

 Qn(^fn)≤inff∈FnQn(f)+Op(ηn), (1.1)

where as .

Throughout the rest of the paper, we focus on the sieve of neural networks with one hidden layer and sigmoid activation function. Specifically, we let

 Frn ={α0+rn∑j=1αjσ(γTjx+γ0,j):γj∈Rd,αj,γ0,j∈R, rn∑j=0|αj|≤Vn for some Vn>4 and max1≤j≤rnd∑i=0|γi,j|≤Mn for some Mn>0},

where as . Such method has been discussed in previous literatures (see for example White (1989) and White (1990)). In those papers, consistency of the neural network sieve estimators has been established under a random design. However, there are few results on the asymptotic distribution of the neural network sieve estimators, which will be established in this paper. Moreover, throughout this paper, we focus on the fixed design. Hornik, Stinchcombe and White (1989) showed that is dense in under the sup-norm. But when considering the asymptotic properties of the sieve estimators, we use the pseudo-norm (see Proposition 6.1 in the Appendix) defined on and .

In section 2, we discuss the existence of neural network sieve estimators. The weak consistency and rate of convergence of the neural network sieve estimators will be established in section 3 and section 4, respectively. Section 5 focuses on the asymptotic distribution of the neural network sieve estimators. Simulation results are presented in section 6.

Notation: Throughout the rest of the paper, bold font alphabetic letters and Greek letters are vectors. is the set of continuous functions defined on . The symbol means “is bounded above up to a universal constant” and means as . For a pseudo-metric space , is its covering number, that is the minimum number of -balls needed to cover . Its natural logarithm is the entropy number and is denoted by .

## 2 Existence

A natural question to ask is whether the sieve estimator based on neural networks exists. Before discussing this question, we first study some properties of . Proposition 2.1 shows that the sigmoid function is a Lipschitz function with Lipschitz constant .

###### Proposition 2.1.

A sigmoid function is a Lipschitz function on with Lipschitz constant .

###### Proof.

For all , is continuous on and is differentiable on . Note that

 σ′(z) =σ(z)(1−σ(z))≤14∀z∈R.

By the Mean Value Theorem, we know that

 σ(z1)−σ(z2)=σ′(λz1+(1−λ)z2)(z1−z2),

for some . Hence

 |σ(z1)−σ(z2)|=|σ′(λz1+(1−λ)z2)||z1−z2|≤14|z1−z2|,

which means that is a Lipschitz function on with Lipschitz constant . ∎

The second proposition provides an upper bound for the envelope function .

###### Proposition 2.2.

For each fixed ,

 supf∈Frn∥f∥∞≤Vn.
###### Proof.

For any with fixed, note that for all , we have

 |f(x)| =∣∣ ∣∣α0+rn∑j=1αjσ(γTjx+γ0,j)∣∣ ∣∣ ≤|α0|+rn∑j=1|αj|σ(γTjx+γ0,j)≤rn∑j=0|αj|≤Vn.

Since the right hand side does not depend on and , we get

 supf∈Frn∥f∥∞=supf∈Frnsupx∈X|f(x)|≤Vn.

Now we quote a general result from White and Wooldridge (1991). The theorem tells us that under some mild conditions, there exists a sieve approximate estimator and such an estimator is also measurable.

###### Theorem 2.1 (Theorem 2.2 in White and Wooldridge (1991)).

Let be a complete probability space and let be a pseudo-metric space. Let be a sequence of compact subsets of . Let be -measurable, and suppose that for each , is lower semicontinuous on , . Then for each , there exists , -measurable such that for each , .

Note that

 Qn(f) =1nn∑i=1(yi−f(xi))2 =1nn∑i=1(f0(xi)+ϵi−f(xi))2 =1nn∑i=1(f(xi)−f0(xi))2−21nn∑i=1ϵi(f(xi)−f0(xi))+1nn∑i=1ϵ2i.

Since the randomness only comes from ’s, it is clear that is a measurable function and for a fixed , is continuous in . Therefore, to show the existence of the sieve estimator, it suffices to show that is compact in , which is proved in the following lemma.

###### Lemma 2.1.

Let be a compact subset of . Then for each fixed , is a compact set.

###### Proof.

For each fixed , let belong to . For fixed, is a bounded closed set and hence it is a compact set in . Consider a map

 H:(Θn,∥⋅∥2) →(Frn,∥⋅∥n) θn ↦H(θn)=α0+rn∑j=1αjσ(γTjx+γ0,j)

Note that . Therefore, to show that is a compact set, it suffices to show that is a continuous map due to the compactness of . Let , then

 ∥H(θ1,n)−H(θ2,n)∥2n = 1nn∑i=1[α(1)0+rn∑j=1α(1)jσ(γ(1)Tjxi+γ(1)0,j)−α(2)0−rn∑j=1α(2)jσ(γ(2)Tjxi+γ(2)0,j)]2 ≤ 1nn∑i=1[∣∣α(1)0−α(2)0∣∣+rn∑j=1∣∣α(1)jσ(γ(1)Tjxi+γ(1)0,j)−α(2)jσ(γ(2)Tjxi+γ(2)0,j)∣∣]2 = |α(1)j−α(2)j|σ(γ(2)Tjxi+γ(2)0,j)]2 ≤ 1nn∑i=1[rn∑j=0|α(1)j−α(2)j|+Vn4rn∑j=1∣∣∣(γ(1)j−γ(2)j)Txi∣∣∣+∣∣γ(1)0,j−γ(2)0,j∣∣]2 ≤ [rn∑j=0|α(1)j−α(2)j|+Vn4(1∨∥x∥∞)rn∑j=1∥∥γ(1)j−γ(2)j∥∥1+∣∣γ(1)0,j−γ(2)0,j∣∣]2 ≤ (Vn4(1∨∥x∥∞))2[rn(d+1)]∥θ1,n−θ2,n∥22.

Hence, for any , by choosing , we observe that when , we have

 ∥H(θ1,n)−H(θ2,n)∥n<ϵ,

which implies that is a continuous map and hence is a compact set for each fixed . ∎

As a corollary of Lemma 2.1 and Theorem 2.1, we can easily obtain the existence of sieve estimator.

###### Corollary 2.1.

Under the notations above, for each , there exists , -measurable such that .

## 3 Consistency

In this section, we are going to show the consistency of the neural network sieve estimator. The consistency result leans heavily on the following Uniform Law of Large Numbers. We start by considering a simple case with for all . In such a case, is not dense in but rather in a subset of with functions satisfying a certain smoothness condition.

###### Lemma 3.1.

Let be i.i.d. sub-Gaussian random variables with sub-Gaussian parameter . Then if , we have

 supf∈Frn|Qn(f)−Qn(f)|p∗−→0.
###### Proof.

For any , we have

 P∗(supf∈Frn|Qn(f)−Qn(f)|>δ) = P∗(supf∈Frn∣∣ ∣∣1nn∑i=1ϵ2i−σ2−21nn∑i=1ϵi(f(xi)−f0(xi))∣∣ ∣∣>δ) ≤ P(∣∣ ∣∣1nn∑i=1ϵ2i−σ2∣∣ ∣∣>δ2)+P∗(supf∈Frn∣∣ ∣∣1nn∑i=1ϵi(f(xi)−f0(xi))∣∣ ∣∣>δ4) := (I)+(II).

For (I), by Weak Law of Large Numbers, we know that there exists such that for all we have

 (I)=P(∣∣ ∣∣1nn∑i=1ϵ2i−σ2∣∣ ∣∣>δ2)<δ2.

Now, we are going to evaluate (II). From the sub-Gaussianity of , we know that is also sub-Gaussian with mean 0 and sub-Gaussian parameter . Hence, by using the Hoeffding inequality,

 P(∣∣ ∣∣1nn∑i=1ϵi(f(xi)−f0(xi))∣∣ ∣∣>δ4) =P(∣∣ ∣∣n∑i=1ϵi(f(xi)−f0(xi))∣∣ ∣∣>nδ4) ≤2exp{−n2δ232σ20∑ni=1(f(xi)−f0(xi))2}.

From Proposition 2.2, we know that . Hence, based on Corollary 8.3 in van de Geer (2000), (II) will have an exponential bound if there exists some constant and satisfying and

 √nδ≥2C(∫Vδ/(8σ)H1/2(u,Frn,∥⋅∥n)du∨V). (3.1)

Now, we are going to show that (3.1) holds in our case. It follows from Theorem 14.5 in Anthony and Bartlett (2009), which gives an upper bound of the covering number for ,

 N(ϵ,Frn,∥⋅∥∞)≤⎛⎜ ⎜⎝4e[rn(d+2)+1](14V)2ϵ(14V−1)⎞⎟ ⎟⎠rn(d+2)+1:=~Arn,d,Vϵ−[rn(d+2)+1],

where . By letting

 Arn,d,V =log~Arn,d,V−[rn(d+2)+1] =[rn(d+2)+1](loge[rn(d+2)+1]V2V−4−1) =[rn(d+2)+1]log[rn(d+2)+1]V2V−4,

and noting that for all , we have . Then,

 H(ϵ,Frn,∥⋅∥∞) =logN(ϵ,Frn,∥⋅∥∞) =log~Arn,d,V+[rn(d+2)+1]log1ϵ ≤Arn,d,V+[rn(d+2)+1]1ϵ(% since logx≤x−1 for all x>0) ≤Arn,d,V(1+1ϵ).

Note that

 ∥f∥2n=1nn∑i=1f2(xi)≤(supx|f(x)|)2=∥f∥2∞,

we have . Then

 ∫Vδ/(8σ)H1/2(ϵ,Frn,∥⋅∥n)dϵ ≤A1/2rn,d,V∫V0(1+1ϵ)1/2dϵ =A1/2rn,d,V[∫10(1+1ϵ)1/2dϵ+∫V1(1+1ϵ)1/2dϵ] ≤A1/2rn,d,V[√2∫10ϵ−12dϵ+√2(V−1)] ≤A1/2rn,d,V[2√2+2√2(V−1)] =2√2A1/2rn,d,VV.

Clearly . Under the assumption that , we get for any there exists such that for all ,

 4√2V(1nArn,d,V)1/2<δ4,

i.e. (3.1) holds with and . Hence, based on Corollary 8.3 in van de Geer (2000), for ,

 P∗(supf∈Frn∣∣ ∣∣1nn∑i=1ϵi(f(xi)−f0(xi))∣∣ ∣∣>δ4∧1nn∑i=1ϵ2i≤σ2)≤exp{−nδ264V2}. (3.2)

Since , we can take in (3.2) to get

 P∗(supf∈Frn∣∣ ∣∣1nn∑i=1ϵi(f(xi)−f0(xi))∣∣ ∣∣>δ4)≤exp{−nδ264V2}.

Let , then for , we have

 (II)=P∗(supf∈Frn∣∣ ∣∣1nn∑i=1ϵi(f(xi)−f0(xi))∣∣ ∣∣>δ4)≤δ2.

Thus we conclude that for any , by taking , we have

 P∗(supf∈Frn|Qn(f)−Qn(f)|>δ)<δ,

which proves the desired result. ∎

###### Remark 3.1.

Lemma 3.1 shows that if we have a fixed number of features, the desired Uniform Law of Large Numbers holds when the number of hidden units in the neural network sieve does not grow too fast.

Now, we are going to extend the result to a more general case. In Lemma 3.1, we assume that the errors are i.i.d. sub-Gaussian and . In the following lemma, we are going to relax both restrictions.

###### Lemma 3.2.

Under the assumption that

 [rn(d+2)+1]V2nlog(Vn[rn(d+2)+1]=o(n), as n→∞,

we have

 supf∈Frn|Qn(f)−Qn(f)|p∗−→0, as n→∞.
###### Proof.

As in the proof of Lemma 3.1, it suffices to show that

 P∗(supf∈Frn∣∣ ∣∣1nn∑i=1ϵi(f(xi)−f0(xi))∣∣ ∣∣>δ4)→0, as n→∞. (3.3)

By Markov’s inequality, (3.3) holds if we can show

 E∗[supf∈Frn∣∣ ∣∣1nn∑i=1ϵi(f(xi)−f0(xi))∣∣ ∣∣]→0, as n→∞.

Note that and each has its corresponding parametrization . Since is in a compact set, we know that there exists a sequence as with . Each corresponds to a function . By continuity, we have for each . From Example 2.3.4 in van der Vaart and Wellner (1996), we know that is -measurable and by symmetrization inequality, we have

 E∗[supf∈Frn∣∣ ∣∣1nn∑i=1ϵi(f(xi)−f0(xi))∣∣ ∣∣] ≤ 2EϵEξ[supf∈Frn∣∣ ∣∣1nn∑i=1ξiϵi(f(xi)−f0(xi))∣∣ ∣∣],

where are i.i.d. Rademacher random variables independent of . Based on the Strong Law of Large Numbers, there exists , such that for all ,

 1nn∑i=1ϵ2i<σ2+1, a.s.

For fixed , is a sub-Gaussian process indexed by . Suppose that is the probability space on which are defined and let with and . As we have shown above, we have and by continuity, for any . This shows that is a separable sub-Gaussian process. Hence Corollary 2.2.8 in van der Vaart and Wellner (1996) implies that there exists a universal constant and for any with ,

 Eξ [supf∈Frn∣∣ ∣∣1nn∑i=1ξiϵi(f(xi))−f0(xi))∣∣ ∣∣] =Eξ[1√nsupf∈Frn∣∣ ∣∣1√nn∑i=1ξiϵi(f(xi)−f0(xi))∣∣ ∣∣] ≤Eξ[∣∣ ∣∣1nn∑i=1ξiϵi(f∗n(xi)−f0(xi))∣∣ ∣∣]+K∫∞0 ⎷logN(12η,Frn,d)ndη =Eξ[∣∣ ∣∣1nn∑i=1ξiϵi(f∗n(xi)−f0(xi))∣∣ ∣∣]+K∫2Vn0 ⎷logN(12η,Frn,d)ndη ≤Eξ[∣∣ ∣∣1nn∑i=1ξiϵi(f∗n(xi)−f0(xi))∣∣ ∣∣]+K∫2Vn0  ⎷