Estimating the error distribution function in nonparametric regression 1footnote 11footnote 1Technical report, 2004 AMS 2000 subject classification: Primary 62G05, 62G08, 62G20 Key words and phrases: Local polynomial smoother, kernel estimator, under-smoothing, plug-in estimator, error variance, empirical likelihood, adaptive estimator, efficient estimator, influence function

# Estimating the error distribution function in nonparametric regression 111Technical report, 2004 AMS 2000 subject classification: Primary 62G05, 62G08, 62G20 Key words and phrases: Local polynomial smoother, kernel estimator, under-smoothing, plug-in estimator, error variance, empirical likelihood, adaptive estimator, efficient estimator, influence function

Ursula U. Müller, Anton Schick, Wolfgang Wefelmeyer

Summary: We construct an efficient estimator for the error distribution function of the nonparametric regression model . Our estimator is a kernel smoothed empirical distribution function based on residuals from an under-smoothed local quadratic smoother for the regression function.

## 1 Introduction

Consider the nonparametric regression model , where the covariate and the error are independent, and has mean zero, finite variance and density . We observe independent copies of and want to estimate the distribution function of . If the regression function were known, we could use the empirical distribution function based on the errors , defined by

 F(t)=1nn∑i=11{εi≤t}.

We consider the regression function as unknown and propose a kernel smoothed empirical distribution function based on residuals from an under-smoothed local quadratic smoother for the regression function. We give conditions under which is asymptotically equivalent to plus some correction term:

 supt∈Rn1/2∣∣^F∗(t)−F(t)−f(t)1nn∑i=1εi∣∣=op(1). (1.1)

Smoothing the empirical distribution function is appropriate because we assume that the error distribution has a Lipschitz density and therefore a smooth distribution function. A local quadratic smoother for the regression function is appropriate because we assume that the regression function is twice continuously differentiable.

It follows from (1.1) that has influence function

 1{ε≤t}−F(t)+f(t)ε.

Müller, Schick and Wefelmeyer (2004a) show that this is the efficient influence function for estimators of . Hence is efficient for in the sense that is a least dispersed regular estimator of for all and all . The influence function of our estimator coincides with the efficient influence function in the model with constant regression function; see Bickel, Klaassen, Ritov and Wellner (1998, Section 5.5, Example 1).

It follows in particular from (1.1) that has asymptotic variance

 F(t)(1−F(t))+σ2f2(t)−2f(t)∫∞txf(x)dx.

If is a normal density, this simplifies to

 F(t)(1−F(t))−σ2f2(t).

Hence, for normal errors, the asymptotic variance of is strictly smaller than the asymptotic variance of the empirical estimator based on the true errors. This paradox is explained by the fact that the empirical estimator is not efficient: Unlike , it does not make use of the information that the errors have mean zero. The efficient influence function for estimators of from mean zero observations is

 1{ε≤t}−F(t)−C0(t)εwithC0(t)=σ−2∫t−∞xf(x)dx;

see Levit (1975). Efficient estimators for from observations are

 F(t)−^C0(t)1nn∑i=1εi%with^C0(t)=∑ni=1εi1{εi≤t}∑ni=1ε2i,

and the empirical likelihood estimator

 1nn∑i=1pi1{εi≤t}

with (random) probabilities maximizing subject to . The empirical likelihood was introduced by Owen (1988), (1990); see also Owen (2001). The asymptotic variance of an efficient estimator for from is

 F(t)(1−F(t))−σ−2(∫∞txf(x)dx)2.

The variance increase of our estimator over is therefore

 (σf(t)−σ−1∫∞txf(x)dx)2.

This is the price for not knowing the regression function. For normal errors this term is zero, and we lose nothing. We refer also to the introduction of Müller, Schick and Wefelmeyer (2004b).

Our proof is complicated by two features of the model: the error distribution cannot be estimated adaptively with respect to the regression function, and the regression function cannot be estimated at the efficient rate . Akritas and Van Keilegom (2001) encountered these problems in a related model, the heteroscedastic regression model . They used different techniques and stronger assumptions to get an expansion similar to (1.1). Their results do not cover ours in our simpler model.

Previous related results are easier because at least one of these complicating features is missing. Loynes (1980) assumes that . Koul (1969), (1970), (1987), (1992), Shorack (1984), Shorack and Wellner (1986, Section 4.6) and Bai (1996) consider linear models . Mammen (1996) studies the linear model as the dimension of increases with . Klaassen and Putter (1997) and (2001) construct efficient estimators for the error distribution function in the linear regression model . Koshevnik (1996) treats the nonparametric regression model with error density symmetric about zero; an efficient estimator for is obtained by symmetrizing the empirical distribution function based on residuals. Related results exist for time series. See Boldin (1982), Koul (2002, Chapter 7) and Koul and Leventhal (1989) for linear autoregressive processes ; Kreiss (1991) and Schick and Wefelmeyer (2002b) for invertible linear processes ; and Koul (2002, Chapter 8), Schick and Wefelmeyer (2002a) and Müller, Schick and Wefelmeyer (2004c, Section 4) for nonlinear autoregressive processes . For invertible linear processes, Schick and Wefelmeyer (2004) show that the smoothed residual-based empirical estimator is asymptotically equivalent to the empirical estimator based on the true innovations. General considerations on empirical processes based on estimated observations are in Ghoudi and Rémillard (1998).

Our result gives efficient estimators for linear functionals with bounded . For smooth and -square-integrable functions , it is easier to prove an i.i.d. representation analogous to (1.1) directly; see Müller, Schick and Wefelmeyer (2004a), who also use an under-smoothed estimator for the regression function. Müller, Schick and Wefelmeyer (2004b) compare these results with estimation in the larger model in which one assumes rather than independence of and with . A particularly simple special case is the error variance , with . For the estimator based on residuals with kernel estimator , under-smoothing is not needed. The asymptotic variance of this estimator was already obtained in Hall and Marron (1990). Müller, Schick and Wefelmeyer (2003) show that a covariate-matched U-statistic is efficient for ; it does not require estimating but uses a kernel density estimator for the covariate density . There is a large literature on simpler, inefficient, difference-based estimators for ; reviews are Carter and Eagleson (1992) and Dette, Munk and Wagner (1998) and (1999).

We can write

 F(t)=∫1{y−r(z)≤t}Q(dy,dz),

where is the distribution of . Our estimator is obtained by plugging in estimators for and . For we use essentially the empirical distribution; for we use a local quadratic smoother that is under-smoothed and hence does not have the optimal rate for estimating . This means that our estimator does not obey the plug-in principle of Bickel and Ritov (2000) and (2003).

The paper is organized as follows. Section 2 introduces our estimator and states, in Theorem 2.7, the assumptions needed for expansion (1.1). Section 3 derives some consequences of exponential inequalities, and Section 4 contains properties of local polynomial smoothers. Section 5 gives the proof of Proposition 2.8.

## 2 The estimator and the main result

Let us now define our estimator. We begin be defining the residuals. This requires an estimator of the regression function. We take to be a local quadratic smoother. To define it we need a kernel and a bandwidth . A local quadratic smoother of is defined as for , where is the minimizer of

 n∑j=1(Yj−β0−β1(Zj−x)−β2(Zj−x)2)21cnw(Zj−xcn).

The residuals of the regression estimator are

 ^εi=Yi−^r(Zi),i=1,…,n.

Let denote the empirical distribution function based on these residuals:

 ^F(t)=1nn∑i=11{^εi≤t},t∈R.

Our estimator of the error distribution function will be a smoothed version of . To this end, let be a density and another bandwidth. Then we define our estimator of by

 ^F∗(t)=∫^F(t−anx)k(x)dx,t∈R.

With the distribution function of , we can write

 ^F∗(t)=∫K(t−xan)d^F(x)=1nn∑i=1K(t−^εian),t∈R.

This shows that is the convolution of the empirical distribution function of the residuals with the distribution function . Alternatively, is the distribution function with density given by

 f∗(t)=1nann∑i=1k(t−^εian),t∈R.

This is the usual kernel density estimator of based on the residuals, with kernel and bandwidth . We make the following assumptions.

###### Assumption 2.1

The covariate density is bounded and bounded away from zero on , and its restriction to is (uniformly) continuous.

###### Assumption 2.2

The regression function is twice continuously differentiable.

###### Assumption 2.3

The error density is Lipschitz, has mean zero, and satisfies the moment condition for some .

###### Assumption 2.4

The density is symmetric, twice continuously differentiable, and has compact support .

###### Assumption 2.5

The kernel used to define the local quadratic smoother is a symmetric density which has compact support and a bounded derivative .

###### Assumption 2.6

The bandwidths satisfy and .

Note that is smaller than the optimal bandwidth under Assumptions 2.1 and 2.2. Such a bandwidth would be proportional to . This means that our choice of bandwidth results in an under-smoothed local quadratic smoother.

We are now ready to state our main result.

###### Theorem 2.7

Suppose that Assumptions 2.1 to 2.6 hold. Then

 supt∈Rn1/2∣∣^F∗(t)−F(t)−f(t)1nn∑i=1εi∣∣=op(1).

In particular, converges in distribution in the space to a centered Gaussian process.

###### Proof.

For and set

Since the density has mean zero by Assumption 2.4, we have

 Fa(t)−F(t) = ∫(F(t−ax)−F(t)+axf(t))k(x)dx = ∫(−ax)∫10(f(t−axy)−f(t))dyk(x)dx.

Thus the Lipschitz continuity of yields

 supt∈R∣∣Fan(t)−F(t)∣∣=O(a2n)=o(n−1/2).

It follows from standard empirical process theory that

 Gn=n1/2supx∈R|Fan(x)−Fan(x)−F(x)+F(x)|=op(1),an→0. (2.1)

Indeed, with we have

The above shows that

 supt∈Rn1/2∣∣Fan(t)−F(t)∣∣=o(1).

Hence the desired result follows from Proposition 2.8 below. ∎

###### Proposition 2.8

Suppose that Assumptions 2.1 to 2.6 hold. Then

 supt∈Rn1/2∣∣^F∗(t)−Fan(t)−f(t)1nn∑i=1εi∣∣=op(1).

The proof of Proposition 2.8 is in Section 5. We conclude this section with a simple lemma that will be needed repeatedly in the sequel.

###### Lemma 2.9

Suppose that for some . Then

 max1≤i≤n|εi|=op(n1/β).

If has also mean zero, then, as ,

 E[ε1{|ε|≤A}]=o(A1−β).
###### Proof.

The first conclusion follows by the sharper version of the Markov inequality: For ,

 P(max1≤i≤n|εi|>an1/β)≤n∑i=1P(|εi|>an1/β)≤a−βE[|ε|β1{|ε|>an1/β}]→0.

The second conclusion follows from

 |E[ε1{|ε|≤A}]|=|E[ε1{|ε|>A}]|≤A1−βE[|ε|β1{|ε|>A}]=o(A1−β).

In the first equality, we have used that has mean zero. ∎

## 3 Auxiliary Results

In this section we derive some results that will be used in the proof of Proposition 2.8. Let be a probability space. For each positive integer let be independent -valued random variables with distribution , and for each in , let be a bounded measurable function from into . We first study the process defined by

###### Lemma 3.1

Let be a sequence of positive numbers such that for some . Suppose that

 sup|x|≤Bn(E[h2nx(V)]+∥hnx∥∞)=O(n/logn) (3.1)

and, for positive numbers and ,

 ∥hny−hnx∥∞≤|y−x|κ1O(nκ2),|x|,|y|≤Bn,|y−x|≤1. (3.2)

Then

 sup|x|≤Bn|Hn(x)|=Op(1). (3.3)

If we strengthen (3.1) to

 sup|x|≤Bn(E[h2nx(V)]+∥hnx∥∞)=o(n/logn), (3.4)

then

 sup|x|≤Bn|Hn(x)|=op(1). (3.5)
###### Proof.

To prove the lemma we use an inequality of Hoeffding (1963): If are independent random variables that have mean zero and variance and are bounded by , then for ,

 P(∣∣1nn∑j=1ξj∣∣≥η)≤2exp(−nη22σ2+(2/3)Mη).

Applying this inequality with , we obtain for :

 P(|Hn(x)|≥η)≤2exp(−nη22E[h2nx(V)]+2η∥hnx∥∞).

Thus there is a positive number such that for all ,

 sup|x|≤BnP(|Hn(x)|≥η)≤2exp(−η21∨ηalogn).

Now let for , with an integer greater than . The above yields for large enough ,

 P(maxk=0,…,nm|Hn(xnk)|>η)≤nm∑k=0P(|Hn(xnk)|>η)=o(1).

Now, using (3.2),

 sup|x|≤Bn|Hn(x)| ≤ maxk=0,…,nm(|Hn(xnk)|+sup|x−xnk|≤Bnn−m|Hn(x)−Hn(xnk)|) = Op(1)+O(Bκ1nn−mκ1nκ2)=Op(1).

This is the desired result (3.3). The second conclusion is an immediate consequence. ∎

Next we consider the degenerate U-process

 Un(x)=n−m/2∑(i1,…,im)∈Inmunx(Vi1,…,Vim),x∈R,

with , and a bounded measurable function from to such that for all in ,

 E[unx(V1,v2,…,vm)]=⋯=E[unx(v1,v2,…,Vm)]=0.

Set .

###### Lemma 3.2

Let be positive numbers such that for some . Suppose that

 sup|x|≤Bn(∥unx∥2/m2+∥unx∥2/(m+1)∞n−1/(m+1))=O((logn)−1) (3.6)

and, for some positive and ,

 ∥uny−unx∥∞≤|y−x|κ1O(nκ2),|x|,|y|≤Bn,|y−x|≤1. (3.7)

Then

 sup|x|≤Bn|Un(x)|=Op(1). (3.8)

If we strengthen (3.6) to

 sup|x|≤Bn(∥unx∥2/m2+∥unx∥2/(m+1)∞n−1/(m+1))=o((logn)−1), (3.9)

then

 sup|x|≤Bn|Un(x)|=op(1). (3.10)
###### Proof.

We use a similar argument as for Lemma 3.1, but rely now on the Arcones–Giné exponential inequality for degenerate U-processes (inequality (c) in Proposition 2.3 of Arcones and Giné, 1994). This inequality states that there are constants and depending only on such that, for every , all and all ,

 P(|Un(x)|>η)≤c1exp(−c2η2/m∥unx∥2/m2+(∥unx∥∞η1/mn−1/2)2/(m+1)).

From this inequality one obtains as in the proof of Lemma 3.1 that there is a positive number such that

 sup|x|≤BnP(|Un(x)|>η)≤c1exp(−η2/m(1∨η)2/(m+m2)blogn),η>0.

Now proceed as in the proof of Lemma 3.1. ∎

## 4 Properties of local polynomial smoothers

For an introduction to local polynomial smoothers we refer to Fan and Gijbels (1996). In this section we derive some properties of local polynomial smoothers of order , defined by for , where is the minimizer of

 n∑j=1(Yj−d∑m=0βm(Zj−xcn)m)21cnw(Zj−xcn).

Here we have re-scaled for convenience. The normal equations are

 Qn(x)β=1nn∑j=1wn(Zj−x)Yj,

where the vector has entries

 wnm(x)=xmcm+1nw(xcn),

and the matrix has entries , , with

 qnm(x)=1nn∑j=1wnm(Zj−x).

By the properties of the kernel and the covariate density we have for and all ,

 |wnm(x)| ≤ ∥w∥∞c−1n, (4.1) |w′nm(x)| ≤ (∥w′∥∞+m∥w∥∞)c−2n, (4.2) E[w2nm(Z−x)] ≤ ∥w∥∞∥g∥∞c−1n. (4.3)

Write for the first column of the inverse of , and

 An(x,y)=pn(x)⊤wn(y−x).

From the normal equations we obtain

 ^r(x)=β0(x)=1nn∑j=1An(x,Zj)Yj.

For the expectation of we write

 ¯¯¯qnm(x)=E[qnm(x)]=∫g(x+cnt)tmw(t)dt.

We define correspondingly, replacing by . Furthermore, and are defined as and , with replaced by .

For a unit vector and we have

 v⊤¯Qn(x)v=∫(d∑i=0viti)2g(x+cnt)w(t)dt.

Thus, by Assumption 2.1,

 (d+1)∥g∥∞≥v⊤¯Qn(x)v≥inf0≤x≤1g(x)∫1∧((1−x)/cn)−1∨(−x/cn)(d∑i=0viti)2w(t)dt.

By Assumption 2.5, there is an such that the eigenvalues of are in the interval for all and . Thus is invertible, and

 sup0≤x≤1∥¯Q−1n(x)∥≤1/η. (4.4)
###### Lemma 4.1

Suppose Assumptions 2.1 and 2.5 hold. Let and . Then

 sup0≤x≤1∣∣qnm(x)−¯¯¯qnm(x)∣∣=Op((ncn/logn)−1/2),m=0,…,2d,

and consequently

 sup0≤x≤1∥Qn(x)−¯Qn(x)∥=Op((ncn/logn)−1/2).
###### Proof.

Fix and use Lemma 3.1 with , and

 hnx(v)=(ncn/logn)1/2wnm(v−x).

For these choices, the conditions (3.1) and (3.2), with , follow from (4.1) to (4.3). ∎

###### Lemma 4.2

Suppose Assumptions 2.1 and 2.5 hold. Assume also that has mean zero and finite moment of order . Let and . Then

 sup0≤x≤1∣∣1nn∑j=1wnm(Zj−x)εj∣∣=Op((ncn/logn)−1/2),m=0,…,2d.
###### Proof.

Fix . In view of Lemmas 2.9 and 4.1 it suffices to show that

 sup0≤x≤1∣∣1nn∑j=1wnm(Zj−x)εnj∣∣=Op((ncn/logn)−1/2), (4.5)

where . Here we used the fact that

 (ncn/logn)1/2E[ε1{|ε|≤n1/β}]=O(n−1/2c1/2nn1/β)=o(1).

But (4.5) follows from an application of Lemma 3.1 with , and

 hnx(Zj,εj)=(ncn/logn)1/2wnm(Zj−x)εnj.

Indeed, the left-hand side of (3.1) is of order , which is of order by the assumptions on . Relation (3.2) follows by the Lipschitz continuity of . ∎

###### Theorem 4.3

Suppose Assumptions 2.1 and 2.5 hold. Assume also that has mean zero and finite moment of order . Let and . Then

 sup0≤x≤1∣∣1nn∑j=1¯An(x,Zj)εj∣∣=Op((ncn/logn)−1/2). (4.6)

If, in addition, is -times continuously differentiable with , then

 sup0≤x≤1∣∣^r(x)−r(x)−1nn∑j=1¯An(x,Zj)εj∣∣=Op(logn/(ncn))+op(cνn). (4.7)

If has a Lipschitz continuous -th derivative, then can be replaced by .

###### Proof.

Since

 1nn∑j=1¯An(x,Zj)εj=¯¯¯pn(x)⊤1nn∑j=1wn(Zj−x)εj,

relation (4.6) follows from (4.4) and Lemma 4.2. To prove (4.7), write

 ^r(x)=~r(x)+pn(x)⊤1nn∑j=1wn(Zj−x)εj

with

 ~r(x)=1nn∑j=1An(x,Zj)r(Zj).

By Lemma 4.1 and relation (4.4),

 sup0≤x≤1∥pn(x)−¯¯¯pn(x)∥=Op((ncn/logn)−1/2). (4.8)

In view of this and Lemma 4.2, assertion (4.7) follows if we verify

 sup0≤x≤1|~r(x)−r(x)|=op(cνn). (4.9)

By construction,

 n∑j=1An(x,Zj)=1andn∑j=1An(x,Zj)(x−Zj)m=0,m=1,…,d.

Hence, if we assume that is -times continuously differentiable with , we can write

 ~r(x)−r(x)=1nn∑j=1An(x,Zj)(r(Zj)−r(x)−ν∑m=1r(m)(x)(Zj−x)mm!)

and obtain the bound

 |~r(x)−r(x)|≤1nn∑j=1|An(x,Zj)|cνnν!supz∈[0,1],|z−x|≤cn|r(ν)(z)−r(ν)(x)|.

By (4.8), Lemma 4.1 and (4.4),

 sup0≤x≤11nn∑j=1|An(x,Zj)|=Op(1). (4.10)

The desired (4.9) follows from this and the uniform continuity of on . If the -th derivative is Lipschitz, one readily sees that (4.9) holds with replaced by . ∎

We conclude this section b