Estimating the error distribution function in nonparametric regression 1footnote 11footnote 1Technical report, 2004 AMS 2000 subject classification: Primary 62G05, 62G08, 62G20 Key words and phrases: Local polynomial smoother, kernel estimator, under-smoothing, plug-in estimator, error variance, empirical likelihood, adaptive estimator, efficient estimator, influence function

Estimating the error distribution function in nonparametric regression 111Technical report, 2004
AMS 2000 subject classification: Primary 62G05, 62G08, 62G20
Key words and phrases: Local polynomial smoother, kernel estimator, under-smoothing, plug-in estimator, error variance, empirical likelihood, adaptive estimator, efficient estimator, influence function

Ursula U. Müller, Anton Schick, Wolfgang Wefelmeyer

Summary: We construct an efficient estimator for the error distribution function of the nonparametric regression model . Our estimator is a kernel smoothed empirical distribution function based on residuals from an under-smoothed local quadratic smoother for the regression function.

1 Introduction

Consider the nonparametric regression model , where the covariate and the error are independent, and has mean zero, finite variance and density . We observe independent copies of and want to estimate the distribution function of . If the regression function were known, we could use the empirical distribution function based on the errors , defined by

We consider the regression function as unknown and propose a kernel smoothed empirical distribution function based on residuals from an under-smoothed local quadratic smoother for the regression function. We give conditions under which is asymptotically equivalent to plus some correction term:

(1.1)

Smoothing the empirical distribution function is appropriate because we assume that the error distribution has a Lipschitz density and therefore a smooth distribution function. A local quadratic smoother for the regression function is appropriate because we assume that the regression function is twice continuously differentiable.

It follows from (1.1) that has influence function

Müller, Schick and Wefelmeyer (2004a) show that this is the efficient influence function for estimators of . Hence is efficient for in the sense that is a least dispersed regular estimator of for all and all . The influence function of our estimator coincides with the efficient influence function in the model with constant regression function; see Bickel, Klaassen, Ritov and Wellner (1998, Section 5.5, Example 1).

It follows in particular from (1.1) that has asymptotic variance

If is a normal density, this simplifies to

Hence, for normal errors, the asymptotic variance of is strictly smaller than the asymptotic variance of the empirical estimator based on the true errors. This paradox is explained by the fact that the empirical estimator is not efficient: Unlike , it does not make use of the information that the errors have mean zero. The efficient influence function for estimators of from mean zero observations is

see Levit (1975). Efficient estimators for from observations are

and the empirical likelihood estimator

with (random) probabilities maximizing subject to . The empirical likelihood was introduced by Owen (1988), (1990); see also Owen (2001). The asymptotic variance of an efficient estimator for from is

The variance increase of our estimator over is therefore

This is the price for not knowing the regression function. For normal errors this term is zero, and we lose nothing. We refer also to the introduction of Müller, Schick and Wefelmeyer (2004b).

Our proof is complicated by two features of the model: the error distribution cannot be estimated adaptively with respect to the regression function, and the regression function cannot be estimated at the efficient rate . Akritas and Van Keilegom (2001) encountered these problems in a related model, the heteroscedastic regression model . They used different techniques and stronger assumptions to get an expansion similar to (1.1). Their results do not cover ours in our simpler model.

Previous related results are easier because at least one of these complicating features is missing. Loynes (1980) assumes that . Koul (1969), (1970), (1987), (1992), Shorack (1984), Shorack and Wellner (1986, Section 4.6) and Bai (1996) consider linear models . Mammen (1996) studies the linear model as the dimension of increases with . Klaassen and Putter (1997) and (2001) construct efficient estimators for the error distribution function in the linear regression model . Koshevnik (1996) treats the nonparametric regression model with error density symmetric about zero; an efficient estimator for is obtained by symmetrizing the empirical distribution function based on residuals. Related results exist for time series. See Boldin (1982), Koul (2002, Chapter 7) and Koul and Leventhal (1989) for linear autoregressive processes ; Kreiss (1991) and Schick and Wefelmeyer (2002b) for invertible linear processes ; and Koul (2002, Chapter 8), Schick and Wefelmeyer (2002a) and Müller, Schick and Wefelmeyer (2004c, Section 4) for nonlinear autoregressive processes . For invertible linear processes, Schick and Wefelmeyer (2004) show that the smoothed residual-based empirical estimator is asymptotically equivalent to the empirical estimator based on the true innovations. General considerations on empirical processes based on estimated observations are in Ghoudi and Rémillard (1998).

Our result gives efficient estimators for linear functionals with bounded . For smooth and -square-integrable functions , it is easier to prove an i.i.d. representation analogous to (1.1) directly; see Müller, Schick and Wefelmeyer (2004a), who also use an under-smoothed estimator for the regression function. Müller, Schick and Wefelmeyer (2004b) compare these results with estimation in the larger model in which one assumes rather than independence of and with . A particularly simple special case is the error variance , with . For the estimator based on residuals with kernel estimator , under-smoothing is not needed. The asymptotic variance of this estimator was already obtained in Hall and Marron (1990). Müller, Schick and Wefelmeyer (2003) show that a covariate-matched U-statistic is efficient for ; it does not require estimating but uses a kernel density estimator for the covariate density . There is a large literature on simpler, inefficient, difference-based estimators for ; reviews are Carter and Eagleson (1992) and Dette, Munk and Wagner (1998) and (1999).

We can write

where is the distribution of . Our estimator is obtained by plugging in estimators for and . For we use essentially the empirical distribution; for we use a local quadratic smoother that is under-smoothed and hence does not have the optimal rate for estimating . This means that our estimator does not obey the plug-in principle of Bickel and Ritov (2000) and (2003).

The paper is organized as follows. Section 2 introduces our estimator and states, in Theorem 2.7, the assumptions needed for expansion (1.1). Section 3 derives some consequences of exponential inequalities, and Section 4 contains properties of local polynomial smoothers. Section 5 gives the proof of Proposition 2.8.

2 The estimator and the main result

Let us now define our estimator. We begin be defining the residuals. This requires an estimator of the regression function. We take to be a local quadratic smoother. To define it we need a kernel and a bandwidth . A local quadratic smoother of is defined as for , where is the minimizer of

The residuals of the regression estimator are

Let denote the empirical distribution function based on these residuals:

Our estimator of the error distribution function will be a smoothed version of . To this end, let be a density and another bandwidth. Then we define our estimator of by

With the distribution function of , we can write

This shows that is the convolution of the empirical distribution function of the residuals with the distribution function . Alternatively, is the distribution function with density given by

This is the usual kernel density estimator of based on the residuals, with kernel and bandwidth . We make the following assumptions.

Assumption 2.1

The covariate density is bounded and bounded away from zero on , and its restriction to is (uniformly) continuous.

Assumption 2.2

The regression function is twice continuously differentiable.

Assumption 2.3

The error density is Lipschitz, has mean zero, and satisfies the moment condition for some .

Assumption 2.4

The density is symmetric, twice continuously differentiable, and has compact support .

Assumption 2.5

The kernel used to define the local quadratic smoother is a symmetric density which has compact support and a bounded derivative .

Assumption 2.6

The bandwidths satisfy and .

Note that is smaller than the optimal bandwidth under Assumptions 2.1 and 2.2. Such a bandwidth would be proportional to . This means that our choice of bandwidth results in an under-smoothed local quadratic smoother.

We are now ready to state our main result.

Theorem 2.7

Suppose that Assumptions 2.1 to 2.6 hold. Then

In particular, converges in distribution in the space to a centered Gaussian process.

Proof.

For and set

Since the density has mean zero by Assumption 2.4, we have

Thus the Lipschitz continuity of yields

It follows from standard empirical process theory that

(2.1)

Indeed, with we have

The above shows that

Hence the desired result follows from Proposition 2.8 below. ∎

Proposition 2.8

Suppose that Assumptions 2.1 to 2.6 hold. Then

The proof of Proposition 2.8 is in Section 5. We conclude this section with a simple lemma that will be needed repeatedly in the sequel.

Lemma 2.9

Suppose that for some . Then

If has also mean zero, then, as ,

Proof.

The first conclusion follows by the sharper version of the Markov inequality: For ,

The second conclusion follows from

In the first equality, we have used that has mean zero. ∎

3 Auxiliary Results

In this section we derive some results that will be used in the proof of Proposition 2.8. Let be a probability space. For each positive integer let be independent -valued random variables with distribution , and for each in , let be a bounded measurable function from into . We first study the process defined by

Lemma 3.1

Let be a sequence of positive numbers such that for some . Suppose that

(3.1)

and, for positive numbers and ,

(3.2)

Then

(3.3)

If we strengthen (3.1) to

(3.4)

then

(3.5)
Proof.

To prove the lemma we use an inequality of Hoeffding (1963): If are independent random variables that have mean zero and variance and are bounded by , then for ,

Applying this inequality with , we obtain for :

Thus there is a positive number such that for all ,

Now let for , with an integer greater than . The above yields for large enough ,

Now, using (3.2),

This is the desired result (3.3). The second conclusion is an immediate consequence. ∎

Next we consider the degenerate U-process

with , and a bounded measurable function from to such that for all in ,

Set .

Lemma 3.2

Let be positive numbers such that for some . Suppose that

(3.6)

and, for some positive and ,

(3.7)

Then

(3.8)

If we strengthen (3.6) to

(3.9)

then

(3.10)
Proof.

We use a similar argument as for Lemma 3.1, but rely now on the Arcones–Giné exponential inequality for degenerate U-processes (inequality (c) in Proposition 2.3 of Arcones and Giné, 1994). This inequality states that there are constants and depending only on such that, for every , all and all ,

From this inequality one obtains as in the proof of Lemma 3.1 that there is a positive number such that

Now proceed as in the proof of Lemma 3.1. ∎

4 Properties of local polynomial smoothers

For an introduction to local polynomial smoothers we refer to Fan and Gijbels (1996). In this section we derive some properties of local polynomial smoothers of order , defined by for , where is the minimizer of

Here we have re-scaled for convenience. The normal equations are

where the vector has entries

and the matrix has entries , , with

By the properties of the kernel and the covariate density we have for and all ,

(4.1)
(4.2)
(4.3)

Write for the first column of the inverse of , and

From the normal equations we obtain

For the expectation of we write

We define correspondingly, replacing by . Furthermore, and are defined as and , with replaced by .

For a unit vector and we have

Thus, by Assumption 2.1,

By Assumption 2.5, there is an such that the eigenvalues of are in the interval for all and . Thus is invertible, and

(4.4)
Lemma 4.1

Suppose Assumptions 2.1 and 2.5 hold. Let and . Then

and consequently

Proof.

Fix and use Lemma 3.1 with , and

For these choices, the conditions (3.1) and (3.2), with , follow from (4.1) to (4.3). ∎

Lemma 4.2

Suppose Assumptions 2.1 and 2.5 hold. Assume also that has mean zero and finite moment of order . Let and . Then

Proof.

Fix . In view of Lemmas 2.9 and 4.1 it suffices to show that

(4.5)

where . Here we used the fact that

But (4.5) follows from an application of Lemma 3.1 with , and

Indeed, the left-hand side of (3.1) is of order , which is of order by the assumptions on . Relation (3.2) follows by the Lipschitz continuity of . ∎

Theorem 4.3

Suppose Assumptions 2.1 and 2.5 hold. Assume also that has mean zero and finite moment of order . Let and . Then

(4.6)

If, in addition, is -times continuously differentiable with , then

(4.7)

If has a Lipschitz continuous -th derivative, then can be replaced by .

Proof.

Since

relation (4.6) follows from (4.4) and Lemma 4.2. To prove (4.7), write

with

By Lemma 4.1 and relation (4.4),

(4.8)

In view of this and Lemma 4.2, assertion (4.7) follows if we verify

(4.9)

By construction,

Hence, if we assume that is -times continuously differentiable with , we can write

and obtain the bound

By (4.8), Lemma 4.1 and (4.4),

(4.10)

The desired (4.9) follows from this and the uniform continuity of on . If the -th derivative is Lipschitz, one readily sees that (4.9) holds with replaced by . ∎

We conclude this section b