Estimating the error distribution function
in nonparametric regression
111Technical report, 2004
AMS 2000 subject classification: Primary 62G05, 62G08, 62G20
Key words and phrases: Local polynomial smoother, kernel estimator,
under-smoothing, plug-in estimator, error variance,
empirical likelihood, adaptive estimator,
efficient estimator, influence function
Summary: We construct an efficient estimator for the error distribution function of the nonparametric regression model . Our estimator is a kernel smoothed empirical distribution function based on residuals from an under-smoothed local quadratic smoother for the regression function.
Consider the nonparametric regression model , where the covariate and the error are independent, and has mean zero, finite variance and density . We observe independent copies of and want to estimate the distribution function of . If the regression function were known, we could use the empirical distribution function based on the errors , defined by
We consider the regression function as unknown and propose a kernel smoothed empirical distribution function based on residuals from an under-smoothed local quadratic smoother for the regression function. We give conditions under which is asymptotically equivalent to plus some correction term:
Smoothing the empirical distribution function is appropriate because we assume that the error distribution has a Lipschitz density and therefore a smooth distribution function. A local quadratic smoother for the regression function is appropriate because we assume that the regression function is twice continuously differentiable.
It follows from (1.1) that has influence function
Müller, Schick and Wefelmeyer (2004a) show that this is the efficient influence function for estimators of . Hence is efficient for in the sense that is a least dispersed regular estimator of for all and all . The influence function of our estimator coincides with the efficient influence function in the model with constant regression function; see Bickel, Klaassen, Ritov and Wellner (1998, Section 5.5, Example 1).
It follows in particular from (1.1) that has asymptotic variance
If is a normal density, this simplifies to
Hence, for normal errors, the asymptotic variance of is strictly smaller than the asymptotic variance of the empirical estimator based on the true errors. This paradox is explained by the fact that the empirical estimator is not efficient: Unlike , it does not make use of the information that the errors have mean zero. The efficient influence function for estimators of from mean zero observations is
see Levit (1975). Efficient estimators for from observations are
and the empirical likelihood estimator
with (random) probabilities maximizing subject to . The empirical likelihood was introduced by Owen (1988), (1990); see also Owen (2001). The asymptotic variance of an efficient estimator for from is
The variance increase of our estimator over is therefore
This is the price for not knowing the regression function. For normal errors this term is zero, and we lose nothing. We refer also to the introduction of Müller, Schick and Wefelmeyer (2004b).
Our proof is complicated by two features of the model: the error distribution cannot be estimated adaptively with respect to the regression function, and the regression function cannot be estimated at the efficient rate . Akritas and Van Keilegom (2001) encountered these problems in a related model, the heteroscedastic regression model . They used different techniques and stronger assumptions to get an expansion similar to (1.1). Their results do not cover ours in our simpler model.
Previous related results are easier because at least one of these complicating features is missing. Loynes (1980) assumes that . Koul (1969), (1970), (1987), (1992), Shorack (1984), Shorack and Wellner (1986, Section 4.6) and Bai (1996) consider linear models . Mammen (1996) studies the linear model as the dimension of increases with . Klaassen and Putter (1997) and (2001) construct efficient estimators for the error distribution function in the linear regression model . Koshevnik (1996) treats the nonparametric regression model with error density symmetric about zero; an efficient estimator for is obtained by symmetrizing the empirical distribution function based on residuals. Related results exist for time series. See Boldin (1982), Koul (2002, Chapter 7) and Koul and Leventhal (1989) for linear autoregressive processes ; Kreiss (1991) and Schick and Wefelmeyer (2002b) for invertible linear processes ; and Koul (2002, Chapter 8), Schick and Wefelmeyer (2002a) and Müller, Schick and Wefelmeyer (2004c, Section 4) for nonlinear autoregressive processes . For invertible linear processes, Schick and Wefelmeyer (2004) show that the smoothed residual-based empirical estimator is asymptotically equivalent to the empirical estimator based on the true innovations. General considerations on empirical processes based on estimated observations are in Ghoudi and Rémillard (1998).
Our result gives efficient estimators for linear functionals with bounded . For smooth and -square-integrable functions , it is easier to prove an i.i.d. representation analogous to (1.1) directly; see Müller, Schick and Wefelmeyer (2004a), who also use an under-smoothed estimator for the regression function. Müller, Schick and Wefelmeyer (2004b) compare these results with estimation in the larger model in which one assumes rather than independence of and with . A particularly simple special case is the error variance , with . For the estimator based on residuals with kernel estimator , under-smoothing is not needed. The asymptotic variance of this estimator was already obtained in Hall and Marron (1990). Müller, Schick and Wefelmeyer (2003) show that a covariate-matched U-statistic is efficient for ; it does not require estimating but uses a kernel density estimator for the covariate density . There is a large literature on simpler, inefficient, difference-based estimators for ; reviews are Carter and Eagleson (1992) and Dette, Munk and Wagner (1998) and (1999).
We can write
where is the distribution of . Our estimator is obtained by plugging in estimators for and . For we use essentially the empirical distribution; for we use a local quadratic smoother that is under-smoothed and hence does not have the optimal rate for estimating . This means that our estimator does not obey the plug-in principle of Bickel and Ritov (2000) and (2003).
The paper is organized as follows. Section 2 introduces our estimator and states, in Theorem 2.7, the assumptions needed for expansion (1.1). Section 3 derives some consequences of exponential inequalities, and Section 4 contains properties of local polynomial smoothers. Section 5 gives the proof of Proposition 2.8.
2 The estimator and the main result
Let us now define our estimator. We begin be defining the residuals. This requires an estimator of the regression function. We take to be a local quadratic smoother. To define it we need a kernel and a bandwidth . A local quadratic smoother of is defined as for , where is the minimizer of
The residuals of the regression estimator are
Let denote the empirical distribution function based on these residuals:
Our estimator of the error distribution function will be a smoothed version of . To this end, let be a density and another bandwidth. Then we define our estimator of by
With the distribution function of , we can write
This shows that is the convolution of the empirical distribution function of the residuals with the distribution function . Alternatively, is the distribution function with density given by
This is the usual kernel density estimator of based on the residuals, with kernel and bandwidth . We make the following assumptions.
The covariate density is bounded and bounded away from zero on , and its restriction to is (uniformly) continuous.
The regression function is twice continuously differentiable.
The error density is Lipschitz, has mean zero, and satisfies the moment condition for some .
The density is symmetric, twice continuously differentiable, and has compact support .
The kernel used to define the local quadratic smoother is a symmetric density which has compact support and a bounded derivative .
The bandwidths satisfy and .
Note that is smaller than the optimal bandwidth under Assumptions 2.1 and 2.2. Such a bandwidth would be proportional to . This means that our choice of bandwidth results in an under-smoothed local quadratic smoother.
We are now ready to state our main result.
The proof of Proposition 2.8 is in Section 5. We conclude this section with a simple lemma that will be needed repeatedly in the sequel.
Suppose that for some . Then
If has also mean zero, then, as ,
The first conclusion follows by the sharper version of the Markov inequality: For ,
The second conclusion follows from
In the first equality, we have used that has mean zero. ∎
3 Auxiliary Results
In this section we derive some results that will be used in the proof of Proposition 2.8. Let be a probability space. For each positive integer let be independent -valued random variables with distribution , and for each in , let be a bounded measurable function from into . We first study the process defined by
Let be a sequence of positive numbers such that for some . Suppose that
and, for positive numbers and ,
If we strengthen (3.1) to
To prove the lemma we use an inequality of Hoeffding (1963): If are independent random variables that have mean zero and variance and are bounded by , then for ,
Applying this inequality with , we obtain for :
Thus there is a positive number such that for all ,
Now let for , with an integer greater than . The above yields for large enough ,
Now, using (3.2),
This is the desired result (3.3). The second conclusion is an immediate consequence. ∎
Next we consider the degenerate U-process
with , and a bounded measurable function from to such that for all in ,
Let be positive numbers such that for some . Suppose that
and, for some positive and ,
If we strengthen (3.6) to
We use a similar argument as for Lemma 3.1, but rely now on the Arcones–Giné exponential inequality for degenerate U-processes (inequality (c) in Proposition 2.3 of Arcones and Giné, 1994). This inequality states that there are constants and depending only on such that, for every , all and all ,
From this inequality one obtains as in the proof of Lemma 3.1 that there is a positive number such that
Now proceed as in the proof of Lemma 3.1. ∎
4 Properties of local polynomial smoothers
For an introduction to local polynomial smoothers we refer to Fan and Gijbels (1996). In this section we derive some properties of local polynomial smoothers of order , defined by for , where is the minimizer of
Here we have re-scaled for convenience. The normal equations are
where the vector has entries
and the matrix has entries , , with
By the properties of the kernel and the covariate density we have for and all ,
Write for the first column of the inverse of , and
From the normal equations we obtain
For the expectation of we write
We define correspondingly, replacing by . Furthermore, and are defined as and , with replaced by .
For a unit vector and we have
Thus, by Assumption 2.1,
By Assumption 2.5, there is an such that the eigenvalues of are in the interval for all and . Thus is invertible, and
where . Here we used the fact that
Hence, if we assume that is -times continuously differentiable with , we can write
and obtain the bound
We conclude this section b