Nonparametric regression with nuisance components

# A theory of nonparametric regression in the presence of complex nuisance components

Martin Wahl Institut für Angewandte Mathematik, Universität Heidelberg, Im Neuenheimer Feld 294, 69120 Heidelberg, Germany
###### Abstract.

In this paper, we consider the nonparametric random regression model and address the problem of estimating the function . The term is regarded as a nuisance term which can be considerably more complex than . Under minimal assumptions, we prove several nonasymptotic -risk bounds for our estimators of . Our approach is geometric and based on considerations in Hilbert spaces. It shows that the performance of our estimators is closely related to geometric quantities, such as minimal angles and Hilbert-Schmidt norms. Our results establish new conditions under which the estimators of have up to first order the same sharp upper bound as the corresponding estimators of in the model . As an example we apply the results to an additive model in which the number of components is very large or in which the nuisance components are considerably less smooth than . In particular, the results apply to an asymptotic scenario in which the number of components is allowed to increase with the sample size.

###### Key words and phrases:
Nonparametric regression, nuisance components, projection on sumspaces, minimax estimation, additive model, increasing number of components
###### 2010 Mathematics Subject Classification:
62G08, 62G20, 62H05, 62H20

## 1. Introduction

In this paper, we consider the nonparametric random regression model

 Y=f1(X1)+f2(X2)+ϵ. (1.1)

We study the problem of estimating the function , while the function is regarded as a nuisance parameter. We are interested in settings where the second term is much more complex than the first term . A particular model of interest is the additive model

 Y=f1(X1)+q−1∑j=1f2j(X2j)+ϵ (1.2)

in which the nuisance components are considerably less smooth than or in which the number of components is very large, for instance in the sense that is allowed to increase with the sample size . The estimation problem is similar to the one arising in semiparametric models where the aim is to estimate a finite-dimensional parameter in the presence of a (more complex) infinite-dimensional parameter.

Estimation in nonparametric additive models is a well-studied topic, especially when considering the problem of estimating all components in the case that is fixed. One of the seminal theoretical papers is by Stone [30], who showed that each component can be estimated with the rate of convergence corresponding to the situation in which the other components are known. Since then, many estimation procedures have been proposed, many of them consisting of several steps. In the work by Linton [19] and Fan, Härdle, and Mammen [11], it is shown that there exist estimators of single components which have the same asymptotic bias and variance as the corresponding oracle estimators for which the other components are known.

Probably the most popular estimation procedures are the backfitting procedures, which are empirical versions of the orthogonal projection onto the subspace of additive functions in a Hilbert space setting (see, e.g., the book by Hastie and Tibshirani [12] and the references therein). This orthogonal projection was studied, e.g., by Breiman and Friedman [6] (see also the book by Bickel, Klaassen, Ritov, and Wellner [3, Appendix A.4]). They showed that, under certain conditions including compactness of certain conditional expectation operators, it can be computed by an iterative procedure using only bivariate conditional expectation operators. Replacing these conditional expectation operators by empirical versions leads to the backfitting procedures. Opsomer and Ruppert [24] and Opsomer [23] computed the asymptotic bias and variance of estimators based on the backfitting procedure in the case where the conditional expectation operators are estimated using local polynomial regression. Mammen, Linton, and Nielsen [20] introduced the smooth backfitting procedure and showed that their estimators of single components achieve the same asymptotic bias and variance as oracle estimators for which the other components are known. Concerning the distribution of the covariates, they make some high-level assumptions which are satisfied under some boundedness conditions on the one- and two-dimensional densities. This is still more than is required in the Hilbert space setting (see [6]). In the work by Horowitz, Klemelä, and Mammen [13], a general two-step procedure was proposed in which a preliminary undersmoothed estimator is based on the smooth backfitting procedure of [20]. They also showed that there are estimators which are asymptotically efficient (i.e., achieve the asymptotic minimax risk) with the same constant as in the case with only one component. In addition to the assumptions coming from the results in [20], they require a Lipschitz condition for all components.

The problem of estimating in cases in which is more complex than is also considered in the work by Efromovich [10] and Muro and van de Geer [22]. In [10], an estimator of is constructed which is both adaptive to the unknown smoothness and asymptotically efficient with the same constant as in the case with only one component. The assumptions include smoothness and boundedness conditions on the full-dimensional density of . The construction of the estimator is involved and starts with a blockwise-shrinkage oracle estimator. In [22], a penalized least squares estimator is analyzed in cases where the function is smoother than the function . Under certain assumptions including smoothness conditions on the design densities, it is shown that for both components, the estimator attains the rate of convergence corresponding to the situation in which the other component is known; i.e., no undersmoothing of the function is needed to estimate the function .

The previously discussed literature on additive models focuses on the asymptotic behavior of estimators as the number of observations goes to infinity in the case that is fixed. Note that one of our purposes is to generalize several results to the case that increases with .

Recently, high-dimensional sparse additive models have been studied, e.g., in the work by Meier, van de Geer, and Bühlmann [21], Huang, Horowitz, and Wei [14], Koltchinskii and Yuan [16], Raskutti, Wainwright, and Yu [25], Suzuki and Sugiyama [31], and Dalalyan, Ingster, and Tsybakov [7]. These papers consider the case that the number of covariates is much larger than the sample size . The focus is on the problem of estimating all components under sparsity constraints. In [7], e.g., the authors construct an estimator achieving optimal minimax rates of convergence. These rates of convergence depend on and also on the smallest degree of smoothness of the . Hence, they may only lead to crude bounds for the rates of convergence of estimators of . Let us mention that in this paper, we do not consider a sparsity scenario. We are interested in cases in which the number of components is very large, but smaller than .

In this paper, we consider model (1.1) in the case that the functions and belong to closed subspaces and of and , respectively. We propose an estimator of which is based on the composition of two least squares criteria. Our main contribution is to derive several nonasymptotic risk bounds which show that the performance of our estimators is closely related to geometric quantities of and , such as minimal angles and Hilbert-Schmidt norms. These risk bounds lead to minimal conditions under which the function can be estimated (up to first order) just as well as in the model . Our analysis is based on geometric considerations in Hilbert spaces, and relies on the theory of projections on sumspaces in Hilbert spaces (see, e.g., [3, Appendix A.4]). Moreover, we apply recent concentration inequalities for structured random matrices (see, e.g., the work by Rauhut [26]) in order to show that several geometric properties in the Hilbert space setting carry over to the finite sample setting with high probability. As a main example we apply our results to the additive model (1.2) which corresponds to the case that has an additive structure. Using our results, we establish new conditions on and on the smoothness of the nuisance components under which our estimator of attains the same (nonasymptotic) optimal rate of convergence as the corresponding least squares estimator in the model . We also address the question of when the corresponding constants coincide.

The paper is organized as follows. In Section 2 and 3, we present the assumptions on the model and state our main results in Theorems 1-4. In Section 4, we apply our results to several models including the additive model. The proofs of our results are given in Sections 5 and 6. Finally, some complements are given in the Appendix.

## 2. The framework

### 2.1. The model

Let be a triple of random variables satisfying (1.1), where and take values in some measurable spaces and , respectively, is a real valued random variable such that and , and the unknown regression functions satisfy the following assumption:

###### Assumption 1.

Suppose that , where

 H1⊆{g1∈L2(PX1):E[g1(X1)]=0}

is a closed subspace, and that , where is a closed subspace.

Structural assumptions on and (see, e.g., Section 4 where we also consider the additive model) should be incorporated into the model by making assumptions on and . From the above, we have that is a random variable taking values in (note that in Section 4.3, we consider the example , , and , where all spaces are equipped with the Borel -algebra). Moreover, we have that the spaces and are (in a canonical way) subspaces of , which implies that and are also closed subspaces of . Finally, we denote by the whole regression function given by . We assume that we observe independent copies

 (Y1,X1),…,(Yn,Xn)

of , where , . Based on this sample, we consider the problem of estimating the function .

### 2.2. The main assumption

Our approach relies strongly on the fact that the space is a Hilbert space with the inner product and the corresponding norm (see, e.g., [9, Theorem 5.2.1]). In order to state our main assumption, we give the following general definition of a minimal angle in Hilbert spaces (see [15, Definition 1] and the references therein).

###### Definition 1.

Let and be two closed subspaces of a Hilbert space with inner product and norm . The minimal angle between and is the number whose cosine is given by

 ρ0=ρ0(H1,H2)=sup{⟨h1,h2⟩∥h1∥∥h2∥ ∣∣∣ 0≠h1∈H1,0≠h2∈H2}.
###### Assumption 2.

Suppose that the cosine of the minimal angle between and is strictly less than 1, i.e.,

 ρ0(H1,H2)<1.

The next lemma states two equivalent formulations of Assumption 2. Since we will also apply it to the finite sample setting in later sections, we again give a general statement.

###### Lemma 1.

Let and be two closed subspaces of a Hilbert space with inner product and norm . Let be a constant. Then the following assertions are equivalent:

1. For all we have

 |⟨h1,h2⟩|∥h1∥∥h2∥≤ϱ.
2. For all we have

 ∥h1+h2∥2≥(1−ϱ)(∥h1∥2+∥h2∥2).
3. For all we have

 ∥h1+h2∥2≥(1−ϱ2)∥h1∥2.

A proof of Lemma 1 is given in Appendix A.

### 2.3. The estimation procedure

Let and be - and -dimensional linear subspaces, respectively, and let be a linear subspace. Let and . By Assumption 2, we have , which implies that is equal to the dimension of and that each can be decomposed uniquely as with and . We will make only one assumption on which relates the -norm with the -norm, and which will be needed to apply concentration of measure inequalities (compare to, e.g., [5, Section 3.1.1] and [2, Section 1.1]).

###### Assumption 3.

Suppose that there is a real number such that

 ∥g∥∞≤φ√d∥g∥ (2.1)

for all .

###### Remark 1.

In view of Assumption 2, Equation (2.1) is satisfied if there are real numbers such that for all , . Indeed, applying the Cauchy-Schwarz inequality and Lemma 1, we have

 ∥g1+g2∥∞≤φ1√d1∥g1∥+φ2√d2∥g2∥≤φ1∨φ2√1−ρ0√d1+d2∥g1+g2∥.

The construction of our estimator is based on two least squares criteria. First, let be the least squares estimator on the model which is given (not uniquely) by

 ^fV=argming∈V1nn∑i=1(Yi−g(Xi))2. (2.2)

By the definition of , we have with and . Next, by applying a second least squares criterion, we define the estimator by

 ^f1=argming1∈W11nn∑i=1((^fV)1(Xi1)−g1(Xi1))2. (2.3)

We will also consider the special case , in which we have . This means that the second least squares criterion can be dropped. However, we will see that choosing as a preliminary space of larger dimension leads to a smaller bias (it lowers the dependence on ). Finally, since we want to establish risk bounds, it is convenient to eliminate very large values. Therefore, we define our final estimator by

 ^f∗1=^f1 if ∥^f1∥∞≤kn% and ^f∗1≡0 otherwise, (2.4)

where is a real number to be chosen later (compare to the work by Baraud [2, Eq. (3)]). Finally, note that the estimator is not feasible since the distribution of is not known and therefore the condition cannot be checked. However, one can replace it by the condition . In Appendix B, we show how our results carry over to these modified estimators.

In our analysis of , one important step is to carry over the geometric properties valid in the Hilbert space setting to the finite sample setting. For this, the following event

 Eδ={(1−δ)∥g∥2≤∥g∥2n≤(1+δ)∥g∥2  for all g∈V},

, will play the key role. Here, denotes the empirical norm (see, e.g., Section 5.1). A first observation is that, under Assumptions 1 and 2, the estimator is unique on the event . This can be seen as follows. If holds, then and are equivalent norms on , which in turn implies that each is uniquely determined by . Hence, the solutions of the least squares criteria in (2.2) and (2.3) are unique (since the solutions are unique when restricted to vectors in evaluated at the observations). Moreover, by Assumption 2, the decomposition is unique.

In addition, we also obtain a simple representation of our estimator. Let be the orthogonal projection from to the subspace , and let be defined analogously. If holds, then we have

 (^f1(X11),…,^f1(Xn1))T=^ΠW1(^ΠVY)1,

where is the unique decomposition of the least squares estimator on the model , considered as a vector in , with .

## 3. Main results

### 3.1. A first risk bound

In this section, we present a first nonasymptotic risk bound in the case , which will be further improved (under additional assumptions) in later sections. We denote by (resp. , , and ) the orthogonal projection from to the subspace (resp. , , and ).

###### Theorem 1.

Let Assumption 1, 2, and 3 be satisfied. Let be a real number. Let . Then

 E[∥f1−^f∗1∥2] ≤1+δ(1−δ)311−ρ20((1+φ2dn)∥f−ΠVf∥2+σ2dimV1n)+Rn

with

 Rn= 2(1+δ)φ2d∥f1∥2(∥f−ΠV2f∥2+σ2)(1−δ)2(1−ρ20)k2n+2(∥f1∥+kn)2dexp(−κδ2nφ2d),

where is the universal constant in Theorem 7.

Before we discuss the two main terms, let us give conditions under which the remainder term is small. Suppose that for some real number , we have

 φ2d≤cδ2nlogn,

and let (this is a theoretical choice of leading to a simple upper bound for , many other choices are possible, too). Then one can show that

 Rn≤12c(1+δ)δ2(1−δ)2(1−ρ20)(∥f1∥2+∥f2−ΠV2f2∥2+σ2)n−κ2c+1.

Letting, e.g., and , we obtain the following corollary of Theorem 1.

###### Corollary 1.

Let Assumption 1, 2, and 3 be satisfied. Suppose that

 φ2d≤n(logn)4. (3.1)

Then there is a universal constant such that

 E[∥f1−^f∗1∥2] ≤11−ρ20(∥f1−ΠV1f1∥2+σ2dimV1n)(1+C/logn) +C1−ρ20((logn)∥f2−ΠV2f2∥2+∥f1∥2n−κ2logn+1).

The first two terms on the right hand side are (up to the factor ) equal to the bias term and the variance term of the same estimator with in the model . The third term is the approximation error of the function with respect to the space . It decreases if is chosen larger. Moreover, the choice of does not effect any of the other terms, the only restriction is given by (3.1). The question arising now is as follows: Is it possible to choose a space subject to the constraint (3.1) such that is negligible with respect to the first two terms.

### 3.2. A refined risk bound

In this section, we improve Theorem 1 such that the factor only appears in remainder terms. Since the refined upper bound for the variance term will also contain a Hilbert-Schmidt norm, we give the following general definition (see, e.g., [35]).

###### Definition 2.

Let and be Hilbert spaces. A bounded linear operator is called Hilbert-Schmidt if for some orthonormal basis of ,

 ∑α∈I∥Tϕ1α∥2<∞. (3.2)

This sum is independent of the choice of the orthonormal basis (see [35, Satz 3.18]). The square root of this sum is called the Hilbert-Schmidt norm of , denoted by .

Let be the orthogonal projection from to , and let be the restriction of to . Then is a Hilbert-Schmidt operator, since is finite-dimensional. We prove:

###### Theorem 2.

Let Assumption 1, 2, and 3 be satisfied. Let be a real number. Then

 E[∥f1−^f∗1∥2] ≤(∥f1−ΠW1f1∥2+11−δσ2dimW1n)(1+1+δ(1−δ)311−ρ202φ2dn) +1+δ(1−δ)261−ρ20(∥f1−ΠV1f1∥2+∥f2−ΠV2f2∥2) +1+δ(1−δ)411−ρ20σ2∥∥ΠV2|W1∥∥2HSn+Rn, (3.3)

where is given in Theorem 1.

In order to state a corollary of Theorem 2 similar to Corollary 1, we have to discuss the quantity . If is an orthonormal basis of , then it can be bounded as follows:

 ∥∥ΠV2|W1∥∥2HS=dimW1∑k=1∥∥ΠV2ϕ1k∥∥2≤dimW1∑k=1ρ20∥ϕ1k∥2=ρ20dimW1, (3.4)

where the inequality can be shown as in (5.7). Using this bound, we get a variance term which coincides (up to first order) with the one in Theorem 1. However, (3.4) can be considerably improved under certain Hilbert-Schmidt Assumptions. In particular, we will derive upper bounds which are dimension free. The first assumption is as follows:

###### Assumption 4.

Suppose that there are measures and on and , respectively, such that has the density with respect to the product measure . Let and be the marginal densities of and with respect to the measures and , respectively. Suppose that

 ∥K∥2HS =∫S2∫S1(p(x1,x2)p1(x1)p2(x2))2p1(x1)p2(x2)dν1(x1)dν2(x2) =∫S2∫S1(p(x1,x2))2p1(x1)p2(x2)dν1(x1)dν2(x2)<∞.

If Assumption 4 is satisfied, then we can define the integral operator by

 (Kg1)(x2)=∫S1g1(x1)p(x1,x2)p1(x1)p2(x2)p1(x1)dν1(x1)

which is the orthogonal projection from to restricted to . Applying [35, Satz 3.19], we obtain that is a Hilbert-Schmidt operator with Hilbert-Schmidt norm . We conclude that

 ∥∥ΠV2|W1∥∥HS≤∥K∥HS.

Next, we present a more sophisticated upper bound, by using the spaces and instead of and . Let be the orthogonal projection from to , and let be the restriction of to .

###### Assumption 5 (Weaker form of Assumption 4).

Suppose that is a Hilbert-Schmidt operator.

If Assumption 5 is satisfied, then

Letting now and as in Corollary 1, we obtain the following corollary of Theorem 2.

###### Corollary 2.

Let Assumption 1, 2, 3, and 4 be satisfied. Suppose that

 φ2d≤n(logn)4.

Then there is a universal constant such that

 E[∥f1−^f∗1∥2] ≤(∥f1−ΠW1f1∥2+σ2dimW1n)(1+C′/logn) +C′(∥f1−ΠV1f1∥2+∥f2−ΠV2f2∥2+σ2∥K∥2HSn+∥f1∥2nκ2logn−1),

where . Moreover, if Assumption 5 holds instead of Assumption 4, then the above inequality holds if is replaced by . Finally, if Assumption 5 and 4 are not satisfied, then the above inequality holds if is replaced by .

Now the first two terms in the brackets on the right hand side are equal to the bias term and the variance term of the same estimator with in the model . As in Corollary 1, we see that the choices of and do not effect any of the other terms, the only restriction is given by (3.1).

Finally, we give an alternative representation of the Hilbert-Schmidt norm using the operator (which we consider as a map from to ). To simplify the exposition, we suppose that is separable, which implies that each orthonormal basis of is countable (see, e.g., [27, Chapter II]). From Assumption 5, it follows that is compact (see, e.g., [18, Chapter 30.8]). Since it is also symmetric and positive, the spectral theorem (see, e.g., [18, Theorem 3 in Chapter 28]) implies that there is an orthonormal basis for consisting of eigenvectors of . These all have non-negative eigenvalues. We arrange the positive eigenvalues of in decreasing order: . We now have:

###### Lemma 2.

Under the above assumptions, we have

 ρ20=α1

and

 ∥∥ΠH2|H1∥∥2HS=tr(ΠH1ΠH2ΠH1)=∑k≥1αk.
###### Proof.

We only prove the second equality. Let be an orthonormal basis for consisting of eigenvectors of . Then

 ∥∥ΠH2|H1∥∥2HS

###### Example 1.

Consider the case that is a bivariate Gaussian random variable such that , , and .

First, suppose that and are the spaces of linear centered functions, i.e., and . Then it is easy to see that

 ρ0=|ρ|

and

 ∥∥ΠH2|H1∥∥2HS=ρ2.

Second, suppose that and . Then it follows from [17] that has eigenvalues . Hence, the above lemma implies that

 ρ0=|ρ|

and

 ∥∥ΠH2|H1∥∥2HS=∞∑k=1ρ2k=ρ21−ρ2,

which is an improvement over (3.4) if is large.

### 3.3. Regularity conditions on the design densities

In this section, we present two improvements of Theorem 2 which are possible under Assumption 4 and additional regularity conditions on the design densities. In particular, we show that the dependence of the bias term on the function can decrease considerably.

By Assumption 4 and Fubini’s theorem, we have

 p(x1,⋅)p1(x1)p2(⋅)∈L2(PX2) (3.5)

for -almost all . Thus we can make the following assumption. Suppose that there is a real number and a function such that

 ∥∥∥(1−ΠV2)p(x1,⋅)p1(x1)p2(⋅)∥∥∥L2(PX2)≤h1(x1)ψ(V2) (3.6)

for -almost all . In analogy, we let be a real number such that . We prove:

###### Theorem 3.

Let Assumption 1, 2, 3, and 4 be satisfied. Let be a real number. Suppose that (3.6) is satisfied. Moreover, suppose that for all , where is the constant from Assumption 3. Then (3.3) holds when

 1+δ(1−δ)261−ρ20(∥f1−ΠV1f1∥2+∥f2−ΠV2f2∥2)

is replaced by

 (1+δ)2(1−δ)4121−ρ20(∥f1−ΠV1f1∥2+∥h1∥2(ϕ(V2)ψ(V2))21−ρ20+1n∥h1∥2∥f2−ΠV2f2∥2∞(ψ(V2))21−ρ20+(ϕ(V2))21−ρ20φ2d1n).

Theorem 3 shows that the regularity conditions on and have similar effects, which can be seen from second term. In contrast to Theorems 1 and 2, Theorem 3 shows that the estimator can also behave well when is considerably less regular than . For instance, if we apply Theorem 3 to an asymptotic scenario, then, under suitable conditions on , the regularity conditions on can be (almost) reduced to (see, e.g., Corollary 6).

For fixed , let the function be the orthogonal projection of on . By (3.5), is defined for -almost all . Thus we can consider the following weaker version of (3.6). Suppose that there exists a real number and a function such that

 ∥∥(1−ΠV2)r(x1,⋅)∥∥L2(PX2)≤h1(x1)ψΠ(V2) (3.7)

for -almost all . If (3.7) holds, then we obtain the following theorem. Note that, compared to Theorem 3, the last term is not always negligible.

###### Theorem 4.

Let Assumption 1, 2, 3, and 4 be satisfied. Let be a real number. Suppose that (3.7) is satisfied. Then Theorem 2 also holds when the term

 1+δ(1−δ)261−ρ20(∥f1−ΠV1f1∥2+∥f2−ΠV2f2∥2)

is replaced by

 1+δ(1−δ)361−ρ20(∥f1−ΠV1f1∥2(1+2φ2dn)+∥h1∥2(ϕ(V2)ψΠ(V2))21−ρ20+(ϕ(V2))21−ρ202φ2dn).

## 4. Applications

### 4.1. The two-dimensional case

In this section, we want to discuss Theorem 1 and 2 in the case that and take values in and that Assumptions 1 and 2 are satisfied with

 H1={g1∈L2(PX1)|E[g1(X1)]=0}

and

 H2=L2(PX2).

The main remaining issue is to bound the approximation errors. This is possible if the belong to certain nonparametric classes of functions and if the are chosen appropriately. Here, we shall restrict our attention to (periodic) Sobolev smoothness and spaces of trigonometric polynomials. Note that we will also consider Hölder smoothness and spaces of piecewise polynomials in Section 4.4 and 4.5. Recall that the trigonometric basis is given by , and , , where .

###### Assumption 6.

Suppose that the take values in and have densities with respect to the Lebesgue measure on , which satisfy for some constant . Moreover, suppose that the belong to the Sobolev classes

 ~Wj(αj,Kj)={∞∑k∈Zθkϕk(xj) : ∞∑k∈Z|k|2αjθ2k≤K2j},

where and (see, e.g., [32, Definition 1.12]).

For , let be the intersection of with the linear span of the