Adaptive estimation in the single-index modelvia oracle approach \thanksrefT1

# Adaptive estimation in the single-index model via oracle approach \thanksrefT1

## Abstract

In the framework of nonparametric multivariate function estimation we are interested in structural adaptation. We assume that the function to be estimated has the “single-index” structure where neither the link function nor the index vector is known. We suggest a novel procedure that adapts simultaneously to the unknown index and smoothness of the link function. For the proposed procedure, we prove a “local” oracle inequality (described by the pointwise seminorm), which is then used to obtain the upper bound on the maximal risk of the adaptive estimator under assumption that the link function belongs to a scale of Hölder classes. The lower bound on the minimax risk shows that in the case of estimating at a given point the constructed estimator is optimally rate adaptive over the considered range of classes. For the same procedure we also establish a “global” oracle inequality (under the norm, ) and examine its performance over the Nikol’skii classes. This study shows that the proposed method can be applied to estimating functions of inhomogeneous smoothness, that is whose smoothness may vary from point to point.

[
\kwd
\startlocaldefs\endlocaldefs\setattribute

journalname \startlocaldefs \endlocaldefs

\runtitle

Adaptation in the single-index model \thankstextT1Funding of the ANR-07-BLAN-0234 is acknowledged. The second author is also supported by the DFG FOR 916.

{aug}

and

class=AMS] \kwd[Primary ]62G05 \kwd[; secondary ]62G20 \kwd62M99

Adaptive estimation \kwdGaussian white noise \kwdlower bounds \kwdminimax rate of convergence \kwdnonparametric function estimation \kwdoracle inequalities \kwdsingle-index \kwdstructural adaptation.

## 1 Introduction

This research aims at estimating multivariate functions with the use of the oracle approach. The first step of the method consists in justification of pointwise and global oracle inequalities for the estimation procedure; the second step is the deriving from them adaptive results for estimation of the point functional and the entire function correspondingly. The obtained results show full adaptivity of the proposed estimator as well as its minimax rate optimality.

Model and set-up Let be a bounded interval in . We observe a path , satisfying the stochastic differential equation

 Yε(dt)=F(t)dt+εW(dt),t=(t1,…,td)∈D, (1.1)

where is a Brownian sheet and is the deviation parameter.

In the single-index modeling the signal has a particular structure:

 F(x)=f(x⊤θ∘), (1.2)

where is called link function and is the index vector.

We consider the case of completely unknown parameters and and the only technical assumption is that where for some . However, the knowledge of as well as any information on the smoothness of the link function are not required for the proposed below estimation procedure. The consideration is restricted to the case except the second assertion of Theorem 3 concerning a lower bound for function estimation at a given point. Also, without loss of generality we will assume that and .

Let be an estimator, i.e. a measurable function of the observation and denote the mathematical expectation with respect to , the family of probability distributions generated by the observation process on the Banach space of continuous functions on , when is the mean function. The estimation quality is measured by the risk, ,

 R(ε)r(˜F,F)=EεF∥˜F−F∥r, (1.3)

where is the norm on or by the “pointwise” risk

 R(ε)r,x(˜F,F)=(EεF|˜F(x)−F(x)|r)1/r. (1.4)

The aim is to estimate the entire function on or its value from the observation satisfying SDE (1.1) without any prior knowledge of the nuisance parameters: the function and the unit vector . More precisely, we will construct an adaptive (not depending of and ) estimator at any point . In what follows notation stands for an adaptive estimator and denotes an arbitrary estimator. Our estimation procedure is a random selector from a special family of kernel estimators parametrized by a window size (bandwidth) and a direction of the projection , see Section 2.2 below. For that procedure we then establish a pointwise oracle inequality (Theorem 1) of the following type:

 R(ε)r,x(ˆF,F)≤C1ε√ln(1/ε)/h∗(x⊤θ∘)+C2ε√ln(1/ε), (1.5)

where is an optimal in a certain sense (oracle) bandwidths, see Definition 2.1. As Jensen’s inequality and Fubini’s theorem trivially imply

 [R(ε)r(ˆF,F)]r≤EεF∥∥ˆF(⋅)−F(⋅)∥∥rr=∥∥R(ε)r,⋅(ˆF,F)∥∥rr.

Hence, we immediately obtain the “global” oracle inequality

 R(ε)r(ˆF,F)≤C1ε∥√ln(1/ε)/h∗∥r+C2ε√ln(1/ε). (1.6)

Both inequalities (1.5) and (1.6) aside of being quite informative itself – we will see in Section 2.1 from Proposition 1 that they claim that our adaptive estimator mimics its ideal (oracle) counterpart, i.e. their risk bounds differ only by a numerical constant, – they are further used to judge the minimax rate of convergence under the pointwise and losses correspondingly (Theorems 3 and 4). We will see that these rates are in accordance with Stone’s dimensionality reduction principle, see pp. 692-693 in Stone (1985). Indeed, as the statistical model is effectively one-dimensional due to the structural assumption (1.2) so the rate of convergence is.

The obtained results demonstrate full adaptivity of the proposed estimator to the unknown direction of the projection and the smoothness of . Moreover, the lower bound given in the second assertion of Theorem 3 shows that in the case of pointwise estimation over the range of classes of -variate functions having the single-index structure, see definition (3.1), our estimator is even optimally rate adaptive, that is it achieves the minimax rate of convergence. This fact is in striking contrast to the common knowledge that a payment for pointwise adaptation in terms of convergence rate is unavoidable. Indeed, if the index would be known, than the problem boils down to pointwise adaptation over Hölder classes in the univariate GWN model. As demonstrated in Lepski (1990), an optimally adaptive estimator does not exist in this case.

Although the literature on the single-index model is rather numerous, we mention only books Härdle et al. (2004), Horowitz (1998), Györfi et al. (2002) and Korostelev and Korosteleva (2011), quite a few works address the problem of function estimating when both the link function and index are unknown. To the best of our knowledge the only exceptions are Golubev (1992), Gaïffas and Lecué (2007) and Goldenshluger and Lepski (2008). An adaptive projection estimator is constructed in Golubev (1992), in Gaïffas and Lecué (2007) the aggregation method is used. Both the papers employ losses. Goldenshluger and Lepski (2008) seems to be the first work on pointwise adaptive estimation in the considered set-up, the upper bound for estimation at a point obtained therein is similar to our, but the estimation procedure is different.

Organization of the paper In Section 2 we motivate and explain the proposed selection rule. Then in Section 2.3 we establish for it local and global oracle inequalities of type (1.5) and (1.6). In Section 3 we apply these results to minimax adaptive estimation. Particularly, Section 3.1 is devoted to the upper bound and already discussed above lower bound for estimation over a range of Hölder classes. Section 3.2 addresses the “global” adaptation under the losses and the estimator performance over the collection of classes of single-index functions with the link function in a Nikol’skii class, see Definition 2 and (3.2). That consideration allows to incorporate in analysis functions of inhomogeneous smoothness, that is those which can be very smooth on some parts of observation domain and irregular on the others. The proofs of the main results are given in Section 4 and the proofs of technical lemmas are postponed until Appendix.

## 2 Oracle approach

Below we define an “ideal” (oracle) estimator and describe our estimation procedure. Then we present local and global oracle inequalities demonstrating a nearly oracle performance of the proposed estimator.

Denote by any function (kernel) that integrates to one, and define for any , and any

 ΔK,f(h,z)=supδ≤h∣∣∣1δ∫K(u−zδ)[f(u)−f(z)]du∣∣∣,

a monotonous approximation error of the kernel smoother . In particular, if the function is uniformly continuous then as .

In what follows we assume that the kernel obeys

###### Assumption 1.

(1)  , , is symmetric;

(2)  there exists such that

### 2.1 Oracle estimator

For any denote by

 ¯¯¯¯¯ΔK,f(h,y)=supa>012a∫y+ay−aΔK,f(h,z)dz,

the Hardy-Littlewood maximal function of , see for instance Wheeden and Zygmund (1977). Put also and remark that in view of the Lebesgue differentiation theorem and coincide almost everywhere. Note also, that if is a continuous function then .

Define for the oracle (depending on the underlying function) bandwidth

 h∗K,f(y)=sup{h∈[ε2,1]:√hΔ∗K,f(h,y)≤∥K∥∞ε√ln(1/ε)}. (2.1)

We see that, with the proviso that , the “bias” , and consequently the set (2.1) is not empty for all . Here , , denotes the norm of .

For any define the matrix

 E(θ,h)=(h−1θ1h−1θ2−θ2θ1)

and consider the family of kernel estimators

 F={ˆF(θ,h)(⋅)=det(E(θ,h))∫K(E(θ,h)(t−⋅))Yε(dt),(θ,h)∈S1×[ε2,1]}.

We use the product type kernels with a one-dimensional kernel obeying Assumption 1. Note also that and

 ˆF(θ,h)(⋅)−EεF[ˆF(θ,h)(⋅)]∼N(0,∥K∥42ε2h−1). (2.2)

The choice and leads to the “ideal” (oracle) estimator , that is the estimator constructed as if and would be known. Such an “estimator” is not available but serves as a quality benchmark, given by the following result.

###### Proposition 1.

For any , and any

 R(ε)r,x(ˆF(θ∘,h∗),F)≤cr[∥K∥4∞ε2ln(1/ε)h∗K,f(x⊤θ∘)]1/2,∀x∈[−1/2,1/2]2,

where . The proof is straightforward and can be omitted.

The meaning of Proposition 1 is that the “oracle” knows the exact value of the index and the optimal, up to , bias-variance trade-off between the approximation error caused by and the variance, see formula (2.2), of the kernel estimator from the collection .

Below we will propose an adaptive (not depending of and ) estimator and show that this estimator is as good as the oracle one, i.e. that the risk of that estimator is worse than that of Proposition 1 by a numerical constant only.

### 2.2 Selection rule

The procedure below is based on a pairwise comparison of the estimators from with an auxiliary estimator defined as follows. For any and any introduce the matrices

 ¯¯¯¯E(θ,h)(ν,h)=⎛⎜ ⎜⎝(θ1+ν1)2h(1+|ν⊤θ|)(θ2+ν2)2h(1+|ν⊤θ|)−(θ2+ν2)2(1+|ν⊤θ|)(θ1+ν1)2(1+|ν⊤θ|)⎞⎟ ⎟⎠,E(θ,h)(ν,h)=⎧⎨⎩¯¯¯¯E(θ,h)(ν,h),ν⊤θ≥0;¯¯¯¯E(−θ,h)(ν,h),ν⊤θ<0.

It is easy to check that Then, similarly to the construction of the estimators from we define a kernel estimator parametrized by

 ˆF(θ,h)(ν,h)(x)=det(E(θ,h)(ν,h))∫K(E(θ,h)(ν,h)(t−x))Yε(dt). (2.3)

Put and let for any

 TH(η)=2∥K∥2∞[Λ(K,Q)+√4r+2+1]ε√η−1ln(1/ε).

Set and define for any and

 R(θ,h)(x)=supη∈Hε:η≤h{supν∈S1∣∣ˆF(θ,η)(ν,η)(x)−ˆF(ν,η)(x)∣∣−TH(η)}. (2.4)

For any introduce the random set

 P(x)={(θ,h)∈S1×Hε:R(θ,h)(x)≤0},

and let if Note that there exists such that , since the set is finite. Define

If is not unique, let us make any measurable choice. In particular, if one can choose as a vector belonging to with the smallest first coordinate. The measurability of this choice follows from the fact that the mapping is almost surely continuous on . This continuity, in its turn, follows from Assumption 1 (2), bound (5.9) for Dudley’s entropy integral proved in Lemma 2 below and the condition . Define

 ˆh=sup{h∈Hε:∣∣ˆF(ˆθ,h)(x)−ˆF(ˆθ,η)(x)∣∣≤TH(η),∀η≤h,η∈Hε} (2.5)

and put as a final estimator .

The proposed above procedure belongs to the stream of pointwise adaptive procedures originating from Lepski (1990). Indeed, the second step determined by (2.5) for the “frozen” is exactly the procedure of Lepski (1990) which was originally developed in the framework of the univariate GWN model. There is a rather vast literature on that topic, we mention Bauer et al. (2009) adapted the method of Lepski (1990) for the choice of the parameter for iterated Tikhonov regularization in nonlinear inverse problems, Bertin and Rivoirard (2009) showed the maxiset optimality of that procedure for bandwidth selection under the norm losses, Chichignoud (2012) used it for selecting among local bayesian estimators, Gaïffas (2007) studied the problem of pointwise estimation in random design Gaussian regression, Serdyukova (2012) investigated a heteroscedastic Gaussian regression under noise misspecification, among many others.

The application of Lepski (1990) requires some sort of ordering on the set of estimators, for instance in (2.5) as soon as is fixed it is due to the monotonicity of the “bias” . However, when the projection direction is unknown no natural order on is available. This problem is similar to the one arising in generalizations of the pointwise adaptive method for multivariate (anisotropic) settings, see for developments in that direction Lepski and Levit (1999), Kerkyacharian et al. (2001) and Goldenshluger and Lepski (2009). Usually the aforementioned issue requires to introduce an auxiliary estimator and construct a procedure carefully capturing the “incomparability” of the estimators. In the considered set-up it is realized by the first step of procedure with given by (2.4).

### 2.3 Oracle inequalities

Throughout the paper we assume that

 ε≤exp{−max[1,(2M∥K∥1/∥K∥∞)2]}.
###### Theorem 1.

For any , and any

 R(ε)r,x(ˆF(ˆθ,ˆh),F)≤Cr,1(Q,K) ⎷∥K∥4∞ε2ln(1/ε)h∗K,f(xTθ∘)+Cr,2(M,Q,K)∥K∥2∞ε√ln(1/ε).

The constants and are given in the beginning of the proof.

As already mentioned, the global oracle inequality is obtained by integrating the local oracle inequality. For ease of notation, we write and . It follows from Jensen’s inequality and Fubini’s theorem that

 R(ε)r(ˆF,F)≤∥∥R(ε)r,⋅(ˆF,F)∥∥r≤Cr⎧⎪⎨⎪⎩∫[−1/2,1/2]2[∥K∥4∞ε2ln(1/ε)h∗K,f(xTθ∘)]r2dx⎫⎪⎬⎪⎭1r+r(ε).

Integration by substitution gives:

 ∫[−1/2,1/2]2[∥K∥4∞ε2ln(1/ε)h∗K,f(xTθ∘)]r2dx≤∫1/2−1/2[∥K∥4∞ε2ln(1/ε)h∗K,f(z)]r2dz

###### Theorem 2.

For any and any

 R(ε)r(ˆF(ˆθ,ˆh),F)≤Cr,1(Q,K)∥∥ ∥∥ ⎷∥K∥4∞ε2ln(1/ε)h∗K,f(⋅)∥∥ ∥∥r+Cr,2(M,Q,K)∥K∥2∞ε√ln(1/ε).

In this section with the use of the local oracle inequality from Theorem 1 we solve the problem of pointwise adaptive estimation over a collection of Hölder classes. Then, we turn to the problem of adaptive estimating the entire function over a collection of Nikol’skii classes with the accuracy of an estimator measured under the  risk. That is done with the help of the global oracle inequality given in Theorem 2.

Throughout this section we will assume that the kernel satisfies additionally Assumption 2 below. Introduce the following notation: for any let be the maximal integer strictly less than .

###### Assumption 2.

There exists such that

 ∫zjK(z)dz=0,∀j=1,…,mβmax.

Let us firstly recall the definition of Hölderian functions.

###### Definition 1.

Let and . A function belongs to the Hölder class if is -times continuously differentiable, and

 ∣∣g(mβ)(t+h)−g(mβ)(t)∣∣≤Lhβ−mβ,∀t∈Randh>0.

The aim is to estimate the function at a given point under the additional assumption that , where

 Fd(β,L)={F:Rd→R|F(z)=f(z⊤θ),f∈H(β,L),θ∈Sd−1}, (3.1)

is the dimension and is the constant from Assumption 2, which can be arbitrary but must be chosen a priory.

###### Theorem 3.

Let be fixed and let Assumptions 1 and 2 hold. Then, for any , and , we have

 supF∈F2(β,L)R(ε)r,x(ˆF(ˆθ,ˆh),F)≤∥K∥2∞[Cr,1(Q,K)ψε(β,L)+Cr,2(L,Q,K)ε√ln(1/ε)],

where .

Moreover, for any , , with and any small enough,

 inf˜FsupF∈Fd(β,L)R(ε)r,x(˜F,F)≥ϰψε(β,L),

where infimum is over all possible estimators. Here is a numerical constant independent of and .

We conclude that the estimator is minimax adaptive with respect to the collection of classes . As already mentioned, this result is quite surprising. Indeed, if for example, the directional vector , i.e. is known, then and the considered estimation problem can be easily reduced to estimation of at a given point in the univariate Gaussian white noise model. As it is shown in Lepski (1990) the adaptive estimator over the collection does not exist.

Also, we would like to emphasize that the lower bound result given by the second assertion of the theorem is proved for arbitrary dimension. As to the proof of the first statement of the theorem it is based on the evaluation of the uniform, over , lower bound for and on the application of Theorem 1. We note also that the upper bound for the minimax risk given in Theorem 3 was earlier given in Goldenshluger and Lepski (2008), but the estimation procedure used there is completely different from our selection rule.

### 3.2 Adaptive estimation under the Lr losses

We start this section with the definition of the Nikol’skii class of functions.

###### Definition 2.

Let , and be fixed. A function belongs to the Nikol’skii class , if is -times continuously differentiable and

 (∫R∣∣g(m)(t)∣∣pdt)1p≤L,∀m=1,…,mβ; (∫R∣∣g(mβ)(t+h)−g(mβ)(t)∣∣pdz)1p≤Lhβ−mβ,∀h>0.

Later on we assume that if .

Here the target of estimation is the entire function under the assumption that , where

 (3.2)
###### Theorem 4.

Let be fixed and let Assumptions 1 and 2 hold. Then, for any , , and ,

 supF∈F2,p(β,L)R(ε)r(ˆF(ˆθ,ˆh),F)≤∥K∥2∞[ϰCr,1(Q,K)φε(β,L,p)+Cr,2(L,Q,K)ε√ln(1/ε)],

where

 φε(β,L,p)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩L12β+1(ε√ln(1/ε))2β2β+1,(2β+1)p>r;L12β+1(ε√ln(1/ε))2β2β+1[ln(1/ε)]1r,(2β+1)p=r;L1/2−1/rβ−1/p+1/2(ε√ln(1/ε))β−1/p+1/rβ−1/p+1/2,(2β+1)p

The constant is independent of , and .

Let us make some remarks. First, note that . Indeed, the class can be viewed as the class of functions satisfying with . Then, the problem of estimating such (2-variate) functions can be reduced to the estimation of univariate functions observed in the one-dimensional GWN model. In view of this remark the rate of convergence for the latter problem (which can be found for example in Donoho et al. (1995); Delyon and Juditsky (1996) ) is the lower bound for the minimax risk defined on . Under assumption this rate of convergence is given by

 ϕε(β,L,p)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩L12β+1ε2β2β+1,(2β+1)p>r;L12β+1(ε√ln(1/ε))2β2β+1,(2β+1)p=r;L1/2−1/rβ−1/p+1/2(ε√ln(1/ε))β−1/p+1/rβ−1/p+1/2,(2β+1)p

The minimax rate of convergence in the case remains an open problem, and the rate presented in the middle line above is only the lower asymptotic bound for the minimax risk. Therefore the proposed estimator is adaptive whenever . In the case we loose only a logarithmic factor with respect to the optimal rate and, as mentioned in Introduction, the construction of adaptive estimator over a collection in this case remains an open problem.

## 4 Proofs

### 4.1 Proof of Theorem 1

The section starts with the constants used in the statement of the theorem as well as technical lemmas whose proofs are postponed to Appendix.

##### Constants
 Cr,1(Q,K)=8[Λ(K,Q)+√4r+2+1]+cr[(2+√2)Λ(K,Q)+2]+1; Cr,2(M,Q,K)=21/r[2M+Λ(K,Q)c2r].

#### Auxiliary results

For any and denote

 S(θ,h)(ν,h)(x) = det(E(θ,h)(ν,h))∫K(E(θ,h)(ν,h)(t−x))F(t)dt, S(θ,h)(x) = det(E(θ,h))∫K(E(θ,h)(t−x))F(t)dt.

For ease of notation, we write .

###### Lemma 1.

Grant Assumption 1. Then, for any and any satisfying , one has

 ∣∣S(θ∘,h)(ν,h)(x)−S(ν,h)(x)∣∣≤2(h∗f)−1/2∥K∥2∞ε√ln(1/ε); ∣∣S(ν,h)(x)−S(ν,η)(x)∣∣≤2(h∗f)−1/2∥K∥2∞ε√ln(1/ε); ∣∣S(θ∘,h)−F(x)∣∣≤(h∗f)−1/2∥K∥∞ε√ln(1/ε).

Let be a set of matrices such that

 |det(E)|≥a,|E|∞≤A,∀E∈Ea,A.

Here denotes the supremum norm, the maximum absolute value entry of the matrix . Later on without loss of generality we will assume that .

Assume that the function is compactly supported on , and satisfies the Lipschitz condition

 |L(u)−L(v)|≤Υ|u−v|2,∀u,v∈R2,

where is the Euclidian norm. Let be fixed. On the parameter set let a Gaussian random function be defined by

 ζy(E)=∥L∥−12√|det(E)|∫L(E(u−y))W(du).

Put and , where .

###### Lemma 2.

For any

 P{supE∈Ea,A∣∣ζy(E)∣∣≥c(a,A)+z}≤P{|ς|≥z}≤e−z22.

Moreover, for any

 (E[supE∈Ea,A∣∣ζy(E)∣∣]q)1/q≤cqc(a,A)

#### Proof of Theorem 1

Let be such that . Introduce the random events

 A={(θ∘,h∗)∈P(x)},B={ˆh≥h∗},C=A∩B,

and let denote the event complimentary to . We split the proof into two steps.

Risk computation under The triangle inequality gives

 ∣∣ˆF(ˆθ,ˆh)(x)−F(x)∣∣ ≤ ∣∣ˆF(ˆθ,ˆh)(x)−ˆF(ˆθ,h∗)(x)∣∣+∣∣ˆF(θ∘,h∗)(ˆθ,h∗)(x)−ˆF(ˆθ,h∗)(x)∣∣ (4.1) +∣∣ˆF(θ∘,h∗)(ˆθ,h∗)(x)−ˆF(θ∘,h∗)(x)∣∣+∣∣ˆF(θ∘,h∗)(x)−F(x)∣∣.