The EFM approach for single-index models

# The EFM approach for single-index models

[ [    [ [    [ [ Sun Yat-sen University, Humboldt-Universität zu Berlin and
National Central University, and Hong Kong Baptist University
and Yunnan University of Finance and Economics
X. Cui
School of Mathematics
and Computational Science
Sun Yat-sen University
Guangzhou
Guangdong Province, 510275
P.R. China
W. K. Härdle
CASE-Center for Applied Statistics
and Economics
Humboldt-Universität zu Berlin
Wirtschaftswissenschaftliche Fakultät
Spandauer Str. 1
10178 Berlin
Germany
L. Zhu
FSC1207, Fong Shu Chuen Building
Department of Mathematics
Hong Kong Baptist University
Kowloon Tong
Hong Kong
P.R. China
\smonth4 \syear2010\smonth12 \syear2010
\smonth4 \syear2010\smonth12 \syear2010
\smonth4 \syear2010\smonth12 \syear2010
###### Abstract

Single-index models are natural extensions of linear models and circumvent the so-called curse of dimensionality. They are becoming increasingly popular in many scientific fields including biostatistics, medicine, economics and financial econometrics. Estimating and testing the model index coefficients is one of the most important objectives in the statistical analysis. However, the commonly used assumption on the index coefficients, , represents a nonregular problem: the true index is on the boundary of the unit ball. In this paper we introduce the EFM approach, a method of estimating functions, to study the single-index model. The procedure is to first relax the equality constraint to one with components of lying in an open unit ball, and then to construct the associated estimating functions by projecting the score function to the linear space spanned by the residuals with the unknown link being estimated by kernel estimating functions. The root- consistency and asymptotic normality for the estimator obtained from solving the resulting estimating equations are achieved, and a Wilks type theorem for testing the index is demonstrated. A noticeable result we obtain is that our estimator for has smaller or equal limiting variance than the estimator of Carroll et al. [J. Amer. Statist. Assoc. 92 (1997) 447–489]. A fixed-point iterative scheme for computing this estimator is proposed. This algorithm only involves one-dimensional nonparametric smoothers, thereby avoiding the data sparsity problem caused by high model dimensionality. Numerical studies based on simulation and on applications suggest that this new estimating system is quite powerful and easy to implement.

[
\kwd
\doi

10.1214/10-AOS871 \volume39 \issue3 2011 \firstpage1658 \lastpage1688 \newproclaimremRemark \newproclaimexmExample \setattributeabstractwidth290pt

\runtitle

The EFM approach for single-index models

{aug}

A]\fnmsXia \snmCui\thanksreft1label=e1]cuixia@mail.sysu.edu.cn, B]\fnmsWolfgang Karl \snmHärdle\thanksreft2label=e2]haerdle@wiwi.hu-berlin.de and C]\fnmsLixing \snmZhu\corref\thanksreft3label=e3]lzhu@hkbu.edu.hk \thankstextt1Supported by NNSF project (11026194) of China, RFDP (20100171120042) of China and “the Fundamental Research Funds for the Central Universities” (11lgpy26) of China. \thankstextt2Supported by Deutsche Forschungsgemeinschaft SFB 649 “Ökonomisches Risiko.” \thankstextt3Supported by a Grant (HKBU2030/07P) from Research Grants Council of Hong Kong, Hong Kong, China.

class=AMS] \kwd62G08 \kwd62G08 \kwd62G20.

Single-index models \kwdindex coefficients \kwdestimating equations \kwdasymptotic properties \kwditeration.

## 1 Introduction

Single-index models combine flexibility of modeling with interpretability of (linear) coefficients. They circumvent the curse of dimensionality and are becoming increasingly popular in many scientific fields. The reduction of dimension is achieved by assuming the link function to be a univariate function applied to the projection of explanatory covariate vector on to some direction. In this paper we consider an extension of single-index models where, instead of a distributional assumption, assumptions of only the mean function and variance function of the response are made. Let , denote the observed values with being the response variable and as the vector of explanatory variables. The relationship of the mean and variance of is specified as follows:

 E(Yi|Xi)=μ{g(\boldsβ⊤Xi)},Var(Yi|Xi)=σ2V{g(\boldsβ⊤Xi)}, (1)

where is a known monotonic function, is a known covariance function, is an unknown univariate link function and is an unknown index vector which belongs to the parameter space . Here we assume the parameter space is rather than the entire in order to ensure that in the representation (1) can be uniquely defined. This is a commonly used assumption on the index parameter [see Carroll et al. (1997), Zhu and Xue (2006), Lin and Kulasekera (2007)]. Another reparameterization is to let for the sign identifiability and to transform to for the scale identifiability. Clearly can also span the parameter space by simply checking that and the first component . However, the fixed-point algorithm recommended in this paper for normalized vectors may not be suitable for such a reparameterization. Model (1) is flexible enough to cover a variety of situations. If is the identity function and is equal to constant 1, (1) reduces to a single-index model Härdle, Hall and Ichimura (1993). Model (1) is an extension of the generalized linear model McCullagh and Nelder (1989) and the single-index model. When the conditional distribution of is logistic, then and .

The main challenges of estimation in the semiparametric model (1) are that the support of the infinite-dimensional nuisance parameter depends on the finite-dimensional parameter , and the parameter is on the boundary of a unit ball. For estimating the former challenge forces us to deal with the infinite-dimensional nuisance parameter . The latter one represents a nonregular problem. The classic assumptions about asymptotic properties of the estimates for are not valid. In addition, as a model proposed for dimension reduction, the dimension may be very high and one often meets the problem of computation. To attack the above problems, in this paper we will develop an estimating function method (EFM) and then introduce a computational algorithm to solve the equations based on a fixed-point iterative scheme. We first choose an identifiable parameterization which transforms the boundary of a unit ball in to the interior of a unit ball in . By eliminating , the parameter space can be rearranged to a form . Then the derivatives of a function with respect to are readily obtained by the chain rule and the classical assumptions on the asymptotic normality hold after transformation. The estimating functions (equations) for can be constructed by replacing with . The estimate for the nuisance parameter is obtained using kernel estimating functions and the smoothing parameter is selected using -fold cross-validation. For the problem of testing the index, we establish a quasi-likelihood ratio based on the proposed estimating functions and show that the test statistics asymptotically follow a -distribution whose degree of freedom does not depend on nuisance parameters, under the null hypothesis. Then a Wilks type theorem for testing the index is demonstrated.

The proposed EFM technique is essentially a unified method of handling different types of data situations including categorical response variable and discrete explanatory covariate vector. The main results of this research are as follows:

1. [(a)]

2. Efficiency. A surprising result we obtain is that our EFM estimator for has smaller or equal limiting variance than the estimator of Carroll et al. (1997).

3. Computation. The estimating function system only involves one-dimensional nonparametric smoothers, thereby avoiding the data sparsity problem caused by high model dimensionality. Unlike the quasi-likelihood inference [Carroll et al. (1997)] where the maximization is difficult to implement when is large, the reparameterization and the explicit formulation of the estimating functions facilitate an efficient computation algorithm. Here we use a fixed-point iterative scheme to compute the resultant estimator. The simulation results show that the algorithm adapts to higher model dimension and richer data situations than the MAVE method of Xia et al. (2002).

It is noteworthy that the EFM approach proposed in this paper cannot be obtained from the SLS method proposed in Ichimura (1993) and investigated in Härdle, Hall and Ichimura (1993). SLS minimizes the weighted least squares criterion , which leads to a biased estimating equation when we use its derivative if does not contain the parameter of interest. It will not in general provide a consistent estimator [see Heyde (1997), page 4]. Chang, Xue and Zhu (2010) and Wang et al. (2010) discussed the efficient estimation of single-index model for the case of additive noise. However, their methods are based on the estimating equations induced from the least squares rather than the quasi-likelihood. Thus, their estimation does not have optimal property. Also their comparison is with the one from Härdle, Hall and Ichimura (1993) and its later development. It cannot be applied to the setting under study. In this paper, we investigate the efficiency and computation of the estimates for the single-index models, and systematically develop and prove the asymptotic properties of EFM.

The paper is organized as follows. In Section 2, we state the single-index model, discuss estimation of using kernel estimating functions and of using profile estimating functions, and investigate the problem of testing the index using quasi-likelihood ratio. In Section 3 we provide a computation algorithm for solving the estimating functions and illustrate the method with simulation and practical studies. The proofs are deferred to the Appendix.

## 2 Estimating function method (EFM) and its large sample properties

In this section, which is concerned with inference based on the estimating function method, the model of interest is determined through specification of mean and variance functions, up to an unknown vector and an unknown function . Except for Gaussian data, model (1) need not be a full semiparametric likelihood specification. Note that the parameter space means that is on the boundary of a unit ball and it represents therefore a nonregular problem. So we first choose an identifiable parameterization which transforms the boundary of a unit ball in to the interior of a unit ball in . By eliminating , the parameter space can be rearranged to a form . Then the derivatives of a function with respect to are readily obtained by chain rule and the classic assumptions on the asymptotic normality hold after transformation. This reparameterization is the key to analyzing the asymptotic properties of the estimates for and to facilitating an efficient computation algorithm. We will investigate the estimation for and and propose a quasi-likelihood method to test the statistical significance of certain variables in the parametric component.

### 2.1 The kernel estimating functions for the nonparametric part g

If is known, then we estimate and using the local linear estimating functions. Let denote the bandwidth parameter, and let denote the symmetric kernel density function satisfying . The estimation method involves local linear approximation. Denote by and the values of and evaluating at , respectively. The local linear approximation for in a neighborhood of is . The estimators and are obtained by solving the kernel estimating functions with respect to :

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩n∑j=1Kh(\boldsβ⊤Xj−t)μ′{~g(\boldsβ⊤Xj)}V−1{~g(\boldsβ⊤Xj)}×[Yj−μ{~g(\boldsβ⊤Xj)}]=0,n∑j=1(\boldsβ⊤Xj−t)Kh(\boldsβ⊤Xj−t)μ′{~g(\boldsβ⊤Xj)}V−1{~g(\boldsβ⊤Xj)}×[Yj−μ{~g(\boldsβ⊤Xj)}]=0. (2)

Having estimated at as , the local linear estimators of and are and , respectively.

The key to obtain the asymptotic normality of the estimates for lies in the asymptotic properties of the estimated nonparametric part. The following theorem will provide some useful results. The following notation will be used. Let , and the Jacobian matrix of size with

 J=⎛⎝−\boldsβ(1)⊤/√1−∥∥\boldsβ(1)∥∥2Id−1⎞⎠,\boldsβ(1)=(β2,…,βd)⊤.

The moments of and are denoted, respectively, by,

 γj=∫tjK(t)dtandνj=∫tjK2(t)dt.
###### Proposition 1

Under regularity conditions (a), (b), (d) and (e) given in the Appendix, we have: {longlist}[(iii)]

With , such that and , , the asymptotic conditional bias and variance of are given by

 E{{^g(\boldsβ⊤x)−g(\boldsβ⊤x)}2|X} ={12γ2h2g′′(\boldsβ⊤x)}2 (3) +ν0σ2/[nhf\boldsβ⊤x(\boldsβ⊤x)ρ2{g(\boldsβ⊤x)}] +{O}P(h4+n−1h−1).

With , such that and , for the estimates of the derivative , it holds that

 E{{^g′(\boldsβ⊤x)−g′(\boldsβ⊤x)}2|X} ={16γ4γ−12h2g′′′(\boldsβ⊤x) {+12(γ4γ−12−γ2)h2g′′(\boldsβ⊤x) (4) {+×[ρ′2{g(\boldsβ⊤x)}/ρ2{g(\boldsβ⊤x)}+f′\boldsβ⊤x(\boldsβ⊤x)/f\boldsβ⊤x(\boldsβ⊤x)]}2 +{O}P(h4+n−1h−3).

With , such that and , we have that

 E{∥∥∥∂^g(\boldsβ⊤x)∂\boldsβ(1)−g′(\boldsβ⊤x)J⊤{x−E(x|\boldsβ⊤x)}∥∥∥2∣∣X}=OP(h4+n−1h−3). (5)

The proof of this proposition appears in the Appendix. Results (i) and (ii) in Proposition 1 are routine and similar to Carroll, Ruppert and Welsh (1998). In the situation where and the function is identity, results (i) and (ii) coincide with those given by Fan and Gijbels (1996). From result (iii), it is seen that converges in probability to , rather than as if were known. That is, , which means that the convergence in probability and the derivation of the sequence (as a function of ) cannot commute. This is primarily caused by the fact that the support of the infinite-dimensional nuisance parameter depends on the finite-dimensional projection parameter . In contrast, a semiparametric model where the support of the nuisance parameter is independent of the finite-dimensional parameter is a partially linear regression model having form . It is easy to check that the limit of is equal to , which is the derivative of with respect to . Result (iii) ensures that the proposed estimator does not require undersmoothing of to obtain a root- consistent estimator for and it is also of its own interest in inference theory for semiparametric models.

### 2.2 The asymptotic distribution for the estimates of the parametric part β

We will now proceed to the estimation of . We need to estimate the -dimensional vector , the estimator of which will be defined via

 n∑i=1[∂μ{^g(\boldsβ⊤Xi)}/∂\boldsβ(1)]V−1{^g(\boldsβ⊤Xi)}[Yi−μ{^g(\boldsβ⊤Xi)}]=0. (6)

This is the direct analogue of the “ideal” estimating equation for known , in that it is calculated by replacing with . An asymptotically equivalent and easily computed version of this equation is

 ^G(\boldsβ) \lx@stackreldef= n∑i=1J⊤^g′(\boldsβ⊤Xi){Xi−^h(\boldsβ⊤Xi)}ρ1{^g(\boldsβ⊤Xi)}[Yi−μ{^g(\boldsβ⊤Xi)}] = 0

with the Jacobian mentioned above, and are defined by (2), and the local linear estimate for ,

 ^h(t)=n∑i=1bi(t)Xi/n∑i=1bi(t),

where , We use (2.2) to estimate in the single-index model, and then use the fact that to obtain . The use of (2.2) constitutes in our view a new approach to estimating single-index models; since (2.2) involves smooth pilot estimation of , and we call it the Estimation Function Method (EFM) for .

{rem}

The estimating equations can be represented as the gradient vector of the following objective function:

 ^Q(\boldsβ)=n∑i=1Q[μ{^g(\boldsβ⊤Xi)},Yi]

with and the inverse function of . The existence of such a potential function makes to inherit properties of the ideal likelihood score function. Note that is an open, connected subset of . By the regularity conditions assumed on (for details see the Appendix), we know that the quasi-likelihood function is twice continuously differentiable on such that the global maximum of can be achieved at some point. One may ask whether the solution is unique and also consistent. Some elementary calculations lead to the Hessian matrix , because the partial derivative then

 1n∂2^Q(\boldsβ)∂\boldsβ(1)∂\boldsβ(1)⊤ =1n∂^G(\boldsβ)∂\boldsβ(1) =1nn∑i=1∂[J⊤^g′(\boldsβ⊤Xi){Xi−^h(\boldsβ⊤Xi)}ρ1{^g(\boldsβ⊤Xi)}]∂\boldsβ(1)[Yi−μ{^g(\boldsβ⊤Xi)}] −1nn∑i=1J⊤^g′(\boldsβ⊤Xi){Xi−^h(\boldsβ⊤Xi)}ρ1{^g(\boldsβ⊤Xi)}∂μ{^g(\boldsβ⊤Xi)}∂\boldsβ(1) =1nn∑i=1[−∂{\boldsβ(1)/√1−∥\boldsβ(1)∥2}∂\boldsβ(1)^g′(\boldsβ⊤Xi){X1i−^h1(\boldsβ⊤Xi)}ρ1{^g(\boldsβ⊤Xi)} 1nn∑i=1[+J⊤{Xi−^h(\boldsβ⊤Xi)}∂^g′(\boldsβ⊤Xi)∂\boldsβ(1)⊤ρ1{^g(\boldsβ⊤Xi)} 1nn∑i=1[+J⊤^g′(\boldsβ⊤Xi){Xi−^h(\boldsβ⊤Xi)}∂ρ1{^g(\boldsβ⊤Xi)}∂\boldsβ(1)⊤ −J⊤^g′(\boldsβ⊤Xi)∂^h(\boldsβ⊤Xi)∂\boldsβ(1)ρ1{^g(\boldsβ⊤Xi)}] 1nn∑i=1×[Yi−μ{^g(\boldsβ⊤Xi)}] −1nn∑i=1J⊤^g′2(\boldsβ⊤Xi){Xi−^h(\boldsβ⊤Xi)}{Xi−^h(\boldsβ⊤Xi)}⊤ρ2{^g(\boldsβ⊤Xi)}J.

By the regularity conditions in the Appendix, the multipliers of the residuals in the first sum of (2.2) are bounded. Mimicking the proof of Proposition 1, the first sum can be shown to converge to 0 in probability as goes to infinity. The second sum converges to a negative semidefinite matrix. If the Hessian matrix is negative definite for all values of , has a unique root. At sample level, however, estimating functions may have more than one root. For the EFM method, the quasi-likelihood exists, which can be used to distinguish local maxima from minima. Thus, we suppose (2.2) has a unique solution in the following context.

{rem}

It can be seen from the proof in the Appendix that the population version of is

 G(\boldsβ)=n∑i=1J⊤g′(\boldsβ⊤Xi){Xi−h(\boldsβ⊤Xi)}ρ1{g(\boldsβ⊤Xi)}[Yi−μ{g(\boldsβ⊤Xi)}], (8)

which is obtained by replacing with in (2.2). One important property of (8) is that the second Bartlett identity holds, for any :

 E{G(\boldsβ)G⊤(\boldsβ)}=−E{∂G(\boldsβ)∂\boldsβ(1)}.

This property makes the semiparametric efficiency of the EFM (2.2) possible.

Let denote the true parameter and denote the Moore–Penrose inverse of any given matrix . We have the following asymptotic result for the estimator .

{thm}

Assume the estimating function (2.2) has a unique solution and denote it by . If the regularity conditions (a)–(e) in the Appendix are satisfied, the following results hold: {longlist}[(ii)]

With , such that , converges in probability to the true parameter .

If and ,

 √n(^\boldsβ(1)−\boldsβ(1)0)\lx@stackrelL⟶Nd−1(\bolds0,\boldsΣ\boldsβ(1)0), (9)

where , and

 \boldsΩ=E[{XX⊤−E(X|\boldsβ⊤X)E(X⊤|\boldsβ⊤X)}ρ2{g(\boldsβ⊤X)}{g′(\boldsβ⊤X)}2/σ2].
{rem}

Note that , so the nonnegative matrix degenerates in the direction of . If the mean function is the identity function and the variance function is equal to a scale constant, that is, , , the matrix in Theorem 2.2 reduces to be

 \boldsΩ=E[{XX⊤−E(X|\boldsβ⊤X)E(X⊤|\boldsβ⊤X)}{g′(\boldsβ⊤X)}2/σ2].

Technically speaking, Theorem 2.2 shows that an undersmoothing approach is unnecessary and that root- consistency can be achieved. The asymptotic covariance in general can be estimated by replacing terms in its expression by estimates of those terms. The asymptotic normality of will follow from Theorem 2.2 with a simple application of the multivariate delta-method, since . According to the results of Carroll et al. (1997), the asymptotic variance of their estimator is . Define the block partition of matrix as follows:

 \boldsΩ=(\boldsΩ11\boldsΩ12\boldsΩ21\boldsΩ22), (10)

where is a positive constant, is a -dimensional row vector, is a -dimensional column vector and is a nonnegative definite matrix.

###### Corollary 1

Under the conditions of Theorem 2.2, we have

 √n(^\boldsβ−\boldsβ0)\lx@stackrelL⟶Np(\bolds0,\boldsΣ\boldsβ0) (11)

with . Further,

 \boldsΣ\boldsβ0≤\boldsΩ+|\boldsβ=\boldsβ0

and a strict less-than sign holds when . That is, in this case EFM is more efficient than that of Carroll et al. (1997).

The possible smaller limiting variance derived from the EFM approach partly benefits from the reparameterization so that the quasi-likelihood can be adopted. As we know, the quasi-likelihood is often of optimal property. In contrast, most existing methods treat the estimation of as if it were done in the framework of linear dimension reduction. The target of linear dimension reduction is to find the directions that can linearly transform the original variables vector into a vector of one less dimension. For example, ADE and SIR are two relevant methods. However, when the link function is identity, the limiting variance derived here may not be smaller or equal to the ones of Wang et al. (2010) and Chang, Xue and Zhu (2010) when the quasi-likelihood of (2.5) is applied.

### 2.3 Profile quasi-likelihood ratio test

In applications, it is important to test the statistical significance of added predictors in a regression model. Here we establish a quasi-likelihood ratio statistic to test the significance of certain variables in the linear index. The null hypothesis that the model is correct is tested against a full model alternative. Fan and Jiang (2007) gave a recent review about generalized likelihood ratio tests. Bootstrap tests for nonparametric regression, generalized partially linear models and single-index models have been systematically investigated [see Härdle and Mammen (1993), Härdle, Mammen and Müller (1998), Härdle, Mammen and Proenca (2001)]. Consider the testing problem:

 (12) ⟷H1\dvtxg(⋅)=g(r∑k=1βkXk+d∑k=r+1βkXk).

We mainly focus on testing , though the following test procedure can be easily extended to a general linear testing where is a known matrix with full row rank and . The profile quasi-likelihood ratio test is defined by

 Tn=2{sup\boldsβ∈Θ^Q(\boldsβ)−sup\boldsβ∈Θ,˜\boldsβ=0^Q(\boldsβ)}, (13)

where and is the inverse function of . The following Wilks type theorem shows that the distribution of is asymptotically chi-squared and independent of nuisance parameters.

{thm}

Under the assumptions of Theorem 2.2, if , then

 Tn\lx@stackrelL⟶χ2(d−r). (14)

## 3 Numerical studies

### 3.1 Computation of the estimates

Solving the joint estimating equations (2) and (2.2) poses some interesting challenges, since the functions and depend on implicitly. Treating as a new predictor (with given ), (2) gives us as in Fan, Heckman and Wand (1995). We therefore focus on (2.2), as estimating equations. It cannot be solved explicitly, and hence one needs to find solutions using numerical methods. The Newton–Raphson algorithm is one of the popular and successful methods for finding roots. However, the computational speed of this algorithm crucially depends on the initial value. We propose therefore a fixed-point iterative algorithm that is not very sensitive to starting values and is adaptive to larger dimension. It is worth noting that this algorithm can be implemented in the case that is slightly larger than , because the resultant procedure only involves one-dimensional nonparametric smoothers, thereby avoiding the data sparsity problem caused by high dimensionality.

Rewrite the estimating functions as with

 ^F(\boldsβ)=(^F1(\boldsβ),…,^Fd(\boldsβ))⊤

and

 ^Fs(\boldsβ) = ×[Yi−μ{^g(\boldsβ⊤Xi)}].

Setting , we have that

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩−β2^F1(\boldsβ)/√1−∥∥\boldsβ(1)∥∥2+^F2(\boldsβ)=0,−β3^F1(\boldsβ)/√1−∥∥\boldsβ(1)∥∥2+^F3(\boldsβ)=0,⋯−βd^F1(\boldsβ)/√1−∥∥\boldsβ(1)∥∥2+^Fd(\boldsβ)=0. (15)

Note that ,