[

[

Jianqing Fan    Yunbei Ma    [
Abstract

The varying-coefficient model is an important nonparametric statistical model that allows us to examine how the effects of covariates vary with exposure variables. When the number of covariates is big, the issue of variable selection arrives. In this paper, we propose and investigate marginal nonparametric screening methods to screen variables in ultra-high dimensional sparse varying-coefficient models. The proposed nonparametric independence screening (NIS) selects variables by ranking a measure of the nonparametric marginal contributions of each covariate given the exposure variable. The sure independent screening property is established under some mild technical conditions when the dimensionality is of nonpolynomial order, and the dimensionality reduction of NIS is quantified. To enhance practical utility and the finite sample performance, two data-driven iterative NIS methods are proposed for selecting thresholding parameters and variables: conditional permutation and greedy methods, resulting in Conditional-INIS and Greedy-INIS. The effectiveness and flexibility of the proposed methods are further illustrated by simulation studies and real data applications.

NIS for Ultra-High Dimensional Varying Coefficient Models]Nonparametric Independence Screening in Sparse Ultra-High Dimensional Varying Coefficient Models J. Fan, Y. Ma and W. Dai]and Wei Dai

footnotetext: Address for correspondence: Yunbei Ma, School of Statistics, Southwest University of Finance and Economics, Chengdu, China.
E-mail:myb@swufe.edu.cn

1 Introduction

The development of information and technology drives big data collections in many areas of advanced scientific research ranging from genomic and health science to machine learning and economics. The collected data frequently has an ultra-high dimensionality that is allowed to diverge at nonpolynomial (NP) rate with the sample size , namely for some . For example, in biomedical research such as genomewide association studies for some mental diseases, millions of SNPs are potential covariates. Traditional statistical methods face significant challenges in dealing with such a high-dimensional problem with large sample sizes.

With the sparsity assumption, variable selection helps improve the accuracy of estimation and gain scientific insights. Many significant variable selection techniques have been developed, such as Bridge regression in Frank and Friedman (1993), Lasso in Tibshirani (1996), SCAD and folded concave penalty in Fan and Li (2001), the Elastic net in Zou and Hastie (2005), Adaptive Lasso (Zou, 2006), and the Dantzig selector in Candes and Tao (2007). Methods on the implementation of folded concave penalized least-squares include the local linear approximation algorithm in Zou and Li (2008) and the plus algorithm in Zhang (2010). However, due to the simultaneous challenges of computational expediency, statistical accuracy and algorithmic stability, these methods do not perform well in ultra-high dimensional problems.

To tackle these problems, Fan and Lv (2008) introduced a sure independence screening (SIS) method to select important variables in ultra-high dimensional linear regression models via marginal correlation learning. Hall and Miller (2009) extended the method to the generalized correlation ranking, which was further extended by Fan, Feng and Song (2011) for ultra-high dimensional nonparametric additive models, resulting in nonparametric independence screening (NIS). On a different front, Fan and Song (2010) extended the SIS idea to ultra-high dimensional generalized linear models and devised a useful technical tool for establishing the sure screening results and bounding false selection rates. Other related methods include data-tilling method (Hall, Titterington and Xue, 2009), marginal partial likelihood method MPLE (Zhao and Li, 2010), and robust screening methods by rank correlation (Li, et al. , 2012) and distance correlation (Li, Zhong and Zhu, 2012). Inspired by these previous work, our study will focus on variable screening in nonparametric varying-coefficient models with NP dimensionality.

It is well known that nonparametric models are flexible enough to reduce modeing biases. However, they suffer from the so-called “curse of dimensionality”. A remarkable simple and powerful nonparametric model for dimensionality reductions is the varying-coefficient model,

 Y=\boldmathβT(W)X+ϵ, (0)

where is the vector of covariates, is some observable exposure variables, is the response, and is the random noise with conditional mean 0 and finite conditional variance. An intercept term (i.e., ) can be introduced if necessary. This model assumes that the variables in the covariate vector X enter the model linearly, meanwhile it allows regression coefficient functions to very smoothly with the exposure variable. The model retains general nonparametric characteristics and allows the nonlinear interactions between the exposure variable and the covariates. It arises frequently from economics, finance, politics, epidemiology, medical science, ecology, among others. For an overview, see Fan and Zhang (2008).

When the dimensionality is finite, Fan, Zhang and Zhang (2001) proposed the generalized likelihood ratio (GLR) test to select variables in the varying-coefficient model (1). For the time-varying coefficient model, a special case of (1) with the exposure variable being the time , Wang, Li and Huang (2008) applied the basis function approximations and the SCAD penalty to address the problem of variable selection. In the NP dimensional setting, Lian (2011) utilized the adaptive group Lasso penalty in time-varying coefficient models. These methods still face the aforementioned three challenges.

In this paper, we consider a nonparametric screening by ranking a measure of the marginal nonparametric contributions of each covariate given the exposure variable. For each given covariate, we fit marginal regressions of the response against the covariate conditioning on :

 minaj,bjE[(Y−aj−bjXj)2|W] (0)

Let and be the solution to (1) and and be their nonparametric estimates. Then, we rank the importance of each covariate in the joint model according to a measure of marginal utility (which is equivalent to the goodness of fit) in its marginal model. Under some reasonable conditions, the magnitude of these marginal contributions provides useful probes of the importance of variables in the joint varying-coefficient model. This is an important extension of SIS (Fan and Lv, 2008) to a more flexible class of varying coefficient models.

The sure screening property of NIS can be established under certain technical conditions. In some very specific cases, NIS can even be model selection consistent. In establishing this kind of results, three factors are related to the minimum distinguishable marginal signals: the stochastic error in estimating the nonparametric components, the approximation error in modeling nonparametric components, and the tail distributions of the covariates. Following Fan and Lv (2008) and Fan, Feng and Song (2011), we propose two nonparametric independence screening approaches in an iterative framework. One is called Greedy-INIS, in which we adopt a greedy method in the variable screening step. The other is called Conditional-INIS which is built on conditional random permutation to determine a data driven screening threshold. They both serve to effectively control the false positive rate and false negative rate with enhanced performance.

This article is organized as follows. In Section 2, we fit each marginal nonparametric regression model via B-spline basis approximation and screen variables by ranking a measure of these estimators. In Section 3, we establish the sure screening property and model selection consistency under certain technical conditions. Iterative NIS procedures (namely Greedy-INIS and Conditional-INIS) are developed in Section 4. In Section 5, a set of numerical studies are conducted to evaluate the performance of our proposed methods.

2 Models and Nonparametric Marginal Screening Method

In this section we study the varying-coefficient model with the conditional linear structure as in (1). Assume that the functional coefficient vector is sparse. Let be the true sparse model with nonsparsity size . We allow to grow with and denote it by whenever necessary.

2.1 Marginal Regression

For , let and be the minimizer of the following marginal regression problem:

 minaj(W),bj(W)∈L2(P)E[(Y−aj(W)−bj(W)Xj)2|W], (0)

where denotes the joint distribution of and is the class of square integrable functions under the measure . By some algebra, we have that the minimizer of (2.1) is

 bj(W)=Cov[Xj,Y|W]Var[Xj|W], aj(W)=E[Y|W]−bj(W)E[Xj|W]. (0)

Let , we rank the marginal utility of covariates by

 uj=∥aj(W)+bj(W)Xj∥2−∥a0(W)∥2, (0)

where It can be seen that

 uj = E[b2j(W)(Xj−E[Xj|W])2]=E[(Cov[Xj,Y|W])2Var[Xj|W]]. (0)

For each , if , then has the same quantity as the measure of marginal functional coefficient . On the other hand, this marginal utility is closely related to the conditional correlation between and , as if and only if almost surely.

2.2 Marginal Regression Estimation with B-spline

To obtain an estimate of the marginal utility , , we approximate and by functions in , the space of polynomial splines of degree on , a compact set. Let denote its normalized B-spline basis with where is the sup norm. Then

 aj(W) ≈ Ln∑k=1ηjkBk(W), j=0,⋯,p, bj(W) ≈ Ln∑k=1θjkBk(W), j=1,⋯,p.

where and are scalar coefficients.

We now consider the following sample version of the marginal regression problem:

 min\scriptsize\boldmathηj,% \scriptsize\boldmathθj∈RLn1nn∑i=1(Yi−B(Wi)\boldmathηj−B(Wi)% \boldmathθjXji)2, (0)

where , and .

It is easy to show that the minimizers of (2.2) is given by

 (ˆ\boldmathηTj, ˆ\boldmathθTj)T=(QTnjQnj)−1QTnjY, (0)

where

 Qnj=(Bn, Φnj)=⎛⎜ ⎜⎝B(W1),Xj1B(W1)⋮⋮B(Wn),XjnB(Wn)⎞⎟ ⎟⎠

is an matrix. As a result, the estimates of and , are given by

 ˆanj(W)=B(W)ˆ\boldmathηj=(B(W),0TLn)(QTnjQnj)−1QTnjY, ˆbnj(W)=B(W)ˆ\boldmathθ% j=(0TLn,B(W))(QTnjQnj)−1QTnjY, (0)

where is an -dimension vector with all entries 0. Similarly, we have the estimate of the intercept function by

 ˆan0(W)=B(W)ˆ\boldmathη0=B(W)(BTnBn)−1BTnY, (0)

where

 ˆ\boldmathη0=argmin% \scriptsize\boldmathη0∈RLn1nn∑i=1(Yi−B(Wi)\boldmathη0)2. (0)

We now define an estimate of the marginal utility as

 ˆunj = ∥ˆanj(W)+ˆbnj(W)Xj∥2n−∥ˆan0(W)∥2n (0) = 1nn∑i=1(ˆanj(Wi)+ˆbnj(Wi)Xji)2−1nn∑i=1(ˆan0(Wi))2,

where . Note that throughout this paper, whenever two vectors a and b are of the same length, ab denotes the componentwise product. Given a predefined threshold value , we select a set of variables as follows:

 Mτn={1≤j≤p:ˆunj≥τn}. (0)

Alternatively, we can rank the covariates by the residual sum of squares of marginal nonparametric regressions, which is defined as

 ˆvnj=∥Y−ˆanj(W)−ˆbnj(W)Xj∥2n, (0)

and we select variables as follows,

 Mνn={1≤j≤p:ˆvnj≤νn}, (0)

where is a predefined threshold value.

It is worth noting that ranking by marginal utility is equivalent to ranking by the measure of goodness of fit . To see the equivalence, first note that

 ∥ˆanj(W)+ˆbnj(W)Xj∥2n=1nYTQnj(QTnjQnj)−1QTnjY, (0)

and

 1nn∑i=1Yi(ˆanj(Wi)+ˆbnj(Wi)Xji)=1nYTQnj(QTnjQnj)−1QTnjY. (0)

It follows from (2.2) and (2.2) that

 ˆvnj=∥Y∥2n−∥ˆan0(W)∥2n−ˆunj. (0)

Since the first two terms on the right hand side of (2.2) do not vary in , ranking by is the same as that by . Therefore, selecting variables with large marginal utility is the same as picking those that yield small marginal residual sum of squares.

To bridge and , we define the population version of the marginal regression using B-spline basis. From now on, we will omit the argument in and write B whenever the context is clear. Let and , where and are the minimizer of

 min\scriptsize\boldmathηj,% \scriptsize\boldmathθj∈RLnE[(Y−B\boldmathηj−B\boldmathθjXj)2], (0)

and , where is the minimizer of

 min\scriptsize\boldmathη0∈RLnE[(Y−B\boldmathη0)2]. (0)

It can be seen that

 (~aj(W),~bj(W))T = diag(B,B)(E[QTjQ% j])−1E[QTjY], (0) ~a0(W) = B(E[BTB])−1E[B% TY], (0)

where

 ~uj = ∥~aj(W)+~bj(W)Xj∥2−∥~a0(W)∥2 (0) = E[YQj](E[QTjQj])−1E[QTjY]−E[YB](E[BTB])−1E[BTY].

3 Sure Screening

In this section, we establish the sure screening properties of the proposed method for model (1). Recall that by (2.1) the population version of marginal utility quantifies the relationship between and as follows:

 uj=E[(Cov[Xj,Y|W])2Var[Xj|W]],j=1,⋯,p. (0)

Then the following two conditions guarantee that the marginal signal of the active components does not vanish.

(i)

Suppose for , is uniformly bounded away from 0 and infinity on , where is the compact support of . That is, there exist some positive constants and , such that .

(ii)

, for some and .

Then under conditions (i) and (ii),

 minj∈M∗uj≥c1Lnn−2κ/h2. (0)

Note that in condition (ii), the number of basis functions is not intrinsic. By the Remark 1 below, should be chosen in correspondence to the smoothness condition of the nonparametric component. Therefore, condition (ii) depends only on and smoothness parameter in condition (iii). We keep here to make the relationship more explicit.

3.1 Sure Screening Properties

The following conditions (iii)-(vii) are required for the B-spline approximation in marginal regressions and establishing the sure screening properties.

(iii)

The density function of is bounded away from zero and infinity on . That is, for some constants and .

(iv)

Functions and belong to a class of functions , whose th derivative exists and is Lipschitz of order . That is,

 B={f(⋅):|f(r)(s)−f(r)(t)|≤M|s−t|α for s,t∈W},

for some positive constant , where is a nonnegative integer and such that .

(v)

Suppose for all , there exists a positive constant and , such that

 P(|Xj|>t|W)≤exp(1−(t/K1)r1), (0)

uniformly on , for any . Furthermore, let , where . Suppose there exists some positive constants and satisfying , such that

 P(|m(X∗)|>t|W)≤exp(1−(t/K2)r2). (0)

uniformly on , for any .

(vi)

The random errors are i.i.d with conditional mean 0, and there exists some positive constants and satisfying , such that

 P(|ε|>t|W)≤exp(1−(t/K3)r3), (0)

uniformly on , for any .

(vii)

There exists some constant such that .

Proposition 1

Under conditions (i)-(v), there exists a positive constant such that

 uj−~uj≤M1L−2dn. (0)

In addition, when for some , we have

 minj∈M∗~uj≥c1ξLnn−2κ. (0)
Remark 1

It follows from Proposition 1 that the minimum signal level of is approximately the same as , provided that the approximation error is negligible. It also shows that the number of basis functions should be chosen as

 Ln≥Cn2κ/(2d+1),

for some positive constant . In other words, the smoother the underlying function is (i.e., the larger is), the smaller we can take.

The following Theorem 1 provides the sure screening properties of the nonparametric independence screening method proposed in Section 2.2.

Theorem 1

Suppose conditions (i)-(vi) hold.

1. If as , then for any , there exist some positive constants and such that

 P(max1≤j≤p|ˆunj−~uj|≥c2Lnn−2κ) (0) ≤ 12pnLn{(2+Ln)exp(−c3n1−4κL−3n)+3Lnexp(−c4L−3nn)}.
2. If condition (vii) also holds, then by taking with , there exist positive constants and such that

 P(M∗⊂ˆMτn) ≥ 1−12snLn{(2+Ln)exp(−c6n1−4κL−3n) (0) +3Lnexp(−c7L−3nn)}.
Remark 2

According to Theorem 1 , we can handle NP dimensionality

 p=o(exp{n1−4κL−3n}).

It shows that the number of spline bases also affects the order of dimensionality: the smaller is, the higher dimensionality we can handle. On the other hand, Remark 1 points out that it is required to have a good bias property. This means that the smoother the underlying function is (i.e. the larger is), the smaller we can take, and consequently higher dimensionality can be handled. The compatibility of these two requirements requires that , which implies that . We can take , which is the optimal convergence rate for nonparametric regression (Stone, 1982). In this case, the allowable dimensionality can be as high as

 p=o(exp{n2(d−1)2d+1}).

3.2 False Selection Rates

According to (1), the ideal case for vanishing false-positive rate is when

 maxj∉M∗~uj=o(Lnn−2κ)

so that there is a natural separation between important and unimportant variables. By Theorem 1(i), when (1) tends to zero, we have with probability tending to 1 that

 maxj∉M∗ˆunj≤cLnn−2κ, for any c>0.

Consequently, by choosing as in Theorem 1(ii), NIS can achieve the model selection consistency under this ideal situation, i.e.,

 P(ˆMτn=M∗)=1−o(1).

In particular, this ideal situation occurs under the partial orthogonality condition, i.e., is independent of given , which implies for

In general, the model selection consistency can not be achieved by a single step of marginal screening. The marginal probes can not separate important variables from unimportant variables. The following Theorem 2 quantifies how the size of selected models is related to the matrix of basis functions and the thresholding parameter .

Theorem 2

Under the same conditions in Theorem 1, for any , there exist positive constants and such that

 P{|ˆMτn|≤O(n2κλmax(\boldmathΣ))} ≥ 1−12pnLn{(2+Ln)exp(−c8n1−4κL−3n) (0) +3Lnexp(−c9nL−3n)},

where , and is a functional vector of dimension.

4 Iterative Nonparametric Independence Screening

As Fan and Lv (2008) points out, in practice the nonparametric independence screening (NIS) would still suffer from false negative (i.e., miss some important predictors that are marginally weakly correlated but jointly correlated with the response), and false positive (i.e., select some unimportant predictors which are highly correlated with the important ones). Therefore, we adopt an iterative framework to enhance the performance of this method. We repeatedly apply the large-scale variable screening (NIS) followed by a moderate-scale variable selection, where we use group-SCAD penalty as our selection strategy. In the NIS step, we propose two methods to determine a data-driven threshold for screening, which result in Conditional-INIS and Greedy-INIS, respectively.

4.1 Conditional-INIS Method

The conditional-INIS method builds upon conditional random permutation in determining the thresholding . Recall the random permutation used in Fan, Feng and Song (2011), which generalizes that Zhao and Li (2010). Randomly permute Y to get and compute , where is a permutation of , based on the randomly coupled data that has no relationship between covariates and response. Thus, these estimates serve as the baseline of the marginal utilities under the null model (no relationship). To control the false selection rate at under the null model, one would choose the screening threshold be , the th-ranked magnitude of . Thus, the NIS step selects variables . In practice, one frequently uses , namely, the largest marginal utility under the null model.

When the correlations among covariates are large, there will be hardly any differentiability between the marginal utilities of the true variables and the false ones. This makes the selected variable set very large to begin with and hard to proceed the rest of iterations with limited false positives. For numerical illustrations, see section 5.2. Therefore, we propose a conditional permutation method to tackle this problem. Combining the other steps, our Conditional-INIS algorithm proceeds as follows.

0.

For , compute

 ˆunj=∥ˆanj(W)+ˆbnj(W)% Xj∥2n−∥ˆan0(W)∥2n,

where the estimates are defined in (2.2) and (2.2) using . Select the top variables by ranking their marginal utilities , resulting in the index subset to condition upon.

1.

Regress Y on , and get intercept and their functional coefficients’ estimators . Conditioning on , the -dimensional partial residual is

 Y∗=Y−ˆβn0(W)−∑j∈M0Xjˆβnj(W).

For all , compute using , which measures the additional utility of each covariate conditioning on the selected set .

To determine the threshold for NIS, we apply random permutation on the partial residual , which yields . Compute based on the decoupled data . Let be the th-ranked magnitude of . Then, the active variable set of variables is chosen as

 A1={j:ˆu∗nj≥τ∗q,j∈Mc0}∪M0.

In our numerical studies, .

2.

Apply the group-SCAD penalty on to select a subset of variables . Details about the implementation of SCAD will be described later.

3.

Repeat step 1-2, where we replace in step 1 by , , and get and in step 2. Iterate until for some or , for some prescribed positive integer .

4.2 Greedy-INIS Method

Following Fan, Feng and Song (2011), we also implemented a greedy version of INIS method. We skip step 0 and start from step 1 in the algorithm above (i.e., take ), and select the top variables that have the largest marginal norms . This NIS step is followed by the same group-SCAD penalized regression as in step 2. We then iterate these steps until there are two identical subsets or the number of variables selected exceeds a prespecified . In our simulation studies, is set as .

In the group-SCAD step, variables are selected as through minimizing the following objective function:

 min\scriptsize\boldmathγ0,% \scriptsize\boldmathγj∈RLn1nn∑i=1(Yi−B(Wi)\boldmathγ0−∑j∈AlB(Wi)Xji\boldmathγj)2+∑j∈Alpλ(||\boldmathγj||B), (0)

where and is the SCAD penalty such that

 p′λ(|x|) = λI(|x|≤λ)+(aλ−|x|)+a−1I(|x|>λ),

with . We set as suggested and solve the optimization above via local quadratic approximations (Fan and Li, 2001). is chosen by BIC criteria , where is the number of covariates chosen. By Antoniadis and Fan (2001) and Yuan and Lin (2006), the norm-penalty in (4.3) encourages the group selection.

5 Numerical Studies

In this section, we carry out several simulation studies to assess the performance of our proposed methods. If not otherwise stated, the common setup for the following simulations are: cubic B-spline, sample size , the number of variables , and the number of simulations for each example.

5.1 Comparison of Minimum Model Size

In this study, as in Fan and Song (2010), we illustrate the performance of NIS method in terms of the minimum model size (MMS) needed to include all the true variables, i.e., to possess sure screening property.

Example 1

Following Fan and Song (2010), we first consider a linear model as a special case of the varying coefficient model. Let be i.i.d. standard normal random variables and

 Xk=s∑j=1(−1)j+1Xj/5+√1−s25ξk,k=951,⋯,1000,

where are standard normal random variables. We construct the following model: , where and has nonzero components. To carry out NIS, we define an exposure independently from the standard uniform distribution.

We compare NIS, Lasso and SIS (independence screening for linear models). The boxplots of minimum model size are presented in Figure 1. Note that when , the irrepresentable condition fails, and Lasso performs badly even in terms of pure screening. On the other hand, SIS performs better than NIS because the coefficients are indeed constant, and there are fewer parameters () involved in SIS than those of NIS ().

Example 2

For the second example, we illustrate that when the underlying model’s coefficients are indeed varying, we do need nonparametric independence screening. Let be i.i.d. uniform random variables on , based on which we construct X and as follows:

 Xj = Uj+t1Up+11+t1,j=1,⋯,p,W=Up+2+t2Up+11+t2,

where and controls the correlation among the covariates X and the correlation between X and , respectively. When , ’s are uncorrelated, and when the correlation is . If , ’s and are also correlated with correlation coefficient 0.5.

For the varying coefficients part, we take coefficient functions

 β1(W)=W,β2(W)=(2W−1)2,β3(W)=sin(2πW).

The true data generation model is

 Y=5β1(W)⋅X1+3β2(W)⋅X2+4β3(W)⋅X3+ϵ,

where ’s are i.i.d. standard Gaussian random variable.

Under different correlation settings, the comparison MMS between NIS and SIS methods are presented in Figure 2. When the correlation gets stronger, independence screening becomes harder.

5.2 Comparison of Permutation and Conditional Permutation

In this section, we illustrate the performance the conditional random permutation method.

Example 3

Let be i.i.d. standard normal, be i.i.d. standard uniformly distributed random variables, and the noise follows the standard normal distribution. We construct and as follows:

 Xj = Zj+t1U11+t1,j=1,⋯,p,W=U2+t2U11+t2, Y = 2X1+3W⋅X2+(W+1)2⋅X3+4sin(2πW)2−sin(2πW)⋅X4+ϵ.

We will take , resulting in uncorrelated case and and , corresponding to for all and . By taking (i.e., take the maximum value of the marginal utility of the permuted estimates), we report the average of the true positive number (TP), model size, the lower bound of the marginal signal of true variables and the upper bound of the marginal signal of false variables for different correlation settings based on simulations. Their robust standard deviations are also reported therein.

Based on Table 1, we see that when the correlation gets stronger, although sure screening properties can be achieved most of the time via unconditional () random permutation thresholding, the model size becomes very large and therefore the false selection rate is high. The reason is that there is no differentiability between the marginal signals of the true variables and the false ones. This drawback makes the original random permutation not a feasible method to determine the screening threshold in practice.