Invited Review Article: A Selective Overview of Variable Selection in High Dimensional Feature Space

Invited Review Article: A Selective Overview of Variable Selection in High Dimensional Feature Space

Jianqing Fan and Jinchi Lv
Princeton University and University of Southern California
Jianqing Fan is Frederick L. Moore ’18 Professor of Finance, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA (e-mail: jqfan@princeton.edu). Jinchi Lv is Assistant Professor of Statistics, Information and Operations Management Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089, USA (e-mail: jinchilv@marshall.usc.edu). Fan’s research was partially supported by NSF Grants DMS-0704337 and DMS-0714554 and NIH Grant R01-GM072611. Lv’s research was partially supported by NSF Grant DMS-0806030 and 2008 Zumberge Individual Award from USC’s James H. Zumberge Faculty Research and Innovation Fund. We sincerely thank the Co-Editor, Professor Peter Hall, for his kind invitation to write this article. We are also very grateful to the helpful comments of the Co-Editor, Associate Editor and referees that substantially improved the presentation of the paper.
September 1, 2009
Abstract

High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods.

Short title: Variable Selection in High Dimensional Feature Space

AMS 2000 subject classifications: Primary 62J99; secondary 62F12, 68Q32

Key words and phrases: Variable selection, model selection, high dimensionality, penalized least squares, penalized likelihood, folded-concave penalty, oracle property, dimensionality reduction, LASSO, SCAD, sure screening, sure independence screening

1 Introduction

High dimensional data analysis has become increasingly frequent and important in diverse fields of sciences, engineering, and humanities, ranging from genomics and health sciences to economics, finance and machine learning. It characterizes many contemporary problems in statistics (Hastie, Tibshirani and Friedman (2009)). For example, in disease classification using microarray or proteomics data, tens of thousands of expressions of molecules or ions are potential predictors; in genowide association studies between genotypes and phenotypes, hundreds of thousands of SNPs are potential covariates for phenotypes such as cholesterol levels or heights. When interactions are considered, the dimensionality grows quickly. For example, in portfolio allocation among two thousand stocks, it involves already over two million parameters in the covariance matrix; interactions of molecules in the above examples result in ultra-high dimensionality. To be more precise, throughout the paper ultra-high dimensionality refers to the case where the dimensionality grows at a non-polynomial rate as the sample size increases, and high dimensionality refers to the general case of growing dimensionality. Other examples of high dimensional data include high-resolution images, high-frequency financial data, e-commerce data, warehouse data, functional, and longitudinal data, among others. Donoho (2000) convincingly demonstrates the need for developments in high dimensional data analysis, and presents the curses and blessings of dimensionality. Fan and Li (2006) give a comprehensive overview of statistical challenges with high dimensionality in a broad range of topics, and in particular, demonstrate that for a host of statistical problems, the model parameters can be estimated as well as if the best model is known in advance, as long as the dimensionality is not excessively high. The challenges that are not present in smaller scale studies have been reshaping statistical thinking, methodological development, and theoretical studies.

Statistical accuracy, model interpretability, and computational complexity are three important pillars of any statistical procedures. In conventional studies, the number of observations is much larger than the number of variables or parameters . In such cases, none of the three aspects needs to be sacrificed for the efficiency of others. The traditional methods, however, face significant challenges when the dimensionality is comparable to or larger than the sample size . These challenges include how to design statistical procedures that are more efficient in inference; how to derive the asymptotic or nonasymptotic theory; how to make the estimated models interpretable; and how to make the statistical procedures computationally efficient and robust.

\@normalsize
Figure 1: Distributions (left panel) of the maximum absolute sample correlation coefficient , and distributions (right panel) of the maximum absolute multiple correlation coefficient of with 5 other variables (, where is the regression coefficient of regressed on , a subset of variables indexed by and excluding ), computed by the stepwise addition algorithm (the actual values are larger than what are presented here), when , (solid curve) and (dashed), based on 1000 simulations.

A notorious difficulty of high dimensional model selection comes from the collinearity among the predictors. The collinearity can easily be spurious in high dimensional geometry (Fan and Lv (2008)), which can make us select a wrong model. Figure 1 shows the maximum sample correlation and multiple correlation with a given predictor despite that predictors are generated from independent Gaussian random variables. As a result, any variable can be well-approximated even by a couple of spurious variables, and can even be replaced by them when the dimensionality is much higher than the sample size. If that variable is a signature predictor and is replaced by spurious variables, we choose wrong variables to associate the covariates with the response and, even worse, the spurious variables can be independent of the response at population level, leading to completely wrong scientific conclusions. Indeed, when the dimensionality is large, intuition might not be accurate. This is also exemplified by the data piling problems in high dimensional space observed in Hall, Marron and Neeman (2005). Collinearity also gives rise to issues of over-fitting and model mis-identification.

Noise accumulation in high dimensional prediction has long been recognized in statistics and computer sciences. Explicit characterization of this is well-known for high dimensional regression problems. The quantification of the impact of dimensionality on classification was not well understood until Fan and Fan (2008), who give a simple expression on how dimensionality impacts misclassification rates. Hall, Pittelkow and Ghosh (2008) study a similar problem for distanced based-classifiers and observe implicitly the adverse impact of dimensionality. As shown in Fan and Fan (2008), even for the independence classification rule described in Section 4.2, classification using all features can be as bad as a random guess due to noise accumulation in estimating the population centroids in high dimensional feature space. Therefore, variable selection is fundamentally important to high dimensional statistical modeling, including regression and classification.

What makes high dimensional statistical inference possible is the assumption that the regression function lies in a low dimensional manifold. In such cases, the -dimensional regression parameters are assumed to be sparse with many components being zero, where nonzero components indicate the important variables. With sparsity, variable selection can improve the estimation accuracy by effectively identifying the subset of important predictors and can enhance the model interpretability with parsimonious representation. It can also help reduce the computational cost when sparsity is very high.

This notion of sparsity is in a narrow sense. It should be understood more widely in transformed or enlarged feature spaces. For instance, some prior knowledge may lead us to apply some grouping or transformation of the input variables (see, e.g., Fan and Lv (2008)). Some transformation of the variables may be appropriate if a significant portion of the pairwise correlations are high. In some cases, we may want to enlarge the feature space by adding interactions and higher order terms to reduce the bias of the model. Sparsity can also be viewed in the context of dimensionality reduction by introducing a sparse representation, i.e., by reducing the number of effective parameters in estimation. Examples include the use of a factor model for high dimensional covariance matrix estimation in Fan, Fan and Lv (2008).

Sparsity arises in many scientific endeavors. In genomic studies, it is generally believed that only a fraction of molecules are related to biological outcomes. For example, in disease classification, it is commonly believed that only tens of genes are responsible for a disease. Selecting tens of genes helps not only statisticians in constructing a more reliable classification rule, but also biologists to understand molecular mechanisms. In contrast, popular but naive methods used in microarray data analysis (Dudoit, Shaffer and Boldrick (2003); Storey and Tibshirani (2003); Fan and Ren (2006); Efron (2007)) rely on two-sample tests to pick important genes, which is truly a marginal correlation ranking (Fan and Lv (2008)) and can miss important signature genes (Fan, Samworth and Wu (2009)). The main goals of high dimensional regression and classification, according to Bickel (2008), are

  • to construct as effective a method as possible to predict future observations;

  • to gain insight into the relationship between features and response for scientific purposes, as well as, hopefully, to construct an improved prediction method.

The former appears in problems such as text and document classification or portfolio optimization, whereas the latter appears naturally in many genomic studies and other scientific endeavors.

As pointed out in Fan and Li (2006), it is helpful to differentiate two types of statistical endeavors in high dimensional statistical learning: accuracy of estimated model parameters and accuracy of the expected loss of the estimated model. The latter property is called persistence in Greenshtein and Ritov (2004) and Greenshtein (2006), and arises frequently in machine learning problems such as document classification and computer vision. The former appears in many other contexts where we want to identify the significant predictors and characterize the precise contribution of each to the response variable. Examples include health studies, where the relative importance of identified risk factors needs to be assessed for prognosis. Many of the existing results in the literature have been concerned with the study of consistency of high dimensional variable selection methods, rather than characterizing the asymptotic distributions of the estimated model parameters. However, consistency and persistence results are inadequate for understanding uncertainty in parameter estimation.

High dimensional variable selection encompasses a majority of frontiers where statistics advances rapidly today. There has been an evolving literature in the last decade devoted to understanding the performance of various variable selection techniques. The main theoretical questions include determining the limits of the dimensionality that such methods can handle and how to characterize the optimality of variable selection procedures. The answers to the first question for many existing methods were largely unknown until recently. To a large extent, the second question still remains open for many procedures. In the Gaussian linear regression model, the case of orthonormal design reduces to the problem of Gaussian mean estimation, as do the wavelet settings where the design matrices are orthogonal. In such cases, the risks of various shrinkage estimators and their optimality have been extensively studied. See, e.g., Donoho and Johnstone (1994) and Antoniadis and Fan (2001).

In this article we address the issues of variable selection for high dimensional statistical modeling in the unified framework of penalized likelihood estimation. It has been widely used in statistical inferences and machine learning, and is basically a moderate scale learning technique. We also give an overview on the techniques for ultrahigh dimensional screening. Combined iteratively with large scale screening, it can handle problems of ultra-high dimensionality (Fan, Samworth and Wu (2009)). This will be reviewed as well.

The rest of the article is organized as follows. In Section 2, we discuss the connections of penalized likelihood to classical model selection methods. Section 3 details the methods and implementation of penalized likelihood estimation. We review some recent advances in ultra-high dimensional variable selection in Section 4. In Section 5, we survey the sampling properties of penalized least squares. Section 6 presents the classical oracle property of penalized least squares and penalized likelihood methods in ultra-high dimensional space. We conclude the article with some additional remarks in Section 7.

2 Classical model selection

Suppose that the available data are , where is the -th observation of the response variable and is its associated -dimensional covariates vector. They are usually assumed to be a random sample from the population , where the conditional mean of given X depends on the linear predictor with . In sparse modeling, it is frequently assumed that most regression coefficients are zero. Variable selection aims to identify all important variables whose regression coefficients do not vanish and to provide effective estimates of those coefficients.

More generally, assume that the data are generated from the true density function with parameter vector . Often, we are uncertain about the true density, but more certain about a larger family of models in which is a (nonvanishing) subvector of the -dimensional parameter vector . The problems of how to estimate the dimension of the model and compare models of different dimensions naturally arise in many statistical applications, including time series modeling. These are referred to as model selection in the literature.

Akaike (1973, 1974) proposes to choose a model that minimizes the Kullback-Leibler (KL) divergence of the fitted model from the true model. Akaike (1973) considers the maximum likelihood estimator (MLE) of the parameter vector and shows that, up to an additive constant, the estimated KL divergence can be asymptotically expanded as

where is the log-likelihood function, denotes the dimension of the model, and . This leads to the AIC. Schwartz (1978) takes a Bayesian approach with prior distributions that have nonzero prior probabilities on some lower dimensional subspaces and proposes the BIC with for model selection. Recently, Lv and Liu (2008) gave a KL divergence interpretation of Bayesian model selection and derive generalizations of AIC and BIC when the model may be misspecified.

The work of AIC and BIC suggests a unified approach to model selection: choose a parameter vector that maximizes the penalized likelihood

(1)

where the -norm of counts the number of non-vanishing components in and is a regularization parameter. Given , the solution to (1) is the subset with the largest maximum likelihood among all subsets of size . The model size is then chosen to maximize (1) among best subsets of sizes (). Clearly, the computation of the penalized problem is a combinational problem with NP-complexity.

When the normal likelihood is used, (1) becomes penalized least squares. Many traditional methods can be regarded as penalized likelihood methods with different choices of . Let be the residual sum of squares of the best subset with variables. Then in Mallows (1973) corresponds to , where is the mean squared error of the full model. The adjusted given by

also amounts to a penalized- problem, where SST is the total sum of squares. Clearly maximizing is equivalent to minimizing . By (the error variance), we have

This shows that the adjusted method is approximately equivalent to PMLE with . Other examples include the generalized cross-validation (GCV) given by , cross-validation (CV), and RIC in Foster and George (1994). See Bickel and Li (2006) for more discussions of regularization in statistics.

3 Penalized likelihood

As demonstrated above, regularization arises naturally in many classical model selection methods. It gives a nice interpretation of best subset selection and admits nice sampling properties (Barron, Birge and Massart (1999)). However, the computation is infeasible in high dimensional statistical endeavors. Other penalty functions should be used. This results in a generalized form

(2)

where is the log-likelihood function and is a penalty function indexed by the regularization parameter . By maximizing the penalized likelihood (2), we hope to simultaneously select variables and estimate their associated regression coefficients. In other words, those variables whose regression coefficients are estimated as zero are automatically deleted.

A natural generalization of penalized -regression is penalized -regression, called bridge regression in Frank and Friedman (1993), in which for . This bridges the best subset section (penalized ) and ridge regression (penalized ), including the -penalty as a specific case. The non-negative garrote is introduced in Breiman (1995) for shrinkage estimation and variable selection. Penalized -regression is called the LASSO by Tibshirani (1996) in the ordinary regression setting, and is now collectively referred to as penalized -likelihood. Clearly, penalized -regression possesses the variable selection feature, whereas penalized -regression does not. What kind of penalty functions are good for model selection?

Fan and Li (2001) advocate penalty functions that give estimators with three properties:

  • Sparsity: The resulting estimator automatically sets small estimated coefficients to zero to accomplish variable selection and reduce model complexity.

  • Unbiasedness: The resulting estimator is nearly unbiased, especially when the true coefficient is large, to reduce model bias.

  • Continuity: The resulting estimator is continuous in the data to reduce instability in model prediction (Breiman (1996)).

They require the penalty function to be nondecreasing in , and provide insights into these properties. We first consider the penalized least squares in a canonical form.

3.1 Canonical regression model

Consider the linear regression model

(3)

where , , and is an -dimensional noise vector. If , then the penalized likelihood (2) is equivalent, up to an affine transformation of the log-likelihood, to the penalized least squares (PLS) problem

(4)

where denotes the -norm. Of course, the penalized least squares continues to be applicable even when the noise does not follow a normal distribution.

For the canonical linear model in which the design matrix multiplied by is orthonormal (i.e., ), (4) reduces to the minimization of

(5)

where is the ordinary least squares estimate. Minimizing (5) becomes a componentwise regression problem. This leads to considering the univariate PLS problem

(6)

Antoniadis and Fan (2001) show that the PLS estimator possesses the properties:

  • sparsity if ;

  • approximate unbiasedness if for large ;

  • continuity if and only if ,

where is nondecreasing and continuously differentiable on , the function is strictly unimodal on , and means when for notational simplicity. In general for the penalty function, the singularity at the origin (i.e., ) is needed for generating sparsity in variable selection and the concavity is needed to reduce the estimation bias.

3.2 Penalty function

\@normalsize
Figure 2: Some commonly used penalty functions (left panel) and their derivatives (right panel). They correspond to the risk functions shown in the right panel of Figure 3. More precisely, for hard thresholding penalty, for -penalty, for SCAD with , and for MCP with .

It is known that the convex penalty with does not satisfy the sparsity condition, whereas the convex penalty does not satisfy the unbiasedness condition, and the concave penalty with does not satisfy the continuity condition. In other words, none of the penalties satisfies all three conditions simultaneously. For this reason, Fan (1997) and Fan and Li (2001) introduce the smoothly clipped absolute deviation (SCAD), whose derivative is given by

(7)

where and, often, is used (suggested by a Bayesian argument). It satisfies the aforementioned three properties. A penalty of similar spirit is the minimax concave penalty (MCP) in Zhang (2009), whose derivative is given by

(8)

Clearly SCAD takes off at the origin as the penalty and then gradually levels off, and MCP translates the flat part of the derivative of SCAD to the origin. When

(9)

Antoniadis (1996) shows that the solution is the hard-thresholding estimator . A family of concave penalties that bridge the and penalties is studied by Lv and Fan (2009) for model selection and sparse recovery. A linear combination of and penalties is called an elastic net by Zou and Hastie (2005), which encourages some grouping effects. Figure 2 depicts some of those commonly used penalty functions.

We now look at the PLS estimator in (6) for a few penalties. Each increasing penalty function gives a shrinkage rule: and (Antoniadis and Fan (2001)). The entropy penalty ( penalty) and the hard thresholding penalty yield the hard thresholding rule (Donoho and Johnstone (1994)), while the penalty gives the soft thresholding rule (Bickel (1983); Donoho and Johnstone (1994)). The SCAD and MCP give rise to analytical solutions to (6), each of which is a linear spline in (Fan (1997)).

\@normalsize
Figure 3: The risk functions for penalized least squares under the Gaussian model for the hard-thresholding penalty, -penalty, SCAD (), and MCP (). The left panel corresponds to and the right panel corresponds to for the hard-thresholding estimator, and the rest of parameters are chosen so that their risks are the same at the point .

How do those thresholded-shrinkage estimators perform? To compare them, we compute their risks in the fundamental model in which . Let . Figure 3 shows the risk functions for some commonly used penalty functions. To make them comparable, we chose and for the hard thresholding penalty, and for other penalty functions the values of were chosen to make their risks at the same. Clearly the penalized likelihood estimators improve the ordinary least squares estimator in the region where is near zero, and have the same risk as the ordinary least squares estimator when is far away from zero (e.g., 4 standard deviations away), except the LASSO estimator. When is large, the LASSO estimator has a bias approximately of size , and this causes higher risk as shown in Figure 3. When , the LASSO estimator has higher risk than the SCAD estimator, except in a small region. The bias of the LASSO estimator makes LASSO prefer a smaller . For , the advantage of the LASSO estimator around zero is more pronounced. As a result in model selection, when is automatically selected by a data-driven rule to compensate the bias problem, the LASSO estimator has to choose a smaller in order to have a desired mean squared error. Yet, a smaller value of results in a complex model. This explains why the LASSO estimator tends to have many false positive variables in the selected model.

3.3 Computation and implementation

It is challenging to solve the penalized likelihood problem (2) when the penalty function is nonconvex. Nevertheless, Fan and Lv (2009) are able to give the conditions under which the penalized likelihood estimator exists and is unique; see also Kim and Kwon (2009) for the results of penalized least squares with SCAD penalty. When the -penalty is used, the objective function (2) is concave and hence convex optimization algorithms can be applied. We show in this section that the penalized likelihood (2) can be solved by a sequence of reweighted penalized -regression problems via local linear approximation (Zou and Li (2008)).

In the absence of other available algorithms at that time, Fan and Li (2001) propose a unified and effective local quadratic approximation (LQA) algorithm for optimizing nonconcave penalized likelihood. Their idea is to locally approximate the objective function by a quadratic function. Specifically, for a given initial value , the penalty function can be locally approximated by a quadratic function as

(10)

With this and a LQA to the log-likelihood, the penalized likelihood (2) becomes a least squares problem that admits a closed-form solution. To avoid numerical instability, it sets the estimated coefficient if is very close to 0, which amounts to deleting the -th covariate from the final model. Clearly the value is an absorbing state of LQA in the sense that once a coefficient is set to zero, it remains zero in subsequent iterations.

The convergence property of the LQA was studied in Hunter and Li (2005), who show that LQA plays the same role as the E-step in the EM algorithm in Dempster, Laird and Rubin (1977). Therefore LQA has similar behavior to EM. Although the EM requires a full iteration for maximization after each E-step, the LQA updates the quadratic approximation at each step during the course of iteration, which speeds up the convergence of the algorithm. The convergence rate of LQA is quadratic, which is the same as that of the modified EM algorithm in Lange (1995).

\@normalsize
Figure 4: The local linear (dashed) and local quadratic (dotted) approximations to the SCAD function (solid) with and at a given point .

A better approximation can be achieved by using the local linear approximation (LLA):

(11)

as in Zou and Li (2008). See Figure 4 for an illustration of the local linear and local quadratic approximations to the SCAD function. Clearly, both LLA and LQA are convex majorants of concave penalty function on , but LLA is a better approximation since it is the minimum (tightest) convex majorant of the concave function on . With LLA, the penalized likelihood (2) becomes

(12)

where the weights are . Problem (12) is a concave optimization problem if the log-likelihood function is concave. Different penalty functions give different weighting schemes, and LASSO gives a constant weighting scheme. In this sense, the nonconcave penalized likelihood is an iteratively reweighted penalized regression. The weight function is chosen adaptively to reduce the biases due to penalization. For example, for SCAD and MCP, when the estimate of a particular component is large so that it has high confidence to be non-vanishing, the component does not receive any penalty in (12), as desired.

Zou (2006) proposes the weighting scheme for some and calls the resulting procedure adaptive LASSO. This weight reduces the penalty when the previous estimate is large. However, the penalty at zero is infinite. When the procedure is applied iteratively, zero becomes an absorbing state. On the other hand, the penalty functions such as SCAD and MCP do not have this undesired property. For example, if the initial estimate is zero, then and the resulting estimate is the LASSO estimate.

Fan and Li (2001), Zou (2006), and Zou and Li (2008) all suggest a consistent estimate such as the un-penalized MLE. This implicitly assumes that . For dimensionality that is larger than sample size , the above method is not applicable. Fan and Lv (2008) recommend using , which is equivalent to using the LASSO estimate as the initial estimate. Another possible initial value is to use a stepwise addition fit or componentwise regression. They put forward the recommendation that only a few iterations are needed, which is in line with Zou and Li (2008).

Before we close this section, we remark that with the LLA and LQA, the resulting sequence of target values is always nondecreasing, which is a specific feature of minorization-maximization (MM) algorithms (Hunter and Lange (2000)). Let . Suppose that at the -th iteration, is approximated by such that

(13)

where is the estimate at the -th iteration. Let maximize the approximated penalized likelihood . Then we have

Thus, the target values are non-decreasing. Clearly, the LLA and LQA are two specific cases of the MM algorithms, satisfying condition (13); see Figure 4. Therefore, the sequence of target function values is non-decreasing and thus converges provided it is bounded. The critical point is the global maximizer under the conditions in Fan and Lv (2009).

3.4 LARS and other algorithms

As demonstrated in the previous section, the penalized least squares problem (4) with an penalty is fundamental to the computation of penalized likelihood estimation. There are several additional powerful algorithms for such an endeavor. Osborne, Presnell and Turlach (2000) cast such a problem as a quadratic programming problem. Efron et al. (2004) propose a fast and efficient least angle regression (LARS) algorithm for variable selection, a simple modification of which produces the entire LASSO solution path that optimizes (4). The computation is based on the fact that the LASSO solution path is piecewise linear in . See Rosset and Zhu (2007) for a more general account of the conditions under which the solution to the penalized likelihood (2) is piecewise linear. The LARS algorithm starts from a large value of which selects only one covariate that has the greatest correlation with the response variable and decreases the value until the second variable is selected, at which the selected variables have the same correlation (in magnitude) with the current working residual as the first one, and so on. See Efron et al. (2004) for details.

The idea of the LARS algorithm can be expanded to compute the solution paths of penalized least squares (4). Zhang (2009) introduces the PLUS algorithm for efficiently computing a solution path of (4) when the penalty function is a quadratic spline such as the SCAD and MCP. In addition, Zhang (2009) also shows that the solution path is piecewise linear in , and the proposed solution path has desired statistical properties.

For the penalized least squares problem (4), Fu (1998), Daubechies, Defrise and De Mol (2004), and Wu and Lang (2008) propose a coordinate descent algorithm, which iteratively optimizes (4) one component at a time. This algorithm can also be applied to optimize the group LASSO (Antoniadis and Fan (2001); Yuan and Lin (2006)) as shown in Meier, van de Geer and Bühlmann (2008), penalized precision matrix estimation (Friedman, Hastie and Tibshirani (2007)), and penalized likelihood (2) (Fan and Lv (2009); Zhang and Li (2009)).

More specifically, Fan and Lv (2009) employ a path-following coordinate optimization algorithm, called the iterative coordinate ascent (ICA) algorithm, for maximizing the nonconcave penalized likelihood. It successively maximizes the penalized likelihood (2) for regularization parameters in decreasing order. A similar idea is also studied in Zhang and Li (2009), who introduce the ICM algorithm. The coordinate optimization algorithm uses the Gauss-Seidel method, i.e., maximizing one coordinate at a time with successive displacements. Specifically, for each coordinate within each iteration, it uses the second order approximation of at the -vector from the previous step along that coordinate and maximizes the univariate penalized quadratic approximation

(14)

where . It updates each coordinate if the maximizer of the corresponding univariate penalized quadratic approximation makes the penalized likelihood (2) strictly increase. Therefore, the ICA algorithm enjoys the ascent property that the resulting sequence of values of the penalized likelihood is increasing for a fixed . Compared to other algorithms, the coordinate optimization algorithm is especially appealing for large scale problems with both and large, thanks to its low computational complexity. It is fast to implement when the univariate problem (14) admits a closed-form solution. This is the case for many commonly used penalty functions such as SCAD and MCP. In practical implementation, we pick a sufficiently large such that the maximizer of the penalized likelihood (2) with is 0, and a decreasing sequence of regularization parameters. The studies in Fan and Lv (2009) show that the coordinate optimization works equally well and efficiently for producing the entire solution paths for concave penalties.

The LLA algorithm for computing penalized likelihood is now available in R at

http://cran.r-project.org/web/packages/SIS/index.html

as a function in the SIS package. So is the PLUS algorithm for computing the penalized least squares estimator with SCAD and MC+ penalties. The Matlab codes are also available for the ICA algorithm for computing the solution path of the penalized likelihood estimator and for computing SIS upon request.

3.5 Composite quasi-likelihood

The function in (2) does not have to be the true likelihood. It can be a quasi-likelihood or a loss function (Fan, Samworth and Wu (2009)). In most statistical applications, it is of the form

(15)

where is the conditional quasi-likelihood of given . It can also be the loss function of using to predict . In this case, the penalized quasi-likelihood (15) is written as the minimization of

(16)

where is a loss function. For example, the loss function can be a robust loss: . How should we choose a quasi-likelihood to enhance the efficiency of procedure when the error distribution possibly deviates from normal?

To illustrate the idea, consider the linear model (3). As long as the error distribution of is homoscedastic, is, up to an additive constant, the conditional quantile of given . Therefore, can be estimated by the quantile regression

where (Koenker and Bassett (1978)). Koenker (1984) proposes solving the weighted composite quantile regression by using different quantiles to improve the efficiency, namely, minimizing with respect to and ,

(17)

where is a given sequence of quantiles and is a given sequence of weights. Zou and Yuan (2008) propose the penalized composite quantile with equal weights to improve the efficiency of the penalized least squares.

Recently, Bradic, Fan and Wang (2009) proposed the more general composite quasi-likelihood

(18)

They derive the asymptotic normality of the estimator and choose the weight function to optimize the asymptotic variance. In this view, it always performs better than a single quasi-likelihood function. In particular, they study in detail the relative efficiency of the composite - loss and optimal composite quantile loss with the least squares estimator.

Note that the composite likelihood (18) can be regarded as an approximation to the log-likelihood function via

with . Hence, can also be chosen to minimize (18) directly. If the convexity of the composite likelihood is enforced, we need to impose the additional constraint that all weights are non-negative.

3.6 Choice of penalty parameters

The choice of penalty parameters is of paramount importance in penalized likelihood estimation. When , all variables are selected and the model is even unidentifiable when . When , if the penalty satisfies for , then none of the variables is selected. The interesting cases lie between these two extreme choices.

The above discussion clearly indicates that governs the complexity of the selected model. A large value of tends to choose a simple model, whereas a small value of inclines to a complex model. The estimation using a larger value of tends to have smaller variance, whereas the estimation using a smaller value of inclines to smaller modeling biases. The trade-off between the biases and variances yields an optimal choice of . This is frequently done by using a multi-fold cross-validation.

There are relatively few studies on the choice of penalty parameters. In Wang, Li and Tsai (2007), it is shown that the model selected by generalized cross-validation using the SCAD penalty contains all important variables, but with nonzero probability includes some unimportant variables, and that the model selected by using BIC achieves the model selection consistency and an oracle property. It is worth to point out that missing some true predictor causes model misspecification, as does misspecifying the family of distributions. A semi-Bayesian information criterion (SIC) is proposed by Lv and Liu (2008) to address this issue for model selection.

4 Ultra-high dimensional variable selection

Variable selection in ultra-high dimensional feature space has become increasingly important in statistics, and calls for new or extended statistical methodologies and theory. For example, in disease classification using microarray gene expression data, the number of arrays is usually on the order of tens while the number of gene expression profiles is on the order of tens of thousands; in the study of protein-protein interactions, the number of features can be on the order of millions while the sample size can be on the order of thousands (see, e.g., Tibshirani et al. (2003) and Fan and Ren (2006)); the same order of magnitude occurs in genetic association studies between genotypes and phenotypes. In such problems, it is important to identify significant features (e.g., SNPs) contributing to the response and reliably predict certain clinical prognosis (e.g., survival time and cholesterol level). As mentioned in the introduction, three important issues arise in such high dimensional statistical endeavors: computational cost, statistical accuracy, and model interpretability. Existing variable selection techniques can become computationally intensive in ultra-high dimensions.

A natural idea is to reduce the dimensionality from a large or huge scale (say, for some ) to a relatively large scale (e.g., for some ) by a fast, reliable, and efficient method, so that well-developed variable selection techniques can be applied to the reduced feature space. This provides a powerful tool for variable selection in ultra-high dimensional feature space. It addresses the aforementioned three issues when the variable screening procedures are capable of retaining all the important variables with asymptotic probability one, the sure screening property introduced in Fan and Lv (2008).

\@normalsize
Figure 5: Illustration of ultra-high dimensional variable selection scheme. A large scale screening is first used to screen out unimportant variables and then a moderate-scale searching is applied to further select important variables. At both steps, one can choose a favorite method.

The above discussion suggests already a two-scale method for ultra-high dimensional variable selection problems: a crude large scale screening followed by a moderate scale selection. The idea is explicitly suggested by Fan and Lv (2008) and is illustrated by the schematic diagram in Figure 5. One can choose one of many popular screening techniques, as long as it possesses the sure screening property. In the same vein, one can also select a preferred tool for moderate scale selection. The large-scale screening and moderate-scale selection can be iteratively applied, resulting in iterative sure independence screening (ISIS) (Fan and Lv (2008)). Its amelioration and extensions are given in Fan, Samworth and Wu (2009), who also develop R and Matlab codes to facilitate the implementation in generalized linear models (McCullagh and Nelder (1989)).

4.1 Sure independence screening

Independence screening refers to ranking features according to marginal utility, namely, each feature is used independently as a predictor to decide its usefulness for predicting the response. Sure independence screening (SIS) was introduced by Fan and Lv (2008) to reduce the computation in ultra-high dimensional variable selection: all important features are in the selected model with probability tending to 1 (Fan and Lv (2008)). An example of independence learning is the correlation ranking proposed in Fan and Lv (2008) that ranks features according to the magnitude of its sample correlation with the response variable. More precisely, let be a -vector obtained by componentwise regression, where we assume that each column of the design matrix X has been standardized with mean zero and variance one. For any given , take the selected submodel to be

(19)

This reduces the full model of size to a submodel with size , which can be less than . Such correlation learning screens those variables that have weak marginal correlations with the response. For classification problems with , the correlation ranking reduces to selecting features by using two-sample -test statistics. See Section 4.2 for additional details.

Other examples of independence learning include methods in microarray data analysis where a two-sample test is used to select significant genes between the treatment and control groups (Dudoit, Shaffer and Boldrick (2003); Storey and Tibshirani (2003); Fan and Ren (2006); Efron (2007)), feature ranking using a generalized correlation (Hall and Miller (2009a)), nonparametric learning under sparse additive models (Ravikumar et al. (2009)), and the method in Huang, Horowitz and Ma (2008) that uses the marginal bridge estimators for selecting variables in high dimensional sparse regression models. Hall, Titterington and Xue (2009) derive some independence learning rules using tilting methods and empirical likelihood, and propose a bootstrap method to assess the fidelity of feature ranking. In particular, the false discovery rate (FDR) proposed by Benjamini and Hochberg (1995) is popularly used in multiple testing for controlling the expected false positive rate. See also Efron et al. (2001), Abramovich et al. (2006), Donoho and Jin (2006), and Clarke and Hall (2009).

We now discuss the sure screening property of correlation screening. Let be the true underlying sparse model with nonsparsity size ; the other variables can also be correlated with the response variable via the link to the predictors in the true model. Fan and Lv (2008) consider the case with for some , where is specified below, and Gaussian noise for some . They assume that , ,

where , , is a positive constant, and the -dimensional covariate vector x has an elliptical distribution with the random matrix having a concentration property that holds for Gaussian distributions. For studies on the extreme eigenvalues and limiting spectral distributions of large random matrices, see, e.g., Silverstein (1985), Bai and Yin (1993), Bai (1999), Johnstone (2001), and Ledoux (2001, 2005).

Under the above regularity conditions, Fan and Lv (2008) show that if , then there exists some such that when , we have for some ,

(20)

In particular, this sure screening property entails the sparsity of the model: . It demonstrates that SIS can reduce exponentially high dimensionality to a relatively large scale , while the reduced model still contains all the important variables with an overwhelming probability. In practice, to be conservative we can choose or . Of course, one can also take final model size . Clearly larger means larger probability of including the true underlying sparse model in the final model . See Section 4.3 for further results on sure independence screening.

When the dimensionality is reduced from a large scale to a moderate scale by applying a sure screening method such as correlation learning, the well-developed variable selection techniques, such as penalized least squares methods, can be applied to the reduced feature space. This is a powerful tool of SIS based variable selection methods. The sampling properties of these methods can be easily obtained by combining the theory of SIS and penalization methods.

4.2 Feature selection for classification

Independence learning has also been widely used for feature selection in high dimensional classification problems. In this section we look at the specific setting of classification and continue the topic of independence learning for variable selection in Section 4.3. Consider the -dimensional classification between two classes. For , let , , , be i.i.d. -dimensional observations from the -th class. Classification aims at finding a discriminant function that classifies new observations as accurately as possible. The classifier assigns x to the class 1 if and class 2 otherwise.

Many classification methods have been proposed in the literature. The best classifier is the Fisher when the data are from the normal distribution with a common covariance matrix: , for and . However, this method is hard to implement when dimensionality is high due to the difficulty of estimating the unknown covariance matrix . Hence, the independence rule that involves estimating the diagonal entries of the covariance matrix, with discriminant function is frequently employed for the classification, where . For a survey of recent developments, see Fan, Fan and Wu (2010).

Classical methods break down when the dimensionality is high. As demonstrated by Bickel and Levina (2004), the Fisher discrimination method no longer performs well in high dimensional settings due to the diverging spectra and singularity of the sample covariance matrix. They show that the independence rule overcomes these problems and outperforms the Fisher discriminant in high dimensional setting. However, in practical implementation such as tumor classification using microarray data, one hopes to find tens of genes that have high discriminative power. The independence rule does not possess the property of feature selection.

The noise accumulation phenomenon is well-known in the regression setup, but has never been quantified in the classification problem until Fan and Fan (2008). They show that the difficulty of high dimensional classification is intrinsically caused by the existence of many noise features that do not contribute to the reduction of classification error. For example, in linear discriminant analysis one needs to estimate the class mean vectors and covariance matrix. Although each parameter can be estimated accurately, aggregated estimation error can be very large and can significantly increase the misclassification rate.

Let be the common correlation matrix, be its largest eigenvalue, and . Consider the parameter space

where and are given constants, and is the -th diagonal element of . Note that measures the strength of signals. Let be the estimated discriminant function of the independence rule, obtained by plugging in the sample estimates of and D. If , Fan and Fan (2008) demonstrate that the worst case classification error, , over the parameter space converges:

(21)

where and is the cumulative distribution function of the standard normal random variable.

The misclassification rate (21) relates to dimensionality in the term , which depends on . This quantifies the tradeoff between dimensionality and the overall signal strength . The signal always increases with dimensionality. If the useful features are located at the first components, say, then the signals stop increasing when more than features are used, yet the penalty of using all features is . Clearly, using features can perform much better than using all features. The optimal number should be the one that minimizes , where the are the signals of the best subset of features, defined as , where and are the sub-vector and sub-matrix of and D constructed using variables in . The result (21) also indicates that the independence rule works no better than random guessing due to noise accumulation, unless the signal levels are extremely high, say, for some . Hall, Pittelkow and Ghosh (2008) show that if , the classification error goes to zero for a distance-based classifier, which is a specific result of Fan and Fan (2008) with .

The above results reveal that dimensionality reduction is also very important for reducing misclassification rate. A popular class of dimensionality reduction techniques is projection. See, for example, principal component analysis in Ghosh (2002) and Zou, Hastie and Tibshirani (2004); partial least squares in Huang and Pan (2003), and Boulesteix (2004); and sliced inverse regression in Chiaromonte and Martinelli (2002), Antoniadis, Lambert-Lacroix and Leblanc (2003), and Bura and Pfeiffer (2003). These projection methods attempt to find directions that can result in small classification errors. In fact, the directions that they find usually put much larger weights on features with large classification power, which is indeed a type of sparsity in the projection vector. Fan and Fan (2008) formally show that linear projection methods are likely to perform poorly unless the projection vector is sparse, namely, the effective number of selected features is small. This is due to the aforementioned noise accumulation when estimating and in high dimensional problems. For formal results, see Theorem 2 in Fan and Fan (2008). See also Tibshirani et al. (2002), Donoho and Jin (2008), Hall, Park and Samworth (2008), Hall, Pittelkow and Ghosh (2008), Hall and Chan (2009), Hall and Miller (2009b), and Jin (2009) for some recent developments in high dimensional classifications.

To select important features, the two-sample test is frequently employed (see, e.g., Tibshirani et al. (2003)). The two-sample statistic for feature is

(22)

where and are the sample mean and variance of the -th feature in class . This is a specific example of independence learning, which ranks the features according to . Fan and Fan (2008) prove that when dimensionality grows no faster than the exponential rate of the sample size, if the lowest signal level is not too small, the two-sample test can select all important features with probability tending to 1. Their proof relies on the deviation results of the two-sample -statistic. See, e.g., Hall (1987, 2006), Jing, Shao and Wang (2003), and Cao (2007) for large deviation theory.

Although the test can correctly select all important features with probability tending to 1 under some regularity conditions, the resulting choice is not necessarily optimal, since the noise accumulation can exceed the signal accumulation for faint features. Therefore, it is necessary to further single out the most important features. To address this issue, Fan and Fan (2008) propose the Features Annealed Independence Rule (FAIR). Instead of constructing the independence rule using all features, FAIR selects the most important ones and uses them to construct an independence rule. To appreciate the idea of FAIR, first note that the relative importance of features can be measured by , where is the -th component of and is the common variance of the -th feature. If such oracle ranking information is available, then one can construct the independence rule using features with the largest , with optimal value of to be determined. In this case, the oracle classifier takes the form

where is a positive constant. It is easy to see that choosing the optimal is equivalent to selecting the optimal . However oracle information is usually unavailable, and one needs to learn it from the data. Observe that can be estimated by , where the latter is in fact , in which the pooled sample variance is used. This is indeed the same as ranking the feature by using the correlation between the th variable with the class response when (Fan and Lv (2008)). Indeed, as pointed out by Hall, Titterington and Xue (2008), this is always true if the response for the first class is assigned as 1, whereas the response for the second class is assigned as . Thus to mimick the oracle, FAIR takes a slightly different form to adapt to the unknown signal strength

(23)

It is clear from (23) that FAIR works the same way as if we first sort the features by the absolute values of their -statistics in descending order, then take out the first features to construct a classifier. The number of features is selected by minimizing the upper bound of the classification error:

where are the ordered squared -statistics, and is the estimate of the largest eigenvalue of the correlation matrix of the most significant features. Fan and Fan (2008) also derive the misclassification rates of FAIR and demonstrate that it possesses an oracle property.

4.3 Sure independence screening for generalized linear models

Correlation learning cannot be directly applied to the case of discrete covariates such as genetic studies with different genotypes. The mathematical results and technical arguments in Fan and Lv (2008) rely heavily on the joint normality assumptions. The natural question is how to screen variables in a more general context, and whether the sure screening property continues to hold with a limited false positive rate.

Consider the generalized linear model (GLIM) with canonical link. That is, the conditional density is given by

(24)

for some known functions , , and . As we consider only variable selection on the mean regression function, we assume without loss of generality that the dispersion parameter . As before, we assume that each variable has been standardized with mean 0 and variance 1.

For GLIM (24), the penalized likelihood (2) is

(25)

where . The maximum marginal likelihood estimator (MMLE) is defined as the minimizer of the componentwise regression

(26)

where is the th observation of the th variable. This can be easily computed and its implementation is robust, avoiding numerical instability in ultra-high dimensional problems. The marginal estimator estimates the wrong object of course, but its magnitude provides useful information for variable screening. Fan and Song (2009) select a set of variables whose marginal magnitude exceeds a predefined threshold value :

(27)

This is equivalent to ranking features according to the magnitude of MMLEs . To understand the utility of MMLE, we take the population version of the minimizer of the componentwise regression to be

Fan and Song (2009) show that if and only if , and under some additional conditions if for , for given positive constants and , then there exists a constant such that

(28)

In words, as long as and are somewhat marginally correlated with , the marginal signal is detectable. They prove further the sure screening property:

(29)

(the convergence is exponentially fast) if with a sufficiently small , and that only the size of non-sparse elements (not the dimensionality) matters for the purpose of sure screening property. For the Gaussian linear model (3) with sub-Gaussian covariate tails, the dimensionality can be as high as , a weaker result than that in Fan and Lv (2008) in terms of condition on , but a stronger result in terms of the conditions on the covariates. For logistic regression with bounded covariates, such as genotypes, the dimensionality can be as high as .

The sure screening property (29) is only part of the story. For example, if then all variables are selected and hence (29) holds. The question is how large the size of the selected model size in (27) with should be. Under some regularity conditions, Fan and Song (2009) show that with probability tending to one exponentially fast,

(30)

In words, the size of selected model depends on how large the thresholding parameter is, and how correlated the features are. It is of order if . This is the same or somewhat stronger result than in Fan and Lv (2008) in terms of selected model size, but holds for a much more general class of models. In particularly, there is no restrictions on and , or more generally .

Fan and Song (2009) also study feature screening by using the marginal likelihood ratio test. Let and

(31)

Rank the features according to the marginal utility . Thus, select a set of variables

(32)

where is a predefined threshold value. Let be the population counterpart of . Then, the minimum signal is of order , whereas the individual noise . In words, when , the noise level is larger than the signal. This is the key technical challenge. By using the fact that the ranking is invariant to monotonic transformations, Fan and Song (2009) are able to show that with for a sufficiently small ,

Thus the sure screening property holds with a limited size of the selected model.

4.4 Reduction of false positive rate

A screening method is usually a crude approach that results in many false positive variables. A simple idea of reducing the false positive rate is to apply a resampling technique as proposed by Fan, Samworth and Wu (2009). Split the samples randomly into two halves and let and be the selected sets of active variables based on, respectively, the first half and the second half of the sample. If and both have a sure screening property, so does the set . On the other hand, has many fewer falsely selected variables, as an unimportant variable has to be selected twice at random in the ultra-high dimensional space, which is very unlikely. Therefore, reduces the number of false positive variables.

Write for the set of active indices – that is, the set containing those indices for which in the true model. Let be the size of the selected sets and . Under some exchangeability conditions, Fan, Samworth and Wu (2009) demonstrate that

(33)

where, for the second inequality, we require that . In other words, the probability of selecting at least inactive variables is very small when is small compared to , such as for the situations discussed in the previous two sections.

4.5 Iterative sure independence screening

SIS uses only the marginal information of the covariates and its sure screening property can fail when technical conditions are not satisfied. Fan and Lv (2008) point out three potential problems with SIS:

  • (False Negative) An important predictor that is marginally uncorrelated but jointly correlated with the response cannot be picked by SIS. An example of this has the covariate vector x jointly normal with equi-correlation , while depends on the covariates through

    Clearly, is independent of and hence , yet the regression coefficient can be much larger than for other variables. Such a hidden signature variable cannot be picked by using independence learning, but it has a dominant predictive power on .

  • (False Positive) Unimportant predictors that are highly correlated with the important predictors can have higher priority to be selected by SIS than important predictors that are relatively weakly related to the response. An illustrative example has

    where is independent of the other variables which have a common correlation . Then , for , and has the lowest priority to be selected.

  • The issue of collinearity among the predictors adds difficulty to the problem of variable selection.

Translating a) to microarray data analysis, a two-sample test can never pick up a hidden signature gene. Yet, missing the hidden signature gene can result in very poor understanding of the molecular mechanism and in poor disease classification. Fan and Lv (2008) address these issues by proposing an iterative SIS (ISIS) that extends SIS and uses more fully the joint information of the covariates. ISIS still maintains computational expediency.

Fan, Samworth and Wu (2009) extend and improve the idea of ISIS from the multiple regression model to the more general loss function (16); this includes, in addition to the log-likelihood, the hinge loss and exponential loss in classification in which takes values , among others. The -learning (Shen et al. (2003)) can also be cast in this framework. ISIS also allows variable deletion in the process of iteration. More generally, suppose that our objective is to find a sparse to minimize

The algorithm goes as follows.

  1. Apply an SIS such as (32) to pick a set of indices of size , and then employ a penalized (pseudo)-likelihood method (15) to select a subset of these indices.

  2. (Large-scale screening) Instead of computing residuals as in Fan and Lv (2008), compute