SPARSE COVARIANCE THRESHOLDING FOR
HIGHDIMENSIONAL VARIABLE SELECTION
X. Jessie Jeng and Z. John Daye
Purdue University
Abstract: In highdimensions, many variable selection methods, such as the lasso, are often limited by excessive variability and rank deficiency of the sample covariance matrix. Covariance sparsity is a natural phenomenon in highdimensional applications, such as microarray analysis, image processing, etc., in which a large number of predictors are independent or weakly correlated. In this paper, we propose the covariancethresholded lasso, a new class of regression methods that can utilize covariance sparsity to improve variable selection. We establish theoretical results, under the random design setting, that relate covariance sparsity to variable selection. Realdata and simulation examples indicate that our method can be useful in improving variable selection performances.
Key words and phrases: Consistency, covariance sparsity, large p small n, random design, regression, regularization.
1. Introduction
Variable selection in highdimensional regression is a central problem in Statistics and has stimulated much interest in the past few years. Motivation for developing effective variable selection methods in highdimensions comes from a variety of applications, such as gene microarray analysis, image processing, etc., where it is necessary to identify a parsimonious subset of predictors to improve interpretability and prediction accuracy. In this paper, we consider the following linear model for a vector of predictors and a response variable,
(1.1) 
where is a vector of regression coefficients and is a normal random error with mean 0 and variance . If is nonzero, then is said to be a true variable; otherwise, it is an irrelevant variable. Further, when only a few coefficients ’s are believed to be nonzero, we refer to (1.1) as a sparse linear model. The purpose of variable selection is to separate the true variables from the irrelevant ones based upon some observations of the model. In many applications, can be fairly large or even larger than . The problem of large and small presents a fundamental challenge for variable selection.
Recently, various methods based upon penalized least squares are proposed for variable selection. The lasso, introduced by Tibshirani (1996), is the forerunner and foundation for many of these methods. Suppose that y is an vector of observed responses centered to have mean 0 and is an data matrix with each column standardized to have mean zero and variance of 1. We may reformulate the lasso as the following,
(1.2) 
where is the sample covariance or correlation matrix. Consistency in variable selection for the lasso has been proved under the neighborhood stability condition in Meinshausen and Buhlmann (2006) and under the irrepresentable condition in Zhao and Yu (2006). Compared with traditional variable selection procedures, such as all subset selection, AIC, BIC, etc., the lasso has continuous solution paths and can be computed efficiently using innovative algorithms, such as the LARS in Efron, Hastie, Johnstone, and Tibshirani (2004). Since its introduction, the lasso has emerged as one of the most widelyused methods for variable selection.
In the lasso literature, data matrix is often assumed to be fixed. However, this assumption may not be realistic in highdimensional applications, where data usually come from observational rather than experimental studies. In this paper, we assume the predictors in (1.1) to be random with and . In addition, we assume that the population covariance matrix is sparse in the sense that the proportion of nonzero in is relatively small. Motivations for studying sparse covariance matrices come from a myriad of applications in highdimensions, where a large number of predictors can be independent or weakly correlated with each other. For example, in gene microarray analysis, it is often reasonable to assume that genes belonging to different pathways or systems are independent or weakly correlated (Rothman, Levina, and Zhu 2009; Wagaman and Levina 2008). In these applications, the number of nonzero covariances in can be much smaller than , the total number of covariances.
An important component of lasso regression (1.2) is the sample covariance matrix . We note that the sample covariance matrix is rankdeficient when . This can cause the lasso to saturate after at most variables are selected. Moreover, the ‘large and small ’ scenario can cause excessive variability of sample covariances between the true and irrelevant variables. This deteriorates the ability of the lasso to separate true variables from irrelevant ones. More specifically, a sufficient and almost necessary condition for the lasso to be variable selection consistent is derived in Zhao and Yu (2006), which they call the irrepresentable condition. It poses constraint on the interconnectivity between the true and irrelevant variables in the following way. Let and , such that is the collection of true variables and is the complement of that is composed of the irrelevant variables. Assume that the cardinality of is ; in other words, there are true variables and irrelevant ones. Further, let and be subdata matrices of that contain the observations of the true and irrelevant variables, respectively. Define , where and . We refer to as the sample irrepresentable index. It can be interpreted as representing the amount of interconnectivity between the true and irrelevant variables. In order for lasso to select the true variables consistently, irrepresentable condition requires to be bounded from above, that is for some , entrywise. Clearly, excessive variability of the sample covariance matrix induced by large and small can cause to exhibit large variation that makes the irrepresentable condition less likely to hold. These inadequacies motivate us to consider alternatives to the sample covariance matrix to improve variable selection for the lasso in highdimensions.
Next, we provide some insight on how the sparsity of the population covariance matrix can influence variable selection for the lasso. Under random design assumption on , the interconnectivity between the true and irrelevant variables can be stated in terms of their population variances and covariances. Let be the covariance matrix between the irrelevant variables and true variables and the variancecovariance matrix of the true variables. We define the population irrepresentable index as Intuitively, the sparser the population covariances and are, or the sparser is, the more likely that , entrywise. This property, however, does not automatically trickle down to the sample irrepresentable index , due to its excessive variability. When and are known a priori to be sparse and , entrywise, some regularization on the covariance can be used to reduce the variabilities of and and allow the irrepresentable condition to hold more easily for . Furthermore, the sample covariance matrix is obviously nonsparse; and imposing sparsity on has the benefit of sometimes increasing the rank of the sample covariance matrix.
We use an example to demonstrate how rank deficiency and excessive variability of the sample covariance matrix can compromise the performance of the lasso for large and small . Suppose there are 40 variables () and ( is the identity matrix). Since all variables are independent of each other, the population irrepresentable index clearly satisfies , entrywise. Further, we let , for , and , for . The error standard deviation is set to be about 6.3 to have a signaltonoise ratio of approximately 1. The lasso, in general, does not take into consideration the structural properties of the model, such as the sparsity or the orthogonality of in this example. One way to take advantage of the orthogonality of is to replace in (1.2) by , which leads to the univariate soft thresholding (UST) estimates , where for . We compare the performances of the lasso and UST over various sample sizes () using the variable selection measure . is defined as the geometric mean between sensitivity, , and specificity, (Tibshirani, Saunders, Rosset, Zhu, and Knight 2005; Chong and Jun 2005; Kubat, Holte, and Matwin 1998). varies between 0 and 1. Larger indicates better selection with a larger proportion of variables classified correctly.
Figure 1 plots the median based on 200 replications for the lasso and UST against sample sizes. For each replication, is determined ex post facto by the optimal in order to avoid stochastic errors from tuning parameter estimation, such as by using crossvalidation. It is clear from Figure 1 that, when , lasso slightly outperforms UST; when , the performance of lasso starts to deteriorate precipitously, whereas the performance of UST declines at a much slower pace and starts to outperform lasso. This example suggests that when is large and is relatively small, sparsity of can be used to enhance variable selection.
The discussions above motivate us to consider improving the performance of the lasso by applying regularization to the sample covariance matrix . A good sparse covarianceregularizing operator on should satisfy the following properties:

The operator stabilizes .

The operator can increase the rank of .

The operator utilizes the underlying sparsity of the covariance matrix.
The first and second properties are obviously useful and have been explored in the literature. For example, the elastic net, introduced in Zou and Hastie (2005), replaces by in (1.2), where is a tuning parameter. can be more stable and have higher rank than but is nonsparse. Nonetheless, in many applications, utilizing the underlying sparsity may be more crucial in improving the lasso when data is scarce, such as under the large and small scenario.
Recently, various regularization methods have been proposed in the literature for estimating highdimensional variancecovariance matrices. Some examples include tapering proposed by Furrer and Bengtsson (2007), banding by Bickel and Levina (2008b), thresholding by Bickel and Levina (2008a) and El Karoui (2008), and generalized thresholding by Rothman, Levina, and Zhu (2009). We note that covariance thresholding operators can satisfy all three properties outlined in the previous paragraph; in particular, they can generate sparse covariance estimates to accommodate for the covariance sparsity assumption. In this paper, we propose to apply covariancethresholding on the sample covariance matrix in (1.2) to stabilize and improve the performances of the lasso. We call this procedure the covariancethresholded lasso. We establish theoretical results that relate the sparsity of the covariance matrix with variable selection and compare them to those of the lasso. Simulation and realdata examples are reported. Our results suggest that covariancethresholded lasso can improve upon the lasso, adaptive lasso, and elastic net, especially when is sparse, is small, and is large. Even when the underlying covariance is nonsparse, covariancethresholded lasso is still useful in providing robust variable selection in highdimensions.
Witten and Tibshirani (2009) has recently proposed the scout procedure, that applies regularization to the inverse covariance or precision matrix. We note that this is quite different from the covariancethresholded lasso that regularizes the sample covariance matrix directly. Furthermore, the scout penalizes using the matrix norm , where is an estimate of , whereas the covariancethresholded lasso regularizes individual covariances directly. In our results, we will show that the scout is potentially very similar to the elastic net and that the covariancethresholded lasso can often outperform the scout in terms of variable selection for .
The rest of the paper is organized as follows. In Section 2, we present covariancethresholded lasso in detail and a modified LARS algorithm for our method. We discuss a generalized class of covariancethresholding operators and explain how covariancethresholding can stabilize the LARS algorithm for the lasso. In Section 3, we establish theoretical results on variable selection for the covariancethresholded lasso. The effect of covariance sparsity on variable selection is especially highlighted. In Section 4, we provide simulation results of covariancethresholded lasso at , and, in Section 5, we compare the performances of covariancethresholded lasso with those of the lasso, adaptive lasso, and elastic net using 3 realdata sets. Section 6 concludes with further discussions and implications.
2. The CovarianceThresholded Lasso
Suppose that the response is centered and each column of the data matrix is standardized, as in the lasso (1.2). We define the covariancethresholded lasso estimate as
(2.3) 
where , , , and is a predefined covariancethresholding operator with . If the identity function is used as the covariancethresholding operator, that is for any , then .
2.1. Sparse Covariancethresholding Operators
We consider a generalized class of covariancethresholding operators introduced in Rothman, Levina, and Zhu (2009). These operators satisfy the following properties,
(2.4) 
The first property enforces sparsity for covariance estimation; the second allows shrinkage of covariances; and the third limits the amount of shrinkage. These operators satisfy the desired properties outlined in the Introduction for sparse covarianceregularizing operators and represent a wide spectrum of thresholding procedures that can induce sparsity and stabilize the sample covariance matrix. In this paper, we will consider the following covariancethresholding operators for when .
(2.5)  
(2.6)  
(2.7) 
The above operators are used in Rothman, Levina, and Zhu (2009) for estimating variancecovariance matrices, and it is easy to check that they satisfy the properties in (2.4).
CT Hard  CT Soft  CT Adapt  Elastic Net 

In Figure 2, we depict the sparse covariancethresholding operators (2.52.7) for varying . Hard thresholding presents a discontinuous thresholding of covariances, whereas soft thresholding offers continuous shrinkage. Adaptive thresholding presents less regularization on covariances with large magnitudes than soft thresholding.
Figure 2 further includes the elastic net covarianceregularizing operator, for . Apparently, this operator is nonsparse and does not satisfy the first property in (2.4). In particular, we see that the elastic net penalizes covariances with large magnitudes more severely than those with small magnitudes. In some situations, this has the benefit of alleviating multicollinearity as it shrinks covariances of highly correlated variables. However, under highdimensionality and when much of the random perturbation of the covariance matrix arises from small but numerous covariances, the elastic net in attempting to control these variabilities may inadvertently penalize covariances with large magnitudes severely, which may introduce large bias in estimation and compromise the performance of the elastic net under some scenarios.
2.2. Computations
The lasso solution paths are shown to be piecewise linear in Efron, Hastie, Johnstone, and Tibshirani (2004) and Rosset and Zhu (2007). This property allows Efron, Hastie, Johnstone, and Tibshirani (2004) to propose the efficient LARS algorithm for the lasso. Likewise, in this section, we propose a piecewiselinear algorithm for the covariancethresholded lasso.
We note that the loss function in (2.3) can sometimes be nonconvex since may possess negative eigenvalues for some . This usually may occur for intermediary values of , as is at least semipositive definite for close to 0 or 1. Furthermore, we note that the penalty is a convex function and dominates in (2.3) for large. Intuitively, this means that the optimization problem for covariancethresholded lasso is almost convex for sparse. This is stated conservatively in the following theorem by using secondorder condition from nonlinear programming (McCormick 1976).
Theorem 2.1
Let be fixed. If is semipositive definite, the covariancethresholded lasso solutions for (2.3) are piecewise linear with respect to . If possesses negative eigenvalues, a set of covariancethresholded lasso solutions, which may be local minima for (2.3) under strict complementarity, is piecewise linear with respect to for , where
The proof for Theorem 2.1 is outlined in Appendix 7.6. Strict complementarity, described in Appendix 7.6, is a technical condition that allows the secondorder condition to be more easily interpreted and usually holds with high probability. We note that, when has negative eigenvalues, the solution is global if for all and is positive definite. Theorem 2.1 suggests that piecewise linearity of the covariancethresholded lasso solution path sometimes may not hold for some when is small, even if a solution may well exist. This restricts the sets of tuning parameters (, ) for which we can compute the solutions of covariancethresholded lasso efficiently using a LARStype algorithm. We note that the elastic net does not suffer from a potentially nonconvex optimization. However, as we will demonstrate in Figure 3 of Section 4, covariancethresholded lasso with restricted sets of (, ) is, nevertheless, rich enough to dominate the elastic net in many situations.
Theorem 2.1 establishes that a set of covariancethresholded lasso solutions are piecewise linear. This further provides us with an efficient modified LARS algorithm for computing the covariancethresholded lasso. Let
(2.8) 
be estimates for the covariateresidual correlations . Further, we denote the minimum eigenvalue of as . The covariancethresholded lasso can be computed with the following algorithm.

ALGORITHM: Covariancethresholded LARS

Initialize such that , , and . Let , , , , and .

Let and for any , where is taken only over positive elements.

Let , , , and .

If , remove the variable hitting at from . If , add the variable first attaining equality at to .

Compute the new direction, and , and let .

Repeat steps 25 until or .
The covariateresidual correlations are the most crucial for computing the solution paths. It determines the variable to be included at each step and relates directly to the tuning parameter . In the original LARS for the lasso, is estimated as , which uses the sample covariance matrix without thresholding. In covariancethresholded LARS, is defined using the covariancethresholded estimate , which may contain many zeros. We note that, in (2.8), zerovalued covariances have the effect of essentially removing the associated coefficients from , providing parsimonious estimates for . This allows covariancethresholded LARS to estimate in a more stable way than the LARS. It is clear that covariancethresholded LARS presents an advantage if population covariance is sparse. On the other hand, if the covariance is nonsparse, covariancethresholded LARS can still outperform the LARS when the sample size is small or the data are noisy. This is because parsimonious estimates of can be more robust against random variability of the data.
Moreover, consider computing the direction of the solution paths in Step 5, which is used for updating . LARS for the lasso updates new directions with , whereas covariancethresholded LARS uses . Apparently, covariancethresholded LARS can exploit potential covariance sparsity to improve and stabilize estimates of the directions of the solution paths. In addition, the LARS for the lasso can stop early before all true variables can be considered if is rank deficient at an early stage when sample size is limited. Covariancethresholding can mitigate this problem by proceeding further with properly chosen values of . For example, when , converges towards the identity matrix , which is fullranked.
3. Theoretical Results on Variable Selection
In this section, we derive sufficient conditions for covariancethresholded lasso to be consistent in selecting the true variables. We relate covariance sparsity with variable selection and demonstrate the pivotal role that covariance sparsity plays in improving variable selection under highdimensionality. Furthermore, variable selection results for the lasso under the random design are derived and compared with those of the covariancethresholded lasso. We show that the covariancethresholded lasso, by utilizing covariance sparsity through a properly chosen thresholding level , can improve upon the lasso in terms of variable selection.
For simplicity, we assume that a solution for (2.3) exists and denote the covariancethresholded lasso estimate by in this section. Further, we let represent the collection of indices of nonzero coefficients. We say that the covariancethresholded lasso estimate is variable selection consistent if . In addition, we say that is sign consistent if , where when , and , respectively (Zhao and Yu 2006). Obviously, sign consistency is a stronger property and implies variable selection consistency.
We introduce two quantities to characterize the sparsity of that plays a pivotal role in the performance of covariancethresholded lasso. Recall that and are collections of the true and irrelevant variables, respectively. Define
(3.9) 
ranges between and . When , all pairs of the true variables are orthogonal. When , there are at least one variable correlated with all other variables. Similarly, is between and . When , the true and irrelevant variables are orthogonal to each other, and, when , some irrelevant variables are correlated with all the true variables. The values of and represent the sparsity of covariance submatrices for the true variables and between the irrelevant and true variables, respectively. We have not specified the sparsity of the submatrix for the irrelevant variables themselves. It will be clear later that it is the structure of and instead of that plays the pivotal role in variable selection. We note that and are related to another notion of sparsity used in Bickel and Levina (2008a) to define the class of matrices , for given and a constant depending on . We use the specific quantities and in (3.9) in order to provide easier presentation of our results for variable selection. Our results in this section can be applied to more general characterizations of sparsity, such as in Bickel and Levina (2008a).
In this paper, we employ two different types of matrix norms. For an arbitrary matrix , the infinity norm is defined as , and the spectral norm is defined as . We use and to represent, respectively, the largest and smallest eigenvalues of .
3.1. Sign Consistency of Covariancethresholded Lasso
We develop sign consistency results for covariancethresholded lasso. Proofs for the results are presented in the Appendix.
We first provide conditions for the covariancethresholded lasso estimate to have the same signs as the true coefficients under the fixed design assumption. Let and .
Lemma 3.1
Suppose that the data matrix is fixed and is given. Then, if
(3.10) 
(3.11) 
and
(3.12) 
The above (3.10), (3.11), and (3.12) are derived from the KarushKuhnTucker (KKT) conditions for the optimization problem presented in (2.3) when the solution, which may be a local minimum, exists. Following the arguments in Zhao and Yu (2006) and Wainwright (2006), these conditions are almost necessary for to have the correct signs. The condition (3.10) is needed for (3.11) and (3.12) to be valid. That is, the conditions (3.11) and (3.12) are illdefined if is singular.
Assume the random design setting so that is drawn from some distribution with population covariance . We demonstrate how the sparsity of and the procedure of covariancethresholding work together to ensure that the condition is satisfied. We impose the following moment conditions on the random predictors :
(3.13) 
for some constant and . Assume that
(3.14) 
and , , and satisfy
(3.15) 
We have the following lemma.
The rate of convergence for (3.16) depends on the rate of convergence for (3.15). It is clear that the smaller (or the sparser ) is, the faster (3.15), as well as (3.16), converges. Equivalently, for sample size fixed, the smaller is, the larger the probability that . In other words, covariancethresholding can help to fix potential rank deficiency of when is sparse. In the special case when and , it can be shown that is asymptotically positive definite provided that .
Next, we investigate the remaining two conditions (3.11) and (3.12) in Lemma 3.1. For (3.11) and (3.12) to hold with probability going to 1, additional assumptions including the irrepresentable condition need to be imposed. Since the data matrix is assumed to be random, the original irrepresentable condition needs to be stated in terms of the population covariance matrix as follows,
(3.17) 
for some . We note that the original irrepresentable condition in Zhao and Yu (2006) also involves the signs of . To simplify presentation, we use the stronger condition (3.17) instead. Obviously, (3.17) does not directly imply that . The next lemma establishes the asymptotic behaviors of and . Let . Assume
(3.18) 
(3.19) 
Lemma 3.3
The above lemma indicates that with a properly chosen thresholding parameter and sample size depending on covariancesparsity quantities and , both and behave as their population counterparts and , asymptotically. Again, the influence of the sparsity of on and is shown through and . Asymptotically, the smaller and are, the faster (3.20) and (3.21) converge. Or equivalently, for sample size fixed, the smaller and are, the larger the probabilities in (3.20) and (3.21) are. In the special case when or is a zero matrix, condition (3.19) is always satisfied.
Finally, we are ready to state the sign consistency result for . With the help of Lemmas 1–3 stated above, the only issue left is to show the existence of a proper such that (3.11) and (3.12) hold with probability going to . One more condition is needed. We assume that
(3.22) 
Theorem 3.2
We note that the assumption is natural for highdimensional sparse models, which usually have a large number of irrelevant variables. This assumption effects the conditions (3.19) and (3.22) as well as choices of and . When , that is a nonsparse linear model is assumed, the conditions for to be sign consistent need to be modified by choosing as and replacing by in conditions (3.19), (3.22), and (3.23).
It is possible to establish the convergence rate for the probability in (3.24) more explicitly. For simplicity of presentation, we provide such a result under a special case in the following theorem.
Theorem 3.3
The proof of Theorem 3.3, which we omit, is similar to that of Theorem 3.2. We note that the conditions on dimension parameters in Theorem 3.2 are now expressed in the convergence rate of (3.25). It is clear that the smaller is, the larger the probability is in (3.25).
3.2. Comparison with the Lasso
We compare sign consistency results of covariancethresholded lasso with those of the lasso. By choosing , the covariancethresholded lasso estimate can be reduced to the lasso estimate . Results on sign consistency of the lasso have been established in the literature (Zhao and Yu (2006), Meinshausen and Buhlmann (2006), Wainwright (2006)). To facilitate comparison, we restate sign consistency results for in the same way that we presented results for in Section 3.1 . The proofs, which we omit, for sign consistency of is similar to those for .
First, assuming fixed design, we have the sufficient and almost necessary conditions for as in (3.10)(3.12) with .
Next, we assume the random design. Analogous to Lemma 3.2, the sufficient conditions for are (3.13), (3.14), and
(3.26) 
Compared to (3.15), (3.26) is clearly more demanding since is always less than or equal to . Note that a necessary condition for to be nonsingular is , which is not required for . Thus, the nonsingularity of the sample covariance submatrix is harder to attain than that of . In other words, covariancethresholded lasso may increase the rank of by thresholding. When is sparse, this can be beneficial for variable selection under the large and small scenario.
To ensure that and , as in Lemma 3.3 with , we further assume the conditions (3.17) and
(3.27) 
We note that (3.27) is the main condition that guarantees that satisfies the irrepresentable condition with probability going to 1. Compared with (3.19), (3.27) is clearly more demanding since is larger than both and . This implies that it is harder for than for to satisfy the irrepresentable condition. In other words, covariancethresholded lasso is more likely to be variable selection consistent than the lasso when data are randomly generated from a distribution that satisfies (3.17).
Finally, with the additional condition,
(3.28) 
we arrive at the sign consistency of the lasso as the following.
Corollary 3.1
Compare Corollary 3.1 with Theorem 3.2 for covariancethresholded lasso. We see that conditions (3.13), (3.14), (3.17) on random predictors, in particular the covariances, are the same, but conditions on dimension parameters, such as , , , etc., are different. When the population covariance matrix is sparse, condition (3.19) on dimension parameters is much weaker for covariancethresholded lasso than condition (3.27) for the lasso . This shows that covariancethresholded lasso can improve the possibility of there existing a consistent solution. However, a tradeoff presents in the selection of tuning parameters . The first condition in (3.23) for covariancethresholded lasso is clearly more restricted than the condition in (3.29) for the lasso. This results in a more restricted range for valid . We argue that compared with the existence of consistent solution, the range of the is of secondary concern.
We note that a related sign consistency result under random design for the lasso has been established in Wainwright (2006). They assume that the predictors are normally distributed and utilize the resulting distribution of the sample covariance matrix. The conditions used in Wainwright (2006) include (3.14), (3.17), , , , , and , for some constant . In comparison, we assume, in this paper, that the random predictors follow the more general moment conditions (3.13), which contain the Gaussian assumption as a special case. Moreover, we use a new approach to establish sign consistency that can incorporate the sparsity of the covariance matrix.
4. Simulations
In this section, we examine the finitesample performances of the covariancethresholded lasso for and compare them to those of the lasso, adaptive lasso with univariate as initial estimates, UST, scout(1,1), scout(2,1), and elastic net. Further, we propose a novel variant of crossvalidation that allows improved variable selection when is much less than . We note that the scout(1,1) procedure can be computationally expensive. Results for scout(1,1) that take longer than 5 days on an RCAC cluster were not shown.
We compare variable selection performances using the measure, . is defined as the geometric mean between sensitivity, , and specificity, . Sensitivity and specificity can be interpreted as the proportion of selecting the true variables correctly and discarding the irrelevant variables correctly, respectively. Sensitivity can also be defined as 1 minus false negative rate and specificity as 1 minus false positive rate. A value close to 1 for indicates good selection, whereas a value close to 0 implies that few true variables or too many irrelevant variables are selected, or both. Furthermore, we compare prediction accuracy using the relative prediction error (RPE), where is the population covariance matrix. The RPE is obtained by rescaling the meansquared error (ME), as in Tibshirani (1996), by .
We first present variable selection results using bestpossible selection of tuning parameters, where tuning parameters are selected ex post facto based on the best . This procedure is useful in examining variable selection performances, free from both inherent variabilities in estimating the tuning parameters and possible differences in the validation procedures used. Moreover, it is important as an informant of the possible potentials of the methods examined. We present median G out of 200 replications using bestpossible selection of tuning parameters. Standard errors based on 500 bootstrapped resamplings are very small, in the hundredth decimal place, for median G and are not shown.
Results from bestpossible selection of tuning parameters allow us to understand the potential advantages of the methods if one chooses their tuning parameters correctly. However, in practice, possible errors due to the selection of tuning parameters may sometimes overcome the benefit of introducing them. Hence, we include additional results that use crossvalidation to select tuning parameters.
(a) Example 1  (b) Example 2 
(c) Example 3  
We study variable selection methods using a novel variant of the usual crossvalidation to estimate the model complexity parameter that allows improved variable selection when . Conventional crossvalidation selects tuning parameters based upon the minimum validation error, obtained from the average of sumofsquares errors from each fold. It is well known that, when the sample size is large compared with the number of predictors , procedures such as crossvalidation that are predictionbased tend to overselect. This is because, when the sample size is large, regression methods tend to produce small but nonzero estimates for coefficients of irrelevant variables and overtraining occurs. On the other hand, we note that a different scenario occurs when . In this situation, predictionbased procedures, such as the usual crossvalidation, tend to underselect important variables. This is because, when is small, inclusion of a relatively few irrelevant variables can increase the validation error dramatically, resulting in severe instability and underrepresentation of important variables. In this paper, we propose to use a variant of the usual crossvalidation, in which we include additional variables by decreasing for up to 1 standard deviation of the validation error at the minimum. Through extensive empirical studies, we found that this strategy often works well to prevent underselection when , which corresponds to when and when . For and sample size only moderately large, we use the usual crossvalidation at the minimum. We note that Hastie, Tibshirani, and Friedman (2001, p. 216) have described a related strategy that discards variables up to 1 standard deviation of the minimum crossvalidation error for use when is large relative to and overselection is severe. In Table 13, we present median RPE, number of true and false positives, sensitivity, specificity, and G out of 200 replications using modified crossvalidation for selecting tuning parameters. The smallest 3 values of median RPE and largest 3 of median G are highlighted in bold. Standard errors based on 500 bootstrapped resamplings are further reported in parentheses for median RPE and G. In Table 4, we provide an additional simulation study to illustrate the modified crossvalidation.
In each example, we simulate 200 data sets from the true model, , where . is generated each time from , and we vary , , and in each example to illustrate performances across a variety of situations. We choose the tuning parameter from for both adaptive lasso (Zou 2006) and covariancethresholded lasso with adaptive thresholding. The adaptive lasso seeks to improve upon the lasso by applying the weights , where is an initial estimate, in order to penalize each coefficient differently in the norm of the lasso. The larger is the less the shrinkage applied to coefficients of large magnitudes. The candidate values used for are suggested in Zou (2006) and found to work well in practice.
Example 1. (Autocorrelated.) This example has predictors with coefficients for , for , and otherwise. for all , and . Signaltonoise ratio (SNR) is approximately . This example, similar to Example 1 in (Tibshirani 1996), has an approximately sparse covariance structure, as elements away from the diagonal can be extremely small.
n  Method  rpe  TP  FP  sens  spec  

20  Lasso  1.284 (0.043)  4.0  13.0  0.40  0.86  0.577 (0.003) 
Adapt Lasso  1.301 (0.060)  4.0  12.0  0.40  0.87  0.581 (0.006)  
UST  3.001 (0.223)  7.0  28.0  0.70  0.69  0.690 (0.008)  
Scout(1,1)  1.164 (0.027)  10.0  90.0  1.00  0.00  0.000 (0.000)  
Scout(2,1)  1.474 (0.053)  6.0  39.0  0.60  0.57  0.474 (0.023)  
Elastic net  1.630 (0.097)  7.0  31.0  0.70  0.63  0.633 (0.021)  
CTLasso hard  1.713 (0.100)  5.0  22.5  0.60  0.77  0.593 (0.013)  
CTLasso soft  1.586 (0.051)  6.0  20.5  0.60  0.78  0.667 (0.007)  
CTLasso adapt  1.602 (0.055)  6.0  20.0  0.60  0.78  0.654 (0.006)  
40  Lasso  1.095 (0.052)  6.0  27.0  0.60  0.71  0.672 (0.003) 
Adapt Lasso  1.047 (0.038)  7.0  21.0  0.70  0.77  0.706 (0.007)  
UST  1.918 (0.098)  8.0  28.0  0.80  0.69  0.742 (0.006)  
Scout(1,1)  0.814 (0.016)  10.0  90.0  1.00  0.00  0.000 (0.025)  
Scout(2,1)  1.125 (0.039)  9.0  53.0  0.90  0.41  0.544 (0.029)  
Elastic net  1.490 (0.066)  8.0  32.0  0.90  0.63  0.683 (0.010)  
CTLasso hard  1.221 (0.072)  7.0  23.0  0.70  0.74  0.704 (0.008)  
CTLasso soft  1.068 (0.055)  7.0  23.0  0.80  0.77  0.739 (0.007)  
CTLasso adapt  1.063 (0.045)  7.0  23.0  0.80  0.78  0.743 (0.007)  
80  Lasso  0.379 (0.010)  8.0  19.0  0.80  0.79  0.794 (0.005) 
Adapt Lasso  0.367 (0.013)  8.0  15.0  0.80  0.82  0.800 (0.005)  
UST  0.360 (0.011)  8.0  5.0  0.80  0.94  0.851 (0.012)  
Scout(1,1)  0.245 (0.007)  8.0  8.0  0.80  0.91  0.854 (0.008)  
Scout(2,1)  0.399 (0.014)  6.5  7.0  0.65  0.92  0.762 (0.006)  
Elastic net  0.307 (0.014)  9.0  10.0  0.90  0.90  0.866 (0.006)  
CTLasso hard  0.349 (0.013)  8.0  8.0  0.80  0.94  0.795 (0.010)  
CTLasso soft  0.284 (0.011)  8.0  6.5  0.80  0.94  0.827 (0.008)  
CTLasso adapt  0.316 (0.017)  8.0  8.0  0.80  0.93  0.823 (0.009) 
Figure 3(a) depicts variable selection results using bestpossible selection of tuning parameters. We see that the covariancethresholded lasso methods dominate the lasso, adaptive lasso, and UST in terms of variable selection for . The performances of lasso and adaptive lasso deteriorate precipitously as becomes small, whereas those of the covariancethresholded lasso methods decrease at a relatively slow pace. Furthermore, the covariancethresholded lasso methods dominate the elastic net and scout for small. We also observe that the scout procedures and elastic net perform very similarly. This is not surprising as Witten and Tibshirani (2009) have shown in Section 2.5.1 of their paper that scout(2,1), by regularizing the inverse covariance matrix, is very similar to the elastic net.
Results from bestpossible selection provide information on the potentials of the methods examined. In Table 1, we present results using crossvalidation to illustrate performances in practice. The covariancethresholded lasso methods tend to dominate the lasso, adaptive lasso, scout, and elastic net in terms of variable selection for small. The UST presents good variable selection performances but large prediction errors. We note that, due to its large bias, the UST cannot be legitimately applied with crossvalidation that uses validation error to select tuning parameters, especially when the coefficients are disparate and some correlations are large. The advantages of covariancethresholded lasso with hard thresholding is less apparent compared with those of soft and adaptive thresholding. This suggests that continuous thresholding of covariances may achieve better performances than discontinuous ones using crossvalidation. We note that the scout procedures perform surprisingly poorly compared with the covariancethresholded lasso and the elastic net in terms of variable selection when is small. As the scout and elastic net are quite similar in terms of their potentials for variable selection as shown in Figure 3(a), the differences seem to come from the additional rescaling step of the scout, where the scout rescales its initial estimates by multiplying them with a scalar . This strategy can sometimes be useful in improving prediction accuracy. However, when is small compared with , standard deviations of validation errors for the scout can often be large, which may cause variable selection performances to suffer for crossvalidation. We additionally note that, when and SNR is low as in this example, high specificity can sometimes be more important for prediction accuracy than high sensitivity. This is because, when is small, coefficients of irrelevant variables can be given large estimates, and inclusion of but a few irrelevant variables can significantly deteriorate prediction accuracy. In Table 1, we see that the lasso and adaptive lasso have good prediction accuracy for though it selects less than half of the true variables.
n  Method  rpe  TP  FP  sens  spec  

20  Lasso  0.341 (0.027)  2.0  9.0  0.10  0.89  0.302 (0.009) 
Adapt Lasso  0.352 (0.028)  2.0  9.0  0.10  0.89  0.301 (0.006)  
Elastic net  0.967 (0.137)  14.0  51.5  0.70  0.36  0.437 (0.011)  
UST  28.930 (0.836)  19.0  73.0  0.95  0.09  0.296 (0.012)  
Scout(1,1)  NA  NA  NA  NA  NA  NA  
Scout(2,1)  0.062 (0.004)  20.0  80.0  1.00  0.00  0.000 (0.000)  
CTLasso hard  0.383 (0.018)  3.0  11.0  0.15  0.86  0.370 (0.013)  
CTLasso soft  0.231 (0.015)  6.5  23.0  0.33  0.71  0.465 (0.008)  
CTLasso adapt  0.302 (0.019)  6.5  23.0  0.33  0.71  0.461 (0.012)  
40  Lasso  0.348 (0.017)  5.0  18.5  0.25  0.77  0.429 (0.014) 
Adapt Lasso  0.315 (0.024)  5.0  17.0  0.25  0.79  0.417 (0.008)  
Elastic net  0.739 (0.094)  16.0  58.0  0.80  0.28  0.426 (0.014)  
UST  26.189 (1.001)  20.0  77.0  1.00  0.04  0.194 (0.007)  
Scout(1,1)  NA  NA  NA  NA  NA  NA  
Scout(2,1)  0.043 (0.004)  20.0  80.0  1.00  0.00  0.000 (0.000)  
CTLasso hard  0.363 (0.018)  6.0  21.0  0.30  0.74  0.450 (0.008)  
CTLasso soft  0.269 (0.023)  10.0  35.0  0.50  0.56  0.485 (0.006)  
CTLasso adapt  0.306 (0.029)  8.0  31.0  0.40  0.61  0.482 (0.006)  
80  Lasso  0.123 (0.004)  5.0  14.0  0.25  0.83  0.440 (0.008) 
Adapt Lasso  0.122 (0.004)  4.0  14.0  0.20  0.83  0.423 (0.009)  
Elastic net  0.089 (0.006)  14.0  48.5  0.70  0.39  0.461 (0.012)  
UST  0.042 (0.003)  18.0  66.0  0.90  0.18  0.393 (0.006)  
Scout(1,1)  NA  NA  NA  NA  NA  NA  
Scout(2,1)  0.038 (0.002)  20.0  80.0  1.00  0.00  0.000 (0.000)  
CTLasso hard  0.159 (0.007)  6.0  17.5  0.30  0.78  0.468 (0.008)  
CTLasso soft  0.107 (0.009)  9.0  27.0  0.45  0.66  0.521 (0.007)  
CTLasso adapt  0.129 (0.014)  8.0  24.0  0.40  0.70  0.503 (0.007) 
Example 2. (Constant covariance.) This example has predictors with for , for , and otherwise. for all and such that , and . SNR is approximately . This example, derived from Example 4 in Tibshirani (1996), presents an extreme situation where all nondiagonal elements of the population covariance matrix are nonzero and constant.
In Figure 3(b), we see that the covariancethresholded lasso methods dominate over the lasso and adaptive lasso, especially for small. This example shows that sparse covariance thresholding may still improve variable selection when the underlying covariance matrix is nonsparse. Furthermore, covariancethresholded lasso methods with soft and adaptive thresholding perform better than that with hard thresholding. Interestingly, we see that the performance of UST decreases with increasing and drops below that of the lasso for . This example demonstrates that the UST may not be a good general procedure for variable selection and can sometimes fail unexpectedly. We note that this is a challenging example for variable selection in general. By the irrepresentable condition (Zhao and Yu 2006), the lasso is not variable selection consistent under this scenario. The median values in Figure 3(b) usually increase much slower with increasing in comparison with those of Example 1 in Figure 3(a), even though SNR is higher.
Table 2 shows that the covariancethresholded lasso methods and the elastic net dominate over the