SPARSE COVARIANCE THRESHOLDING FOR
HIGH-DIMENSIONAL VARIABLE SELECTION
X. Jessie Jeng and Z. John Daye
Abstract: In high-dimensions, many variable selection methods, such as the lasso, are often limited by excessive variability and rank deficiency of the sample covariance matrix. Covariance sparsity is a natural phenomenon in high-dimensional applications, such as microarray analysis, image processing, etc., in which a large number of predictors are independent or weakly correlated. In this paper, we propose the covariance-thresholded lasso, a new class of regression methods that can utilize covariance sparsity to improve variable selection. We establish theoretical results, under the random design setting, that relate covariance sparsity to variable selection. Real-data and simulation examples indicate that our method can be useful in improving variable selection performances.
Key words and phrases: Consistency, covariance sparsity, large p small n, random design, regression, regularization.
Variable selection in high-dimensional regression is a central problem in Statistics and has stimulated much interest in the past few years. Motivation for developing effective variable selection methods in high-dimensions comes from a variety of applications, such as gene microarray analysis, image processing, etc., where it is necessary to identify a parsimonious subset of predictors to improve interpretability and prediction accuracy. In this paper, we consider the following linear model for a vector of predictors and a response variable,
where is a vector of regression coefficients and is a normal random error with mean 0 and variance . If is nonzero, then is said to be a true variable; otherwise, it is an irrelevant variable. Further, when only a few coefficients ’s are believed to be nonzero, we refer to (1.1) as a sparse linear model. The purpose of variable selection is to separate the true variables from the irrelevant ones based upon some observations of the model. In many applications, can be fairly large or even larger than . The problem of large and small presents a fundamental challenge for variable selection.
Recently, various methods based upon penalized least squares are proposed for variable selection. The lasso, introduced by Tibshirani (1996), is the forerunner and foundation for many of these methods. Suppose that y is an vector of observed responses centered to have mean 0 and is an data matrix with each column standardized to have mean zero and variance of 1. We may reformulate the lasso as the following,
where is the sample covariance or correlation matrix. Consistency in variable selection for the lasso has been proved under the neighborhood stability condition in Meinshausen and Buhlmann (2006) and under the irrepresentable condition in Zhao and Yu (2006). Compared with traditional variable selection procedures, such as all subset selection, AIC, BIC, etc., the lasso has continuous solution paths and can be computed efficiently using innovative algorithms, such as the LARS in Efron, Hastie, Johnstone, and Tibshirani (2004). Since its introduction, the lasso has emerged as one of the most widely-used methods for variable selection.
In the lasso literature, data matrix is often assumed to be fixed. However, this assumption may not be realistic in high-dimensional applications, where data usually come from observational rather than experimental studies. In this paper, we assume the predictors in (1.1) to be random with and . In addition, we assume that the population covariance matrix is sparse in the sense that the proportion of nonzero in is relatively small. Motivations for studying sparse covariance matrices come from a myriad of applications in high-dimensions, where a large number of predictors can be independent or weakly correlated with each other. For example, in gene microarray analysis, it is often reasonable to assume that genes belonging to different pathways or systems are independent or weakly correlated (Rothman, Levina, and Zhu 2009; Wagaman and Levina 2008). In these applications, the number of nonzero covariances in can be much smaller than , the total number of covariances.
An important component of lasso regression (1.2) is the sample covariance matrix . We note that the sample covariance matrix is rank-deficient when . This can cause the lasso to saturate after at most variables are selected. Moreover, the ‘large and small ’ scenario can cause excessive variability of sample covariances between the true and irrelevant variables. This deteriorates the ability of the lasso to separate true variables from irrelevant ones. More specifically, a sufficient and almost necessary condition for the lasso to be variable selection consistent is derived in Zhao and Yu (2006), which they call the irrepresentable condition. It poses constraint on the inter-connectivity between the true and irrelevant variables in the following way. Let and , such that is the collection of true variables and is the complement of that is composed of the irrelevant variables. Assume that the cardinality of is ; in other words, there are true variables and irrelevant ones. Further, let and be sub-data matrices of that contain the observations of the true and irrelevant variables, respectively. Define , where and . We refer to as the sample irrepresentable index. It can be interpreted as representing the amount of inter-connectivity between the true and irrelevant variables. In order for lasso to select the true variables consistently, irrepresentable condition requires to be bounded from above, that is for some , entry-wise. Clearly, excessive variability of the sample covariance matrix induced by large and small can cause to exhibit large variation that makes the irrepresentable condition less likely to hold. These inadequacies motivate us to consider alternatives to the sample covariance matrix to improve variable selection for the lasso in high-dimensions.
Next, we provide some insight on how the sparsity of the population covariance matrix can influence variable selection for the lasso. Under random design assumption on , the inter-connectivity between the true and irrelevant variables can be stated in terms of their population variances and covariances. Let be the covariance matrix between the irrelevant variables and true variables and the variance-covariance matrix of the true variables. We define the population irrepresentable index as Intuitively, the sparser the population covariances and are, or the sparser is, the more likely that , entry-wise. This property, however, does not automatically trickle down to the sample irrepresentable index , due to its excessive variability. When and are known a priori to be sparse and , entry-wise, some regularization on the covariance can be used to reduce the variabilities of and and allow the irrepresentable condition to hold more easily for . Furthermore, the sample covariance matrix is obviously non-sparse; and imposing sparsity on has the benefit of sometimes increasing the rank of the sample covariance matrix.
We use an example to demonstrate how rank deficiency and excessive variability of the sample covariance matrix can compromise the performance of the lasso for large and small . Suppose there are 40 variables () and ( is the identity matrix). Since all variables are independent of each other, the population irrepresentable index clearly satisfies , entry-wise. Further, we let , for , and , for . The error standard deviation is set to be about 6.3 to have a signal-to-noise ratio of approximately 1. The lasso, in general, does not take into consideration the structural properties of the model, such as the sparsity or the orthogonality of in this example. One way to take advantage of the orthogonality of is to replace in (1.2) by , which leads to the univariate soft thresholding (UST) estimates , where for . We compare the performances of the lasso and UST over various sample sizes () using the variable selection measure . is defined as the geometric mean between sensitivity, , and specificity, (Tibshirani, Saunders, Rosset, Zhu, and Knight 2005; Chong and Jun 2005; Kubat, Holte, and Matwin 1998). varies between 0 and 1. Larger indicates better selection with a larger proportion of variables classified correctly.
Figure 1 plots the median based on 200 replications for the lasso and UST against sample sizes. For each replication, is determined ex post facto by the optimal in order to avoid stochastic errors from tuning parameter estimation, such as by using cross-validation. It is clear from Figure 1 that, when , lasso slightly outperforms UST; when , the performance of lasso starts to deteriorate precipitously, whereas the performance of UST declines at a much slower pace and starts to outperform lasso. This example suggests that when is large and is relatively small, sparsity of can be used to enhance variable selection.
The discussions above motivate us to consider improving the performance of the lasso by applying regularization to the sample covariance matrix . A good sparse covariance-regularizing operator on should satisfy the following properties:
The operator stabilizes .
The operator can increase the rank of .
The operator utilizes the underlying sparsity of the covariance matrix.
The first and second properties are obviously useful and have been explored in the literature. For example, the elastic net, introduced in Zou and Hastie (2005), replaces by in (1.2), where is a tuning parameter. can be more stable and have higher rank than but is non-sparse. Nonetheless, in many applications, utilizing the underlying sparsity may be more crucial in improving the lasso when data is scarce, such as under the large and small scenario.
Recently, various regularization methods have been proposed in the literature for estimating high-dimensional variance-covariance matrices. Some examples include tapering proposed by Furrer and Bengtsson (2007), banding by Bickel and Levina (2008b), thresholding by Bickel and Levina (2008a) and El Karoui (2008), and generalized thresholding by Rothman, Levina, and Zhu (2009). We note that covariance thresholding operators can satisfy all three properties outlined in the previous paragraph; in particular, they can generate sparse covariance estimates to accommodate for the covariance sparsity assumption. In this paper, we propose to apply covariance-thresholding on the sample covariance matrix in (1.2) to stabilize and improve the performances of the lasso. We call this procedure the covariance-thresholded lasso. We establish theoretical results that relate the sparsity of the covariance matrix with variable selection and compare them to those of the lasso. Simulation and real-data examples are reported. Our results suggest that covariance-thresholded lasso can improve upon the lasso, adaptive lasso, and elastic net, especially when is sparse, is small, and is large. Even when the underlying covariance is non-sparse, covariance-thresholded lasso is still useful in providing robust variable selection in high-dimensions.
Witten and Tibshirani (2009) has recently proposed the scout procedure, that applies regularization to the inverse covariance or precision matrix. We note that this is quite different from the covariance-thresholded lasso that regularizes the sample covariance matrix directly. Furthermore, the scout penalizes using the matrix norm , where is an estimate of , whereas the covariance-thresholded lasso regularizes individual covariances directly. In our results, we will show that the scout is potentially very similar to the elastic net and that the covariance-thresholded lasso can often outperform the scout in terms of variable selection for .
The rest of the paper is organized as follows. In Section 2, we present covariance-thresholded lasso in detail and a modified LARS algorithm for our method. We discuss a generalized class of covariance-thresholding operators and explain how covariance-thresholding can stabilize the LARS algorithm for the lasso. In Section 3, we establish theoretical results on variable selection for the covariance-thresholded lasso. The effect of covariance sparsity on variable selection is especially highlighted. In Section 4, we provide simulation results of covariance-thresholded lasso at , and, in Section 5, we compare the performances of covariance-thresholded lasso with those of the lasso, adaptive lasso, and elastic net using 3 real-data sets. Section 6 concludes with further discussions and implications.
2. The Covariance-Thresholded Lasso
Suppose that the response is centered and each column of the data matrix is standardized, as in the lasso (1.2). We define the covariance-thresholded lasso estimate as
where , , , and is a pre-defined covariance-thresholding operator with . If the identity function is used as the covariance-thresholding operator, that is for any , then .
2.1. Sparse Covariance-thresholding Operators
We consider a generalized class of covariance-thresholding operators introduced in Rothman, Levina, and Zhu (2009). These operators satisfy the following properties,
The first property enforces sparsity for covariance estimation; the second allows shrinkage of covariances; and the third limits the amount of shrinkage. These operators satisfy the desired properties outlined in the Introduction for sparse covariance-regularizing operators and represent a wide spectrum of thresholding procedures that can induce sparsity and stabilize the sample covariance matrix. In this paper, we will consider the following covariance-thresholding operators for when .
|CT Hard||CT Soft||CT Adapt||Elastic Net|
In Figure 2, we depict the sparse covariance-thresholding operators (2.5-2.7) for varying . Hard thresholding presents a discontinuous thresholding of covariances, whereas soft thresholding offers continuous shrinkage. Adaptive thresholding presents less regularization on covariances with large magnitudes than soft thresholding.
Figure 2 further includes the elastic net covariance-regularizing operator, for . Apparently, this operator is non-sparse and does not satisfy the first property in (2.4). In particular, we see that the elastic net penalizes covariances with large magnitudes more severely than those with small magnitudes. In some situations, this has the benefit of alleviating multicollinearity as it shrinks covariances of highly correlated variables. However, under high-dimensionality and when much of the random perturbation of the covariance matrix arises from small but numerous covariances, the elastic net in attempting to control these variabilities may inadvertently penalize covariances with large magnitudes severely, which may introduce large bias in estimation and compromise the performance of the elastic net under some scenarios.
The lasso solution paths are shown to be piecewise linear in Efron, Hastie, Johnstone, and Tibshirani (2004) and Rosset and Zhu (2007). This property allows Efron, Hastie, Johnstone, and Tibshirani (2004) to propose the efficient LARS algorithm for the lasso. Likewise, in this section, we propose a piecewise-linear algorithm for the covariance-thresholded lasso.
We note that the loss function in (2.3) can sometimes be non-convex since may possess negative eigenvalues for some . This usually may occur for intermediary values of , as is at least semi-positive definite for close to 0 or 1. Furthermore, we note that the penalty is a convex function and dominates in (2.3) for large. Intuitively, this means that the optimization problem for covariance-thresholded lasso is almost convex for sparse. This is stated conservatively in the following theorem by using second-order condition from nonlinear programming (McCormick 1976).
Let be fixed. If is semi-positive definite, the covariance-thresholded lasso solutions for (2.3) are piecewise linear with respect to . If possesses negative eigenvalues, a set of covariance-thresholded lasso solutions, which may be local minima for (2.3) under strict complementarity, is piecewise linear with respect to for , where
The proof for Theorem 2.1 is outlined in Appendix 7.6. Strict complementarity, described in Appendix 7.6, is a technical condition that allows the second-order condition to be more easily interpreted and usually holds with high probability. We note that, when has negative eigenvalues, the solution is global if for all and is positive definite. Theorem 2.1 suggests that piecewise linearity of the covariance-thresholded lasso solution path sometimes may not hold for some when is small, even if a solution may well exist. This restricts the sets of tuning parameters (, ) for which we can compute the solutions of covariance-thresholded lasso efficiently using a LARS-type algorithm. We note that the elastic net does not suffer from a potentially non-convex optimization. However, as we will demonstrate in Figure 3 of Section 4, covariance-thresholded lasso with restricted sets of (, ) is, nevertheless, rich enough to dominate the elastic net in many situations.
Theorem 2.1 establishes that a set of covariance-thresholded lasso solutions are piecewise linear. This further provides us with an efficient modified LARS algorithm for computing the covariance-thresholded lasso. Let
be estimates for the covariate-residual correlations . Further, we denote the minimum eigenvalue of as . The covariance-thresholded lasso can be computed with the following algorithm.
ALGORITHM: Covariance-thresholded LARS
Initialize such that , , and . Let , , , , and .
Let and for any , where is taken only over positive elements.
Let , , , and .
If , remove the variable hitting at from . If , add the variable first attaining equality at to .
Compute the new direction, and , and let .
Repeat steps 2-5 until or .
The covariate-residual correlations are the most crucial for computing the solution paths. It determines the variable to be included at each step and relates directly to the tuning parameter . In the original LARS for the lasso, is estimated as , which uses the sample covariance matrix without thresholding. In covariance-thresholded LARS, is defined using the covariance-thresholded estimate , which may contain many zeros. We note that, in (2.8), zero-valued covariances have the effect of essentially removing the associated coefficients from , providing parsimonious estimates for . This allows covariance-thresholded LARS to estimate in a more stable way than the LARS. It is clear that covariance-thresholded LARS presents an advantage if population covariance is sparse. On the other hand, if the covariance is non-sparse, covariance-thresholded LARS can still outperform the LARS when the sample size is small or the data are noisy. This is because parsimonious estimates of can be more robust against random variability of the data.
Moreover, consider computing the direction of the solution paths in Step 5, which is used for updating . LARS for the lasso updates new directions with , whereas covariance-thresholded LARS uses . Apparently, covariance-thresholded LARS can exploit potential covariance sparsity to improve and stabilize estimates of the directions of the solution paths. In addition, the LARS for the lasso can stop early before all true variables can be considered if is rank deficient at an early stage when sample size is limited. Covariance-thresholding can mitigate this problem by proceeding further with properly chosen values of . For example, when , converges towards the identity matrix , which is full-ranked.
3. Theoretical Results on Variable Selection
In this section, we derive sufficient conditions for covariance-thresholded lasso to be consistent in selecting the true variables. We relate covariance sparsity with variable selection and demonstrate the pivotal role that covariance sparsity plays in improving variable selection under high-dimensionality. Furthermore, variable selection results for the lasso under the random design are derived and compared with those of the covariance-thresholded lasso. We show that the covariance-thresholded lasso, by utilizing covariance sparsity through a properly chosen thresholding level , can improve upon the lasso in terms of variable selection.
For simplicity, we assume that a solution for (2.3) exists and denote the covariance-thresholded lasso estimate by in this section. Further, we let represent the collection of indices of nonzero coefficients. We say that the covariance-thresholded lasso estimate is variable selection consistent if . In addition, we say that is sign consistent if , where when , and , respectively (Zhao and Yu 2006). Obviously, sign consistency is a stronger property and implies variable selection consistency.
We introduce two quantities to characterize the sparsity of that plays a pivotal role in the performance of covariance-thresholded lasso. Recall that and are collections of the true and irrelevant variables, respectively. Define
ranges between and . When , all pairs of the true variables are orthogonal. When , there are at least one variable correlated with all other variables. Similarly, is between and . When , the true and irrelevant variables are orthogonal to each other, and, when , some irrelevant variables are correlated with all the true variables. The values of and represent the sparsity of covariance sub-matrices for the true variables and between the irrelevant and true variables, respectively. We have not specified the sparsity of the sub-matrix for the irrelevant variables themselves. It will be clear later that it is the structure of and instead of that plays the pivotal role in variable selection. We note that and are related to another notion of sparsity used in Bickel and Levina (2008a) to define the class of matrices , for given and a constant depending on . We use the specific quantities and in (3.9) in order to provide easier presentation of our results for variable selection. Our results in this section can be applied to more general characterizations of sparsity, such as in Bickel and Levina (2008a).
In this paper, we employ two different types of matrix norms. For an arbitrary matrix , the infinity norm is defined as , and the spectral norm is defined as . We use and to represent, respectively, the largest and smallest eigenvalues of .
3.1. Sign Consistency of Covariance-thresholded Lasso
We develop sign consistency results for covariance-thresholded lasso. Proofs for the results are presented in the Appendix.
We first provide conditions for the covariance-thresholded lasso estimate to have the same signs as the true coefficients under the fixed design assumption. Let and .
Suppose that the data matrix is fixed and is given. Then, if
The above (3.10), (3.11), and (3.12) are derived from the Karush-Kuhn-Tucker (KKT) conditions for the optimization problem presented in (2.3) when the solution, which may be a local minimum, exists. Following the arguments in Zhao and Yu (2006) and Wainwright (2006), these conditions are almost necessary for to have the correct signs. The condition (3.10) is needed for (3.11) and (3.12) to be valid. That is, the conditions (3.11) and (3.12) are ill-defined if is singular.
Assume the random design setting so that is drawn from some distribution with population covariance . We demonstrate how the sparsity of and the procedure of covariance-thresholding work together to ensure that the condition is satisfied. We impose the following moment conditions on the random predictors :
for some constant and . Assume that
and , , and satisfy
We have the following lemma.
The rate of convergence for (3.16) depends on the rate of convergence for (3.15). It is clear that the smaller (or the sparser ) is, the faster (3.15), as well as (3.16), converges. Equivalently, for sample size fixed, the smaller is, the larger the probability that . In other words, covariance-thresholding can help to fix potential rank deficiency of when is sparse. In the special case when and , it can be shown that is asymptotically positive definite provided that .
Next, we investigate the remaining two conditions (3.11) and (3.12) in Lemma 3.1. For (3.11) and (3.12) to hold with probability going to 1, additional assumptions including the irrepresentable condition need to be imposed. Since the data matrix is assumed to be random, the original irrepresentable condition needs to be stated in terms of the population covariance matrix as follows,
for some . We note that the original irrepresentable condition in Zhao and Yu (2006) also involves the signs of . To simplify presentation, we use the stronger condition (3.17) instead. Obviously, (3.17) does not directly imply that . The next lemma establishes the asymptotic behaviors of and . Let . Assume
The above lemma indicates that with a properly chosen thresholding parameter and sample size depending on covariance-sparsity quantities and , both and behave as their population counterparts and , asymptotically. Again, the influence of the sparsity of on and is shown through and . Asymptotically, the smaller and are, the faster (3.20) and (3.21) converge. Or equivalently, for sample size fixed, the smaller and are, the larger the probabilities in (3.20) and (3.21) are. In the special case when or is a zero matrix, condition (3.19) is always satisfied.
Finally, we are ready to state the sign consistency result for . With the help of Lemmas 1–3 stated above, the only issue left is to show the existence of a proper such that (3.11) and (3.12) hold with probability going to . One more condition is needed. We assume that
We note that the assumption is natural for high-dimensional sparse models, which usually have a large number of irrelevant variables. This assumption effects the conditions (3.19) and (3.22) as well as choices of and . When , that is a non-sparse linear model is assumed, the conditions for to be sign consistent need to be modified by choosing as and replacing by in conditions (3.19), (3.22), and (3.23).
It is possible to establish the convergence rate for the probability in (3.24) more explicitly. For simplicity of presentation, we provide such a result under a special case in the following theorem.
The proof of Theorem 3.3, which we omit, is similar to that of Theorem 3.2. We note that the conditions on dimension parameters in Theorem 3.2 are now expressed in the convergence rate of (3.25). It is clear that the smaller is, the larger the probability is in (3.25).
3.2. Comparison with the Lasso
We compare sign consistency results of covariance-thresholded lasso with those of the lasso. By choosing , the covariance-thresholded lasso estimate can be reduced to the lasso estimate . Results on sign consistency of the lasso have been established in the literature (Zhao and Yu (2006), Meinshausen and Buhlmann (2006), Wainwright (2006)). To facilitate comparison, we restate sign consistency results for in the same way that we presented results for in Section 3.1 . The proofs, which we omit, for sign consistency of is similar to those for .
Compared to (3.15), (3.26) is clearly more demanding since is always less than or equal to . Note that a necessary condition for to be non-singular is , which is not required for . Thus, the non-singularity of the sample covariance sub-matrix is harder to attain than that of . In other words, covariance-thresholded lasso may increase the rank of by thresholding. When is sparse, this can be beneficial for variable selection under the large and small scenario.
We note that (3.27) is the main condition that guarantees that satisfies the irrepresentable condition with probability going to 1. Compared with (3.19), (3.27) is clearly more demanding since is larger than both and . This implies that it is harder for than for to satisfy the irrepresentable condition. In other words, covariance-thresholded lasso is more likely to be variable selection consistent than the lasso when data are randomly generated from a distribution that satisfies (3.17).
Finally, with the additional condition,
we arrive at the sign consistency of the lasso as the following.
Compare Corollary 3.1 with Theorem 3.2 for covariance-thresholded lasso. We see that conditions (3.13), (3.14), (3.17) on random predictors, in particular the covariances, are the same, but conditions on dimension parameters, such as , , , etc., are different. When the population covariance matrix is sparse, condition (3.19) on dimension parameters is much weaker for covariance-thresholded lasso than condition (3.27) for the lasso . This shows that covariance-thresholded lasso can improve the possibility of there existing a consistent solution. However, a trade-off presents in the selection of tuning parameters . The first condition in (3.23) for covariance-thresholded lasso is clearly more restricted than the condition in (3.29) for the lasso. This results in a more restricted range for valid . We argue that compared with the existence of consistent solution, the range of the is of secondary concern.
We note that a related sign consistency result under random design for the lasso has been established in Wainwright (2006). They assume that the predictors are normally distributed and utilize the resulting distribution of the sample covariance matrix. The conditions used in Wainwright (2006) include (3.14), (3.17), , , , , and , for some constant . In comparison, we assume, in this paper, that the random predictors follow the more general moment conditions (3.13), which contain the Gaussian assumption as a special case. Moreover, we use a new approach to establish sign consistency that can incorporate the sparsity of the covariance matrix.
In this section, we examine the finite-sample performances of the covariance-thresholded lasso for and compare them to those of the lasso, adaptive lasso with univariate as initial estimates, UST, scout(1,1), scout(2,1), and elastic net. Further, we propose a novel variant of cross-validation that allows improved variable selection when is much less than . We note that the scout(1,1) procedure can be computationally expensive. Results for scout(1,1) that take longer than 5 days on an RCAC cluster were not shown.
We compare variable selection performances using the -measure, . is defined as the geometric mean between sensitivity, , and specificity, . Sensitivity and specificity can be interpreted as the proportion of selecting the true variables correctly and discarding the irrelevant variables correctly, respectively. Sensitivity can also be defined as 1 minus false negative rate and specificity as 1 minus false positive rate. A value close to 1 for indicates good selection, whereas a value close to 0 implies that few true variables or too many irrelevant variables are selected, or both. Furthermore, we compare prediction accuracy using the relative prediction error (RPE), where is the population covariance matrix. The RPE is obtained by re-scaling the mean-squared error (ME), as in Tibshirani (1996), by .
We first present variable selection results using best-possible selection of tuning parameters, where tuning parameters are selected ex post facto based on the best . This procedure is useful in examining variable selection performances, free from both inherent variabilities in estimating the tuning parameters and possible differences in the validation procedures used. Moreover, it is important as an informant of the possible potentials of the methods examined. We present median G out of 200 replications using best-possible selection of tuning parameters. Standard errors based on 500 bootstrapped re-samplings are very small, in the hundredth decimal place, for median G and are not shown.
Results from best-possible selection of tuning parameters allow us to understand the potential advantages of the methods if one chooses their tuning parameters correctly. However, in practice, possible errors due to the selection of tuning parameters may sometimes overcome the benefit of introducing them. Hence, we include additional results that use cross-validation to select tuning parameters.
|(a) Example 1||(b) Example 2|
|(c) Example 3|
We study variable selection methods using a novel variant of the usual cross-validation to estimate the model complexity parameter that allows improved variable selection when . Conventional cross-validation selects tuning parameters based upon the minimum validation error, obtained from the average of sum-of-squares errors from each fold. It is well known that, when the sample size is large compared with the number of predictors , procedures such as cross-validation that are prediction-based tend to over-select. This is because, when the sample size is large, regression methods tend to produce small but non-zero estimates for coefficients of irrelevant variables and over-training occurs. On the other hand, we note that a different scenario occurs when . In this situation, prediction-based procedures, such as the usual cross-validation, tend to under-select important variables. This is because, when is small, inclusion of a relatively few irrelevant variables can increase the validation error dramatically, resulting in severe instability and under-representation of important variables. In this paper, we propose to use a variant of the usual cross-validation, in which we include additional variables by decreasing for up to 1 standard deviation of the validation error at the minimum. Through extensive empirical studies, we found that this strategy often works well to prevent under-selection when , which corresponds to when and when . For and sample size only moderately large, we use the usual cross-validation at the minimum. We note that Hastie, Tibshirani, and Friedman (2001, p. 216) have described a related strategy that discards variables up to 1 standard deviation of the minimum cross-validation error for use when is large relative to and over-selection is severe. In Table 1-3, we present median RPE, number of true and false positives, sensitivity, specificity, and G out of 200 replications using modified cross-validation for selecting tuning parameters. The smallest 3 values of median RPE and largest 3 of median G are highlighted in bold. Standard errors based on 500 bootstrapped re-samplings are further reported in parentheses for median RPE and G. In Table 4, we provide an additional simulation study to illustrate the modified cross-validation.
In each example, we simulate 200 data sets from the true model, , where . is generated each time from , and we vary , , and in each example to illustrate performances across a variety of situations. We choose the tuning parameter from for both adaptive lasso (Zou 2006) and covariance-thresholded lasso with adaptive thresholding. The adaptive lasso seeks to improve upon the lasso by applying the weights , where is an initial estimate, in order to penalize each coefficient differently in the -norm of the lasso. The larger is the less the shrinkage applied to coefficients of large magnitudes. The candidate values used for are suggested in Zou (2006) and found to work well in practice.
Example 1. (Autocorrelated.) This example has predictors with coefficients for , for , and otherwise. for all , and . Signal-to-noise ratio (SNR) is approximately . This example, similar to Example 1 in (Tibshirani 1996), has an approximately sparse covariance structure, as elements away from the diagonal can be extremely small.
|20||Lasso||1.284 (0.043)||4.0||13.0||0.40||0.86||0.577 (0.003)|
|Adapt Lasso||1.301 (0.060)||4.0||12.0||0.40||0.87||0.581 (0.006)|
|UST||3.001 (0.223)||7.0||28.0||0.70||0.69||0.690 (0.008)|
|Scout(1,1)||1.164 (0.027)||10.0||90.0||1.00||0.00||0.000 (0.000)|
|Scout(2,1)||1.474 (0.053)||6.0||39.0||0.60||0.57||0.474 (0.023)|
|Elastic net||1.630 (0.097)||7.0||31.0||0.70||0.63||0.633 (0.021)|
|CT-Lasso hard||1.713 (0.100)||5.0||22.5||0.60||0.77||0.593 (0.013)|
|CT-Lasso soft||1.586 (0.051)||6.0||20.5||0.60||0.78||0.667 (0.007)|
|CT-Lasso adapt||1.602 (0.055)||6.0||20.0||0.60||0.78||0.654 (0.006)|
|40||Lasso||1.095 (0.052)||6.0||27.0||0.60||0.71||0.672 (0.003)|
|Adapt Lasso||1.047 (0.038)||7.0||21.0||0.70||0.77||0.706 (0.007)|
|UST||1.918 (0.098)||8.0||28.0||0.80||0.69||0.742 (0.006)|
|Scout(1,1)||0.814 (0.016)||10.0||90.0||1.00||0.00||0.000 (0.025)|
|Scout(2,1)||1.125 (0.039)||9.0||53.0||0.90||0.41||0.544 (0.029)|
|Elastic net||1.490 (0.066)||8.0||32.0||0.90||0.63||0.683 (0.010)|
|CT-Lasso hard||1.221 (0.072)||7.0||23.0||0.70||0.74||0.704 (0.008)|
|CT-Lasso soft||1.068 (0.055)||7.0||23.0||0.80||0.77||0.739 (0.007)|
|CT-Lasso adapt||1.063 (0.045)||7.0||23.0||0.80||0.78||0.743 (0.007)|
|80||Lasso||0.379 (0.010)||8.0||19.0||0.80||0.79||0.794 (0.005)|
|Adapt Lasso||0.367 (0.013)||8.0||15.0||0.80||0.82||0.800 (0.005)|
|UST||0.360 (0.011)||8.0||5.0||0.80||0.94||0.851 (0.012)|
|Scout(1,1)||0.245 (0.007)||8.0||8.0||0.80||0.91||0.854 (0.008)|
|Scout(2,1)||0.399 (0.014)||6.5||7.0||0.65||0.92||0.762 (0.006)|
|Elastic net||0.307 (0.014)||9.0||10.0||0.90||0.90||0.866 (0.006)|
|CT-Lasso hard||0.349 (0.013)||8.0||8.0||0.80||0.94||0.795 (0.010)|
|CT-Lasso soft||0.284 (0.011)||8.0||6.5||0.80||0.94||0.827 (0.008)|
|CT-Lasso adapt||0.316 (0.017)||8.0||8.0||0.80||0.93||0.823 (0.009)|
Figure 3(a) depicts variable selection results using best-possible selection of tuning parameters. We see that the covariance-thresholded lasso methods dominate the lasso, adaptive lasso, and UST in terms of variable selection for . The performances of lasso and adaptive lasso deteriorate precipitously as becomes small, whereas those of the covariance-thresholded lasso methods decrease at a relatively slow pace. Furthermore, the covariance-thresholded lasso methods dominate the elastic net and scout for small. We also observe that the scout procedures and elastic net perform very similarly. This is not surprising as Witten and Tibshirani (2009) have shown in Section 2.5.1 of their paper that scout(2,1), by regularizing the inverse covariance matrix, is very similar to the elastic net.
Results from best-possible selection provide information on the potentials of the methods examined. In Table 1, we present results using cross-validation to illustrate performances in practice. The covariance-thresholded lasso methods tend to dominate the lasso, adaptive lasso, scout, and elastic net in terms of variable selection for small. The UST presents good variable selection performances but large prediction errors. We note that, due to its large bias, the UST cannot be legitimately applied with cross-validation that uses validation error to select tuning parameters, especially when the coefficients are disparate and some correlations are large. The advantages of covariance-thresholded lasso with hard thresholding is less apparent compared with those of soft and adaptive thresholding. This suggests that continuous thresholding of covariances may achieve better performances than discontinuous ones using cross-validation. We note that the scout procedures perform surprisingly poorly compared with the covariance-thresholded lasso and the elastic net in terms of variable selection when is small. As the scout and elastic net are quite similar in terms of their potentials for variable selection as shown in Figure 3(a), the differences seem to come from the additional re-scaling step of the scout, where the scout re-scales its initial estimates by multiplying them with a scalar . This strategy can sometimes be useful in improving prediction accuracy. However, when is small compared with , standard deviations of validation errors for the scout can often be large, which may cause variable selection performances to suffer for cross-validation. We additionally note that, when and SNR is low as in this example, high specificity can sometimes be more important for prediction accuracy than high sensitivity. This is because, when is small, coefficients of irrelevant variables can be given large estimates, and inclusion of but a few irrelevant variables can significantly deteriorate prediction accuracy. In Table 1, we see that the lasso and adaptive lasso have good prediction accuracy for though it selects less than half of the true variables.
|20||Lasso||0.341 (0.027)||2.0||9.0||0.10||0.89||0.302 (0.009)|
|Adapt Lasso||0.352 (0.028)||2.0||9.0||0.10||0.89||0.301 (0.006)|
|Elastic net||0.967 (0.137)||14.0||51.5||0.70||0.36||0.437 (0.011)|
|UST||28.930 (0.836)||19.0||73.0||0.95||0.09||0.296 (0.012)|
|Scout(2,1)||0.062 (0.004)||20.0||80.0||1.00||0.00||0.000 (0.000)|
|CT-Lasso hard||0.383 (0.018)||3.0||11.0||0.15||0.86||0.370 (0.013)|
|CT-Lasso soft||0.231 (0.015)||6.5||23.0||0.33||0.71||0.465 (0.008)|
|CT-Lasso adapt||0.302 (0.019)||6.5||23.0||0.33||0.71||0.461 (0.012)|
|40||Lasso||0.348 (0.017)||5.0||18.5||0.25||0.77||0.429 (0.014)|
|Adapt Lasso||0.315 (0.024)||5.0||17.0||0.25||0.79||0.417 (0.008)|
|Elastic net||0.739 (0.094)||16.0||58.0||0.80||0.28||0.426 (0.014)|
|UST||26.189 (1.001)||20.0||77.0||1.00||0.04||0.194 (0.007)|
|Scout(2,1)||0.043 (0.004)||20.0||80.0||1.00||0.00||0.000 (0.000)|
|CT-Lasso hard||0.363 (0.018)||6.0||21.0||0.30||0.74||0.450 (0.008)|
|CT-Lasso soft||0.269 (0.023)||10.0||35.0||0.50||0.56||0.485 (0.006)|
|CT-Lasso adapt||0.306 (0.029)||8.0||31.0||0.40||0.61||0.482 (0.006)|
|80||Lasso||0.123 (0.004)||5.0||14.0||0.25||0.83||0.440 (0.008)|
|Adapt Lasso||0.122 (0.004)||4.0||14.0||0.20||0.83||0.423 (0.009)|
|Elastic net||0.089 (0.006)||14.0||48.5||0.70||0.39||0.461 (0.012)|
|UST||0.042 (0.003)||18.0||66.0||0.90||0.18||0.393 (0.006)|
|Scout(2,1)||0.038 (0.002)||20.0||80.0||1.00||0.00||0.000 (0.000)|
|CT-Lasso hard||0.159 (0.007)||6.0||17.5||0.30||0.78||0.468 (0.008)|
|CT-Lasso soft||0.107 (0.009)||9.0||27.0||0.45||0.66||0.521 (0.007)|
|CT-Lasso adapt||0.129 (0.014)||8.0||24.0||0.40||0.70||0.503 (0.007)|
Example 2. (Constant covariance.) This example has predictors with for , for , and otherwise. for all and such that , and . SNR is approximately . This example, derived from Example 4 in Tibshirani (1996), presents an extreme situation where all non-diagonal elements of the population covariance matrix are nonzero and constant.
In Figure 3(b), we see that the covariance-thresholded lasso methods dominate over the lasso and adaptive lasso, especially for small. This example shows that sparse covariance thresholding may still improve variable selection when the underlying covariance matrix is non-sparse. Furthermore, covariance-thresholded lasso methods with soft and adaptive thresholding perform better than that with hard thresholding. Interestingly, we see that the performance of UST decreases with increasing and drops below that of the lasso for . This example demonstrates that the UST may not be a good general procedure for variable selection and can sometimes fail unexpectedly. We note that this is a challenging example for variable selection in general. By the irrepresentable condition (Zhao and Yu 2006), the lasso is not variable selection consistent under this scenario. The median values in Figure 3(b) usually increase much slower with increasing in comparison with those of Example 1 in Figure 3(a), even though SNR is higher.
Table 2 shows that the covariance-thresholded lasso methods and the elastic net dominate over the