ETLasso: Efficient Tuning of Lasso for HighDimensional Data
Abstract
The regularization (Lasso) has proven to be a versatile tool to select relevant features and estimate the model coefficients simultaneously. Despite its popularity, it is very challenging to guarantee the feature selection consistency of Lasso. One way to improve the feature selection consistency is to select an ideal tuning parameter. Traditional tuning criteria mainly focus on minimizing the estimated prediction error or maximizing the posterior model probability, such as crossvalidation and BIC, which may either be timeconsuming or fail to control the false discovery rate (FDR) when the number of features is extremely large. The other way is to introduce pseudofeatures to learn the importance of the original ones. Recently, the Knockoff filter is proposed to control the FDR when performing feature selection. However, its performance is sensitive to the choice of the expected FDR threshold. Motivated by these ideas, we propose a new method using pseudofeatures to obtain an ideal tuning parameter. In particular, we present the Efficient Tuning of Lasso (ETLasso) to separate active and inactive features by adding permuted features as pseudofeatures in linear models. The pseudofeatures are constructed to be inactive by nature, which can be used to obtain a cutoff to select the tuning parameter that separates active and inactive features. Experimental studies on both simulations and realworld data applications are provided to show that ETLasso can effectively and efficiently select active features under a wide range of different scenarios.
1 Introduction
High dimensional data analysis is fundamental in many research areas such as genomewide association studies, finance, tumor classification and biomedical imaging (Donoho, 2000 and Fan and Li, 2006). The principle of sparsity is frequently adopted and proves useful when analyzing high dimensional data, which assumes only a small proportion of the features contribute to the response (“active”). Following this general rule, penalized least square methods have been developed in recent years to select the active features and estimate their regression coefficients simultaneously. Among existing penalized least square methods, the least absolute shrinkage and selection operator (Lasso) (Tibshirani, 1996) is one of the most popular regularization method that performs both variable selection and regularization, which enhance the prediction accuracy and interpretability of the statistical model it produces. Since then, many efforts have been devoted to develop algorithms in sparse learning of Lasso. Representative methods include but are not limited to Beck and Teboulle (2009), Wainwright (2009), Zhou (2009), Bach (2008), Reeves and Gastpar (2013), Nesterov (2013), ShalevShwartz and Tewari (2011), Boyd et al. (2011), Friedman et al. (2007).
Tuning parameter selection plays a pivotal role for identifying the true active features in Lasso, For example, Zhao and Yu (2006) showed that there exists an Irrepresentable Condition under which the Lasso selection is consistent when the tuning parameter converges to 0 at a rate slower than . Meinshausen et al. (2009) further established the convergence in norm under a relaxed irrepresentable condition with an appropriate choice of the tuning parameter. The tuning parameter can be computed theoretically but the calculation can be difficult in practice, especially for highdimensional data. In literature, crossvalidation (Stone, 1974), AIC (Akaike, 1974) and BIC (Schwarz, 1978) have been widely used for selecting tuning parameters for Lasso. Wang et al. (2007) and Wang et al. (2009) demonstrated that the tuning parameters selected by a BICtype criterion can identify the true model consistently under some regularity conditions, whereas AIC and crossvalidation may not lead to a consistent selection. These criteria focus on minimizing the estimated prediction error or maximizing the posterior model probability, which can be computationally intensive for largescale datasets.
Recently, Barber and Candes (2015) proposed a novel feature selection method “Knockoff” that is able to control the false discovery rate when performing variable selection. This method operates by first constructing Knockoff variables (which are pseudo copies of the original variables) that mimic the correlation structure of the original variables, and then selecting features that are identified as much more important than their Knockoff copies, according to some measures of feature importance. However, Knockoff requires the number of features to be less than the sample size, which may not be applied to high dimensional settings where the number of features is much larger than that of samples. In order to fix this, Candes et al. (2018) further proposed the ModelX Knockoffs that provides valid FDR control variable selection inference under the scenario. However, this method is sensitive to the choice of the expected FDR level, and it cannot generate a consistent solution for the model coefficients. Moreover, as will be seen from the simulation studies presented in Section 4.1, we notice that the construction complexity of the Knockoff matrix is sensitive to the covariance structure, and it is also very time consuming when is large.
Motivated by both the literature of tuning parameter selection and pseudo variablesbased feature selection, we propose the Efficient Tuning of Lasso (ETLasso) which selects the ideal tuning parameter by using pseudofeatures and accommodates high dimensional settings where is allowed to grow exponentially with . The idea comes from the fact that active features tend to enter the model ahead of inactive ones on the solution path of Lasso. We validate this fact theoretically under some regularity conditions, which results in selection consistency and guarantees a clear separation between active and inactive features. We further propose a cutoff level to separate the active and inactive features by adding permuted features as pseudofeatures, which are constructed to be inactive and can help rule out tuning parameters that identify them as active. The idea of adding pseudofeatures is inspired by Luo, Stefanski and Boos (2006), Wu, Boos and Stefanski (2007), who proposed to add random features in forward selection problems. In our method, the permuted features are generated by making a copy of X and then permuting its rows. In this way, the permuted features have the same marginal distribution as the original ones, and are not correlated with X and y. Unlike the Knockoff method, which selects features that are more important than their Knockoff copies, ETLasso tries to identify original features that are more important than all the permuted features. We show that the proposed method selects all the active features and simultaneously filters out all the inactive features with an overwhelming probability as goes to infinity and goes to infinity at an exponential rate of . The experiments in Section 4 show that ETLasso outperforms other existing methods under different scenarios.
The rest of this paper is organized as follows. In Section 2, we introduce the motivation and the model framework of ETLasso. In Section 3, we establish its theoretical properties. Then, we illustrate the high efficiency and potential usefulness of our new method both by simulation studies and applications to a number of realworld datasets in Section 4. The paper concludes with a brief discussion in Section 5.
To facilitate the presentation of our work, we use to denote an arbitrary subset of , which amounts to a submodel with covariates and associated coefficients . is the complement of . We use to denote the number of nonzero components of a vector and to represent the cardinality of set . We denote the true model by with .
2 Motivation and Model Framework
2.1 Motivation
Consider the problem of estimating the coefficients vector from linear model
(2.1) 
where is the response, is an random design matrix with independent and identically distributed (IID) vectors . correspond to features. is the coefficients vector and is an vector of IID random errors following subGaussian distribution with and . For high dimensional data where , we often assume that only a handful of features contribute to the response, i.e, .
We consider the Lasso model that estimates under the sparsity assumption. The Lasso estimator is given by
(2.2) 
where is a regularization parameter that controls the model sparsity. Consider the point on the solution path of (2.2) at which feature first enters the model,
(2.3) 
which is likely to be large for most of active features and small for most inactive features. Note that accounts for the joint effects among features and thus can be treated as a joint utility measure for ranking the importance of features. For orthogonal designs, the closed form solution of (2.2) (Tibshirani, 1996) for Lasso directly shows that
(2.4) 
In section 3, under more general conditions, we will show that
(2.5) 
Property (2.5) implies a clear separation between active and inactive features, so the next step is to find a practical way to estimate in order to identify active features, i.e., obtain an ideal cutoff to separate the active and the inactive features.
2.2 Model Framework
Motivated by Property (2.5), we calculate the cutoff that separates the active and inactive features by adding pseudofeatures. Since pseudofeatures are known to be inactive, we can rule out tuning parameters that identify them as active. The permuted features , where is a permutation of , are used as the pseudofeatures. In particular, matrix satisfies
(2.6) 
That is, the permuted features possess the same correlation structure as the original features, while breaking association with the y due to the permutation. Suppose that the features are centered, then the design matrix satisfies
(2.7) 
where is the correlation structure of X, and the approximatelyzero offdiagonal blocks arise from the fact that when the features are centered.
Now we define the augmented design matrix , where is the original design matrix and is the permuted design matrix. The augmented linear model with as design matrix is
(2.8) 
where is a vector of coefficients and is the error term. The corresponding Lasso regression problem is
(2.9) 
Similar to , we define by
(2.10) 
which is the largest tuning parameter at which enters the model (2.8). Since are truly inactive by construction, it holds in probability that by the Theorem 1 in Section 3. Define , which can be regarded as a benchmark to separate the important features from the inactive ones. This leads to a soft thresholding selection
(2.11) 
We implement a twostage algorithm in order to reduce the false selection rate. We first generate permuted features . In the first stage, we select the based on the rule (2.11) using . Then in the second stage, we combine and to obtain and select the final feature set . The procedure of ETLasso is summarized in Algorithm 1.

Generate two different permuted predictor samples and then combine with X to obtain augmented design matrix .

For design matrix , we solve the problem
(2.12) over the grid . is the smallest tuning parameter value at which none of the features could be selected. is the cutoff point. In other words, can be regarded as an estimator of . Then we use selection rule (2.11) to obtain .

Combine with , which only includes features in , to obtain the augmented design matrix . Repeat Step 2 for the new design matrix over to select .
Remark. For the path of ETLasso, we first start with at which no feature would be selected, then we set a grid with points equally spaced from to , and finally add 0 to the path. The ETlasso procedure stops at when the first pseudofeature is selected.
2.3 Comparison with “Knockoff”
The Knockoff methods have been proposed to control the false discovery rate when performing variable selection (Barber and Candes, 2015; Candes et al., 2018). Specifically the Knockoff features obey
(2.13) 
where and is a pdimensional nonnegative vector. possesses the same covariance structure as X. The authors then set
(2.14) 
as the importance metric for feature and a datadependent threshold
(2.15) 
where and is the expected FDR level. The Knockoff selects the featureset as , which has been shown to have FDR controlled at (Barber and Candes, 2015; Candes et al., 2018).
The Knockoff method selects features that are clearly better than their Knockoff copies, while ETLasso method selects the features that are more important than all the pseudofeatures. Compared with Knockoff, our method of constructing the pseudofeatures is much simpler than creating the Knockoff features. Particularly, when the dimension of the data is extremely large, it is very time consuming to construct the Knockoff features. On the other hand, the Knockoff method is not able to provide a consistent estimator for the model coefficients. In addition, the feature selection performance of Knockoff is sensitive to the choice of expected FDR () as shown by our experiments, and our method does not have such hyperparameter that needs to be tuned carefully. A comprehensive numerical comparison between ETLasso and Knockoff is presented in Section 4.1.
3 Theoretical Properties
Essentially, property (2.5) is the key to the success of our selection procedure that applies ETLasso to select the ideal regularization parameter. Now we study (2.5) in a more general setting than orthogonal designs. We introduce the regularity conditions needed in this study.

(Mutual Incoherence Condition) There exists some such that
where for any matrix .

There exists some such that
where denotes the minimum eigenvalue of A.

, and , where , and .
Condition (C1) is called mutual incoherence condition, and it has been considered in the previous work on Lasso (Wainwright, 2009, Fuchs, 2005, and Tropp, 2006). This condition resembles a regularization constraint on the regression coefficients of the inactive features on the active features. Condition (C2) indicates that the design matrix consisting of active features is full rank. Condition (C3) states some requirements for establishing the selection consistency of the proposed method. The first one assumes that diverges with up to an exponential rate, which allows the dimension of the data to be substantially larger than the sample size. The second one implies that the number of active features is allowed to grow with sample size but as . We also require the minimal component of does not degenerate too fast.
One of the main results of this paper is that under , property (2.5) holds in probability:
Theorem 1
Under conditions C1  C3, assume that the design matrix X has its dimensional columns normalized such that , then
Theorem 1 justifies using to rank the importance of features. In other words, ranks an active feature above an inactive one with high probability, and thus guarantees a clear separation between the active and inactive features. The proof is given in supplementary material.
The following result shows the upper bound on the probability of recruiting any inactive feature by the proposed method, and implies that our method excludes all the inactive features asymptotically when .
Theorem 2
Let be a positive integer. Assume that the inactive features and the permuted features are equally likely to be selected by the ETLasso procedure, then we have
(3.1) 
where as specified in condition C2.
Remark. When the features are independent of each other, the inactive features and the permuted features are equally likely to be selected by the ETLasso procedure. In reality, the inactive features may be correlated with the active features, which makes them more likely to be selected ahead of permuted features. We consider such case in the simulation study, and find that ETLasso still outperforms other methods.
The proof is given in the supplementary material. Theorem 2 indicates that the number of false positives can be controlled better if there are more active features in the model, and our simulation results in Section 4 support this property.
4 Experiments
4.1 Simulation Study
In this section, we compare the finite sample performance of ETLasso with Lasso+BIC (BIC), Lasso+Crossvalidation (CV) and Knockoff (KF) under different settings. For CV method, we consider using 5folded cross validation to select the tuning parameter . We consider three FDR thresholds for Knockoff, 0.05, 0.1 and 0.2, so as to figure out the sensitivity of the performance of Knockoff to the choice of the FDR threshold. The response is generated from the linear regression model (2.1), where , for .

the sample size ;

the number of predictors ;

the following three covariance structures of (Fan and Lv, 2008) are included to examine the effect of covariance structure on the performance of the methods:

(i) Independent, i.e, ,

(ii) AR(1) correlation structure: ,

(iii) Compound symmetric correlation structure (CS): if and otherwise;


, for , where Bernoulli (0.5), and for ;
The simulation results are based on replications and the following criteria are used to evaluate the performance of ETLasso:

: the average precision (number of active features selected/number of features selected) over simulations;

: the average recall (number of active features selected/number of active features) over simulations;

: the average score (harmonic mean of precision and recall) over simulations;

Time: the average running time of each method over simulations.
The simulation results are summarized in Table 1 and 2. We can observe that ETLasso has higher precision and score than other methods under all circumstances. For independent setting, all methods except KF(0.05) successfully recover all active features, as suggested by the recall values. The average precision values of ETLasso are all above , while Lasso+BIC has precision values around , and Lasso+CV has precision values around . KF(0.05) barely selects any feature into the model due to its restrictive FDR control, resulting in very small values in recall, and the numbers of selected features are zero in some of the replications. KF(0.1) and KF(0.2) successfully identify all active features, whereas their precision values and scores are smaller than ETLasso. The results for AR(1) covariance structure are similar to those of independent setting. In CS setting, KF based methods sometimes select zero feature into the model, and thus the corresponding precision and scores cannot be computed. ETLasso again outperforms others in terms of precision and score. In addition, ETLasso enjoys favorable computational efficiency compared with Lasso+CV and Knockoff. ETLasso finishes in less than s in all settings, while Knockoffs require significantly more computing time, and their computational costs increase rapidly as increases. In addition, the performances of Knockoff rely on the choice of the expected FDR. When the correlations between features are strong, Knockoff method needs higher FDR thresholds to select all the active variables.
Independent  AR(1)  
Time  Time  
,  
ETLasso  0.97 (0.06)  1.0 (0.0)  0.98 (0.03)  0.27 (0.06)  0.93 (0.08)  1.0 (0.0)  0.96 (0.04)  0.27 (0.06) 
BIC  0.68 (0.14)  1.0 (0.0)  0.80 (0.10)  0.09 (0.02)  0.64 (0.13)  1.0 (0.0)  0.77 (0.10)  0.10 (0.03) 
CV  0.20 (0.08)  1.0 (0.0)  0.33 (0.11)  1.01 (0.19)  0.20 (0.07)  1.0 (0.0)  0.32 (0.10)  1.01 (0.19) 
KF(0.05)  #  0.00 (0.03)  #  348.6 (383.7)  #  1.0 (0.0)  #  427.8 (391.9) 
KF(0.1)  0.92 (0.10)  1.0 (0.0)  0.96 (0.06)  356.1 (394.5)  0.91 (0.11)  1.0 (0.0)  0.95 (0.06)  436.9 (405.4) 
KF(0.2)  0.83 (0.15)  1.0 (0.0)  0.90 (0.09)  352.5 (388.8)  0.82 (0.15)  1.0 (0.0)  0.89 (0.10)  432.3 (400.2) 
,  
ETLasso  0.97 (0.04)  1.0 (0.0)  0.99 (0.02)  0.26 (0.05)  0.94 (0.06)  1.0 (0.0)  0.97 (0.03)  0.27 (0.06) 
BIC  0.63 (0.12)  1.0 (0.0)  0.77 (0.09)  0.09 (0.01)  0.59 (0.11)  1.0 (0.0)  0.74 (0.09)  0.09 (0.02) 
CV  0.21 (0.06)  1.0 (0.0)  0.34 (0.08)  0.93 (0.12)  0.20 (0.05)  1.0 (0.0)  0.33 (0.07)  1.02 (0.18) 
KF(0.05)  #  0.45 (2.56)  #  368.1 (410.2)  #  0.04 (0.20)  0.28 (1.36)  453.9 (418.7) 
KF(0.1)  0.93 (0.09)  1.0 (0.0)  0.96 (0.06)  362.5 (406.6)  0.92 (0.10)  1.0 (0.0)  0.95 (0.06)  460.4 (430.4) 
KF(0.2)  0.82 (0.13)  1.0 (0.0)  0.89 (0.08)  351.7 (397.0)  0.80 (0.13)  1.0 (0.0)  0.88 (0.09)  455.9 (423.5) 
,  
ETLasso  0.97 (0.06)  1.0 (0.0)  0.98 (0.03)  0.47 (0.07)  0.94 (0.07)  1.0 (0.0)  0.97 (0.04)  0.47 (0.08) 
BIC  0.65 (0.13)  1.0 (0.0)  0.78 (0.10)  0.17 (0.03)  0.63 (0.14)  1.0 (0.0)  0.76 (0.11)  0.16 (0.03) 
CV  0.17 (0.07)  1.0 (0.0)  0.29 (0.10)  1.75 (0.22)  0.17 (0.06)  1.0 (0.0)  0.29 (0.09)  1.73 (0.28) 
KF(0.05)  #  0.002 (0.04)  #  1252.8 (1355.3)  #  0.0 (0.0)  #  1694.6 (1538.7) 
KF(0.1)  0.92 (0.10)  1.0 (0.0)  0.96 (0.06)  1221.9 (1321.4)  0.92 (0.10)  1.0 (0.0)  0.95 (0.06)  1660.8 (1505.1) 
KF(0.2)  0.82 (0.16)  1.0 (0.0)  0.89 (0.10)  1200.6 (1304.6)  0.82 (0.15)  1.0 (0.0)  0.89 (0.10)  1612.4 (1451.8) 
,  
ETLasso  0.98 (0.04)  1.0 (0.0)  0.99 (0.02)  0.46 ( 0.08)  0.95 (0.06)  1.0 (0.0)  0.97 (0.03)  0.46 (0.09) 
BIC  0.61 (0.12)  1.0 (0.0)  0.75 (0.09)  0.16 (0.03)  0.58 (0.11)  1.0 (0.0)  0.73 (0.09)  0.16 (0.03) 
CV  0.17 (0.05)  1.0 (0.0)  0.29 (0.08)  1.71 (0.26)  0.17 (0.05)  1.0 (0.0)  0.29 (0.07)  1.72 (0.28) 
KF(0.05)  #  0.03 (0.16)  #  1251.6 (1347.3)  #  0.03 (0.18)  #  1689.2 (1521.6) 
KF(0.1)  0.92 (0.10)  1.0 (0.0)  0.96 (0.06)  1240.5 (1319.2)  0.93 (0.10)  1.0 (0.0)  0.96 (0.06)  1658.4 (1490.3) 
KF(0.2)  0.82 (0.13)  1.0 (0.0)  0.89 (0.08)  1192.2 (1269.2)  0.82 (0.12)  1.0 (0.0)  0.89 (0.08)  1610.8 (1442.5) 
CS  
Time  
,  
ETLasso  0.89 (0.17)  1.0 (0.0)  0.93 (0.12)  0.26 (0.05) 
BIC  0.57 (0.16)  1.0 (0.0)  0.71 (0.13)  0.09 (0.01) 
CV  0.20 (0.07)  1.0 (0.0)  0.32 (0.10)  0.94 (0.14) 
KF(0.05)  #  0.00 (0.03)  #  53.22 (6.59) 
KF(0.1)  #  0.94 (0.23)  #  51.01 (6.68) 
KF(0.2)  0.83 (0.15)  0.99 (0.03)  0.89 (0.10)  50.6 (5.9) 
,  
ETLasso  0.92 (0.12)  1.0 (0.01)  0.95 (0.08)  0.26 (0.05) 
BIC  0.55 (0.12)  1.0 (0.0)  0.70 (0.10)  0.09 (0.01) 
CV  0.20 (0.06)  1.0 (0.0)  0.34 (0.08)  0.93 (0.12) 
KF(0.05)  #  0.02 (0.16)  #  53.20(6.72) 
KF(0.1)  #  0.98 (0.08)  #  51.12 (6.80) 
KF(0.2)  0.82 (0.13)  0.99 (0.03)  0.89 (0.08)  50.67 (6.26) 
,  
ETLasso  0.86 (0.19)  1.0 (0.0)  0.91 (0.14)  0.46 (0.07) 
BIC  0.53 (0.16)  1.0 (0.0)  0.68 (0.13)  0.16 (0.03) 
CV  0.17 (0.06)  1.0 (0.0)  0.28 (0.09)  1.63 (0.21) 
KF(0.05)  #  0.03 (0.55)  #  119.1 (14.97) 
KF(0.1)  #  0.79 (0.40)  #  115.6 (14.68) 
KF(0.2)  #  0.97 (0.07)  #  116.1 (14.99) 
,  
ETLasso  0.90 (0.15)  1.0 (0.02)  0.94 (0.10)  0.45 (0.07) 
BIC  0.51 (0.13)  1.0 (0.0)  0.67 (0.11)  0.16 (0.03) 
CV  0.17 (0.05)  1.0 (0.0)  0.29 (0.07)  1.63 (0.23) 
KF(0.05)  #  0.02 (0.14)  #  119.8 (16.53) 
KF(0.1)  #  0.93 (0.19)  #  116.2 (15.62) 
KF(0.2)  #  0.96 (0.07)  #  115.9 (15.66) 
4.2 Stock Price Prediction
In this part, we apply the ETLasso method for stock price prediction. We select four stocks from four big companies, which are GOOG, IBM, AMZN and WFC. We plan to use the stock open price from 20100104 to 20131230 to train the model, and then predict the open price in the trading year 2014. All the stock prices are normalized. Considering that the current open price of a stock might be affected by the open price of the last 252 days, we apply the following regression model,
(4.1) 
and the regularization methods can be written as
(4.2) 
We compare ETLasso with Lasso+CV, Lasso+BIC and Knockoff (KF). Since Knockoff cannot estimate directly, we implement a two stage method for Knockoff, where at the first stage we apply Knockoff for feature selection, and at the second stage, we apply linear regression model with selected features and make predictions on test data. Figure 1 depicts the predicted price using ETLasso and the true price of the four stocks. The black line shows the true open price and the red line is the predicted value. It is obvious that ETLassobased method can predict the trend of the stock price change very well. The mean squared error (MSE), the median of the number of selected features (DF) are reported in Table 3. We can observe that the ETLasso method outperforms Lasso+BIC and Lasso+CV in terms of both prediction error and model complexity. For instance, when we predict the stock price of WFC, the MSE of ETLasso method is , which is only about of that of Lasso+CV () and about of that of Lasso+BIC (). Knockoff methods with a controlled FDR smaller than 0.5 are overconservative in feature selection, leading to an empty recovery set in most circumstances. KF(0.5) works well on IBM, AMZN and WFC, with resulting MSE comparable to that of ETLasso; however it selects zero feature on GOOG stock. In terms of the computing efficiency, ETLasso is much faster than Knockoff and crossvalidation method and slower than BIC, but ETLasso achieve much better performance than BIC. In terms of computational cost, ETLasso uses substantially less time than KF based method and Lasso+CV.
GOOG  IBM  
MSE  DF  Time  MSE  DF  Time  
ETLasso  9  0.25  3  0.12  
CV  9  0.57  3  0.29  
BIC  2  0.06  2  0.02  
KF(0.05) 
#  0  10.44  #  0  8.37 
KF(0.1)  #  0  8.26  #  0  8.48 
KF(0.2)  #  0  8.54  #  0  8.09 
KF(0.3)  #  0  8.24  #  0  7.71 
KF(0.4)  #  0  8.20  #  0  7.55 
KF(0.5)  #  0  7.57  4  8.74  
AMZN  WFC  
MSE  DF  Time  MSE  DF  Time  
ETLasso  8  0.15  11  0.16  
CV  9  0.43  11  0.64  
BIC  2  0.05  3  0.05  
KF(0.05) 
#  0  8.15  #  0  8.43 
KF(0.1)  #  0  7.85  #  0  9.19 
KF(0.2)  #  0  8.74  #  0  8.02 
KF(0.3)  #  0  8.05  #  0  8.06 
KF(0.4)  #  0  7.95  6  7.76  
KF(0.5)  6  7.97  6  7.92  

4.3 Chinese Supermarket Data
In this section, the ETLasso method is applied to a Chinese supermarket dataset in Wang (2009), which records the number of customers and the sale volumes of products in days from year 2004 to 2005. The response is the number of customers and the features include the sale volumes of 6398 products. For safety issue, all the data are normalized. It is believed that only a small proportion of products have significant effects on the number of customers. The response and the features are standardized due to confidential concerns. The training data includes the first 300 days and the testing data contains the last 100 days. The mean squared error (MSE), the number of selected features (DF) of the ETLasso method, crossvalidation (CV), BIC and Knockoff (KF) are reported in Table 4.

MSE  DF  Time 

ETLasso  0.1046  68  1.40 
CV  0.1410  111  5.80 
BIC  0.3268  100  0.517 
KF(0.05)  #  0  1354.574 
KF(0.1)  #  0  1449.355 
KF(0.2)  0.4005  5  1423.386 
KF(0.3)  0.1465  11  1358.877 
KF(0.4)  0.1868  15  1440.143 
KF(0.5)  #  0  1379.757 
We can see that ETLasso performs best with respect to the model predictions accuracy. ETLasso method returns the smallest prediction MSE (0.1046) and a simpler model (includes 68 features) than CV and BIC. Crossvalidation and BIC for Lasso returns larger MSE and models with more features. For the Knockoff method, when FDR is controlled as small as or as large as , it fails in feature selection. Knockoff with 0.2 FDR selects only 5 features, but the predictions MSE is relatively large, which indicates an issue of underfitting. KFbased methods take more than seconds to run, which is relatively slow compared to ETLasso (1.4s), Lasso+CV (5.8s) and Lasso+BIC (0.517s).
5 Conclusion
In this paper, we have propose ETLasso that is able to select the ideal tuning parameter by involving pseudofeatures. The novelties of ETLasso are twofold. First, ETLasso is statistically efficient and powerful in the sense that it can select all active features with the smallest model which contains least irrelevant features (i.e., highest precision) compared to other feature selection methods. Second, ETLasso is computationally scalable, which is essential for highdimensional data analysis. The ETLasso is efficient for tuning parameter selection of regularization methods and requires no calculations of the prediction error and posterior model probability. Moreover, ETLasso is stopped once the cutoff is found, so there is no need to traverse all potential tuning parameters as crossvalidation and BIC. On the other hand, Knockoff turns out to be very computational intensive for high dimensional data. Numerical studies have illustrated the superior performance of ETLasso over the existing methods under different situations.
References
 Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716–723.
 Bach, F. R. (2008). Bolasso: model consistent lasso estimation through the bootstrap. In Proceedings of the 25th international conference on Machine learning, pages 33–40. ACM.
 Barber, R. F. and Candes, E. J. (2015). Controlling the false discovery rate via knockoffs. Ann. Statist., 43(5):2055–2085.
 Beck, A. and Teboulle, M. (2009). Fast gradientbased algorithms for constrained total variation image denoising and deblurring problems. IEEE Transactions on Image Processing, 18(11):2419–2434.
 Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122.
 Candes, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold:‘modelx’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3):551–577.
 Donoho, D. L. (2000). Highdimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture, 1:32.
 Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. Proceedings of the International Congress of Mathematicians, 3:595–622.
 Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911.
 Friedman, J., Hastie, T., Höfling, H., and Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302–332.
 Fuchs, J. J. (2005). Recovery of exact sparse representations in the presence of bounded noise. IEEE Transactions on Information Theory, 51(10):3601–3608.
 Luo, X., Stefanski, L. A., and Boos, D. D. (2006). Tuning variable selection procedures by adding noise. Technometrics, 48(2):165–175.
 Meinshausen, N., Yu, B., et al. (2009). Lassotype recovery of sparse representations for highdimensional data. The Annals of Statistics, 37(1):246–270.
 Nesterov, Y. (2013). Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161.
 Reeves, G. and Gastpar, M. C. (2013). Approximate sparsity pattern recovery: Informationtheoretic lower bounds. IEEE Transactions on Information Theory, 59(6):3451–3465.
 Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist., 6(2):461–464.
 ShalevShwartz, S. and Tewari, A. (2011). Stochastic methods for regularized loss minimization. Journal of Machine Learning Research, 12(Jun):1865–1892.
 Stone, M. (1974). Crossvalidation and multinomial prediction. Biometrika, 61(3):509–515.
 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288.
 Tropp, J. A. (2006). Just relax: convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory, 52(3):1030–1051.
 Wainwright, M. J. (2009). Sharp thresholds for highdimensional and noisy sparsity recovery using constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55(5):2183–2202.
 Wang, H. (2009). Forward regression for ultrahigh dimensional variable screening. Journal of the American Statistical Association, 104(488):1512–1524.
 Wang, H., Li, B., and Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):671–683.
 Wang, H., Li, R., and Tsai, C.L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94(3):553–568.
 Wu, Y., Boos, D. D., and Stefanski, L. A. (2007). Controlling variable selection by the addition of pseudovariables. Journal of the American Statistical Association, 102(477):235–243.
 Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. Journal of Machine learning research, 7(Nov):2541–2563.
 Zhou, S. (2009). Thresholding procedures for high dimensional variable selection and statistical estimation. In Advances in Neural Information Processing Systems, pages 2304–2312.