1 INTRODUCTION

ET-Lasso: Efficient Tuning of Lasso for High-Dimensional Data

Abstract

The regularization (Lasso) has proven to be a versatile tool to select relevant features and estimate the model coefficients simultaneously. Despite its popularity, it is very challenging to guarantee the feature selection consistency of Lasso. One way to improve the feature selection consistency is to select an ideal tuning parameter. Traditional tuning criteria mainly focus on minimizing the estimated prediction error or maximizing the posterior model probability, such as cross-validation and BIC, which may either be time-consuming or fail to control the false discovery rate (FDR) when the number of features is extremely large. The other way is to introduce pseudo-features to learn the importance of the original ones. Recently, the Knockoff filter is proposed to control the FDR when performing feature selection. However, its performance is sensitive to the choice of the expected FDR threshold. Motivated by these ideas, we propose a new method using pseudo-features to obtain an ideal tuning parameter. In particular, we present the Efficient Tuning of Lasso (ET-Lasso) to separate active and inactive features by adding permuted features as pseudo-features in linear models. The pseudo-features are constructed to be inactive by nature, which can be used to obtain a cutoff to select the tuning parameter that separates active and inactive features. Experimental studies on both simulations and real-world data applications are provided to show that ET-Lasso can effectively and efficiently select active features under a wide range of different scenarios.

1 Introduction

High dimensional data analysis is fundamental in many research areas such as genome-wide association studies, finance, tumor classification and biomedical imaging (Donoho, 2000 and Fan and Li, 2006). The principle of sparsity is frequently adopted and proves useful when analyzing high dimensional data, which assumes only a small proportion of the features contribute to the response (“active”). Following this general rule, penalized least square methods have been developed in recent years to select the active features and estimate their regression coefficients simultaneously. Among existing penalized least square methods, the least absolute shrinkage and selection operator (Lasso) (Tibshirani, 1996) is one of the most popular regularization method that performs both variable selection and regularization, which enhance the prediction accuracy and interpretability of the statistical model it produces. Since then, many efforts have been devoted to develop algorithms in sparse learning of Lasso. Representative methods include but are not limited to Beck and Teboulle (2009), Wainwright (2009), Zhou (2009), Bach (2008), Reeves and Gastpar (2013), Nesterov (2013), Shalev-Shwartz and Tewari (2011), Boyd et al. (2011), Friedman et al. (2007).

Tuning parameter selection plays a pivotal role for identifying the true active features in Lasso, For example, Zhao and Yu (2006) showed that there exists an Irrepresentable Condition under which the Lasso selection is consistent when the tuning parameter converges to 0 at a rate slower than . Meinshausen et al. (2009) further established the convergence in -norm under a relaxed irrepresentable condition with an appropriate choice of the tuning parameter. The tuning parameter can be computed theoretically but the calculation can be difficult in practice, especially for high-dimensional data. In literature, cross-validation (Stone, 1974), AIC (Akaike, 1974) and BIC (Schwarz, 1978) have been widely used for selecting tuning parameters for Lasso. Wang et al. (2007) and Wang et al. (2009) demonstrated that the tuning parameters selected by a BIC-type criterion can identify the true model consistently under some regularity conditions, whereas AIC and cross-validation may not lead to a consistent selection. These criteria focus on minimizing the estimated prediction error or maximizing the posterior model probability, which can be computationally intensive for large-scale datasets.

Recently, Barber and Candes (2015) proposed a novel feature selection method “Knockoff” that is able to control the false discovery rate when performing variable selection. This method operates by first constructing Knockoff variables (which are pseudo copies of the original variables) that mimic the correlation structure of the original variables, and then selecting features that are identified as much more important than their Knockoff copies, according to some measures of feature importance. However, Knockoff requires the number of features to be less than the sample size, which may not be applied to high dimensional settings where the number of features is much larger than that of samples. In order to fix this, Candes et al. (2018) further proposed the Model-X Knockoffs that provides valid FDR control variable selection inference under the scenario. However, this method is sensitive to the choice of the expected FDR level, and it cannot generate a consistent solution for the model coefficients. Moreover, as will be seen from the simulation studies presented in Section 4.1, we notice that the construction complexity of the Knockoff matrix is sensitive to the covariance structure, and it is also very time consuming when is large.

Motivated by both the literature of tuning parameter selection and pseudo variables-based feature selection, we propose the Efficient Tuning of Lasso (ET-Lasso) which selects the ideal tuning parameter by using pseudo-features and accommodates high dimensional settings where is allowed to grow exponentially with . The idea comes from the fact that active features tend to enter the model ahead of inactive ones on the solution path of Lasso. We validate this fact theoretically under some regularity conditions, which results in selection consistency and guarantees a clear separation between active and inactive features. We further propose a cutoff level to separate the active and inactive features by adding permuted features as pseudo-features, which are constructed to be inactive and can help rule out tuning parameters that identify them as active. The idea of adding pseudo-features is inspired by Luo, Stefanski and Boos (2006), Wu, Boos and Stefanski (2007), who proposed to add random features in forward selection problems. In our method, the permuted features are generated by making a copy of X and then permuting its rows. In this way, the permuted features have the same marginal distribution as the original ones, and are not correlated with X and y. Unlike the Knockoff method, which selects features that are more important than their Knockoff copies, ET-Lasso tries to identify original features that are more important than all the permuted features. We show that the proposed method selects all the active features and simultaneously filters out all the inactive features with an overwhelming probability as goes to infinity and goes to infinity at an exponential rate of . The experiments in Section 4 show that ET-Lasso outperforms other existing methods under different scenarios.

The rest of this paper is organized as follows. In Section 2, we introduce the motivation and the model framework of ET-Lasso. In Section 3, we establish its theoretical properties. Then, we illustrate the high efficiency and potential usefulness of our new method both by simulation studies and applications to a number of real-world datasets in Section 4. The paper concludes with a brief discussion in Section 5.

To facilitate the presentation of our work, we use to denote an arbitrary subset of , which amounts to a submodel with covariates and associated coefficients . is the complement of . We use to denote the number of nonzero components of a vector and to represent the cardinality of set . We denote the true model by with .

2 Motivation and Model Framework

2.1 Motivation

Consider the problem of estimating the coefficients vector from linear model

(2.1)

where is the response, is an random design matrix with independent and identically distributed (IID) -vectors . correspond to features. is the coefficients vector and is an -vector of IID random errors following sub-Gaussian distribution with and . For high dimensional data where , we often assume that only a handful of features contribute to the response, i.e, .

We consider the Lasso model that estimates under the sparsity assumption. The Lasso estimator is given by

(2.2)

where is a regularization parameter that controls the model sparsity. Consider the point on the solution path of (2.2) at which feature first enters the model,

(2.3)

which is likely to be large for most of active features and small for most inactive features. Note that accounts for the joint effects among features and thus can be treated as a joint utility measure for ranking the importance of features. For orthogonal designs, the closed form solution of (2.2) (Tibshirani, 1996) for Lasso directly shows that

(2.4)

In section 3, under more general conditions, we will show that

(2.5)

Property (2.5) implies a clear separation between active and inactive features, so the next step is to find a practical way to estimate in order to identify active features, i.e., obtain an ideal cutoff to separate the active and the inactive features.

2.2 Model Framework

Motivated by Property (2.5), we calculate the cutoff that separates the active and inactive features by adding pseudo-features. Since pseudo-features are known to be inactive, we can rule out tuning parameters that identify them as active. The permuted features , where is a permutation of , are used as the pseudo-features. In particular, matrix satisfies

(2.6)

That is, the permuted features possess the same correlation structure as the original features, while breaking association with the y due to the permutation. Suppose that the features are centered, then the design matrix satisfies

(2.7)

where is the correlation structure of X, and the approximately-zero off-diagonal blocks arise from the fact that when the features are centered.

Now we define the augmented design matrix , where is the original design matrix and is the permuted design matrix. The augmented linear model with as design matrix is

(2.8)

where is a -vector of coefficients and is the error term. The corresponding Lasso regression problem is

(2.9)

Similar to , we define by

(2.10)

which is the largest tuning parameter at which enters the model (2.8). Since are truly inactive by construction, it holds in probability that by the Theorem 1 in Section 3. Define , which can be regarded as a benchmark to separate the important features from the inactive ones. This leads to a soft thresholding selection

(2.11)

We implement a two-stage algorithm in order to reduce the false selection rate. We first generate permuted features . In the first stage, we select the based on the rule (2.11) using . Then in the second stage, we combine and to obtain and select the final feature set . The procedure of ET-Lasso is summarized in Algorithm 1.

  • Generate two different permuted predictor samples and then combine with X to obtain augmented design matrix .

  • For design matrix , we solve the problem

    (2.12)

    over the grid . is the smallest tuning parameter value at which none of the features could be selected. is the cutoff point. In other words, can be regarded as an estimator of . Then we use selection rule (2.11) to obtain .

  • Combine with , which only includes features in , to obtain the augmented design matrix . Repeat Step 2 for the new design matrix over to select .

Algorithm 1 ET-Lasso

Remark. For the path of ET-Lasso, we first start with at which no feature would be selected, then we set a grid with points equally spaced from to , and finally add 0 to the path. The ET-lasso procedure stops at when the first pseudo-feature is selected.

2.3 Comparison with “Knockoff”

The Knockoff methods have been proposed to control the false discovery rate when performing variable selection (Barber and Candes, 2015; Candes et al., 2018). Specifically the Knockoff features obey

(2.13)

where and is a p-dimensional nonnegative vector. possesses the same covariance structure as X. The authors then set

(2.14)

as the importance metric for feature and a data-dependent threshold

(2.15)

where and is the expected FDR level. The Knockoff selects the feature-set as , which has been shown to have FDR controlled at (Barber and Candes, 2015; Candes et al., 2018).

The Knockoff method selects features that are clearly better than their Knockoff copies, while ET-Lasso method selects the features that are more important than all the pseudo-features. Compared with Knockoff, our method of constructing the pseudo-features is much simpler than creating the Knockoff features. Particularly, when the dimension of the data is extremely large, it is very time consuming to construct the Knockoff features. On the other hand, the Knockoff method is not able to provide a consistent estimator for the model coefficients. In addition, the feature selection performance of Knockoff is sensitive to the choice of expected FDR () as shown by our experiments, and our method does not have such hyper-parameter that needs to be tuned carefully. A comprehensive numerical comparison between ET-Lasso and Knockoff is presented in Section 4.1.

3 Theoretical Properties

Essentially, property (2.5) is the key to the success of our selection procedure that applies ET-Lasso to select the ideal regularization parameter. Now we study (2.5) in a more general setting than orthogonal designs. We introduce the regularity conditions needed in this study.

  • (Mutual Incoherence Condition) There exists some such that

    where for any matrix .

  • There exists some such that

    where denotes the minimum eigenvalue of A.

  • , and , where , and .

Condition (C1) is called mutual incoherence condition, and it has been considered in the previous work on Lasso (Wainwright, 2009, Fuchs, 2005, and Tropp, 2006). This condition resembles a regularization constraint on the regression coefficients of the inactive features on the active features. Condition (C2) indicates that the design matrix consisting of active features is full rank. Condition (C3) states some requirements for establishing the selection consistency of the proposed method. The first one assumes that diverges with up to an exponential rate, which allows the dimension of the data to be substantially larger than the sample size. The second one implies that the number of active features is allowed to grow with sample size but as . We also require the minimal component of does not degenerate too fast.

One of the main results of this paper is that under , property (2.5) holds in probability:

Theorem 1

Under conditions C1 - C3, assume that the design matrix X has its -dimensional columns normalized such that , then

Theorem 1 justifies using to rank the importance of features. In other words, ranks an active feature above an inactive one with high probability, and thus guarantees a clear separation between the active and inactive features. The proof is given in supplementary material.

The following result shows the upper bound on the probability of recruiting any inactive feature by the proposed method, and implies that our method excludes all the inactive features asymptotically when .

Theorem 2

Let be a positive integer. Assume that the inactive features and the permuted features are equally likely to be selected by the ET-Lasso procedure, then we have

(3.1)

where as specified in condition C2.

Remark. When the features are independent of each other, the inactive features and the permuted features are equally likely to be selected by the ET-Lasso procedure. In reality, the inactive features may be correlated with the active features, which makes them more likely to be selected ahead of permuted features. We consider such case in the simulation study, and find that ET-Lasso still outperforms other methods.

The proof is given in the supplementary material. Theorem 2 indicates that the number of false positives can be controlled better if there are more active features in the model, and our simulation results in Section 4 support this property.

4 Experiments

4.1 Simulation Study

In this section, we compare the finite sample performance of ET-Lasso with Lasso+BIC (BIC), Lasso+Cross-validation (CV) and Knockoff (KF) under different settings. For CV method, we consider using 5-folded cross validation to select the tuning parameter . We consider three FDR thresholds for Knockoff, 0.05, 0.1 and 0.2, so as to figure out the sensitivity of the performance of Knockoff to the choice of the FDR threshold. The response is generated from the linear regression model (2.1), where , for .

  • the sample size ;

  • the number of predictors ;

  • the following three covariance structures of (Fan and Lv, 2008) are included to examine the effect of covariance structure on the performance of the methods:

    (i) Independent, i.e, ,

    (ii) AR(1) correlation structure: ,

    (iii) Compound symmetric correlation structure (CS): if and otherwise;

  • , for , where Bernoulli (0.5), and for ;

The simulation results are based on replications and the following criteria are used to evaluate the performance of ET-Lasso:

  • : the average precision (number of active features selected/number of features selected) over simulations;

  • : the average recall (number of active features selected/number of active features) over simulations;

  • : the average -score (harmonic mean of precision and recall) over simulations;

  • Time: the average running time of each method over simulations.

The simulation results are summarized in Table 1 and 2. We can observe that ET-Lasso has higher precision and score than other methods under all circumstances. For independent setting, all methods except KF(0.05) successfully recover all active features, as suggested by the recall values. The average precision values of ET-Lasso are all above , while Lasso+BIC has precision values around , and Lasso+CV has precision values around . KF(0.05) barely selects any feature into the model due to its restrictive FDR control, resulting in very small values in recall, and the numbers of selected features are zero in some of the replications. KF(0.1) and KF(0.2) successfully identify all active features, whereas their precision values and scores are smaller than ET-Lasso. The results for AR(1) covariance structure are similar to those of independent setting. In CS setting, KF based methods sometimes select zero feature into the model, and thus the corresponding precision and scores cannot be computed. ET-Lasso again outperforms others in terms of precision and score. In addition, ET-Lasso enjoys favorable computational efficiency compared with Lasso+CV and Knockoff. ET-Lasso finishes in less than s in all settings, while Knockoffs require significantly more computing time, and their computational costs increase rapidly as increases. In addition, the performances of Knockoff rely on the choice of the expected FDR. When the correlations between features are strong, Knockoff method needs higher FDR thresholds to select all the active variables.


Table 1: Simulation results of ET-Lasso, Lasso+BIC, Lasso+CV and Knockoff with different FDR thresholds in independent and AR(1) covariance structure settings. Numbers in parentheses denote the corresponding standard deviations over the 1000 replicates. # indicates an invalid average precision or score for methods that select zero feature in some of the replications.
Independent AR(1)
Time Time
,
ET-Lasso 0.97 (0.06) 1.0 (0.0) 0.98 (0.03) 0.27 (0.06) 0.93 (0.08) 1.0 (0.0) 0.96 (0.04) 0.27 (0.06)
BIC 0.68 (0.14) 1.0 (0.0) 0.80 (0.10) 0.09 (0.02) 0.64 (0.13) 1.0 (0.0) 0.77 (0.10) 0.10 (0.03)
CV 0.20 (0.08) 1.0 (0.0) 0.33 (0.11) 1.01 (0.19) 0.20 (0.07) 1.0 (0.0) 0.32 (0.10) 1.01 (0.19)
KF(0.05) # 0.00 (0.03) # 348.6 (383.7) # 1.0 (0.0) # 427.8 (391.9)
KF(0.1) 0.92 (0.10) 1.0 (0.0) 0.96 (0.06) 356.1 (394.5) 0.91 (0.11) 1.0 (0.0) 0.95 (0.06) 436.9 (405.4)
KF(0.2) 0.83 (0.15) 1.0 (0.0) 0.90 (0.09) 352.5 (388.8) 0.82 (0.15) 1.0 (0.0) 0.89 (0.10) 432.3 (400.2)
,
ET-Lasso 0.97 (0.04) 1.0 (0.0) 0.99 (0.02) 0.26 (0.05) 0.94 (0.06) 1.0 (0.0) 0.97 (0.03) 0.27 (0.06)
BIC 0.63 (0.12) 1.0 (0.0) 0.77 (0.09) 0.09 (0.01) 0.59 (0.11) 1.0 (0.0) 0.74 (0.09) 0.09 (0.02)
CV 0.21 (0.06) 1.0 (0.0) 0.34 (0.08) 0.93 (0.12) 0.20 (0.05) 1.0 (0.0) 0.33 (0.07) 1.02 (0.18)
KF(0.05) # 0.45 (2.56) # 368.1 (410.2) # 0.04 (0.20) 0.28 (1.36) 453.9 (418.7)
KF(0.1) 0.93 (0.09) 1.0 (0.0) 0.96 (0.06) 362.5 (406.6) 0.92 (0.10) 1.0 (0.0) 0.95 (0.06) 460.4 (430.4)
KF(0.2) 0.82 (0.13) 1.0 (0.0) 0.89 (0.08) 351.7 (397.0) 0.80 (0.13) 1.0 (0.0) 0.88 (0.09) 455.9 (423.5)
,
ET-Lasso 0.97 (0.06) 1.0 (0.0) 0.98 (0.03) 0.47 (0.07) 0.94 (0.07) 1.0 (0.0) 0.97 (0.04) 0.47 (0.08)
BIC 0.65 (0.13) 1.0 (0.0) 0.78 (0.10) 0.17 (0.03) 0.63 (0.14) 1.0 (0.0) 0.76 (0.11) 0.16 (0.03)
CV 0.17 (0.07) 1.0 (0.0) 0.29 (0.10) 1.75 (0.22) 0.17 (0.06) 1.0 (0.0) 0.29 (0.09) 1.73 (0.28)
KF(0.05) # 0.002 (0.04) # 1252.8 (1355.3) # 0.0 (0.0) # 1694.6 (1538.7)
KF(0.1) 0.92 (0.10) 1.0 (0.0) 0.96 (0.06) 1221.9 (1321.4) 0.92 (0.10) 1.0 (0.0) 0.95 (0.06) 1660.8 (1505.1)
KF(0.2) 0.82 (0.16) 1.0 (0.0) 0.89 (0.10) 1200.6 (1304.6) 0.82 (0.15) 1.0 (0.0) 0.89 (0.10) 1612.4 (1451.8)
,
ET-Lasso 0.98 (0.04) 1.0 (0.0) 0.99 (0.02) 0.46 ( 0.08) 0.95 (0.06) 1.0 (0.0) 0.97 (0.03) 0.46 (0.09)
BIC 0.61 (0.12) 1.0 (0.0) 0.75 (0.09) 0.16 (0.03) 0.58 (0.11) 1.0 (0.0) 0.73 (0.09) 0.16 (0.03)
CV 0.17 (0.05) 1.0 (0.0) 0.29 (0.08) 1.71 (0.26) 0.17 (0.05) 1.0 (0.0) 0.29 (0.07) 1.72 (0.28)
KF(0.05) # 0.03 (0.16) # 1251.6 (1347.3) # 0.03 (0.18) # 1689.2 (1521.6)
KF(0.1) 0.92 (0.10) 1.0 (0.0) 0.96 (0.06) 1240.5 (1319.2) 0.93 (0.10) 1.0 (0.0) 0.96 (0.06) 1658.4 (1490.3)
KF(0.2) 0.82 (0.13) 1.0 (0.0) 0.89 (0.08) 1192.2 (1269.2) 0.82 (0.12) 1.0 (0.0) 0.89 (0.08) 1610.8 (1442.5)
CS
Time
,
ET-Lasso 0.89 (0.17) 1.0 (0.0) 0.93 (0.12) 0.26 (0.05)
BIC 0.57 (0.16) 1.0 (0.0) 0.71 (0.13) 0.09 (0.01)
CV 0.20 (0.07) 1.0 (0.0) 0.32 (0.10) 0.94 (0.14)
KF(0.05) # 0.00 (0.03) # 53.22 (6.59)
KF(0.1) # 0.94 (0.23) # 51.01 (6.68)
KF(0.2) 0.83 (0.15) 0.99 (0.03) 0.89 (0.10) 50.6 (5.9)
,
ET-Lasso 0.92 (0.12) 1.0 (0.01) 0.95 (0.08) 0.26 (0.05)
BIC 0.55 (0.12) 1.0 (0.0) 0.70 (0.10) 0.09 (0.01)
CV 0.20 (0.06) 1.0 (0.0) 0.34 (0.08) 0.93 (0.12)
KF(0.05) # 0.02 (0.16) # 53.20(6.72)
KF(0.1) # 0.98 (0.08) # 51.12 (6.80)
KF(0.2) 0.82 (0.13) 0.99 (0.03) 0.89 (0.08) 50.67 (6.26)
,
ET-Lasso 0.86 (0.19) 1.0 (0.0) 0.91 (0.14) 0.46 (0.07)
BIC 0.53 (0.16) 1.0 (0.0) 0.68 (0.13) 0.16 (0.03)
CV 0.17 (0.06) 1.0 (0.0) 0.28 (0.09) 1.63 (0.21)
KF(0.05) # 0.03 (0.55) # 119.1 (14.97)
KF(0.1) # 0.79 (0.40) # 115.6 (14.68)
KF(0.2) # 0.97 (0.07) # 116.1 (14.99)
,
ET-Lasso 0.90 (0.15) 1.0 (0.02) 0.94 (0.10) 0.45 (0.07)
BIC 0.51 (0.13) 1.0 (0.0) 0.67 (0.11) 0.16 (0.03)
CV 0.17 (0.05) 1.0 (0.0) 0.29 (0.07) 1.63 (0.23)
KF(0.05) # 0.02 (0.14) # 119.8 (16.53)
KF(0.1) # 0.93 (0.19) # 116.2 (15.62)
KF(0.2) # 0.96 (0.07) # 115.9 (15.66)
Table 2: Simulation results of ET-Lasso, Lasso+BIC, Lasso+CV and Knockoff with different FDR thresholds in CS covariance structure setting. Numbers in parentheses denote the corresponding standard deviations over 1000 replicates.

4.2 Stock Price Prediction

In this part, we apply the ET-Lasso method for stock price prediction. We select four stocks from four big companies, which are GOOG, IBM, AMZN and WFC. We plan to use the stock open price from 2010-01-04 to 2013-12-30 to train the model, and then predict the open price in the trading year 2014. All the stock prices are normalized. Considering that the current open price of a stock might be affected by the open price of the last 252 days, we apply the following regression model,

(4.1)

and the regularization methods can be written as

(4.2)

We compare ET-Lasso with Lasso+CV, Lasso+BIC and Knockoff (KF). Since Knockoff cannot estimate directly, we implement a two stage method for Knockoff, where at the first stage we apply Knockoff for feature selection, and at the second stage, we apply linear regression model with selected features and make predictions on test data. Figure 1 depicts the predicted price using ET-Lasso and the true price of the four stocks. The black line shows the true open price and the red line is the predicted value. It is obvious that ET-Lasso-based method can predict the trend of the stock price change very well. The mean squared error (MSE), the median of the number of selected features (DF) are reported in Table 3. We can observe that the ET-Lasso method outperforms Lasso+BIC and Lasso+CV in terms of both prediction error and model complexity. For instance, when we predict the stock price of WFC, the MSE of ET-Lasso method is , which is only about of that of Lasso+CV () and about of that of Lasso+BIC (). Knockoff methods with a controlled FDR smaller than 0.5 are overconservative in feature selection, leading to an empty recovery set in most circumstances. KF(0.5) works well on IBM, AMZN and WFC, with resulting MSE comparable to that of ET-Lasso; however it selects zero feature on GOOG stock. In terms of the computing efficiency, ET-Lasso is much faster than Knockoff and cross-validation method and slower than BIC, but ET-Lasso achieve much better performance than BIC. In terms of computational cost, ET-Lasso uses substantially less time than KF based method and Lasso+CV.

GOOG IBM
MSE DF Time MSE DF Time
ET-Lasso 9 0.25 3 0.12
CV 9 0.57 3 0.29
BIC 2 0.06 2 0.02

KF(0.05)
# 0 10.44 # 0 8.37
KF(0.1) # 0 8.26 # 0 8.48
KF(0.2) # 0 8.54 # 0 8.09
KF(0.3) # 0 8.24 # 0 7.71
KF(0.4) # 0 8.20 # 0 7.55
KF(0.5) # 0 7.57 4 8.74
AMZN WFC
MSE DF Time MSE DF Time
ET-Lasso 8 0.15 11 0.16
CV 9 0.43 11 0.64
BIC 2 0.05 3 0.05

KF(0.05)
# 0 8.15 # 0 8.43
KF(0.1) # 0 7.85 # 0 9.19
KF(0.2) # 0 8.74 # 0 8.02
KF(0.3) # 0 8.05 # 0 8.06
KF(0.4) # 0 7.95 6 7.76
KF(0.5) 6 7.97 6 7.92

Table 3: Comparison of ET-Lasso, CV, BIC and Knockoff (KF) on Stock Price Prediction ( indicates an invalid MSE when zero feature is selected).
Figure 1: Comparison of stock price prediction and the true value. The black line represents the true open price and the red line is the predicted value.

4.3 Chinese Supermarket Data

In this section, the ET-Lasso method is applied to a Chinese supermarket dataset in Wang (2009), which records the number of customers and the sale volumes of products in days from year 2004 to 2005. The response is the number of customers and the features include the sale volumes of 6398 products. For safety issue, all the data are normalized. It is believed that only a small proportion of products have significant effects on the number of customers. The response and the features are standardized due to confidential concerns. The training data includes the first 300 days and the testing data contains the last 100 days. The mean squared error (MSE), the number of selected features (DF) of the ET-Lasso method, cross-validation (CV), BIC and Knockoff (KF) are reported in Table 4.


MSE DF Time
ET-Lasso 0.1046 68 1.40
CV 0.1410 111 5.80
BIC 0.3268 100 0.517
KF(0.05) # 0 1354.574
KF(0.1) # 0 1449.355
KF(0.2) 0.4005 5 1423.386
KF(0.3) 0.1465 11 1358.877
KF(0.4) 0.1868 15 1440.143
KF(0.5) # 0 1379.757
Table 4: Comparison of ET-Lasso, CV, BIC and Knockoff (KF) on Chinese market data.

We can see that ET-Lasso performs best with respect to the model predictions accuracy. ET-Lasso method returns the smallest prediction MSE (0.1046) and a simpler model (includes 68 features) than CV and BIC. Cross-validation and BIC for Lasso returns larger MSE and models with more features. For the Knockoff method, when FDR is controlled as small as or as large as , it fails in feature selection. Knockoff with 0.2 FDR selects only 5 features, but the predictions MSE is relatively large, which indicates an issue of under-fitting. KF-based methods take more than seconds to run, which is relatively slow compared to ET-Lasso (1.4s), Lasso+CV (5.8s) and Lasso+BIC (0.517s).

Figure 2: Performance of the proposed method on Chinese market Data. The black line represents the true value and the red line is the prediction.

5 Conclusion

In this paper, we have propose ET-Lasso that is able to select the ideal tuning parameter by involving pseudo-features. The novelties of ET-Lasso are two-fold. First, ET-Lasso is statistically efficient and powerful in the sense that it can select all active features with the smallest model which contains least irrelevant features (i.e., highest precision) compared to other feature selection methods. Second, ET-Lasso is computationally scalable, which is essential for high-dimensional data analysis. The ET-Lasso is efficient for tuning parameter selection of regularization methods and requires no calculations of the prediction error and posterior model probability. Moreover, ET-Lasso is stopped once the cutoff is found, so there is no need to traverse all potential tuning parameters as cross-validation and BIC. On the other hand, Knockoff turns out to be very computational intensive for high dimensional data. Numerical studies have illustrated the superior performance of ET-Lasso over the existing methods under different situations.

References

  1. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716–723.
  2. Bach, F. R. (2008). Bolasso: model consistent lasso estimation through the bootstrap. In Proceedings of the 25th international conference on Machine learning, pages 33–40. ACM.
  3. Barber, R. F. and Candes, E. J. (2015). Controlling the false discovery rate via knockoffs. Ann. Statist., 43(5):2055–2085.
  4. Beck, A. and Teboulle, M. (2009). Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Transactions on Image Processing, 18(11):2419–2434.
  5. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122.
  6. Candes, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3):551–577.
  7. Donoho, D. L. (2000). High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture, 1:32.
  8. Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. Proceedings of the International Congress of Mathematicians, 3:595–622.
  9. Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911.
  10. Friedman, J., Hastie, T., Höfling, H., and Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302–332.
  11. Fuchs, J. J. (2005). Recovery of exact sparse representations in the presence of bounded noise. IEEE Transactions on Information Theory, 51(10):3601–3608.
  12. Luo, X., Stefanski, L. A., and Boos, D. D. (2006). Tuning variable selection procedures by adding noise. Technometrics, 48(2):165–175.
  13. Meinshausen, N., Yu, B., et al. (2009). Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1):246–270.
  14. Nesterov, Y. (2013). Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161.
  15. Reeves, G. and Gastpar, M. C. (2013). Approximate sparsity pattern recovery: Information-theoretic lower bounds. IEEE Transactions on Information Theory, 59(6):3451–3465.
  16. Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist., 6(2):461–464.
  17. Shalev-Shwartz, S. and Tewari, A. (2011). Stochastic methods for -regularized loss minimization. Journal of Machine Learning Research, 12(Jun):1865–1892.
  18. Stone, M. (1974). Cross-validation and multinomial prediction. Biometrika, 61(3):509–515.
  19. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288.
  20. Tropp, J. A. (2006). Just relax: convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory, 52(3):1030–1051.
  21. Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using -constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55(5):2183–2202.
  22. Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488):1512–1524.
  23. Wang, H., Li, B., and Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):671–683.
  24. Wang, H., Li, R., and Tsai, C.-L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94(3):553–568.
  25. Wu, Y., Boos, D. D., and Stefanski, L. A. (2007). Controlling variable selection by the addition of pseudovariables. Journal of the American Statistical Association, 102(477):235–243.
  26. Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. Journal of Machine learning research, 7(Nov):2541–2563.
  27. Zhou, S. (2009). Thresholding procedures for high dimensional variable selection and statistical estimation. In Advances in Neural Information Processing Systems, pages 2304–2312.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...
303136
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description