No penalty no tears: Least squares in high-dimensional linear models
Ordinary least squares (OLS) is the default method for fitting linear models, but is not applicable for problems with dimensionality larger than the sample size. For these problems, we advocate the use of a generalized version of OLS motivated by ridge regression, and propose two novel three-step algorithms involving least squares fitting and hard thresholding. The algorithms are methodologically simple to understand intuitively, computationally easy to implement efficiently, and theoretically appealing for choosing models consistently. Numerical exercises comparing our methods with penalization-based approaches in simulations and data analyses illustrate the great potential of the proposed algorithms.
Long known for its consistency, simplicity and optimality under mild conditions, ordinary least squares (OLS) is the most widely used technique for fitting linear models. Developed originally for fitting fixed dimensional linear models, unfortunately, classical OLS fails in high dimensional linear models where the number of predictors far exceeds the number of observations . To deal with this problem, Tibshirani (1996) proposed -penalized regression, a.k.a, lasso, which triggered the recent overwhelming exploration in both theory and methodology of penalization-based methods. These methods usually assume that only a small number of coefficients are nonzero (known as the sparsity assumption), and minimize the same least squares loss function as OLS by including an additional penalty on the coefficients, with the typical choice being the norm. Such “penalization” constrains the solution space to certain directions favoring sparsity of the solution, and thus overcomes the non-unique issue with OLS. It yields a sparse solution and achieves model selection consistency and estimation consistency under certain conditions. See Zhao and Yu (2006); Fan and Li (2001); Zhang (2010); Zou and Hastie (2005).
Despite the success of the methods based on regularization, there are important issues that can not be easily neglected. On the one hand, methods using convex penalties, such as lasso, usually require strong conditions for model selection consistency (Zhao and Yu, 2006; Lounici, 2008). On the other hand, methods using non-convex penalties (Fan and Li, 2001; Zhang, 2010) that can achieve model selection consistency under mild conditions often require huge computational expense. These concerns have limited the practical use of regularized methods, motivating alternative strategies such as direct hard thresholding (Jain et al., 2014).
In this article, we aim to solve the problem of fitting high-dimensional sparse linear models by reconsidering OLS and answering the following simple question: Can ordinary least squares consistently fit these models with some suitable algorithms? Our result provides an affirmative answer to this question under fairly general settings. In particular, we give a generalized form of OLS in high dimensional linear regression, and develop two algorithms that can consistently estimate the coefficients and recover the support. These algorithms involve least squares type of fitting and hard thresholding, and are non-iterative in nature. Extensive empirical experiments are provided in Section 4 to compare the proposed estimators to many existing penalization methods. The performance of the new estimators is very competitive under various setups in terms of model selection, parameter estimation and computational time.
1.1 Related Works
The work that is most closely related to ours is Yang et al. (2014), in which the authors proposed an algorithm based on OLS and ridge regression. However, both their methodology and theory are still within the regularization framework, and their conditions (especially their C-Ridge and C-OLS conditions) are overly strong and can be easily violated in practice. Jain et al. (2014) proposed an iterative hard thresholding algorithm for sparse regression, which shares a similar spirit of hard thresholding as our algorithm. Nevertheless, their motivation is completely different, their algorithm lacks theoretical guarantees for consistent support recovery, and they require an iterative estimation procedure.
1.2 Our Contributions
We provide a generalized form of OLS for fitting high dimensional data motivated by ridge regression, and develop two algorithms that can consistently fit linear models on weakly sparse coefficients. We summarize the advantages of our new algorithms in three points.
Our algorithms can achieve consistent identify strong signals for ultra-high dimensional data () with only a bounded variance assumption on the noise , i.e., . This is remarkable as most methods (c.f. Zhang (2010); Yang et al. (2014); Cai and Wang (2011); Wainwright (2009); Zhang and Huang (2008); Wang and Leng (2015)) that work for case rely on a sub-Gaussian tail/bounded error assumption, which might fail to hold for real data. Lounici (2008) proved that lasso also achieves consistent model selection with a second-order condition similar to ours, but requires two additional assumptions.
The algorithms are simple, efficient and scale well for large . In particular, the matrix operations are fully parallelizable with very few communications for very large , while regularization methods are either hard to be computed in parallel in the feature space, or the parallelization requires a large amount of machine communications.
The remainder of this article is organized as follows. In Section 2 we generalize the ordinary least squares estimator for high dimensional problems where , and propose two three-step algorithms consisting only of least squares fitting and hard thresholding in a loose sense. Section 3 provides consistency theory for the algorithms. Section 4 evaluates the empirical performance. We conclude and discuss further implications of our algorithms in the last section. All the proofs are provided in the supplementary materials.
2 High dimensional ordinary least squares
Consider the usual linear model
where is the design matrix, is the response vector and is the coefficient. In the high dimensional literature, ’s are routinely assumed to be zero except for a small subset . In this paper, we consider a slightly more general setting, where is not exactly sparse, but consists of both strong and weak signals. In particular, we defined and
as the strong and weak signal sets and . The algorithms developed in this paper is to recover the strong signal set . The specific relationship between and will be detailed later.
To carefully tailor the low-dimensional OLS estimator in a high dimensional scenario, one needs to answer the following two questions: i) What is the correct form of OLS in the high dimensional setting? ii) How to correctly use this estimator? To answer these, we reconsider OLS from a different perspective by viewing the OLS as the limit of the ridge estimator with the ridge parameter going to zero, i.e.,
One nice property of the ridge estimator is that it exists regardless of the relationship between and . A keen observation reveals the following relationship immediately.
For any , we have
Notice that the right hand side of (1) exists when and . Consequently, we can naturally extend the classical OLS to the high dimensional scenario by letting tend to zero in (1). Denote this high dimensional version of the OLS as
The above equation indicates that is essentially an orthogonal projection of onto the row space of . Unfortunately, this (low dimensional) projection does not have good general performance in estimating sparse vectors in high-dimensional cases. Instead of directly estimating as , however, this new estimator of may be used for dimension reduction by observing . Since is stochastically small, if is close to a diagonally dominant matrix and is sparse, then the strong and weak signals can be separated by simply thresholding the small entries of . The exact meaning of this statement will be discussed in the next section. Some simple examples demonstrating the diagonal dominance of are illustrated immediately in Figure 1, where the rows of in the left two plots are drawn from with or . The sample size and data dimension are chosen as . The right plot takes the standardized design matrix directly from the real data in Section 4 with . A clear diagonal dominance pattern is visible in each plot.
This ability to separate strong and weak signals allows us to first obtain a smaller model with size such that containing . Since is below , one can directly apply the usual OLS to obtain an estimator, which will be thresholded further to obtain a more refined model. The final estimator will then be obtained by an OLS fit on the refined model. This three-stage non-iterative algorithm is termed Least-squares adaptive thresholding (LAT) and the concrete procedure is described in Algorithm 1.
The input parameter is the submodel size selected in Stage 1 and is the tuning parameter determining the threshold in Stage 2; these values will be specified in Section 4. In Stage 1, we use instead of because is rank deficient (the rank is ) after standardization. The number can be replaced by any arbitrary small number. As noted in Wang and Leng (2015), this additional ridge term is also essential when and get closer, in which case the condition number of increases dramatically, resulting in an explosion of the model noise. Our results in Section 3 mainly focus on where is assumed to be drawn from a distribution with mean zero, so no standardization or ridge adjustment is required. However, the result is easy to generalize to the case where a ridge term is included. See Wang and Leng (2015).
The Stage 1 of Algorithm 1 is very similar to variable screening methods (Fan and Lv, 2008; Wang and Leng, 2015). However, most screening methods require a sub-Gaussian condition the noise to handle the ultra-high dimensional data where . In contrast to the existing theory, we prove in the next section a better result that Stage 1 of Algorithm 1 can produce satisfactory submodel even for heavy-tailed noise.
The estimator in Stage 2 can be substituted by its ridge counterpart and by to stabilize numerical computation. Similar modification can be applied to the Stage 3 as well. The resulted variant of the algorithm is referred to as the Ridge Adaptive Thresholding (RAT) algorithm and described in Algorithm 2.
We suggest to use 10-fold cross-validation to tune the ridge parameter . Notice that the model is already small after stage 1, so using cross-validation will not significantly increase the computational burden. The computational performance is illustrated in Section 4.
In this section, we prove the consistency of Algorithm 1 in identifying strong signals and provide concrete forms for all the values needed for the algorithm to work. Recall the linear model . We consider the random design where the rows of are drawn from an elliptical distribution with a density of for some nonnegative function and positive definite . It is easy to show that admits an equivalent representation as
where is a p-variate standard Gaussian random variable and is a nonnegative random variable that is independent of . We denote this distribution by . This random design allows for various correlation structures among predictors and contains many distribution families that are widely used to illustrate methods that rely on the restricted eigenvalue conditions (Bickel et al., 2009; Raskutti et al., 2010). The noise , as mentioned earlier, is only assumed to have the second-order moment, i.e., , in contrast to the sub-Gaussian/bounded error assumption seen in most high dimension literature. See Zhang (2010); Yang et al. (2014); Cai and Wang (2011); Wainwright (2009); Zhang and Huang (2008). This relaxation is similar to Lounici (2008); however we do not require any further assumptions needed by Lounici (2008). In Algorithm 1, we also propose to use extended BIC and BIC for parameter tuning. However, the corresponding details will not be pursued here, as their consistency is straightforwardly implied by the results from this section and the existing literature on extended BIC and BIC (Chen and Chen, 2008).
As shown in (11), the variable controls the signal strength of , we thus need a lower bound on to guarantee a good signal strength. Define . We state our result in three theorems.
Assume with and is a random variable with a bounded variance . We also assume for some and . If , , and , then we can choose to be , where is some absolute constant specified in Lemma 2 and for any we have
Theorem 1 guarantees the model selection consistency of the first stage of Algorithm 1. It only requires a second-moment condition on the noise tail, relaxing the sub-Gaussian assumption seen in other literature. The probability term shows that the algorithm requires the important signals to be lower bounded by a signal strength of with a positive . In addition, a gap of is needed between the strong signals and the weak signals in order for a successful support recovery.
As is not easily computable based on data, we propose to rank the and select largest coefficients. Alternatively, we can construct a series of nested models formed by ranking the largest coefficients and adopt the extended BIC (Chen and Chen, 2008) to select the best submodel. Once the submodel is obtained, we proceed to the second stage by obtaining an estimate via ordinary least squares corresponding to . The theory for requires more stringent conditions, as we now need to estimate instead of just a correct ranking. In particular, we have to impose conditions on the magnitude of and the moments of , i.e., for we have the following result.
Assume the same conditions for and as in Theorem 1. We also assume and for some . If , , and there exists some such that , then for any , we have
i.e., if , then we can choose and
with probability tending to 1.
The moment condition on is not tight. We used this number just for simplicity. As shown in Theorem 2, the norm of is allowed to grow in the rate of , i.e., our algorithm works for weakly sparse coefficients. However, Theorem 2 imposes an upper bound on while Theorem 1 not. This is mainly due to the moment assumption on and the different structure between and , i.e., does not rely on for diminishing the unimportant signals. For ridge regression, we have the following result.
Theorem 3 (Ridge regression).
Assume all the conditions in Theorem 2. If we choose the ridge parameter satisfying
then we have
i.e., if , then we can choose and
with probability tending to 1.
Note that the ridge parameter can be chosen as a constant, bypassing the need to specify at least in theory. When both the noise and follows Gaussian distribution and , we can obtain a more explicit form of the threshold , as the following Corollary shows.
Corollary 1 (Gaussian noise).
Assume , and . For any , define , where is the estimated standard error as . For sufficiently large , if for some absolute constants , and , then with probability at least , we have
Write as in Algorithm 1. In practice, we propose to use as the threshold (see Algorithm 1), because the estimation error takes a form of . Alternatively, instead of identifying an explicit form of the threshold value (as is hard for general noise), one may also use BIC on nested models formed by ranking to search for the true model. Once the final model is obtained, as in Stage 3 of Algorithm 1, we refit it again using ordinary least squares. The final output will have the same output as if we knew the true model a priori with probability tending to 1, i.e., we have the following result.
In this section, we provide extensive numerical experiments for assessing the performance of LAT and RAT. In particular, we compare the two methods to existing penalized methods including lasso, elastic net (enet (Zou and Hastie, 2005)), adaptive lasso (Zou, 2006), scad (Fan and Li, 2001) and mc+ (Zhang, 2010). As it is well-known that the lasso estimator is biased, we also consider two variations of it by combining lasso with Stage 2 and 3 of our LAT and RAT algorithms, denoted as lasLAT (las1 in Figures) and lasRAT (las2 in Figures) respectively. We note that the lasLat algorithm is very similar to the thresholded lasso (Zhou, 2010) with an additional thresholding step. We code LAT and RAT and adaptive lasso in Matlab, use glmnet (Friedman et al., 2010) for enet and lasso, and SparseReg (Zhou et al., 2012; Zhou and Lange, 2013) for scad and mc+. Since adaptive lasso achieves a similar performance as lasLat on synthetic datasets, we only report its performance for the real data.
4.1 Synthetic Datasets
The model used in this section for comparison is the linear model , where and . To control the signal-to-noise ratio, we define , which is chosen to be for all experiments. The sample size and the data dimension are chosen to be or for all experiments. For evaluation purposes, we consider four different structures of below.
(i) Independent predictors. The support is set as . We generate from a standard multivariate normal distribution with independent components. The coefficients are specified as
(ii) Compound symmetry. All predictors are equally correlated with correlation . The coefficients are set to be for and otherwise.
(iii) Group structure. This example is Example 4 in Zou and Hastie (2005), for which we allocate the 15 true variables into three groups. Specifically, the predictors are generated as
where and are independent. The coefficients are set as
(iv) Factor models. This model is also considered in Meinshausen and Bühlmann (2010) and Cho and Fryzlewicz (2012). Let be independent standard normal variables. We set predictors as , where and are generated from independent standard normal distributions. The number of factors is chosen as in the simulation while the coefficients are specified the same as in Example (ii).
|Ex. (ii)||# FPs||0.480||0.480||1.500||0.350||0.350||10.820||2.470||2.400|
|Ex. (iii)||# FPs||0.000||0.000||0.060||0.000||0.000||0.120||0.120||0.090|
|Ex. (iv)||# FPs||0.920||0.920||21.710||0.260||0.260||37.210||6.360||6.270|
To compare the performance of all methods, we simulate synthetic datasets for and for for each example, and record i) the root mean squared error (RMSE): , ii) the false negatives (# FN), iii) the false positives (# FP) and iv) the actual runtime (in milliseconds). We use the extended BIC (Chen and Chen, 2008) to choose the parameters for any regularized algorithm. Due to the huge computation expense for scad and mc+, we only find the first predictors on the solution path (because we know ). For RAT and LAT, is set to . For RAT and larsRidge, we adopt a 10-fold cross-validation procedure to tune the ridge parameter for a better finite-sample performance, although the theory allows to be fixed as a constant. For all hard-thresholding steps, we fix . The results for are plotted in Figure 2, 3, 4 and 5 and a more comprehensive result (average values for RMSE, # FPs, # FNs, runtime) for is summarized in Table 1.
As can be seen from both the plots and the tables, LAT and RAT achieve the smallest RMSE for Example (ii), (iii) and (iv) and are on par with lasLAT for Example (i). For Example (iii), RAT and enet achieve the best performance while all the other methods fail to work. In addition, the runtime of LAT and RAT are also competitive compared to that of lasso and enet. We thus conclude that LAT and RAT achieve similar or even better performance compared to the usual regularized methods.
4.2 A Student Performance Dataset
We look at one dataset used for evaluating student achievement in Portuguese schools (Cortez and Silva, 2008). The data attributes include student grades and school related features that were collected by using school reports and questionnaires. The particular dataset used here provides the students’ performance in mathematics. The goal of the research is to predict the final grade based on all the attributes.
The original data set contains 395 students and 32 raw attributes. The raw attributes are recoded as 40 attributes and form 780 features after interaction terms are added. We then remove features that are constant for all students. This gives 767 features for us to work with. To compare the performance of all methods, we first randomly split the dataset into 10 parts. We use one of the 10 parts as a test set, fit all the methods on the other 9 parts, and then record their prediction error (root mean square error, RMSE), model size and runtime on the test set. We repeat this procedure until each of the 10 parts has been used for testing. The averaged prediction error, model size and runtime are summarized in Table 2. We also report the performance of the null model which predicts the final grade on the test set using the mean final grade in the training set.
|methods||mean error||Standard error||average model size||runtime (millisec)|
|methods||mean error||Standard error||average model size||runtime (millisec)|
It can be seen that RAT achieves the smallest cross-validation error, followed by scad and mc+. In the post-feature-selection analysis, we found that two features, the 1st and 2nd period grades of a student, were selected by all the methods. This result coincides with the common perception that these two grades usually have high impact on the final grade.
In addition, we may also be interested in what happens when no strong signals are presented. One way to do this is to remove all the features that are related to the 1st and 2nd grades before applying the aforementioned procedures. The new result without the strong signals removed are summarized in Table 3.
Table 3 shows a few interesting findings. First, under this artificial weak signal scenario, adaptive lasso achieves the smallest cross-validation error and RAT is the first runner-up. Second, in Stage 1, lasso seems to provide slightly more robust screening than OLS in that the selected features are less correlated. This might be the reason that LAT is outperformed by lasLAT. However, in both the strong and weak signal cases, RAT is consistently competitive in terms of performance.
We have proposed two novel algorithms Lat and Rat that only rely on least-squares type of fitting and hard thresholding, based on a high-dimensional generalization of OLS. The two methods are simple, easily implementable, and can consistently fit a high dimensional linear model and recover its support. The performance of the two methods are competitive compared to existing regularization methods. It is of great interest to further extend this framework to other models such as generalized linear models and models for survival analysis.
- Akritas et al. (2014) Akritas, M. G., Lahiri, S., and Politis, D. N. (2014). Topics in nonparametric statistics. In Proceedings of the First Conference of the International Society for Nonparametric Statistics. Springer.
- Bickel et al. (2009) Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732.
- Cai and Wang (2011) Cai, T. T. and Wang, L. (2011). Orthogonal matching pursuit for sparse signal recovery with noise. IEEE Transactions on Information Theory, 57(7):4680–4688.
- Chen and Chen (2008) Chen, J. and Chen, Z. (2008). Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759–771.
- Cho and Fryzlewicz (2012) Cho, H. and Fryzlewicz, P. (2012). High dimensional variable selection via tilting. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(3):593–622.
- Cortez and Silva (2008) Cortez, P. and Silva, A. M. G. (2008). Using data mining to predict secondary school student performance.
- Fan and Li (2001) Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360.
- Fan and Lv (2008) Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911.
- Friedman et al. (2010) Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1.
- Jain et al. (2014) Jain, P., Tewari, A., and Kar, P. (2014). On iterative hard thresholding methods for high-dimensional m-estimation. In Advances in Neural Information Processing Systems, pages 685–693.
- Lounici (2008) Lounici, K. (2008). Sup-norm convergence rate and sign concentration property of lasso and dantzig estimators. Electronic Journal of Statistics, 2:90–102.
- Meinshausen and Bühlmann (2010) Meinshausen, N. and Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473.
- Raskutti et al. (2010) Raskutti, G., Wainwright, M. J., and Yu, B. (2010). Restricted eigenvalue properties for correlated gaussian designs. The Journal of Machine Learning Research, 11:2241–2259.
- Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pages 267–288.
- Vershynin (2010) Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027.
- Wainwright (2009) Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using-constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55(5):2183–2202.
- Wang and Leng (2015) Wang, X. and Leng, C. (2015). High dimensional ordinary least squares projection for screening variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology).
- Wang et al. (2015) Wang, X., Leng, C., and Dunson, D. B. (2015). On the consistency theory of high dimensional variable screening. arXiv preprint arXiv:1502.06895.
- Yang et al. (2014) Yang, E., Lozano, A., and Ravikumar, P. (2014). Elementary estimators for high-dimensional linear regression. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 388–396.
- Zhang (2010) Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942.
- Zhang and Huang (2008) Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics, 36(4):1567–1594.
- Zhao and Yu (2006) Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541–2563.
- Zhou et al. (2012) Zhou, H., Armagan, A., and Dunson, D. B. (2012). Path following and empirical bayes model selection for sparse regression. arXiv preprint arXiv:1201.3528.
- Zhou and Lange (2013) Zhou, H. and Lange, K. (2013). A path algorithm for constrained estimation. Journal of Computational and Graphical Statistics, 22(2):261–283.
- Zhou (2010) Zhou, S. (2010). Thresholded lasso for high dimensional variable selection and statistical estimation. arXiv preprint arXiv:1002.1583.
- Zou (2006) Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429.
- Zou and Hastie (2005) Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320.
Appendix 0: Proof of Lemma 1
Applying the Sherman-Morrison-Woodbury formula
Multiplying on both sides, we get
The right hand side can be further simplified as
Therefore, we have
Appendix A: Proof of Theorem 1
Recall the estimator . The following three lemmas will be used to bound and respectively.
Let . Assume for some , then for any there exists some and such that for any and any ,
The proof can be found in the Lemma 4 and 5 in Wang and Leng (2015) for elliptical distributions. The special case of Gaussian is also proved in the Lemma 3 of Wang et al. (2015). Notice that the eigenvalue assumption in Wang and Leng (2015) is not used for proving Lemma 4 and 5.
Assume follows . If for some constant , and , then for any we have
where is defined as the minimum value for the important signals and .
To prove Lemma 3 we need the following two propositions.
As each row of can be represented as , where and is a matrix of independent Gaussian entries, i.e., . For , we have the following result.
Let , then we have the minimum eigenvalue of satisfies that
for any . Assume for and take . When , we have
The proof follows Corollary 5.35 in Vershynin (2010).
Proof of Lemma 3.
Let and . Then .
Part 1. Bounding .
Consider the standard SVD on as , where and are matrices and is a matrix. Because is a matrix of iid Gaussian variables, its distribution is invariant under both left and right orthogonal transformation. In particular, for any , we have
i.e., is uniformly distributed on conditional on and (they are in fact independent, but we don’t need such a strong condition). Therefore, we have
Because is uniformly distributed conditional on and , the distribution of is also invariant under right orthogonal transformation conditional on and , i.e., for any , we have
Our first goal is to bound the magnitude of individual entries . Let , which is a function of and (see below). From (7), we know that is uniformly distributed on the sphere if conditional on (i.e., conditional on ), which implies that
where are iid standard Gaussian variables. Thus, can be bounded easily if we can bound . Notice that for we have
Now applying the tail bound and the concentration inequality to (8) we have for any and any
Putting the pieces all together, we have for any and any that
Now according to (6), we can further bound and obtain that