# Sparse least trimmed squares regression for analyzing high-dimensional large data sets

###### Abstract

Sparse model estimation is a topic of high importance in modern data analysis due to the increasing availability of data sets with a large number of variables. Another common problem in applied statistics is the presence of outliers in the data. This paper combines robust regression and sparse model estimation. A robust and sparse estimator is introduced by adding an penalty on the coefficient estimates to the well-known least trimmed squares (LTS) estimator. The breakdown point of this sparse LTS estimator is derived, and a fast algorithm for its computation is proposed. In addition, the sparse LTS is applied to protein and gene expression data of the NCI-60 cancer cell panel. Both a simulation study and the real data application show that the sparse LTS has better prediction performance than its competitors in the presence of leverage points.

10.1214/12-AOAS575 \volume7 \issue1 2013 \firstpage226 \lastpage248

Sparse least trimmed squares regression

A]\fnmsAndreas \snmAlfonslabel=e1]andreas.alfons@econ.kuleuven.be, A]\fnmsChristophe \snmCroux\correflabel=e2]christophe.croux@econ.kuleuven.be and B]\fnmsSarah \snmGelperlabel=e3]sgelper@rsm.nl

Breakdown point \kwdoutliers \kwdpenalized regression \kwdrobust regression \kwdtrimming.

## 1 Introduction

In applied data analysis, there is an increasing availability of data sets containing a large number of variables. Linear models that include the full set of explanatory variables often have poor prediction performance as they tend to have large variance. Furthermore, large models are in general difficult to interpret. In many cases, the number of variables is even larger than the number of observations. Traditional methods such as least squares can then no longer be applied due to the rank deficiency of the design matrix. For instance, gene expression or fMRI studies typically contain tens of thousands of variables for only a small number of observations. In this paper, we present an application to the cancer cell panel of the National Cancer Institute, in which the data consists of observations and predictors.

To improve prediction accuracy and as a remedy to computational problems with high-dimensional data, a penalty term on the regression coefficients can be added to the objective function. This approach shrinks the coefficients and reduces variance at the price of increased bias. Tibshirani (1996) introduced the least absolute shrinkage and selection operator (lasso), in which the penalty function is the norm. Let be the response and the matrix of predictor variables, where denotes the number of observations and the number of variables. In addition, let be the -dimensional observations, that is, the rows of . We assume a standard regression model

(1) |

where the regression parameter is , and the error terms have zero expected value. With a penalty parameter , the lasso estimate of is

(2) |

The lasso is frequently used in practice since the penalty allows to shrink some coefficients to exactly zero, that is, to produce sparse model estimates that are highly interpretable. In addition, a fast algorithm for computing the lasso is available through the framework of least angle regression [LARS; Efron et al. (2004)]. Other algorithms are available as well [e.g., Wu and Lange (2008)]. Due to the popularity of the lasso, its theoretical properties are well studied in the literature [e.g., Knight and Fu (2000), Zhao and Yu (2006), Zou, Hastie and Tibshirani (2007)] and several modifications have been proposed [e.g., Zou (2006), Yuan and Lin (2006), Gertheiss and Tutz (2010), Radchenko and James (2011), Wang et al. (2011)]. However, the lasso is not robust to outliers. In this paper we formally show that the breakdown point of the lasso is , that is, only one single outlier can make the lasso estimate completely unreliable. Therefore, robust alternatives are needed.

Outliers are observations that deviate from the model assumptions and are a common problem in the practice of data analysis. For example, for many of the predictors in the NCI data set used in Section 7, (log-transformed) responses on the cell lines showed outliers. Robust alternatives to the least squares regression estimator are well known and studied; see Maronna, Martin and Yohai (2006) for an overview. In this paper, we focus on the least trimmed squares (LTS) estimator introduced by Rousseeuw (1984). This estimator has a simple definition, is quite fast to compute, and is probably the most popular robust regression estimator. Denote the vector of squared residuals by with , . Then the LTS estimator is defined as

(3) |

where are the order statistics of the squared residuals and . Thus, LTS regression corresponds to finding the subset of observations whose least squares fit produces the smallest sum of squared residuals. The subset size can be seen as an initial guess of the number of good observations in the data. While the LTS is highly robust, it clearly does not produce sparse model estimates. Furthermore, if , the LTS estimator cannot be computed. A sparse and regularized version of the LTS is obtained by adding an penalty with penalty parameter to (3), leading to the sparse LTS estimator

(4) |

We prove in this paper that sparse LTS has a high breakdown point. It is resistant to multiple regression outliers, including leverage points. Besides being highly robust, and similar to the lasso estimate, sparse LTS (i) improves the prediction performance through variance reduction if the sample size is small relative to the dimension, (ii) ensures higher interpretability due to simultaneous model selection, and (iii) avoids computational problems of traditional robust regression methods in the case of high-dimensional data. For the NCI data, sparse LTS was less influenced by the outliers than competitor methods and showed better prediction performance, while the resulting model is small enough to be easily interpreted (see Section 7).

The sparse LTS (4) can also be interpreted as a trimmed version of the lasso, since the limit case yields the lasso solution. Other robust versions of the lasso have been considered in the literature. Most of them are penalized M-estimators, as in van de Geer (2008) and Li, Peng and Zhu (2011). Rosset and Zhu (2004) proposed a Huber-type loss function, which requires knowledge of the residual scale. A least absolute deviations (LAD) type of estimator called LAD-lasso is proposed by Wang, Li and Jiang (2007),

(5) |

However, none of these methods is robust with respect to leverage points, that is, outliers in the predictor space, and can handle outliers only in the response variable. The main competitor of the sparse LTS is robust least angle regression, called RLARS, and proposed in Khan, Van Aelst and Zamar (2007). They develop a robust version of the LARS algorithm, essentially replacing correlations by a robust type of correlation, to sequence and select the most important predictor variables. Then a nonsparse robust regression estimator is applied to the selected predictor variables. RLARS, as will be confirmed by our simulation study, is robust with respect to leverage points. A main drawback of the RLARS algorithm of Khan, Van Aelst and Zamar (2007) is the lack of a natural definition, since it is not optimizing a clearly defined objective function.

An entirely different approach is taken by She and Owen (2011), who propose an iterative procedure for outlier detection. Their method is based on imposing a sparsity criterion on the estimator of the mean-shift parameter in the extended regression model

(6) |

They stress that this method requires a nonconvex sparsity criterion. An extension of the method to high-dimensional data is obtained by also assuming sparsity of the coefficients . Nevertheless, their paper mainly focuses on outlier detection and much less on sparse robust estimation. Note that another procedure for simultaneous outlier identification and variable selection based on the mean-shift model is proposed by Menjoge and Welsch (2010).

The rest of the paper is organized as follows. In Section 2 the breakdown point of the sparse LTS estimator is obtained. Further, we also show that the lasso and the LAD-lasso have a breakdown point of only . A detailed description of the proposed algorithm to compute the sparse LTS regression estimator is provided in Section 3. Section 4 introduces a reweighted version of the estimator in order to increase statistical efficiency. The choice of the penalty parameter is discussed in Section 5. Simulation studies are performed in Section 6. In addition, Section 7 presents an application to protein and gene expression data of the well-known cancer cell panel of the National Cancer Institute. The results indicate that these data contain outliers such that robust methods are necessary for analysis. Moreover, sparse LTS yields a model that is easy to interpret and has excellent prediction performance. Finally, Section 8 presents some computation times and Section 9 concludes.

## 2 Breakdown point

The most popular measure for the robustness of an estimator is the replacement finite-sample breakdown point (FBP) [e.g., Maronna, Martin and Yohai (2006)]. Let denote the sample. For a regression estimator , the breakdown point is defined as

(7) |

where are corrupted data obtained from by replacing of the original data points by arbitrary values. We obtained the following result, from which the breakdown point of the sparse LTS estimator immediately follows. The proof is in the Appendix.

###### Theorem 1

Let be a convex and symmetric loss function with and for , and define . With subset size , consider the regression estimator

(8) |

where are the order statistics of the regression loss. Then the breakdown point of the estimator is given by

The breakdown point is the same for any loss function fulfilling the assumptions. In particular, the breakdown point for the sparse LTS estimator with subset size , in which , is still . The smaller the value of , the higher the breakdown point. By taking small enough, it is even possible to have a breakdown point larger than 50%. However, while this is mathematically possible, we are not advising to use since robust statistics aim for models that fit the majority of the data. Thus, we do not envisage to have such large breakdown points. Instead, we suggest to take a value of equal to a fraction of the sample size, with , such that the final estimate is based on a sufficiently large number of observations. This guarantees a sufficiently high statistical efficiency, as will be shown in the simulations in Section 6. The resulting breakdown point is then about . Notice that the breakdown point does not depend on the dimension . Even if the number of predictor variables is larger than the sample size, a high breakdown point is guaranteed. For the nonsparse LTS, the breakdown point does depend on [see Rousseeuw and Leroy (2003)].

Applying Theorem 1 to the lasso [corresponding to and ] yields a finite-sample breakdown point of

Hence, only one outlier can already send the lasso solution to infinity, despite the fact that large values of the regression estimate are penalized in the objective function of the lasso. The nonrobustness of the Lasso comes from the use of the squared residuals in the objective function (2). Using other convex loss functions, as done in the LAD-lasso or penalized M-estimators, does not solve the problem and results in a breakdown point of as well. The theoretical results on robustness are also reflected in the application to the NCI data in Section 7, where the lasso is much more influenced by the outliers than the sparse LTS.

## 3 Algorithm

We first present an equivalent formulation of the sparse LTS estimator (4). For a fixed penalty parameter , define the objective function

(9) |

which is the penalized residual sum of squares based on a subsample with . With

(10) |

the sparse LTS estimator is given by , where

(11) |

Hence, the sparse LTS corresponds to finding the subset of observations whose lasso fit produces the smallest penalized residual sum of squares. To find this optimal subset, we use an analogue of the FAST-LTS algorithm developed by Rousseeuw and Van Driessen (2006).

The algorithm is based on concentration steps or C-steps. The C-step at iteration consists of computing the lasso solution based on the current subset , with , and constructing the next subset from the observations corresponding to the smallest squared residuals. Let denote a certain subsample derived at iteration and let be the coefficients of the corresponding lasso fit. After computing the squared residuals with , the subsample for iteration is defined as the set of indices corresponding to the smallest squared residuals. In mathematical terms, this can be written as

where denote the order statistics of the squared residuals. Let denote coefficients of the lasso fit based on . Then

(12) |

where the first inequality follows from the definition of , and the second inequality from the definition of . From (12) it follows that a C-step results in a decrease of the sparse LTS objective function, and that a sequence of C-steps yields convergence to a local minimum in a finite number of steps.

To increase the chances of arriving at the global minimum, a sufficiently large number of initial subsamples should be used, each of them being used as starting point for a sequence of C-steps. Rather than randomly selecting data points, any initial subset of size is constructed from an elemental subset of size 3 as follows. Draw three observations from the data at random, say, , and . The lasso fit for this elemental subset is then

(13) |

and the initial subset is then given by the indices of the observations with the smallest squared residuals with respect to the fit in (13). The nonsparse FAST-LTS algorithm uses elemental subsets of size , since any OLS regression requires at least as many observations as the dimension . This would make the algorithm not applicable if . Fortunately the lasso is already properly defined for samples of size 3, even for large values of . Moreover, from a robustness point of view, using only three observations is optimal, as it ensures the highest probability of not including outliers in the elemental set. It is important to note that the elemental subsets of size 3 are only used to construct the initial subsets of size for the C-step algorithms. All C-steps are performed on subsets of size .

In this paper, we used initial subsets. Using a larger number of subsets did not lead to better prediction performance in the case of the NCI data. Following the strategy advised in Rousseeuw and Van Driessen (2006), we perform only two C-steps for all subsets and retain the subsamples with the lowest values of the objective function (9). For the reduced number of subsets , further C-steps are performed until convergence. This is a standard strategy for C-step algorithms to decrease computation time.

Estimation of an intercept: the regression model in (1) does not contain an intercept. It is indeed common to assume that the variables are mean-centered and the predictor variables are standardized before applying the lasso. However, computing the means and standard deviations over all observations does not result in a robust method, so we take a different approach. Each time the sparse LTS algorithm computes a lasso fit on a subsample of size , the variables are first centered and the predictors are standardized using the means and standard deviations computed from the respective subsample. The resulting procedure then minimizes (4) with squared residuals , where stands for the intercept. We verified that adding an intercept to the model has no impact on the breakdown point of the sparse LTS estimator of .

## 4 Reweighted sparse LTS estimator

Let denote the proportion of observations from the full sample to be retained in each subsample, that is, . In this paper we take . Then may be interpreted as an initial guess of the proportion of outliers in the data. This initial guess is typically rather conservative to ensure that outliers do not impact the results, and may therefore result in a loss of statistical efficiency. To increase efficiency, a reweighting step that downweights outliers detected by the sparse LTS estimator can be performed.

Under the normal error model, observations with standardized residuals larger than a certain quantile of the standard normal distribution may be declared as outliers. Since the sparse LTS estimator—like the lasso—is biased, we need to center the residuals. A natural estimate for the center of the residuals is

(14) |

where and is the optimal subset from (11). Then the residual scale estimate associated to the raw sparse LTS estimator is given by

(15) |

with squared centered residuals , and

(16) |

a factor to ensure that is a consistent estimate of the standard deviation at the normal model. This formulation allows to define binary weights

(17) |

In this paper is used such that 2.5% of the observations are expected to be flagged as outliers in the normal model, which is a typical choice.

The reweighted sparse LTS estimator is given by the weighted lasso fit

(18) |

with the sum of weights. With the choice of weights given in (17), the reweighted sparse LTS is the lasso fit based on the observations not flagged as outliers. Of course, other weighting schemes could be considered. Using the residual center estimate

(19) |

the residual scale estimate of the reweighted sparse LTS estimator is given by

(20) |

where is the consistency factor from (16) with .

Note that this reweighting step is conceptually different from the adaptive lasso by Zou (2006). While the adaptive lasso derives individual penalties on the predictors from initial coefficient estimates, the reweighted sparse LTS aims to include all nonoutlying observations into fitting the model.

## 5 Choice of the penalty parameter

In practical data analysis, a suitable value of the penalty parameter is not known in advance. We propose to select by optimizing the Bayes Information Criterion (BIC), or the estimated prediction performance via cross-validation. In this paper we use the BIC since it requires less computational effort. The BIC of a given model estimated with shrinkage parameter is given by

(21) |

where denotes the corresponding residual scale estimate, (15) or (20), and are the degrees of freedom of the model. The degrees of freedom are given by the number of nonzero estimated parameters in [see Zou, Hastie and Tibshirani (2007)].

As an alternative to the BIC, cross-validation can be used. To prevent outliers from affecting the choice of , a robust prediction loss function should be used. A natural choice is the root trimmed mean squared prediction error (RTMSPE) with the same trimming proportion as for computing the sparse LTS. In -fold cross-validation, the data are split randomly in blocks of approximately equal size. Each block is left out once to fit the model, and the left-out block is used as test data. In this manner, and for a given value of , a prediction is obtained for each observation in the sample. Denote the vector of squared prediction errors . Then

(22) |

To reduce variability, the RTMSPE may be averaged over a number of different random splits of the data.

The selected then minimizes or over a grid of values in the interval . We take a grid with steps of size , where is an estimate of the shrinkage parameter that would shrink all parameters to zero. If , 0 is of course excluded from the grid. For the lasso solution we take

(23) |

exactly the same as given and motivated in Efron et al. (2004). In (23), stands for the Pearson correlation between and the th column of the design matrix . For sparse LTS, we need a robust estimate . We propose to replace the Pearson correlation in (23) by the robust correlation based on bivariate winsorization of the data [see Khan, Van Aelst and Zamar (2007)].

## 6 Simulation study

This section presents a simulation study for comparing the performance of various sparse estimators. The simulations are performed in R [R Development Core Team (2011)] with package simFrame [Alfons, Templ and Filzmoser (2010), Alfons (2012a)], which is a general framework for simulation studies in statistics. Sparse LTS is evaluated for the subset size . Both the raw and the reweighted version (see Section 4) are considered. We prefer to take a relatively large trimming proportion to guarantee a breakdown point of 25%. Adding the reweighting step will then increase the statistical efficiency of sparse LTS. We make a comparison with the lasso, the LAD-lasso and robust least angle regression (RLARS), discussed in the introduction. We selected the LAD-lasso estimator as a representative of the class of penalized M-estimators, since it does not need an initial residual scale estimator.

For every generated sample, an optimal value of the shrinkage parameter is selected. The penalty parameters for sparse LTS and the lasso are chosen using the BIC, as described in Section 5. For the LAD-lasso, we estimate the shrinkage parameter in the same way as in Wang, Li and Jiang (2007). However, if , we cannot use their approach and use the BIC as in (21), with the mean absolute value of residuals (multiplied by a consistency factor) as scale estimate. For RLARS, we add the sequenced variables to the model in a stepwise fashion and fit robust MM-regressions [Yohai (1987)], as advocated in Khan, Van Aelst and Zamar (2007). The optimal model when using RLARS is then again selected via BIC, now using the robust scale estimate resulting from the MM-regression.

### 6.1 Sampling schemes

The first configuration is a latent factor model taken from Khan, Van Aelst and Zamar (2007) and covers the case of . From latent independent standard normal variables and an independent normal error variable with standard deviation , the response variable is constructed as

where is chosen so that the signal-to-noise ratio is 3, that is, With independent standard normal variables , a set of candidate predictors is then constructed as

where and so that are low-noise perturbations of the latent variables, are noise covariates that are correlated with the latent variables, and are independent noise covariates. The number of observations is set to .

The second configuration covers the case of moderate high-dimensional data. We generate observations from a -dimensional normal distribution , with . The covariance matrix is given by , creating correlated predictor variables. Using the coefficient vector with , , , and for , the response variable is generated according to the regression model (1), where the error terms follow a normal distribution with .

Finally, the third configuration represents a more extreme case of high-dimensional data with observations and variables. The first predictor variables are generated from a multivariate normal distribution with . Furthermore, the remaining covariates are standard normal variables. Then the response variable is generated according to (1), where the coefficient vector is given by for and for , and the error terms follow a standard normal distribution.

For each of the three simulation settings, we apply contamination schemes taken from Khan, Van Aelst and Zamar (2007). To be more precise, we consider the following: {longlist}[(1)]

No contamination.

Vertical outliers: 10% of the error terms in the regression model follow a normal instead of a .

Leverage points: Same as in 2, but the 10% contaminated observations contain high-leverage values by drawing the predictor variables from independent distributions. In addition, we investigate a fourth and more stressful outlier scenario. Keeping the contamination level at 10%, outliers in the predictor variables are drawn from independent distributions. Note the small standard deviation such that the outliers form a dense cluster. Let denote such a leverage point. Then the values of the response variable of the contaminated observations are generated by with . The direction of is very different from the one of the true regression parameter in the following ways. First, is not sparse. Second, all predictors have a negative effect on the response in the contaminated observations, whereas the variables with nonzero coefficients have a positive effect on the response in the good data points. Furthermore, the parameter controls the magnitude of the leverage effect and is varied from to in five equidistant steps.

This results in a total of 12 different simulations schemes, which we think to be representative for the many different simulation designs we tried out. The first scheme has , the second setting has , and the third setting has . The choices for the contamination schemes are standard, inducing both vertical outliers and leverage points in the samples.

### 6.2 Performance measures

Since one of the aims of sparse model estimation is to improve prediction performance, the different estimators are evaluated by the root mean squared prediction error (RMSPE). For this purpose, additional observations from the respective sampling schemes (without outliers) are generated as test data, and this in each simulation run. Then the RMSPE is given by

where and , , denote the observations of the response and predictor variables in the test data, respectively. The RMSPE of the oracle estimator, which uses the true coefficient values , is computed as a benchmark for the evaluated methods. We report average RMSPE over all simulation runs.

Concerning sparsity, the estimated models are evaluated by the false positive rate (FPR) and the false negative rate (FNR). A false positive is a coefficient that is zero in the true model, but is estimated as nonzero. Analogously, a false negative is a coefficient that is nonzero in the true model, but is estimated as zero. In mathematical terms, the FPR and FNR are defined as

Both FPR and FNR should be as small as possible for a sparse estimator and are averaged over all simulation runs. Note that false negatives in general have a stronger effect on the RMSPE than false positives. A false negative means that important information is not used for prediction, whereas a false positive merely adds a bit of variance.

### 6.3 Simulation results

In this subsection the simulation results for the different data configurations are presented and discussed.

No contamination | Vertical outliers | Leverage points | |||||||
---|---|---|---|---|---|---|---|---|---|

Method | RMSPE | FPR | FNR | RMSPE | FPR | FNR | RMSPE | FPR | FNR |

Lasso | 1.18 | 0.10 | 0.00 | 2.44 | 0.54 | 0.09 | 2.20 | 0.00 | 0.16 |

LAD-lasso | 1.13 | 0.05 | 0.00 | 1.15 | 0.07 | 0.00 | 1.27 | 0.18 | 0.00 |

RLARS | 1.14 | 0.07 | 0.00 | 1.12 | 0.03 | 0.00 | 1.22 | 0.09 | 0.00 |

Raw sparse LTS | 1.29 | 0.34 | 0.00 | 1.26 | 0.32 | 0.00 | 1.26 | 0.26 | 0.00 |

Sparse LTS | 1.24 | 0.22 | 0.00 | 1.22 | 0.25 | 0.00 | 1.22 | 0.18 | 0.00 |

Oracle | 0.82 | 0.82 | 0.82 |

#### 6.3.1 Results for the first sampling scheme

The simulation results for the first data configuration are displayed in Table 1. Keep in mind that this configuration is exactly the same as in Khan, Van Aelst and Zamar (2007), and that the contamination settings are a subset of the ones applied in their paper. In the scenario without contamination, LAD-lasso, RLARS and lasso show excellent performance with low RMSPE and FPR. The prediction performance of sparse LTS is good, but it has a larger FPR than the other three methods. The reweighting step clearly improves the estimates, which is reflected in the lower values for RMSPE and FPR. Furthermore, none of the methods suffer from false negatives.

In the case of vertical outliers, the nonrobust lasso is clearly influenced by the outliers, reflected in the much higher RMSPE and FPR. RLARS, LAD-lasso and sparse LTS, on the other hand, keep their excellent behavior. Sparse LTS still has a considerable tendency toward false positives, but the reweighting step is a significant improvement over the raw estimator.

When leverage points are introduced in addition to the vertical outliers, the performance of RLARS, sparse LTS and LAD-lasso is comparable. The FPR of RLARS and LAD-lasso slightly increased, whereas the FPR of sparse LTS slightly decreased. The LAD-lasso still performs well, and even the lasso performs better than in the case of only vertical outliers. This suggests that the leverage points in this example do not have a bad leverage effect.

No contamination | Vertical outliers | Leverage points | |||||||
---|---|---|---|---|---|---|---|---|---|

Method | RMSPE | FPR | FNR | RMSPE | FPR | FNR | RMSPE | FPR | FNR |

Lasso | 0.62 | 0.00 | 0.00 | 2.56 | 0.08 | 0.16 | 2.53 | 0.00 | 0.71 |

LAD-lasso | 0.66 | 0.08 | 0.00 | 0.82 | 0.00 | 0.01 | 1.17 | 0.08 | 0.01 |

RLARS | 0.60 | 0.01 | 0.00 | 0.73 | 0.00 | 0.10 | 0.92 | 0.02 | 0.09 |

Raw sparse LTS | 0.81 | 0.02 | 0.00 | 0.73 | 0.02 | 0.00 | 0.73 | 0.02 | 0.00 |

Sparse LTS | 0.74 | 0.01 | 0.00 | 0.69 | 0.01 | 0.00 | 0.71 | 0.02 | 0.00 |

Oracle | 0.50 | 0.50 | 0.50 |

In Figure 1 the results for the fourth contamination setting are shown. The RMSPE is thereby plotted as a function of the parameter . With increasing , the RMSPE of the lasso and the LAD-lasso increases. RLARS has a considerably higher RMSPE than sparse LTS for lower values of , but the RMSPE gradually decreases with increasing . However, the RMSPE of sparse LTS remains the lowest, thus, it has the best overall performance.

#### 6.3.2 Results for the second sampling scheme

Table 2 contains the simulation results for the moderate high-dimensional data configuration. In the scenario without contamination, RLARS and the lasso perform best with very low RMSPE and almost perfect FPR and FNR. Also, the LAD-lasso has excellent prediction performance, followed by sparse LTS. The LAD-lasso leads to a slightly higher FPR than the other methods, though. When vertical outliers are added, RLARS still has excellent prediction performance despite some false negatives. We see that the sparse LTS performs best here. In addition, the prediction performance of the nonrobust lasso already suffers greatly from the vertical outliers. In the scenario with additional leverage points, sparse LTS remains stable and is still the best. For RLARS, sparsity behavior according to FPR and FNR does not change significantly either, but there is a small increase in the RMSPE. On the other hand, LAD-lasso already has a considerably larger RMSPE than sparse LTS, and again a slightly higher FPR than the other methods. Furthermore, the lasso is still highly influenced by the outliers, which is reflected in a very high FNR and poor prediction performance.

The results for the fourth contamination setting are presented in Figure 2. As for the previous simulation scheme, the RMSPE for the lasso and the LAD-lasso is increasing with increasing parameter . The RMSPE for RLARS, however, is gradually decreasing. Sparse LTS shows particularly interesting behavior: the RMSPE is close to the oracle at first, then there is a kink in the curve (with the value of the RMSPE being in between those for the LAD-lasso and the lasso), after which the RMSPE returns to low values close to the oracle. In any case, for most of the investigated values of , sparse LTS has the best performance.

#### 6.3.3 Results for the third sampling scheme

Table 3 contains the simulation results for the more extreme high-dimensional data configuration. Note that the LAD-lasso was no longer computationally feasible with such a large number of variables. In addition, the number of simulation runs was reduced from 500 to 100 to lower the computational effort.

In the case without contamination, the sparse LTS suffers from an efficiency problem, which is reflected in larger values for RMSPE and FNR than for the other methods. The lasso and RLARS have considerably better performance in this case. With vertical outliers, the RMSPE for the lasso increases greatly due to many false negatives. Also, RLARS has a larger FNR than sparse LTS, resulting in a slightly lower RMSPE for the reweighted version of the latter. When leverage points are introduced, sparse LTS clearly exhibits the lowest RMSPE and FNR. Furthermore, the lasso results in a very large FNR.

No contamination | Vertical outliers | Leverage points | |||||||
---|---|---|---|---|---|---|---|---|---|

Method | RMSPE | FPR | FNR | RMSPE | FPR | FNR | RMSPE | FPR | FNR |

Lasso | 1.43 | 0.000 | 0.00 | 5.19 | 0.004 | 0.49 | 5.57 | 0.000 | 0.83 |

RLARS | 1.54 | 0.001 | 0.00 | 2.53 | 0.000 | 0.38 | 3.34 | 0.001 | 0.45 |

Raw sparse LTS | 3.00 | 0.001 | 0.19 | 2.59 | 0.002 | 0.11 | 2.59 | 0.002 | 0.10 |

Sparse LTS | 2.88 | 0.001 | 0.16 | 2.49 | 0.002 | 0.10 | 2.57 | 0.002 | 0.09 |

Oracle | 1.00 | 1.00 | 1.00 |

Figure 3 shows the results for the fourth contamination setting. Most interestingly, the RMSPE of RLARS in this case keeps increasing in the beginning and even goes above the one of the lasso, before dropping dropping continuously in the remaining steps. Sparse LTS again shows a kink in the curve for the RMSPE, but clearly performs best.

#### 6.3.4 Summary of the simulation results

Sparse LTS shows the best overall performance in this simulation study, if the reweighted version is taken. Concerning the other investigated methods, RLARS also performs well, but suffers sometimes from an increased percentage of false negatives under contamination. It is also confirmed that the lasso is not robust to outliers. The LAD-lasso still sustains vertical outliers, but is not robust against bad leverage points.

## 7 NCI-60 cancer cell panel

In this section the sparse LTS estimator is compared to the competing methods in an application to the cancer cell panel of the National Cancer Institute. It consists of data on 60 human cancer cell lines and can be downloaded via the web application CellMiner (http://discover.nci.nih.gov/cellminer/). We regress protein expression on gene expression data. The gene expression data were obtained with an Affymetrix HG-U133A chip and normalized with the GCRMA method, resulting in a set of predictors. The protein expressions based on 162 antibodies were acquired via reverse-phase protein lysate arrays and transformed. One observation had to be removed since all values were missing in the gene expression data, reducing the number of observations to . More details on how the data were obtained can be found in Shankavaram et al. (2007). Furthermore, Lee et al. (2011) also use this data for regression analysis, but consider only nonrobust methods. They obtain models that still consist of several hundred to several thousand predictors and are thus difficult to interpret.

Similar to Lee et al. (2011), we first order the protein expression variables according to their scale, but use the MAD (median absolute deviation from the median, multiplied with the consistency factor 1.4826) as a scale estimator instead of the standard deviation. We show the results for the protein expressions based on the KRT18 antibody, which constitutes the variable with the largest MAD, serving as one dependent variable. Hence, our response variable measures the expression levels of the protein keratin 18, which is known to be persistently expressed in carcinomas [Oshima, Baribault and Caulín (1996)]. We compare raw and reweighted sparse LTS with 25% trimming, lasso and RLARS. As in the simulation study, the LAD-lasso could not be computed for such a large . The optimal models are selected via BIC as discussed in Section 5. The raw sparse LTS estimator thereby results in a model with 32 genes. In the reweighting step, one more observation is added to the best subset found by the raw estimator, yielding a model with 33 genes for reweighted sparse LTS (thus also one more gene is selected compared to the raw estimator). The lasso model is somewhat larger with 52 genes, whereas the RLARS model is somewhat smaller with 18 genes.

Sparse LTS and the lasso have three selected genes in common, one of which is KRT8. The product of this gene, the protein keratin 8, typically forms an intermediate filament with keratin 18 such that their expression levels are closely linked [e.g., Owens and Lane (2003)]. However, the larger model of the lasso is much more difficult to interpret. Two of the genes selected by the lasso are not even recorded in the Gene database [Maglott et al. (2005)] of the National Center for Biotechnology Information (NCBI). The sparse LTS model is considerably smaller and easier to interpret. For instance, the gene expression level of MSLN, whose product mesothelin is overexpressed in various forms of cancer [Hassan, Bera and Pastan (2004)], has a positive effect on the protein expression level of keratin 18.

=200pt

Method | RTMSPE |
---|---|

Lasso | 1.058 |

RLARS | 0.936 |

Raw sparse LTS | 0.727 |

Sparse LTS | 0.721 |

Concerning prediction performance, the root trimmed mean squared prediction error (RTMSPE) is computed as in (22) via leave-one-out cross-validation (so ). Table 4 reports the RTMSPE for the considered methods. Sparse LTS clearly shows the smallest RTMSPE, followed by RLARS and the lasso. In addition, sparse LTS detects 13 observations as outliers, showing the need for a robust procedure. Further analysis revealed that including those 13 observations changes the correlation structure of the predictor variables with the response. Consequently, the order in which the genes are added to the model by the lasso algorithm on the full sample is completely different from the order on the best subset found by sparse LTS. Leaving out those 13 observations therefore yields more reliable results for the majority of the cancer cell lines.

It is also worth noting that the models still contain a rather large number of variables given the small number of observations. For the lasso, it is well known that it tends to select many noise variables in high dimensions since the same penalty is applied on all variables. Meinshausen (2007) therefore proposed a relaxation of the penalty for the selected variables of an initial lasso fit. Adding such a relaxation step to the sparse LTS procedure may thus be beneficial for large and is considered for future work.

## 8 Computational details and CPU times

All computations are carried out in R version 2.14.0 [R Development Core Team (2011)] using the packages robustHD [Alfons (2012b)] for sparse LTS and RLARS, quantreg [Koenker (2011)] for the LAD-lasso and lars [Hastie and Efron (2011)] for the lasso. Most of sparse LTS is thereby implemented in C++, while RLARS is an optimized version of the R code by Khan, Van Aelst and Zamar (2007). Optimization of the RLARS code was necessary since the original code builds a matrix of robust correlations, which is not computationally feasible for very large . The optimized version only stores an matrix, where is the number of sequenced variables. Furthermore, the robust correlations are computed with C++ rather than R.

Since computation time is an important practical consideration, Figure 4 displays computation times of lasso, LAD-lasso, RLARS and sparse LTS in seconds. Note that those are average times over 10 runs based on simulated data with and varying dimension , obtained on an Intel Xeon X5670 machine. For sparse LTS and the LAD-lasso, the reported CPU times are averages over a grid of five values for . RLARS is a hybrid procedure, thus, we only report the CPU times for obtaining the sequence of predictors, but not for fitting the models along the sequence.

As expected, the computation time of the nonrobust lasso remains very low for increasing . Sparse LTS is still reasonably fast up to , but computation time is a considerable factor if is much larger than that. However, sparse LTS remains faster than obtaining the RLARS sequence. A further advantage of the subsampling algorithm of sparse LTS is that it can easily be parallelized to reduce computation time on modern multicore computers, which is future work.

## 9 Conclusions and discussion

Least trimmed squares (LTS) is a robust regression method frequently used in practice. Nevertheless, it does not allow for sparse model estimates and cannot be applied to high-dimensional data with . This paper introduced the sparse LTS estimator, which overcomes these two issues simultaneously by adding an penalty to the LTS objective function. Simulation results and a real data application to protein and gene expression data of the NCI-60 cancer cell panel illustrated the excellent performance of sparse LTS and showed that it performs as well or better than robust variable selection methods such as RLARS. In addition, an advantage of sparse LTS over algorithmic procedures such as RLARS is that the objective function allows for theoretical investigation of its statistical properties. As such, we could derive the breakdown point of the sparse LTS estimator. However, it should be noted that efficiency is an issue with sparse LTS. A reweighting step can thereby lead to a substantial improvement in efficiency, as shown in the simulation study.

In the paper, an penalization was imposed on the regression parameter, as for the lasso. Other choices for the penalty are possible. For example, an penalty leads to ridge regression. A robust version of ridge regression was recently proposed by Maronna (2011), using penalized MM-estimators. Even though the resulting estimates are not sparse, prediction accuracy is improved by shrinking the coefficients, and the computational issues with high-dimensional robust estimators are overcome due to the regularization. Another possible choice for the penalty function is the smoothly clipped absolute deviation penalty (SCAD) proposed by Fan and Li (2001). It satisfies the mathematical conditions for sparsity but results in a more difficult optimization problem than the lasso. Still, a robust version of SCAD can be obtained by optimizing the associated objective function over trimmed samples instead of over the full sample.

There are several other open questions that we leave for future research. For instance, we did not provide any asymptotics for sparse LTS, as was, for example, done for penalized M-estimators in Germain and Roueff (2010). Potentially, sparse LTS could be used as an initial estimator for computing penalized M-estimators.

All in all, the results presented in this paper suggest that sparse LTS is a valuable addition to the statistics researcher’s toolbox. The sparse LTS estimator has an intuitively appealing definition and is related to the popular least trimmed squares estimator of robust regression. It performs model selection, outlier detection and robust estimation simultaneously, and is applicable if the dimension is larger than the sample size.

## Appendix: Proof of breakdown point

{pf*}Proof of Theorem 1 In this proof the norm of a vector is denoted as and the Euclidean norm as . Since these norms are topologically equivalent, there exists a constant such that for all vectors . The proof is split into two parts.

First, we prove that . Replace the last observations, resulting in the contaminated sample . Then there are still good observations in . Let and . For the case , , the value of the objective function is given by

Now consider any with . For the value of the objective function, it holds that

Since , we conclude that , where does not depend on the outliers. This concludes the first part of the proof.

Second, we prove that . Move the last observations of to the position with , and denote the resulting contaminated sample. Assume that there exists a constant M such that

(A.1) |

that is, there is no breakdown. We will show that this leads to a contradiction.

Let with and define such that . Note that is always well defined due to the assumptions on , in particular, since . Then the objective function is given by

since the residuals with respect to the outliers are all zero. Hence,

(A.2) |

Furthermore, for with we have

since at least one outlier will be in the set of the smallest residuals. Now , so that

(A.3) |

since is nondecreasing.

## Acknowledgments

We would like to thank the Editor and two anonymous referees for their constructive remarks that led to an improvement of the paper.

## References

- Alfons (2012a) {bmisc}[author] \bauthor\bsnmAlfons, \bfnmA.\binitsA. (\byear2012a). \bhowpublishedsimFrame: Simulation framework. R package version 0.5.0. \bptokimsref \endbibitem
- Alfons (2012b) {bmisc}[author] \bauthor\bsnmAlfons, \bfnmA.\binitsA. (\byear2012b). \bhowpublishedrobustHD: Robust methods for high-dimensional data. R package version 0.1.0. \bptokimsref \endbibitem
- Alfons, Templ and Filzmoser (2010) {barticle}[author] \bauthor\bsnmAlfons, \bfnmA.\binitsA., \bauthor\bsnmTempl, \bfnmM.\binitsM. \AND\bauthor\bsnmFilzmoser, \bfnmP.\binitsP. (\byear2010). \btitleAn object-oriented framework for statistical simulation: The R package simFrame. \bjournalJournal of Statistical Software \bvolume37 \bpages1–36. \bptokimsref \endbibitem
- Efron et al. (2004) {barticle}[mr] \bauthor\bsnmEfron, \bfnmBradley\binitsB., \bauthor\bsnmHastie, \bfnmTrevor\binitsT., \bauthor\bsnmJohnstone, \bfnmIain\binitsI. \AND\bauthor\bsnmTibshirani, \bfnmRobert\binitsR. (\byear2004). \btitleLeast angle regression. \bjournalAnn. Statist. \bvolume32 \bpages407–499. \biddoi=10.1214/009053604000000067, issn=0090-5364, mr=2060166 \bptnotecheck related\bptokimsref \endbibitem
- Fan and Li (2001) {barticle}[mr] \bauthor\bsnmFan, \bfnmJianqing\binitsJ. \AND\bauthor\bsnmLi, \bfnmRunze\binitsR. (\byear2001). \btitleVariable selection via nonconcave penalized likelihood and its oracle properties. \bjournalJ. Amer. Statist. Assoc. \bvolume96 \bpages1348–1360. \biddoi=10.1198/016214501753382273, issn=0162-1459, mr=1946581 \bptokimsref \endbibitem
- Germain and Roueff (2010) {barticle}[mr] \bauthor\bsnmGermain, \bfnmJean-Francois\binitsJ.-F. \AND\bauthor\bsnmRoueff, \bfnmFrancois\binitsF. (\byear2010). \btitleWeak convergence of the regularization path in penalized M-estimation. \bjournalScand. J. Stat. \bvolume37 \bpages477–495. \biddoi=10.1111/j.1467-9469.2009.00682.x, issn=0303-6898, mr=2724509 \bptnotecheck year\bptokimsref \endbibitem
- Gertheiss and Tutz (2010) {barticle}[mr] \bauthor\bsnmGertheiss, \bfnmJan\binitsJ. \AND\bauthor\bsnmTutz, \bfnmGerhard\binitsG. (\byear2010). \btitleSparse modeling of categorial explanatory variables. \bjournalAnn. Appl. Stat. \bvolume4 \bpages2150–2180. \biddoi=10.1214/10-AOAS355, issn=1932-6157, mr=2829951 \bptokimsref \endbibitem
- Hassan, Bera and Pastan (2004) {barticle}[pbm] \bauthor\bsnmHassan, \bfnmRaffit\binitsR., \bauthor\bsnmBera, \bfnmTapan\binitsT. \AND\bauthor\bsnmPastan, \bfnmIra\binitsI. (\byear2004). \btitleMesothelin: A new target for immunotherapy. \bjournalClin. Cancer Res. \bvolume10 \bpages3937–3942. \biddoi=10.1158/1078-0432.CCR-03-0801, issn=1078-0432, pii=10/12/3937, pmid=15217923 \bptokimsref \endbibitem
- Hastie and Efron (2011) {bmisc}[author] \bauthor\bsnmHastie, \bfnmT.\binitsT. \AND\bauthor\bsnmEfron, \bfnmB.\binitsB. (\byear2011). \bhowpublishedlars: Least angle regression, lasso and forward stagewise. R package version 0.9-8. \bptokimsref \endbibitem
- Khan, Van Aelst and Zamar (2007) {barticle}[mr] \bauthor\bsnmKhan, \bfnmJafar A.\binitsJ. A., \bauthor\bsnmVan Aelst, \bfnmStefan\binitsS. \AND\bauthor\bsnmZamar, \bfnmRuben H.\binitsR. H. (\byear2007). \btitleRobust linear model selection based on least angle regression. \bjournalJ. Amer. Statist. Assoc. \bvolume102 \bpages1289–1299. \biddoi=10.1198/016214507000000950, issn=0162-1459, mr=2412550 \bptokimsref \endbibitem
- Knight and Fu (2000) {barticle}[mr] \bauthor\bsnmKnight, \bfnmKeith\binitsK. \AND\bauthor\bsnmFu, \bfnmWenjiang\binitsW. (\byear2000). \btitleAsymptotics for lasso-type estimators. \bjournalAnn. Statist. \bvolume28 \bpages1356–1378. \biddoi=10.1214/aos/1015957397, issn=0090-5364, mr=1805787 \bptokimsref \endbibitem
- Koenker (2011) {bmisc}[author] \bauthor\bsnmKoenker, \bfnmR.\binitsR. (\byear2011). \bhowpublishedquantreg: Quantile regression. R package version 4.67. \bptokimsref \endbibitem
- Lee et al. (2011) {barticle}[author] \bauthor\bsnmLee, \bfnmD.\binitsD., \bauthor\bsnmLee, \bfnmW.\binitsW., \bauthor\bsnmLee, \bfnmY.\binitsY. \AND\bauthor\bsnmPawitan, \bfnmY.\binitsY. (\byear2011). \btitleSparse partial least-squares regression and its applications to high-throughput data analysis. \bjournalChemometrics and Intelligent Laboratory Systems \bvolume109 \bpages1–8. \bptokimsref \endbibitem
- Li, Peng and Zhu (2011) {barticle}[mr] \bauthor\bsnmLi, \bfnmGaorong\binitsG., \bauthor\bsnmPeng, \bfnmHeng\binitsH. \AND\bauthor\bsnmZhu, \bfnmLixing\binitsL. (\byear2011). \btitleNonconcave penalized -estimation with a diverging number of parameters. \bjournalStatist. Sinica \bvolume21 \bpages391–419. \bidissn=1017-0405, mr=2796868 \bptokimsref \endbibitem
- Maglott et al. (2005) {barticle}[pbm] \bauthor\bsnmMaglott, \bfnmDonna\binitsD., \bauthor\bsnmOstell, \bfnmJim\binitsJ., \bauthor\bsnmPruitt, \bfnmKim D.\binitsK. D. \AND\bauthor\bsnmTatusova, \bfnmTatiana\binitsT. (\byear2005). \btitleEntrez gene: Gene-centered information at NCBI. \bjournalNucleic Acids Res. \bvolume33 \bpagesD54–D58. \biddoi=10.1093/nar/gki031, issn=1362-4962, pii=33/suppl_1/D54, pmcid=539985, pmid=15608257 \bptokimsref \endbibitem
- Maronna (2011) {barticle}[mr] \bauthor\bsnmMaronna, \bfnmRicardo A.\binitsR. A. (\byear2011). \btitleRobust ridge regression for high-dimensional data. \bjournalTechnometrics \bvolume53 \bpages44–53. \biddoi=10.1198/TECH.2010.09114, issn=0040-1706, mr=2791951 \bptokimsref \endbibitem
- Maronna, Martin and Yohai (2006) {bbook}[mr] \bauthor\bsnmMaronna, \bfnmRicardo A.\binitsR. A., \bauthor\bsnmMartin, \bfnmR. Douglas\binitsR. D. \AND\bauthor\bsnmYohai, \bfnmVictor J.\binitsV. J. (\byear2006). \btitleRobust Statistics: Theory and Methods. \bpublisherWiley, \baddressChichester. \biddoi=10.1002/0470010940, mr=2238141 \bptokimsref \endbibitem
- Meinshausen (2007) {barticle}[mr] \bauthor\bsnmMeinshausen, \bfnmNicolai\binitsN. (\byear2007). \btitleRelaxed lasso. \bjournalComput. Statist. Data Anal. \bvolume52 \bpages374–393. \biddoi=10.1016/j.csda.2006.12.019, issn=0167-9473, mr=2409990 \bptokimsref \endbibitem
- Menjoge and Welsch (2010) {barticle}[mr] \bauthor\bsnmMenjoge, \bfnmRajiv S.\binitsR. S. \AND\bauthor\bsnmWelsch, \bfnmRoy E.\binitsR. E. (\byear2010). \btitleA diagnostic method for simultaneous feature selection and outlier identification in linear regression. \bjournalComput. Statist. Data Anal. \bvolume54 \bpages3181–3193. \biddoi=10.1016/j.csda.2010.02.014, issn=0167-9473, mr=2727745 \bptokimsref \endbibitem
- Oshima, Baribault and Caulín (1996) {barticle}[author] \bauthor\bsnmOshima, \bfnmR. G.\binitsR. G., \bauthor\bsnmBaribault, \bfnmH.\binitsH. \AND\bauthor\bsnmCaulín, \bfnmC.\binitsC. (\byear1996). \btitleOncogenic regulation and function of keratins 8 and 18. \bjournalCancer and Metastasis Rewiews \bvolume15 \bpages445–471. \bptokimsref \endbibitem
- Owens and Lane (2003) {barticle}[pbm] \bauthor\bsnmOwens, \bfnmDewi W.\binitsD. W. \AND\bauthor\bsnmLane, \bfnmE. Birgitte\binitsE. B. (\byear2003). \btitleThe quest for the function of simple epithelial keratins. \bjournalBioessays \bvolume25 \bpages748–758. \biddoi=10.1002/bies.10316, issn=0265-9247, pmid=12879445 \bptokimsref \endbibitem
- R Development Core Team (2011) {bmisc}[author] \borganizationR Development Core Team (\byear2011). \bhowpublishedR: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. \bptokimsref \endbibitem
- Radchenko and James (2011) {barticle}[mr] \bauthor\bsnmRadchenko, \bfnmPeter\binitsP. \AND\bauthor\bsnmJames, \bfnmGareth M.\binitsG. M. (\byear2011). \btitleImproved variable selection with forward-lasso adaptive shrinkage. \bjournalAnn. Appl. Stat. \bvolume5 \bpages427–448. \biddoi=10.1214/10-AOAS375, issn=1932-6157, mr=2810404 \bptokimsref \endbibitem
- Rosset and Zhu (2004) {barticle}[mr] \bauthor\bsnmRosset, \bfnmS.\binitsS. \AND\bauthor\bsnmZhu, \bfnmJ.\binitsJ. (\byear2004). \btitleDiscussion of “Least angle regression,” by B. Efron, T. Hastie, I. Johnstone and R. Tibshirani. \bjournalAnn. Statist. \bvolume32 \bpages469–475. \bptokimsref \endbibitem
- Rousseeuw (1984) {barticle}[mr] \bauthor\bsnmRousseeuw, \bfnmPeter J.\binitsP. J. (\byear1984). \btitleLeast median of squares regression. \bjournalJ. Amer. Statist. Assoc. \bvolume79 \bpages871–880. \bidissn=0162-1459, mr=0770281 \bptokimsref \endbibitem
- Rousseeuw and Leroy (2003) {bbook}[author] \bauthor\bsnmRousseeuw, \bfnmP. J.\binitsP. J. \AND\bauthor\bsnmLeroy, \bfnmA. M.\binitsA. M. (\byear2003). \btitleRobust Regression and Outlier Detection, \bedition2nd ed. \bpublisherWiley, \baddressHoboken. \bptokimsref \endbibitem
- Rousseeuw and Van Driessen (2006) {barticle}[mr] \bauthor\bsnmRousseeuw, \bfnmPeter J.\binitsP. J. \AND\bauthor\bsnmVan Driessen, \bfnmKatrien\binitsK. (\byear2006). \btitleComputing LTS regression for large data sets. \bjournalData Min. Knowl. Discov. \bvolume12 \bpages29–45. \biddoi=10.1007/s10618-005-0024-4, issn=1384-5810, mr=2225526 \bptokimsref \endbibitem
- Shankavaram et al. (2007) {barticle}[author] \bauthor\bsnmShankavaram, \bfnmU. T.\binitsU. T., \bauthor\bsnmReinhold, \bfnmW. C.\binitsW. C., \bauthor\bsnmNishizuka, \bfnmS.\binitsS., \bauthor\bsnmMajor, \bfnmS.\binitsS., \bauthor\bsnmMorita, \bfnmD.\binitsD., \bauthor\bsnmChary, \bfnmK. K.\binitsK. K., \bauthor\bsnmReimers, \bfnmM. A.\binitsM. A., \bauthor\bsnmScherf, \bfnmU.\binitsU., \bauthor\bsnmKahn, \bfnmA.\binitsA., \bauthor\bsnmDolginow, \bfnmD.\binitsD., \bauthor\bsnmCossman, \bfnmJ.\binitsJ., \bauthor\bsnmKaldjian, \bfnmE. P.\binitsE. P., \bauthor\bsnmScudiero, \bfnmD. A.\binitsD. A., \bauthor\bsnmPetricoin, \bfnmE.\binitsE., \bauthor\bsnmLiotta, \bfnmL.\binitsL., \bauthor\bsnmLee, \bfnmJ. K.\binitsJ. K. \AND\bauthor\bsnmWeinstein, \bfnmJ. N.\binitsJ. N. (\byear2007). \btitleTranscript and protein expression profiles of the NCI-60 cancer cell panel: An integromic microarray study. \bjournalMolecular Cancer Therapeutics \bvolume6 \bpages820–832. \bptokimsref \endbibitem
- She and Owen (2011) {barticle}[mr] \bauthor\bsnmShe, \bfnmYiyuan\binitsY. \AND\bauthor\bsnmOwen, \bfnmArt B.\binitsA. B. (\byear2011). \btitleOutlier detection using nonconvex penalized regression. \bjournalJ. Amer. Statist. Assoc. \bvolume106 \bpages626–639. \biddoi=10.1198/jasa.2011.tm10390, issn=0162-1459, mr=2847975 \bptokimsref \endbibitem
- Tibshirani (1996) {barticle}[mr] \bauthor\bsnmTibshirani, \bfnmRobert\binitsR. (\byear1996). \btitleRegression shrinkage and selection via the lasso. \bjournalJ. Roy. Statist. Soc. Ser. B \bvolume58 \bpages267–288. \bidissn=0035-9246, mr=1379242 \bptokimsref \endbibitem
- van de Geer (2008) {barticle}[mr] \bauthor\bparticlevan de \bsnmGeer, \bfnmSara A.\binitsS. A. (\byear2008). \btitleHigh-dimensional generalized linear models and the lasso. \bjournalAnn. Statist. \bvolume36 \bpages614–645. \biddoi=10.1214/009053607000000929, issn=0090-5364, mr=2396809 \bptokimsref \endbibitem
- Wang, Li and Jiang (2007) {barticle}[mr] \bauthor\bsnmWang, \bfnmHansheng\binitsH., \bauthor\bsnmLi, \bfnmGuodong\binitsG. \AND\bauthor\bsnmJiang, \bfnmGuohua\binitsG. (\byear2007). \btitleRobust regression shrinkage and consistent variable selection through the LAD-lasso. \bjournalJ. Bus. Econom. Statist. \bvolume25 \bpages347–355. \biddoi=10.1198/073500106000000251, issn=0735-0015, mr=2380753 \bptokimsref \endbibitem
- Wang et al. (2011) {barticle}[mr] \bauthor\bsnmWang, \bfnmSijian\binitsS., \bauthor\bsnmNan, \bfnmBin\binitsB., \bauthor\bsnmRosset, \bfnmSaharon\binitsS. \AND\bauthor\bsnmZhu, \bfnmJ.\binitsJ. (\byear2011). \btitleRandom lasso. \bjournalAnn. Appl. Stat. \bvolume5 \bpages468–485. \biddoi=10.1214/10-AOAS377, issn=1932-6157, mr=2810406 \bptokimsref \endbibitem
- Wu and Lange (2008) {barticle}[mr] \bauthor\bsnmWu, \bfnmTong Tong\binitsT. T. \AND\bauthor\bsnmLange, \bfnmKenneth\binitsK. (\byear2008). \btitleCoordinate descent algorithms for lasso penalized regression. \bjournalAnn. Appl. Stat. \bvolume2 \bpages224–244. \biddoi=10.1214/07-AOAS147, issn=1932-6157, mr=2415601 \bptokimsref \endbibitem
- Yohai (1987) {barticle}[mr] \bauthor\bsnmYohai, \bfnmVíctor J.\binitsV. J. (\byear1987). \btitleHigh breakdown-point and high efficiency robust estimates for regression. \bjournalAnn. Statist. \bvolume15 \bpages642–656. \biddoi=10.1214/aos/1176350366, issn=0090-5364, mr=0888431 \bptokimsref \endbibitem
- Yuan and Lin (2006) {barticle}[mr] \bauthor\bsnmYuan, \bfnmMing\binitsM. \AND\bauthor\bsnmLin, \bfnmYi\binitsY. (\byear2006). \btitleModel selection and estimation in regression with grouped variables. \bjournalJ. R. Stat. Soc. Ser. B Stat. Methodol. \bvolume68 \bpages49–67. \biddoi=10.1111/j.1467-9868.2005.00532.x, issn=1369-7412, mr=2212574 \bptokimsref \endbibitem
- Zhao and Yu (2006) {barticle}[mr] \bauthor\bsnmZhao, \bfnmPeng\binitsP. \AND\bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2006). \btitleOn model selection consistency of lasso. \bjournalJ. Mach. Learn. Res. \bvolume7 \bpages2541–2563. \bidissn=1532-4435, mr=2274449 \bptokimsref \endbibitem
- Zou (2006) {barticle}[mr] \bauthor\bsnmZou, \bfnmHui\binitsH. (\byear2006). \btitleThe adaptive lasso and its oracle properties. \bjournalJ. Amer. Statist. Assoc. \bvolume101 \bpages1418–1429. \biddoi=10.1198/016214506000000735, issn=0162-1459, mr=2279469 \bptokimsref \endbibitem
- Zou, Hastie and Tibshirani (2007) {barticle}[mr] \bauthor\bsnmZou, \bfnmHui\binitsH., \bauthor\bsnmHastie, \bfnmTrevor\binitsT. \AND\bauthor\bsnmTibshirani, \bfnmRobert\binitsR. (\byear2007). \btitleOn the “degrees of freedom” of the lasso. \bjournalAnn. Statist. \bvolume35 \bpages2173–2192. \biddoi=10.1214/009053607000000127, issn=0090-5364, mr=2363967 \bptokimsref \endbibitem