Robust and sparse estimation methods for high dimensional linear and logistic regression
Abstract
Fully robust versions of the elastic net estimator are introduced for linear and logistic regression. The algorithms to compute the estimators are based on the idea of repeatedly applying the nonrobust classical estimators to data subsets only. It is shown how outlierfree subsets can be identified efficiently, and how appropriate tuning parameters for the elastic net penalties can be selected. A final reweighting step improves the efficiency of the estimators. Simulation studies compare with nonrobust and other competing robust estimators and reveal the superiority of the newly proposed methods. This is also supported by a reasonable computation time and by good performance in real data examples.
keywords:
Elastic net penalty, Least trimmed squares, Cstep algorithm, High dimensional data, Robustness, Sparse estimation1 Introduction
Let us consider the linear regression model, which assumes the linear relationship between the predictors and the predictand ,
(1) 
where are the regression coefficients and is the error term assumed to have standard normal distribution. For simplicity we assume that is centered to mean zero, and the columns of are meancentered and scaled to variance one. The ordinary least squares (OLS) regression estimator is the common choice in situations where the number of observations, , in the data set is greater than the number of predictor variables, . However, in presence of multicollinearity among predictors, the OLS estimator becomes unreliable, and if exceeds it can not even be computed. Several alternatives have been proposed in this case; here we focus on the class of shrinkage estimators which penalize the residual sumofsquares. The ridge estimator uses an penalty on the regression coefficients Hoerl70 (), while the lasso estimator takes an penalty instead Tibshirani96 (). Although this does no longer allow for a closed form solution for the estimated regression coefficients, the lasso estimator gets sparse, which means that some of the regression coefficients are shrunken to zero. This means that lasso acts like a variable selection method by returning a smaller subset of variables being relevant for the model. This is appropriate in particular for high dimensional low sample size data sets (), arising from applications in chemometrics, biometrics, econometrics, social sciences and many other fields, where the data include many uninformative variables which have no effect on the predictand or have very small contribution to the model.
There is also a limitation of the lasso estimator, since it is able to select only at most variables when . If is very small, or if the number of informative variables (variables which are relevant for the model) is expected to be greater than , the model performance can become poor. As a way out, the elastic net (enet) estimator has been introduced Zou05 (), which combines both and penalties:
(2) 
Here, , the observations form the rows of , and the penalty term is defined as
(3) 
The entire strength of the penalty is controlled by the tuning parameter . The other tuning parameter is the mixing proportion of the ridge and lasso penalties and takes value in . The elastic net estimator is able to select variables like in lasso regression, and shrink the coefficients according to ridge. For an overview of sparse methods, see Filzmoser12 ().
A further limitation of the previously mentioned estimators is their lack of robustness against data outliers. In practice, the presence of outliers in data is quite common, and thus robust statistical methods are frequently used, see, for example Liang1 (); Liang2 (). In the linear regression setting, outliers may appear in the space of the predictand (socalled vertical outliers), or in the space of the predictor variables (leverage points) Maronna06 (). The Least Trimmed Squares (LTS) estimator has been among the first proposals of a regression estimator being fully robust against both types of outliers RousseeuwL03 (). It is defined as
(4) 
where the are the ordered absolute residuals , and Rousseeuw84 (). The number is chosen between and , where refers to the largest integer , and it determines the robustness properties of the estimator Rousseeuw84 (). The LTS estimator also became popular due to the proposal of a quick algorithm for its computation, the socalled FASTLTS algorithm Rousseeuw06 (). The key feature of this algorithm is the “concentration step” or Cstep, which is an efficient way to arrive at outlierfree data subsets where the OLS estimator can be applied. This only works for , but recently the sparse LTS regression estimator has been proposed for high dimensional problems Alfons13 ():
(5) 
This estimator adds an penalty to the objective function of the LTS estimator, and it can thus be seen as a robust counterpart of the lasso estimator. The sparse LTS estimator is robust to both vertical outliers and leverage points, and also a fast algorithm has been developed for its computation AlfonsR13 ().
The contribution of this work is twofold: A new sparse and robust regression estimator is proposed with combined and penalties. This robustified elastic net regression estimator overcomes the limitations of lasso type estimators concerning the low number of variables in the models, and concerning the instability of the estimator in case of high multicollinearity among the predictors Tibshirani96 (). As a second contribution, a robust elastic net version of logistic regression is introduced for problems where the response is a binary variable, encoded with referring to the class memberships of two groups. The logistic regression model is , for , where denotes the conditional probability for the th observation,
(6) 
and is the error term assumed to have binomial distribution. The most popular way to estimate the model parameters is the maximum likelihood (ML) estimator which is based on maximizing the loglikelihood function or, equivalently, minimizing the negative loglikelihood function,
(7) 
with the deviances
(8) 
The estimation of the model parameters with this method is not reliable when there is multicollinearity among the predictors and is not feasible when . To solve these problems, Friedman et al. Friedman10 () suggested to minimize a penalized negative loglikelihood function,
(9) 
Here, is the elastic net penalty as given in Equation (3), and thus this estimator extends (2) to the logistic regression setting. Using the elastic net penalty also solves the nonexistence problem of the estimator in case of nonoverlapping groups Albert84 (); Friedman10 (); FriedmanR16 (). Robustness can be achieved by trimming the penalized loglikelihood function, and using weights as proposed in the context of robust logistic regression Croux03 (); Bianco96 (). These weights can also be applied in a reweighting step which increases the efficiency of the robust elastic net logistic regression estimator.
The outline of this paper is as follows. In Section 2, we introduce the robust and sparse linear regression estimator and provide a detailed algorithm for its computation. Section 3 presents the robust elastic net logistic regression estimator. Some important details which are different from the linear regression algorithm are mentioned here. Section 4 explains how the tuning parameters for the proposed estimators can be selected; we prefer an approach based on crossvalidation. Since LTS estimators possess a rather low statistical efficiency, a reweighting step is introduced in Section 5 to increase the efficiency. The properties of the proposed estimators are investigated in simulation studies in Section 6, and Section 7 shows the performance on real data examples. Section 8 provides some insight into the computation time of the algorithms, and the final Section 9 concludes.
2 Robust and sparse linear regression with elastic net penalty
A robust and sparse elastic net estimator in linear regression can be defined with the objective function
(10) 
where with , , and indicates the elastic net penalty with as in Equation (3). We call this estimator the enetLTS estimator, since it uses a trimmed sum of squared residuals, like the sparse LTS estimator (5). The minimum of the objective function (10) determines the optimal subset of size ,
(11) 
which is supposed to be outlierfree. The coefficient estimates depend on the subset . The enetLTS estimator is given for this subset by
(12) 
It is not trivial to identify this optimal subset, and practically one has to use an algorithm to approximate the solution. This algorithm uses Csteps: Suppose that the current subset in the th iteration of the algorithm is denoted by , and the resulting estimator by . Then the next subset is formed by the indexes of those observations which correspond to the smallest squared residuals
(13) 
If denotes the estimator based on , then by construction of the subsets it follows immediately:
(14) 
This means that the Csteps decrease the objective function (10) successively, and lead to a local optimum after convergence. The global optimum is approximated by performing the Csteps with several initial subsets. However, in order to keep the runtime of the algorithm low, it is crucial that the initial subsets are chosen carefully. As motivated in Alfons13 (), for a certain combination of the penalty parameters and , elemental subsets are created consisting of the indexes of three randomly selected observations. Using only three observations increases the possibility of having no outliers in the elemental subsets. Let us denote these elemental subsets by
(15) 
where . The resulting estimators based on the three observations are denoted by . Now the squared residuals can be computed for all observations , and two Csteps are carried out, starting with the subset defined by the indexes of the smallest squared residuals. Then only those subsets with the smallest values of the objective function (10) are kept as candidates. With these candidate subsets, the Csteps are performed until convergence (no further decrease), and the best subset is defined as that one with the smallest value of the objective function. This best subset also defines the estimator for this particular combination of and .
Basically, one can apply this procedure now for a grid of values in the interval and . Practically, this may still be quite time consuming, and therefore, for a new parameter combination, the best subset of the neighboring grid value of and/or , is taken, and the Csteps are started from this best subset until convergence. This technique, called warm starts, is repeated for each combination over the grid of and values, and thus the start based on the elemental subsets is carried out only once.
The choice of the optimal tuning parameters and is detailed in Section 4. The subset corresponding to the optimal tuning parameters is the optimal subset of size . The enetLTS estimator is then calculated on the optimal subset with and .
3 Robust and sparse logistic regression with elastic net penalty
Based on the definition (9) of the elastic net logistic regression estimator, it is straightforward to define the objective function of its robust counterpart based on trimming,
(16) 
where again with , and is the elastic net penalty as defined in Equation (3). As outlined in the last Section 2, the task is to find the optimal subset which minimizes the objective function and defines the robust sparse elastic net estimator for logistic regression. It turns out that the algorithm explained previously in the linear regression setting can be successfully used to find the approximative solution. In the following we will explain the modifications that need to be carried out.
 Csteps:

In the linear regression case, the Csteps were based on the squared residuals (13). Now the subsets are determined according to the indexes of those observations with the smallest values of the deviances . However, here it needs to be made sure that the original group sizes are in the same proportion. Denote and the number of observations in both groups, with . Then and define the group sizes in each subset. A new subset is created with the indexes of the smallest deviances and with the indexes of the smallest deviances .
 Elemental subsets:

In the linear regression case, the elemental subsets consisted of the indexes of three randomly selected observations, see (15). Now four observations are randomly selected to form the elemental subsets, two from each group. This allows to compute the estimator, and the two Csteps are based on the smallest values of the deviances. As before, this is carried out for 500 elemental subsets, and only the “best” 10 subsets are kept. Here, “best” refers to an evaluation that is borrowed from a robustified deviance measure proposed in Croux and Haesbroeck Croux03 () in the context of robust logistic regression (but not in high dimension). These authors replace the deviance function (8) used in (7) by a function to define the socalled Bianco Yohai (BY) estimator
(17) a highly robust logistic regression estimator, see also Bianco96 (). The form of the function is shown in Figure 1, see Croux03 () for details.
We use this function as follows: Positive scores of group 1, i.e. , refer to correct classification and receive the highest values for , while negative scores refer to misclassification, with small or zero values. For the scores of group 0 we have the reverse behavior, see Figure 1. When evaluating an subset, the sum over the values of for is computed, and this sum should be as large as possible. This means that we aim at identifying an subset where the groups are separated as much as possible. Points on the wrong side have almost no contribution, but also the contribution of outliers on the correct side is bounded. In this way, outliers will not dominate the sum.
With the best 10 subsets we continue the Csteps until convergence. Finally, the subset with the largest sum over all forms the best index set.
The selection of the optimal parameters and is discussed in Section 4. The subset corresponding to these optimal tuning parameters is defined as the optimal subset of size . The enetLTS logistic regression estimator is then calculated on the optimal subset with and .
Note that at the beginning of the algorithm for linear regression, the predictand is centered, and the predictor variables are centered robustly by the median and scaled by the MAD. Within the Csteps of the algorithm, we additionally meancenter the response variable and scale the predictors by their arithmetic means and standard deviations, calculated on each current subset, see also Alfons13 (). The same procedure is applied for logistic regression, except for centering the predictand. In the end, the coefficients are backtransformed to the original scale.
4 Selection of the tuning parameters
Sections 2 and 3 outlined the algorithms to arrive at a best subset for robust elastic net linear and logistic regression, for each combination of the tuning parameters and . In this section we define the strategy to select the optimal combination and , leading to the optimal subset. For this purpose we are using fold crossvalidation (CV) on those best subsets of size , with . In more detail, for fold CV, the data are randomly split into blocks of approximately equal size. In case of logistic regression, each block needs to consist of observations from both classes with approximately the same class proportions as in the complete data set. Each block is left out once, the model is fitted to the “training data” contained in the blocks, using a fixed parameter combination for and , and it is applied to the leftout block with the “test data”. In this way, fitted values are obtained from models, and they are compared to the corresponding original response by using the following evaluation criteria:

For linear regression we take the root mean squared prediction error (RMSPE)
(18) where presents the test set residuals from the models estimated on the training sets with a specific and (for simplicity we omitted here the index denoting the models where the th block was left out and the corresponding test data from this block).

For logistic regression we use the mean of the negative loglikelihoods or deviances (MNLL)
(19) where presents the test set deviances from the models estimated on the training sets with a specific and .
Note that the evaluation criteria given by (18) and (19) are robust against outliers, because they are based on the best subsets of size , which are supposed to be outlier free.
In order to obtain more stable results, we repeat the fold CV five times and take the average of the corresponding evaluation measure. Finally, the optimal parameters and are defined as that couple for which the evaluation criterion gives the minimal value. The corresponding best subset is determined as the optimal subset.
Note that the optimal couple and is searched on a grid of values and . In our experiments we used 41 equally spaced values for , and was varied in steps of size . For determining in the linear regression case we used the same approach as in Alfons et al. Alfons13 () which is based on the Pearson correlation between and the th predictor variable on winsorized data. For logistic regression we replaced the Pearson correlation by a robustified pointbiserial correlation: denote by and the group sizes of the two groups, and by and the medians of the th predictor variable for the data from the two groups, respectively. Then the robustified pointbiserial correlation between and is defined as
where is the MAD of , and .
5 Reweighting step
The LTS estimator has a low efficiency, and thus it is common to use a reweighting step RousseeuwL03 (). This idea is also used for the estimators introduced here. Generally, in a reweighting step the outliers according to the current model are identified and downweighted. For the linear regression model we will use the same reweighting scheme as proposed in Alfons et al. Alfons13 (), which is based on standardized residuals. In case of logistic regression we compute the Pearson residuals which are approximately standard normally distributed and given by
(20) 
with the conditional probabilities from (6).
For simplicity, denote the standardized residuals from the linear regression case also by . Then the weights are defined by
(21) 
where , such that of the observations are flagged as outliers in the normal model. The reweighted enetLTS estimator is defined as
(22) 
where , stands for the vector of binary weights (according to the current model), , and corresponds to squared residuals for linear regression or to the deviances in case of logistic regression. Since , and because the optimal parameters and have been derived with observations, the penalty can act (slightly) differently in (22) than for the raw estimator. For this reason, the parameter has to be updated, while the regulating the tradeoff between the and penalty is kept the same. The updated parameter is determined by fold CV, with the simplification that is already fixed.
6 Simulation studies
In this section, the performance of the new estimators is compared with different sparse estimators in different scenarios. We consider both the raw and the reweighted versions of the enetLTS estimators, and therefore aim to show how the reweighting step improves the methods. The raw and reweighted enetLTS estimators are compared with their classical, nonrobust counterparts, the linear and logistic regression estimators with elastic net penalty Friedman10 (). In case of linear regression we also compare with the reweighted sparse LTS estimator of Alfons13 (). All robust estimators are calculated taking the subset size such that their performances are directly comparable.
For each replication, we choose the optimal tuning parameters and over the grids and with 5times repeated fold CV as described in Section 4. To select the tuning parameters for the classical estimators with elastic net penalty, we first draw the same grid for , namely , with 41 equally spaced grid points. Then we use fold CV as provided by the R package glmnet, which automatically checks the model quality for a sequence of values for , taking the mean squared error as an evaluation criterion. Finally, the tuning parameters corresponding to the smallest value of the minimum crossvalidated error are determined as the optimal tuning parameters. In order to be coherent with our evaluation, the tuning parameters for the sparse LTS estimator are determined in the same way as for the enetLTS estimator. All simulations are carried out in R R ().
Note that we simulated the data sets with intercept. As described at the end of Section 3, the data are centered and scaled at the beginning of the algorithm and only in the final step the coefficients are backtransformed to the original scale, where also the estimate of the intercept is computed.
Sampling schemes for linear regression: Let us consider two different scenarios by means of generating a “low dimensional” data set with and and a “high dimensional” data set with and . We generate a data matrix where the variables are forming correlated blocks, , where , and have the dimensions , and , with . Such a block structure can be assumed in many application, and it mimics different underlying hidden processes. The observations of the blocks are generated independently from each other, from a multivariate normal distribution with , , , with , , , and with , , , respectively. While the first two blocks belong to the informative variables with sizes of and , the third block represents uninformative variables with . Furthermore, we take to allow for a high correlation among the informative variables, and to have low correlation among the uninformative variables.
To create sparsity, the true parameter vector consists of zeros for the last 90% of the entries referring to the uninformative variables, while the first 10% of the entries are assigned to one. The response variable is calculated by
(23) 
where the error term is distributed according to a standard normal distribution , for .
This is the design for the simulations with clean data. For the simulation scenarios with outliers we replace the first of the observations of the block of informative variables by values coming from independent normal distributions for each variable. Further, the error terms for these outliers are replaced by values from instead of , where represents the estimated standard deviation of the clean predictand vector. In this way, the contaminated data consist of both vertical outliers and leverage points.
Sampling schemes for logistic regression: We also consider two different scenarios for logistic regression, a “low dimensional” data set with and and a “high dimensional” data set with and . The data matrix is , where has the dimension and is of dimension , with . The data matrices are generated independently from with , , , and with , , , respectively. While the first block consists of the informative variables with , the second block represents uninformative variables with . We take for a high correlation among the informative variables, and for moderate correlation among the uninformative variables.
The coefficient vector consists of ones for the first 10% of the entries, and zeros for the remaining uninformative block. The elements of the error term are generated independently from . The grouping variable is then generated according to the model
(24) 
With this setting, both groups are of approximately the same size.
Contamination is introduced by adding outliers only to the informative variables. Denote the number of observations in class 0. Then the first observations of group 0 are replaced by values generated from . In order to create “vertical” outliers in addition to leverage points, we assign those first observations of class 0 a wrong class membership.
Performance measures: For the evaluation of the different estimators, training and test data sets are generated according to the explained sampling schemes. The models are fit to the training data and evaluated on the test data. The test data are always generated without outliers.
As performance measures we use the root mean squared prediction error (RMSPE) for linear regression,
(25) 
and the mean of the negative loglikelihoods or deviances (MNLL) for logistic regression,
(26) 
where and , , indicate the observations of the test data set, denotes the coefficient vector and stands for the estimated intercept term obtained from the training data set. In logistic regression we also calculate the misclassification rate (MCR), defined as
(27) 
where is the number of misclassified observations from the test data after fitting the model on the training data. Further, we consider the precision of the coefficient estimate as a quality criterion, defined by
(28) 
In order to compare the sparsity of the coefficient estimators, we evaluate the False Positive Rate (FPR) and the False Negative Rate (FNR), defined as
(29) 
(30) 
The FPR is the proportion of noninformative variables that are incorrectly included in the model. On the other hand, the FNR is the proportion of informative variables that are incorrectly excluded from the model. A high FNR usually has a bad effect on the prediction performance since it inflates the variance of the estimator.
These evaluation measures are calculated for the generated data in each of 100 simulation replications separately, and then summarized by boxplots. The smaller the value for these criteria, the better the performance of the method.
Results for linear regression: The outcome of the simulations for linear regression is summarized in Figures 2–5. The left plots in these figures are for the simulations with low dimensional data, and the right plots for the high dimensional configuration. Figure 2 compares the RMSPE. All methods yield similar results in the low dimensional noncontaminated case, while in the high dimensional clean data case the elastic net method is clearly better. However, in the contaminated case, elastic net leads to poor performance, which is also the case for sparse LTS. EnetLTS performs even slightly better with contaminated data, and there is also a slight improvement visible in the reweighted version of this estimator. The PRECISION in Figure 3 shows essentially the same behavior. The FPR in Figure 4, reflecting the proportion of incorrectly added noise variables to the models, shows a very low rate for sparse LTS. Here, the elastic net even improves in the contaminated setting, and the same is true for enetLTS. A quite different picture is shown in Figure 5 with the FNR. Sparse LTS and elastic net miss a high proportion of informative variables in the contaminated data scenario, which is the reason for their poor overall performance. Note that the outliers are placed in the informative variables, which seems to be particularly difficult for sparse LTS.
Results for logistic regression: Figures 6–10 summarize the simulation results for logistic regression. As before, the left plots refer to the low dimensional case, and the right plots to the high dimensional data. Within one plot, the results for uncontaminated and contaminated data are directly compared. The misclassification rate in Figure 6 is around 10% for all methods, and it is slightly higher in the high dimensional situation. In case of contamination, however, this rate increases enormously for the classical method elastic net.
The average deviances in Figure 7 show that the reweighting of the enetLTS estimator clearly improves the raw estimate in both the low and high dimensional cases. It can also be seen that elastic net is sensitive to the outliers. The precision of the parameter estimates in Figure 8 reveal a remarkable improvement for the reweighted enetLTS estimator compared to the raw version, while there is not any clear effect of the contamination on the classical elastic net estimator.
The FPR in Figure 9 shows a certain difference between uncontaminated and contaminated data for the elastic net, but otherwise the results are quite comparable. A different picture is visible from the FNR in Figure 10, where especially in the low dimensional case the elastic net is very sensitive to the outliers. Overall we conclude that the enetLTS performs very well in case of contamination even though this was not clearly visible in the precision, and it also yields reasonable results for clean data.
7 Applications to real data
In this section we focus on applications with logistic regression, and compare the nonrobust elastic net estimator with the robust enetLTS method. The model selection is conducted as described in Section 4. Model evaluation is done with leaveoneout cross validation, i.e. each observation is used as test observation once, a model is estimated on the remaining observations, and the negative loglikelihood is calculated for the test observation. In these real data examples it is unknown if outliers are present. In order to avoid an influence of potential outliers on the evaluation of a model, the 25% trimmed mean of the negative loglikelihoods is calculated to compare the models.
7.1 Analysis of meteorite data
The timeofflight secondary iron mass spectroscope COSIMA kissel2007cosima () was sent to the comet ChuryumovGerasimenko in the Rosetta space mission by the ESA to analyze the elemental composition of comet particles which were collected there schulz2015comet (). As reference measurements, samples of meteorites provided by the Natural History Museum Vienna were analyzed with the same type of spectroscope at Max Planck Institute for Solar System Research in Göttingen.
Here we apply our proposed method for logistic regression to the measurements from particles from the meteorites Ochansk and Renazzo with 160 and 110 spectra, respectively. We restrict the mass range to 1100mu, consider only mass windows where inorganic and organic ions can be expected as described in Varmuza11 () and variables with positive median absolute deviation. So we obtain variables. Further, the data is normalized to have constant row sum 100.
Table 1 summarizes the results for the comparison of the methods. The trimmed MNLL is much smaller for the enetLTS estimator than for the classical elastic net method. The reweighting step improves the quality of the model further. The selected tuning parameter is much smaller for enetLTS than for the classical elastic net method which strongly influences the number of variables in the models.
number variables  trimmed MNLL  

elastic net  136  0.00866 
enetLTS raw  294  0.00030 
enetLTS  397  0.00014 
Figure 11 compares the Pearson residuals of the elastic net model and the enetLTS model. In the classical approach no abnormal observations can be detected. With the enetLTS model several observations are identified as outliers by the 1.25% and 98.25% quantiles of the standard normal distribution, which are marked as horizontal lines in Figure 11. Closer investigation showed that these spectra lie on the outer border of the measurement area and are potentially measured on the target instead of the meteorite particle. Their multivariate structure for those variables which are included in the model is visualized in Figure 12, where we can see that in some variables they have particularly large values compared to the majority of the group.
7.2 Analysis of the glass vessels data
Archaeological glass vessels where analyzed with electronprobe Xray microanalysis to investigate the chemical concentrations of elements in order to learn more about their origin and the trade market at the time of their making in the 16 and 17 century Janssens98 (). Four different groups were identified, i.e. sodic, potassic, potassocalcic and calcic glass vessels. For demonstration of the performance of logistic regression, two groups are selected from the glass vessels data set. The first group is the potassic group with 15 spectra, the second group the potassocalcic group with 10 spectra. As in Filzmoser08 () we remove variables with MAD equal to zero, resulting in variables.
The quality of the selected models is described in Table 2. The trimmed mean of the negative log likelihoods is much smaller for enetLTS than for elastic net. The reweighting step in enetLTS hardly improves the model, but includes more variables. Again, both enetLTS models include more variables than the elastic net model. In the elastic net model the penalty gives higher emphasis on the term, i.e. ; for enetLTS it is .
number variables  trimmed MNLL  

elastic net  50  0.004290 
enetLTS raw  375  0.000345 
enetLTS  448  0.000338 
Different behavior of the coefficient estimates can be expected. Figure 13 (left) shows coefficients of the reweighted enetLTS model corresponding to variables associated with potassium and calcium. The band which is associated with potassium has positive coefficients, i.e. high values of these variables correspond to the potassic group which is coded with ones in the response. High values of the variables in the band which is associated with calcium will favor a classification to the potassocalcic group (coded with zero), since the coefficients for these variables are negative. Further, it can be observed that neighboring variables, which are correlated, have similar coefficients. This is favored by the term in the elastic net penalty. In Figure 13 (right) the coefficient estimates of the elastic net model are visualized. Fewer coefficients are nonzero than for enetLTS which was favored by the term in the elastic net penalty, but in the second block of nonzero coefficients neighboring variables receive very different coefficient estimates.
8 Computation time
For our algorithm we employ the classical elastic net estimator as it is implemented in the R package FriedmanR16 (). So, it is natural to compare the computation time of our algorithm with this method. In the linear regression case we also compare with the sparse LTS estimator implemented in the R package AlfonsR13 (). For calculating the estimators we take a grid of five values for both tuning parameters and . The data sets are simulated as in Section 6 for a fixed number of observations , but for a varying number of variables in a range from to . In Figure 14 (left: linear regression, right: logistic regression), the CPU time is reported in seconds, as an average over replications. In order to show the dependency on the number of observations , we also simulated data sets for a fixed number of variables with a varying number of observations . The results for linear and logistic regression are summarized in Figure 15. The computations have been performed on an Intel Core 2 Q9650 @ GHz4 processor.
Let us first consider the dependency of the computation time on the number of variables for linear regression, shown in the left plot of Figure 14. Sparse LTS increases strongly with the number of variables since it is based on the LARS algorithm which has a computational complexity of Efron2004 (). Also for the smallest number of considered variables, the computation time is considerably higher than for the other two methods. The reason is that for each value of and each step in the CV the best subset is determined starting with 500 elemental subsets. In this setting at least 25,000 estimations of a Lasso model are needed, because for each cross validation step at each of the 5 values of , two Csteps for 500 elemental subsets are carried out, and for the 10 subsamples with lowest value of the objective function, further Csteps are performed. In contrast, the enetLTS estimator starts with 500 elemental subsets only for one combination of and , and takes the warm start strategy for subsequent combinations. This saves computation time, and there is indeed only a slight increase with visible when compared to the elastic net estimator. In total approximately 1,700 elastic net models are estimated in this procedure, which are considerably fewer than for the sparse LTS approach. The computation time of sparse LTS also increases with due to the computational complexity of LARS, while the increase is only minor for enetLTS, see Figure 15 (left).
The results for the computation time in logistic regression are presented in Figure 14 (right) and 15 (right). Here we can only compare the classical elastic net estimator and the proposed robustified enetLTS version. The difference in computation time between elastic net and enetLTS is again due to the many calls of the glmnet function within enetLTS. The robust estimator is considerably slower in logistic regression when compared to linear regression for the same number of explanatory variables or observations. The reason is that more Csteps are necessary to identify the optimal subset for each parameter combination of and .
9 Conclusions
In this paper, robust methods for linear and logistic regression using the elastic net penalty were introduced. This penalty allows for variable selection, can deal with high multicollinearity among the variables, and is thus very appropriate in high dimensional sparse settings. Robustness has been achieved by using trimming. This usually leads to a loss in efficiency, and therefore a reweighting step was introduced. Overall, the outlined algorithms for linear and logistic regression turned out to yield good performance in different simulation settings, but also with respect to computation time. Particularly, it was shown that the idea of using “warm starts” for parameter tuning allows to save computation time, while the precision is still preserved. A competing method for robust high dimensional linear regression, the sparse LTS estimator AlfonsR13 (), does not use this idea, and is thus much less attractive concerning computation time, especially in case of many explanatory variables. We should also admit that for other simulation settings (not shown here), the difference between sparse LTS and the enetLTS estimator is not so big, or even marginal, depending on the exact setting.
For this reason, a further focus was on the robust high dimensional logistic regression case. We consider such a method as highly relevant, since in many modern applications in chemometrics or bioinformatics, one is confronted with data information from two groups, with the task to find a classification rule and to identify marker variables which support the rules. Outliers in the data are frequently a problem, and they can affect the identification of the marker variables as well as the performance of the classifier. For this reason it is desirable to treat outliers appropriately. It was shown in simulation studies as well as in data examples, that in presence of outliers the new proposal still works well, while its classical nonrobust counterpart can lead to poor performance.
Note that in Park16 () a logistic regression method with elastic net penalty is proposed using weights to reduce the influence of outliers. Their approach is to perform outlier detection in a PCA space, obtain weights based on robust Mahalanobis distances in the PCA score space and derive weights from these distances. These weights are then used to downweight the negative log likelihoods in the penalized objective function to reduce the influence of outliers. However, it is not guaranteed that outliers can be detected in the PCA score space. An increasing number of uninformative variables will disguise observations deviating from the majority only in few informative variables, but these hidden outlying observations can still distort the model. Therefore, model based outlier detection is highly recommended as proposed in our algorithm.
The algorithms to compute the proposed estimators are implemented in R functions, which are available upon request from the authors. The basis for the computation of the robust estimator is the R package FriedmanR16 (). This package also implements the case of multinomial and Poisson regression. Naturally, a further extension of the algorithms introduced here could go into these directions. Further work will be devoted to the theoretical properties of the family of enetLTS estimators.
Acknowledgments
This work is partly supported by the Austrian Science Fund (FWF), project P 26871N20 and by grant TUBITAK 2214/A from the Scientific and Technological Research Council of Turkey (TUBITAK).
The authors thank F. Brandstätter, L. Ferrière, and C. Koeberl (Natural History Museum Vienna, Austria) for providing meteorite samples, C. Engrand (Centre de Sciences Nucléaires et de Sciences de la Matière, Orsay, France) for sample preparation, and M. Hilchenbach (Max Planck Institute for Solar System Research, Göttingen, Germany) for TOFSIMS measurements. The authors are grateful to Kurt Varmuza for valuable feedback on the results of the meteorite data.
References
References
 (1) A. Hoerl, R. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1970) 55–67.
 (2) R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series (Methodological) 58(1) (1996) 267–288.
 (3) H. Zou, T. Hastie, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B 67(2) (2005) 301–320.
 (4) P. Filzmoser, M. Gschwandtner, V. Todorov, Review of sparse methods in regression and classification with application to chemometrics, Journal of Chemometrics 26(3–4) (2012) 42–51.
 (5) Y.Z. Liang, O. Kvalheim, Robust methods for multivariate analysis – a tutorial review, Chemometrics and Intelligent Laboratory Systems 32 (1) (1996) 1–10.
 (6) K. Liang, Y.Z. anf Fang, Robust multivariate calibration algorithm based on least median of squares and sequential number theory optimization method, Analyst 121 (8) (1996) 1025–1029.
 (7) R. Maronna, R. Martin, V. Yohai, Robust Statistics: Theory and Methods, Wiley, New York, 2006.
 (8) P. Rousseeuw, A. Leroy, Robust Regression and Outlier Detection, Wiley 2nd edition: John Wiley & Sons, New York, 2003.
 (9) P. J. Rousseeuw, Least median of squares regression, Journal of the American Statistical Association 79(388) (1984) 871–880.
 (10) P. J. Rousseeuw, K. Van Driessen, Computing LTS regression for large data sets, Data Mining and Knowledge Discovery 12(1) (2006) 29–45.
 (11) A. Alfons, C. Croux, S. Gelper, Sparse least trimmed squares regression for analyzing highdimensional large data sets, The Annals of Applied Statistics 7(1) (2013) 226–248.

(12)
A. Alfons, robustHD: Robust
methods for high dimensional data, R Foundation for Statistical Computing,
Vienna, Austria, R package version 0.4.0 (2013).
URL http://CRAN.Rproject.org/package=robustHD  (13) J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for generalized linear models via coordinate descent, Journal of Statistical Software 33(1) (2010) 1–22.
 (14) A. Albert, J. Anderson, On the existence of maximum likelihood estimates in logistic regression models, Biometrika 71 (1984) 1–10.

(15)
J. Friedman, T. Hastie, N. Simon, R. Tibshirani,
glmnet: Lasso and Elastic Net
Regularized Generalized Linear Models, R Foundation for Statistical
Computing, Vienna, Austria, R package version 2.05 (2016).
URL http://CRAN.Rproject.org/package=glmnet  (16) C. Croux, G. Haesbroeck, Implementing the Bianco and Yohai estimator for logistic regression, Computational Statistics and Data Analysis 44(1–2) (2003) 273–295.
 (17) V. Bianco, A.M. Yohai, Robust Estimation in Logistic Regression Model, in robust Statistics, Data Analysis, and Computer Intensive Methods, 17–34; Lecture Notes in Statistics 109, Springer Verlag, Ed. H., Rieder: New York, 1996.

(18)
R Development Core Team, R: A Language and
Environment for Statistical Computing, Vienna, Austria, R Foundation for
Statistical Computing, Vienna, Austria, ISBN 3900051070 (2017).
URL http://www.Rproject.org  (19) J. Kissel, K. Altwegg, B. Clark, L. Colangeli, H. Cottin, S. Czempiel, J. Eibl, C. Engrand, H. Fehringer, B. Feuerbacher, et al., COSIMA–high resolution timeofflight secondary ion mass spectrometer for the analysis of cometary dust particles onboard Rosetta, Space Science Reviews 128(14) (2007) 823–867.
 (20) R. Schulz, M. Hilchenbach, Y. Langevin, J. Kissel, J. Silen, C. Briois, C. Engrand, K. Hornung, D. Baklouti, A. Bardyn, et al., Comet 67P/ChuryumovGerasimenko sheds dust coat accumulated over the past four years, Nature 518(7538) (2015) 216–218.
 (21) K. Varmuza, C. Engrand, P. Filzmoser, M. Hilchenbach, J. Kissel, H. Krüger, J. Silén, M. Trieloff, Random projection for dimensionality reduction  applied to timeofflight secondary ion mass spectrometry data, Analytica Chimica Acta 705(1) (2011) 48–55.
 (22) K. Janssens, I. Deraedt, A. Freddy, J. Veekman, Composition of 1517 th century archeological glass vessels excavated in Antwerp, Belgium, Mikrochima Acta 15 (1998) 253–267.
 (23) P. Filzmoser, R. Maronna, W. M., Outlier identification in high dimensions, Computational Statistics & Data Analysis 52(3) (2008) 1694–1711.
 (24) B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, The Annals of Statistics 32(2) (2004) 407–499.
 (25) Robust logistic regression modelling via the elastic nettype regularization and tuning parameter selection, Journal of Statistical Computation and Simulation 86 (7) (2016) 1450–1461.