Robust and sparse estimation methods for high dimensional linear and logistic regression

# Robust and sparse estimation methods for high dimensional linear and logistic regression

Fatma Sevinç Kurnaz Irene Hoffmann Peter Filzmoser Department of Statistics, Yildiz Technical University, 34220, Istanbul, Turkey Institute of Statistics and Mathematical Method in Economics, Vienna University of Technology, Wiedner Hauptstraße 8-10, 1040, Vienna, Austria
###### Abstract

Fully robust versions of the elastic net estimator are introduced for linear and logistic regression. The algorithms to compute the estimators are based on the idea of repeatedly applying the non-robust classical estimators to data subsets only. It is shown how outlier-free subsets can be identified efficiently, and how appropriate tuning parameters for the elastic net penalties can be selected. A final reweighting step improves the efficiency of the estimators. Simulation studies compare with non-robust and other competing robust estimators and reveal the superiority of the newly proposed methods. This is also supported by a reasonable computation time and by good performance in real data examples.

###### keywords:
Elastic net penalty, Least trimmed squares, C-step algorithm, High dimensional data, Robustness, Sparse estimation
journal: Chemometrics and Intelligent Laboratory Systems

## 1 Introduction

Let us consider the linear regression model, which assumes the linear relationship between the predictors and the predictand ,

 y=Xβ+ε, (1)

where are the regression coefficients and is the error term assumed to have standard normal distribution. For simplicity we assume that is centered to mean zero, and the columns of are mean-centered and scaled to variance one. The ordinary least squares (OLS) regression estimator is the common choice in situations where the number of observations, , in the data set is greater than the number of predictor variables, . However, in presence of multicollinearity among predictors, the OLS estimator becomes unreliable, and if exceeds it can not even be computed. Several alternatives have been proposed in this case; here we focus on the class of shrinkage estimators which penalize the residual sum-of-squares. The ridge estimator uses an penalty on the regression coefficients Hoerl70 (), while the lasso estimator takes an penalty instead Tibshirani96 (). Although this does no longer allow for a closed form solution for the estimated regression coefficients, the lasso estimator gets sparse, which means that some of the regression coefficients are shrunken to zero. This means that lasso acts like a variable selection method by returning a smaller subset of variables being relevant for the model. This is appropriate in particular for high dimensional low sample size data sets (), arising from applications in chemometrics, biometrics, econometrics, social sciences and many other fields, where the data include many uninformative variables which have no effect on the predictand or have very small contribution to the model.

There is also a limitation of the lasso estimator, since it is able to select only at most variables when . If is very small, or if the number of informative variables (variables which are relevant for the model) is expected to be greater than , the model performance can become poor. As a way out, the elastic net (enet) estimator has been introduced Zou05 (), which combines both and penalties:

 ^βenet=argminβ{n∑i=1(yi−xTiβ)2+λPα(β)} (2)

Here, , the observations form the rows of , and the penalty term is defined as

 Pα(β)=(1−α)12∥β∥22+α∥β∥1=p∑j=1[(1−α)12β2j+α|βj|]. (3)

The entire strength of the penalty is controlled by the tuning parameter . The other tuning parameter is the mixing proportion of the ridge and lasso penalties and takes value in . The elastic net estimator is able to select variables like in lasso regression, and shrink the coefficients according to ridge. For an overview of sparse methods, see Filzmoser12 ().

A further limitation of the previously mentioned estimators is their lack of robustness against data outliers. In practice, the presence of outliers in data is quite common, and thus robust statistical methods are frequently used, see, for example Liang1 (); Liang2 (). In the linear regression setting, outliers may appear in the space of the predictand (so-called vertical outliers), or in the space of the predictor variables (leverage points) Maronna06 (). The Least Trimmed Squares (LTS) estimator has been among the first proposals of a regression estimator being fully robust against both types of outliers RousseeuwL03 (). It is defined as

 ^βLTS=argminβh∑i=1r2(i)(β), (4)

where the are the ordered absolute residuals , and Rousseeuw84 (). The number is chosen between and , where refers to the largest integer , and it determines the robustness properties of the estimator Rousseeuw84 (). The LTS estimator also became popular due to the proposal of a quick algorithm for its computation, the so-called FAST-LTS algorithm Rousseeuw06 (). The key feature of this algorithm is the “concentration step” or C-step, which is an efficient way to arrive at outlier-free data subsets where the OLS estimator can be applied. This only works for , but recently the sparse LTS regression estimator has been proposed for high dimensional problems Alfons13 ():

 ^βsparseLTS=argminβ{h∑i=1r2(i)(β)+hλ∥β∥1}. (5)

This estimator adds an penalty to the objective function of the LTS estimator, and it can thus be seen as a robust counterpart of the lasso estimator. The sparse LTS estimator is robust to both vertical outliers and leverage points, and also a fast algorithm has been developed for its computation AlfonsR13 ().

The contribution of this work is twofold: A new sparse and robust regression estimator is proposed with combined and penalties. This robustified elastic net regression estimator overcomes the limitations of lasso type estimators concerning the low number of variables in the models, and concerning the instability of the estimator in case of high multicollinearity among the predictors Tibshirani96 (). As a second contribution, a robust elastic net version of logistic regression is introduced for problems where the response is a binary variable, encoded with referring to the class memberships of two groups. The logistic regression model is , for , where denotes the conditional probability for the th observation,

 πi=Pr(yi=1|xi)=exTiβ1+exTiβ, (6)

and is the error term assumed to have binomial distribution. The most popular way to estimate the model parameters is the maximum likelihood (ML) estimator which is based on maximizing the log-likelihood function or, equivalently, minimizing the negative log-likelihood function,

 ^βML=argminβn∑i=1d(xTiβ,yi), (7)

with the deviances

 d(xTiβ,yi)=−yilogπi−(1−yi)log(1−πi)=−yixTiβ+log(1+exTiβ). (8)

The estimation of the model parameters with this method is not reliable when there is multicollinearity among the predictors and is not feasible when . To solve these problems, Friedman et al. Friedman10 () suggested to minimize a penalized negative log-likelihood function,

 ^βenet=argminβ{n∑i=1d(xTiβ,yi)+nλPα(β)}. (9)

Here, is the elastic net penalty as given in Equation (3), and thus this estimator extends (2) to the logistic regression setting. Using the elastic net penalty also solves the non-existence problem of the estimator in case of non-overlapping groups Albert84 (); Friedman10 (); FriedmanR16 (). Robustness can be achieved by trimming the penalized log-likelihood function, and using weights as proposed in the context of robust logistic regression Croux03 (); Bianco96 (). These weights can also be applied in a reweighting step which increases the efficiency of the robust elastic net logistic regression estimator.

The outline of this paper is as follows. In Section 2, we introduce the robust and sparse linear regression estimator and provide a detailed algorithm for its computation. Section 3 presents the robust elastic net logistic regression estimator. Some important details which are different from the linear regression algorithm are mentioned here. Section 4 explains how the tuning parameters for the proposed estimators can be selected; we prefer an approach based on cross-validation. Since LTS estimators possess a rather low statistical efficiency, a reweighting step is introduced in Section 5 to increase the efficiency. The properties of the proposed estimators are investigated in simulation studies in Section 6, and Section 7 shows the performance on real data examples. Section 8 provides some insight into the computation time of the algorithms, and the final Section 9 concludes.

## 2 Robust and sparse linear regression with elastic net penalty

A robust and sparse elastic net estimator in linear regression can be defined with the objective function

 Q(H,β)=∑i∈H(yi−xTiβ)2+hλPα(β) (10)

where with , , and indicates the elastic net penalty with as in Equation (3). We call this estimator the enet-LTS estimator, since it uses a trimmed sum of squared residuals, like the sparse LTS estimator (5). The minimum of the objective function (10) determines the optimal subset of size ,

 Hopt=argminH⊆1,2,…,n:|H|=hQ(H,^βH), (11)

which is supposed to be outlier-free. The coefficient estimates depend on the subset . The enet-LTS estimator is given for this subset by

 ^βenetLTS=argminQ(Hopt,β). (12)

It is not trivial to identify this optimal subset, and practically one has to use an algorithm to approximate the solution. This algorithm uses C-steps: Suppose that the current -subset in the th iteration of the algorithm is denoted by , and the resulting estimator by . Then the next subset is formed by the indexes of those observations which correspond to the smallest squared residuals

 r2k,i=(yi−xTi^βHk)2, for i=1,…,n. (13)

If denotes the estimator based on , then by construction of the -subsets it follows immediately:

 Q(Hk+1,^βHk+1)≤Q(Hk+1,^βHk)≤Q(Hk,^βHk) (14)

This means that the C-steps decrease the objective function (10) successively, and lead to a local optimum after convergence. The global optimum is approximated by performing the C-steps with several initial subsets. However, in order to keep the runtime of the algorithm low, it is crucial that the initial subsets are chosen carefully. As motivated in Alfons13 (), for a certain combination of the penalty parameters and , elemental subsets are created consisting of the indexes of three randomly selected observations. Using only three observations increases the possibility of having no outliers in the elemental subsets. Let us denote these elemental subsets by

 Hsel={js1,js2,js3}, (15)

where . The resulting estimators based on the three observations are denoted by . Now the squared residuals can be computed for all observations , and two C-steps are carried out, starting with the -subset defined by the indexes of the smallest squared residuals. Then only those -subsets with the smallest values of the objective function (10) are kept as candidates. With these candidate subsets, the C-steps are performed until convergence (no further decrease), and the best subset is defined as that one with the smallest value of the objective function. This best subset also defines the estimator for this particular combination of and .

Basically, one can apply this procedure now for a grid of values in the interval and . Practically, this may still be quite time consuming, and therefore, for a new parameter combination, the best subset of the neighboring grid value of and/or , is taken, and the C-steps are started from this best subset until convergence. This technique, called warm starts, is repeated for each combination over the grid of and values, and thus the start based on the elemental subsets is carried out only once.

The choice of the optimal tuning parameters and is detailed in Section 4. The subset corresponding to the optimal tuning parameters is the optimal subset of size . The enet-LTS estimator is then calculated on the optimal subset with and .

## 3 Robust and sparse logistic regression with elastic net penalty

Based on the definition (9) of the elastic net logistic regression estimator, it is straightforward to define the objective function of its robust counterpart based on trimming,

 Q(H,β)=∑i∈Hd(xTiβ,yi)+hλPα(β), (16)

where again with , and is the elastic net penalty as defined in Equation (3). As outlined in the last Section 2, the task is to find the optimal subset which minimizes the objective function and defines the robust sparse elastic net estimator for logistic regression. It turns out that the algorithm explained previously in the linear regression setting can be successfully used to find the approximative solution. In the following we will explain the modifications that need to be carried out.

C-steps:

In the linear regression case, the C-steps were based on the squared residuals (13). Now the -subsets are determined according to the indexes of those observations with the smallest values of the deviances . However, here it needs to be made sure that the original group sizes are in the same proportion. Denote and the number of observations in both groups, with . Then and define the group sizes in each -subset. A new -subset is created with the indexes of the smallest deviances and with the indexes of the smallest deviances .

Elemental subsets:

In the linear regression case, the elemental subsets consisted of the indexes of three randomly selected observations, see (15). Now four observations are randomly selected to form the elemental subsets, two from each group. This allows to compute the estimator, and the two C-steps are based on the smallest values of the deviances. As before, this is carried out for 500 elemental subsets, and only the “best” 10 -subsets are kept. Here, “best” refers to an evaluation that is borrowed from a robustified deviance measure proposed in Croux and Haesbroeck Croux03 () in the context of robust logistic regression (but not in high dimension). These authors replace the deviance function (8) used in (7) by a function to define the so-called Bianco Yohai (BY) estimator

 ^βBY=argminβn∑i=1φ(xTiβ;yi), (17)

a highly robust logistic regression estimator, see also Bianco96 (). The form of the function is shown in Figure 1, see Croux03 () for details.

We use this function as follows: Positive scores of group 1, i.e. , refer to correct classification and receive the highest values for , while negative scores refer to misclassification, with small or zero values. For the scores of group 0 we have the reverse behavior, see Figure 1. When evaluating an -subset, the sum over the values of for is computed, and this sum should be as large as possible. This means that we aim at identifying an -subset where the groups are separated as much as possible. Points on the wrong side have almost no contribution, but also the contribution of outliers on the correct side is bounded. In this way, outliers will not dominate the sum.

With the best 10 -subsets we continue the C-steps until convergence. Finally, the subset with the largest sum over all forms the best index set.

The selection of the optimal parameters and is discussed in Section 4. The subset corresponding to these optimal tuning parameters is defined as the optimal subset of size . The enet-LTS logistic regression estimator is then calculated on the optimal subset with and .

Note that at the beginning of the algorithm for linear regression, the predictand is centered, and the predictor variables are centered robustly by the median and scaled by the MAD. Within the C-steps of the algorithm, we additionally mean-center the response variable and scale the predictors by their arithmetic means and standard deviations, calculated on each current subset, see also Alfons13 (). The same procedure is applied for logistic regression, except for centering the predictand. In the end, the coefficients are back-transformed to the original scale.

## 4 Selection of the tuning parameters

Sections 2 and 3 outlined the algorithms to arrive at a best subset for robust elastic net linear and logistic regression, for each combination of the tuning parameters and . In this section we define the strategy to select the optimal combination and , leading to the optimal subset. For this purpose we are using -fold cross-validation (CV) on those best subsets of size , with . In more detail, for -fold CV, the data are randomly split into blocks of approximately equal size. In case of logistic regression, each block needs to consist of observations from both classes with approximately the same class proportions as in the complete data set. Each block is left out once, the model is fitted to the “training data” contained in the blocks, using a fixed parameter combination for and , and it is applied to the left-out block with the “test data”. In this way, fitted values are obtained from models, and they are compared to the corresponding original response by using the following evaluation criteria:

• For linear regression we take the root mean squared prediction error (RMSPE)

 RMSPE(α,λ)= ⎷1hh∑i=1r2i(^βα,λ) (18)

where presents the test set residuals from the models estimated on the training sets with a specific and (for simplicity we omitted here the index denoting the models where the -th block was left out and the corresponding test data from this block).

• For logistic regression we use the mean of the negative log-likelihoods or deviances (MNLL)

 MNLL(α,λ)=1hh∑i=1di(^βα,λ), (19)

where presents the test set deviances from the models estimated on the training sets with a specific and .

Note that the evaluation criteria given by (18) and (19) are robust against outliers, because they are based on the best subsets of size , which are supposed to be outlier free.

In order to obtain more stable results, we repeat the -fold CV five times and take the average of the corresponding evaluation measure. Finally, the optimal parameters and are defined as that couple for which the evaluation criterion gives the minimal value. The corresponding best subset is determined as the optimal subset.

Note that the optimal couple and is searched on a grid of values and . In our experiments we used 41 equally spaced values for , and was varied in steps of size . For determining in the linear regression case we used the same approach as in Alfons et al. Alfons13 () which is based on the Pearson correlation between and the th predictor variable on winsorized data. For logistic regression we replaced the Pearson correlation by a robustified point-biserial correlation: denote by and the group sizes of the two groups, and by and the medians of the th predictor variable for the data from the two groups, respectively. Then the robustified point-biserial correlation between and is defined as

where is the MAD of , and .

## 5 Reweighting step

The LTS estimator has a low efficiency, and thus it is common to use a reweighting step RousseeuwL03 (). This idea is also used for the estimators introduced here. Generally, in a reweighting step the outliers according to the current model are identified and downweighted. For the linear regression model we will use the same reweighting scheme as proposed in Alfons et al. Alfons13 (), which is based on standardized residuals. In case of logistic regression we compute the Pearson residuals which are approximately standard normally distributed and given by

 rsi=yi−πiπi(1−πi), (20)

with the conditional probabilities from (6).

For simplicity, denote the standardized residuals from the linear regression case also by . Then the weights are defined by

 wi={1, if |rsi|≤Φ−1(1−δ)0, if |rsi|>Φ−1(1−δ)i=1,2,…,n, (21)

where , such that of the observations are flagged as outliers in the normal model. The reweighted enet-LTS estimator is defined as

 ^βreweighted=argminβ{n∑i=1wif(xi;yi)+λupdnwPαopt(β)}, (22)

where , stands for the vector of binary weights (according to the current model), , and corresponds to squared residuals for linear regression or to the deviances in case of logistic regression. Since , and because the optimal parameters and have been derived with observations, the penalty can act (slightly) differently in (22) than for the raw estimator. For this reason, the parameter has to be updated, while the regulating the tradeoff between the and penalty is kept the same. The updated parameter is determined by -fold CV, with the simplification that is already fixed.

## 6 Simulation studies

In this section, the performance of the new estimators is compared with different sparse estimators in different scenarios. We consider both the raw and the reweighted versions of the enet-LTS estimators, and therefore aim to show how the reweighting step improves the methods. The raw and reweighted enet-LTS estimators are compared with their classical, non-robust counterparts, the linear and logistic regression estimators with elastic net penalty Friedman10 (). In case of linear regression we also compare with the reweighted sparse LTS estimator of Alfons13 (). All robust estimators are calculated taking the subset size such that their performances are directly comparable.

For each replication, we choose the optimal tuning parameters and over the grids and with 5-times repeated -fold CV as described in Section 4. To select the tuning parameters for the classical estimators with elastic net penalty, we first draw the same grid for , namely , with 41 equally spaced grid points. Then we use -fold CV as provided by the R package glmnet, which automatically checks the model quality for a sequence of values for , taking the mean squared error as an evaluation criterion. Finally, the tuning parameters corresponding to the smallest value of the minimum cross-validated error are determined as the optimal tuning parameters. In order to be coherent with our evaluation, the tuning parameters for the sparse LTS estimator are determined in the same way as for the enet-LTS estimator. All simulations are carried out in R R ().

Note that we simulated the data sets with intercept. As described at the end of Section 3, the data are centered and scaled at the beginning of the algorithm and only in the final step the coefficients are back-transformed to the original scale, where also the estimate of the intercept is computed.

Sampling schemes for linear regression: Let us consider two different scenarios by means of generating a “low dimensional” data set with and and a “high dimensional” data set with and . We generate a data matrix where the variables are forming correlated blocks, , where , and have the dimensions , and , with . Such a block structure can be assumed in many application, and it mimics different underlying hidden processes. The observations of the blocks are generated independently from each other, from a multivariate normal distribution with , , , with , , , and with , , , respectively. While the first two blocks belong to the informative variables with sizes of and , the third block represents uninformative variables with . Furthermore, we take to allow for a high correlation among the informative variables, and to have low correlation among the uninformative variables.

To create sparsity, the true parameter vector consists of zeros for the last 90% of the entries referring to the uninformative variables, while the first 10% of the entries are assigned to one. The response variable is calculated by

 yi=1+xTiβ+εi, (23)

where the error term is distributed according to a standard normal distribution , for .

This is the design for the simulations with clean data. For the simulation scenarios with outliers we replace the first of the observations of the block of informative variables by values coming from independent normal distributions for each variable. Further, the error terms for these outliers are replaced by values from instead of , where represents the estimated standard deviation of the clean predictand vector. In this way, the contaminated data consist of both vertical outliers and leverage points.

Sampling schemes for logistic regression: We also consider two different scenarios for logistic regression, a “low dimensional” data set with and and a “high dimensional” data set with and . The data matrix is , where has the dimension and is of dimension , with . The data matrices are generated independently from with , , , and with , , , respectively. While the first block consists of the informative variables with , the second block represents uninformative variables with . We take for a high correlation among the informative variables, and for moderate correlation among the uninformative variables.

The coefficient vector consists of ones for the first 10% of the entries, and zeros for the remaining uninformative block. The elements of the error term are generated independently from . The grouping variable is then generated according to the model

 (24)

With this setting, both groups are of approximately the same size.

Contamination is introduced by adding outliers only to the informative variables. Denote the number of observations in class 0. Then the first observations of group 0 are replaced by values generated from . In order to create “vertical” outliers in addition to leverage points, we assign those first observations of class 0 a wrong class membership.

Performance measures: For the evaluation of the different estimators, training and test data sets are generated according to the explained sampling schemes. The models are fit to the training data and evaluated on the test data. The test data are always generated without outliers.

As performance measures we use the root mean squared prediction error (RMSPE) for linear regression,

 RMSPE(^β)= ⎷1nn∑i=1(yi−^β0−xTi^β)2, (25)

and the mean of the negative log-likelihoods or deviances (MNLL) for logistic regression,

 MNLL(^β)=1nn∑i=1d(^β0+xTi^β,yi), (26)

where and , , indicate the observations of the test data set, denotes the coefficient vector and stands for the estimated intercept term obtained from the training data set. In logistic regression we also calculate the misclassification rate (MCR), defined as

 MCR=mn (27)

where is the number of misclassified observations from the test data after fitting the model on the training data. Further, we consider the precision of the coefficient estimate as a quality criterion, defined by

 PRECISION(^β)= ⎷p∑i=0(βi−^βi)2, (28)

In order to compare the sparsity of the coefficient estimators, we evaluate the False Positive Rate (FPR) and the False Negative Rate (FNR), defined as

 FPR(^β)=|{j=0,…,p:^βj≠0∧βj=0}||{j=0,…,p:βj=0}|, (29)
 FNR(^β)=|{j=0,…,p:^βj=0∧βj≠0}||{j=0,…,p:βj≠0}|. (30)

The FPR is the proportion of non-informative variables that are incorrectly included in the model. On the other hand, the FNR is the proportion of informative variables that are incorrectly excluded from the model. A high FNR usually has a bad effect on the prediction performance since it inflates the variance of the estimator.

These evaluation measures are calculated for the generated data in each of 100 simulation replications separately, and then summarized by boxplots. The smaller the value for these criteria, the better the performance of the method.

Results for linear regression: The outcome of the simulations for linear regression is summarized in Figures 25. The left plots in these figures are for the simulations with low dimensional data, and the right plots for the high dimensional configuration. Figure 2 compares the RMSPE. All methods yield similar results in the low dimensional non-contaminated case, while in the high dimensional clean data case the elastic net method is clearly better. However, in the contaminated case, elastic net leads to poor performance, which is also the case for sparse LTS. Enet-LTS performs even slightly better with contaminated data, and there is also a slight improvement visible in the reweighted version of this estimator. The PRECISION in Figure 3 shows essentially the same behavior. The FPR in Figure 4, reflecting the proportion of incorrectly added noise variables to the models, shows a very low rate for sparse LTS. Here, the elastic net even improves in the contaminated setting, and the same is true for enet-LTS. A quite different picture is shown in Figure 5 with the FNR. Sparse LTS and elastic net miss a high proportion of informative variables in the contaminated data scenario, which is the reason for their poor overall performance. Note that the outliers are placed in the informative variables, which seems to be particularly difficult for sparse LTS.

Results for logistic regression: Figures 610 summarize the simulation results for logistic regression. As before, the left plots refer to the low dimensional case, and the right plots to the high dimensional data. Within one plot, the results for uncontaminated and contaminated data are directly compared. The misclassification rate in Figure 6 is around 10% for all methods, and it is slightly higher in the high dimensional situation. In case of contamination, however, this rate increases enormously for the classical method elastic net.

The average deviances in Figure 7 show that the reweighting of the enet-LTS estimator clearly improves the raw estimate in both the low and high dimensional cases. It can also be seen that elastic net is sensitive to the outliers. The precision of the parameter estimates in Figure 8 reveal a remarkable improvement for the reweighted enet-LTS estimator compared to the raw version, while there is not any clear effect of the contamination on the classical elastic net estimator.

The FPR in Figure 9 shows a certain difference between uncontaminated and contaminated data for the elastic net, but otherwise the results are quite comparable. A different picture is visible from the FNR in Figure 10, where especially in the low dimensional case the elastic net is very sensitive to the outliers. Overall we conclude that the enet-LTS performs very well in case of contamination even though this was not clearly visible in the precision, and it also yields reasonable results for clean data.

## 7 Applications to real data

In this section we focus on applications with logistic regression, and compare the non-robust elastic net estimator with the robust enet-LTS method. The model selection is conducted as described in Section 4. Model evaluation is done with leave-one-out cross validation, i.e. each observation is used as test observation once, a model is estimated on the remaining observations, and the negative log-likelihood is calculated for the test observation. In these real data examples it is unknown if outliers are present. In order to avoid an influence of potential outliers on the evaluation of a model, the 25% trimmed mean of the negative log-likelihoods is calculated to compare the models.

### 7.1 Analysis of meteorite data

The time-of-flight secondary iron mass spectroscope COSIMA kissel2007cosima () was sent to the comet Churyumov-Gerasimenko in the Rosetta space mission by the ESA to analyze the elemental composition of comet particles which were collected there schulz2015comet (). As reference measurements, samples of meteorites provided by the Natural History Museum Vienna were analyzed with the same type of spectroscope at Max Planck Institute for Solar System Research in Göttingen.

Here we apply our proposed method for logistic regression to the measurements from particles from the meteorites Ochansk and Renazzo with 160 and 110 spectra, respectively. We restrict the mass range to 1-100mu, consider only mass windows where inorganic and organic ions can be expected as described in Varmuza11 () and variables with positive median absolute deviation. So we obtain variables. Further, the data is normalized to have constant row sum 100.

Table 1 summarizes the results for the comparison of the methods. The trimmed MNLL is much smaller for the enet-LTS estimator than for the classical elastic net method. The reweighting step improves the quality of the model further. The selected tuning parameter is much smaller for enet-LTS than for the classical elastic net method which strongly influences the number of variables in the models.

Figure 11 compares the Pearson residuals of the elastic net model and the enet-LTS model. In the classical approach no abnormal observations can be detected. With the enet-LTS model several observations are identified as outliers by the 1.25% and 98.25% quantiles of the standard normal distribution, which are marked as horizontal lines in Figure 11. Closer investigation showed that these spectra lie on the outer border of the measurement area and are potentially measured on the target instead of the meteorite particle. Their multivariate structure for those variables which are included in the model is visualized in Figure 12, where we can see that in some variables they have particularly large values compared to the majority of the group.

### 7.2 Analysis of the glass vessels data

Archaeological glass vessels where analyzed with electron-probe X-ray micro-analysis to investigate the chemical concentrations of elements in order to learn more about their origin and the trade market at the time of their making in the 16 and 17 century Janssens98 (). Four different groups were identified, i.e. sodic, potassic, potasso-calcic and calcic glass vessels. For demonstration of the performance of logistic regression, two groups are selected from the glass vessels data set. The first group is the potassic group with 15 spectra, the second group the potasso-calcic group with 10 spectra. As in Filzmoser08 () we remove variables with MAD equal to zero, resulting in variables.

The quality of the selected models is described in Table 2. The trimmed mean of the negative log likelihoods is much smaller for enet-LTS than for elastic net. The reweighting step in enet-LTS hardly improves the model, but includes more variables. Again, both enet-LTS models include more variables than the elastic net model. In the elastic net model the penalty gives higher emphasis on the term, i.e. ; for enet-LTS it is .

Different behavior of the coefficient estimates can be expected. Figure 13 (left) shows coefficients of the reweighted enet-LTS model corresponding to variables associated with potassium and calcium. The band which is associated with potassium has positive coefficients, i.e. high values of these variables correspond to the potassic group which is coded with ones in the response. High values of the variables in the band which is associated with calcium will favor a classification to the potasso-calcic group (coded with zero), since the coefficients for these variables are negative. Further, it can be observed that neighboring variables, which are correlated, have similar coefficients. This is favored by the term in the elastic net penalty. In Figure 13 (right) the coefficient estimates of the elastic net model are visualized. Fewer coefficients are non-zero than for enet-LTS which was favored by the term in the elastic net penalty, but in the second block of non-zero coefficients neighboring variables receive very different coefficient estimates.

## 8 Computation time

For our algorithm we employ the classical elastic net estimator as it is implemented in the R package FriedmanR16 (). So, it is natural to compare the computation time of our algorithm with this method. In the linear regression case we also compare with the sparse LTS estimator implemented in the R package AlfonsR13 (). For calculating the estimators we take a grid of five values for both tuning parameters and . The data sets are simulated as in Section 6 for a fixed number of observations , but for a varying number of variables in a range from to . In Figure 14 (left: linear regression, right: logistic regression), the CPU time is reported in seconds, as an average over replications. In order to show the dependency on the number of observations , we also simulated data sets for a fixed number of variables with a varying number of observations . The results for linear and logistic regression are summarized in Figure 15. The computations have been performed on an Intel Core 2 Q9650 @ GHz4 processor.

Let us first consider the dependency of the computation time on the number of variables for linear regression, shown in the left plot of Figure 14. Sparse LTS increases strongly with the number of variables since it is based on the LARS algorithm which has a computational complexity of Efron2004 (). Also for the smallest number of considered variables, the computation time is considerably higher than for the other two methods. The reason is that for each value of and each step in the CV the best subset is determined starting with 500 elemental subsets. In this setting at least 25,000 estimations of a Lasso model are needed, because for each cross validation step at each of the 5 values of , two C-steps for 500 elemental subsets are carried out, and for the 10 subsamples with lowest value of the objective function, further C-steps are performed. In contrast, the enet-LTS estimator starts with 500 elemental subsets only for one combination of and , and takes the warm start strategy for subsequent combinations. This saves computation time, and there is indeed only a slight increase with visible when compared to the elastic net estimator. In total approximately 1,700 elastic net models are estimated in this procedure, which are considerably fewer than for the sparse LTS approach. The computation time of sparse LTS also increases with due to the computational complexity of LARS, while the increase is only minor for enet-LTS, see Figure 15 (left).

The results for the computation time in logistic regression are presented in Figure 14 (right) and 15 (right). Here we can only compare the classical elastic net estimator and the proposed robustified enet-LTS version. The difference in computation time between elastic net and enet-LTS is again due to the many calls of the glmnet function within enet-LTS. The robust estimator is considerably slower in logistic regression when compared to linear regression for the same number of explanatory variables or observations. The reason is that more C-steps are necessary to identify the optimal subset for each parameter combination of and .

## 9 Conclusions

In this paper, robust methods for linear and logistic regression using the elastic net penalty were introduced. This penalty allows for variable selection, can deal with high multicollinearity among the variables, and is thus very appropriate in high dimensional sparse settings. Robustness has been achieved by using trimming. This usually leads to a loss in efficiency, and therefore a reweighting step was introduced. Overall, the outlined algorithms for linear and logistic regression turned out to yield good performance in different simulation settings, but also with respect to computation time. Particularly, it was shown that the idea of using “warm starts” for parameter tuning allows to save computation time, while the precision is still preserved. A competing method for robust high dimensional linear regression, the sparse LTS estimator AlfonsR13 (), does not use this idea, and is thus much less attractive concerning computation time, especially in case of many explanatory variables. We should also admit that for other simulation settings (not shown here), the difference between sparse LTS and the enet-LTS estimator is not so big, or even marginal, depending on the exact setting.

For this reason, a further focus was on the robust high dimensional logistic regression case. We consider such a method as highly relevant, since in many modern applications in chemometrics or bio-informatics, one is confronted with data information from two groups, with the task to find a classification rule and to identify marker variables which support the rules. Outliers in the data are frequently a problem, and they can affect the identification of the marker variables as well as the performance of the classifier. For this reason it is desirable to treat outliers appropriately. It was shown in simulation studies as well as in data examples, that in presence of outliers the new proposal still works well, while its classical non-robust counterpart can lead to poor performance.

Note that in Park16 () a logistic regression method with elastic net penalty is proposed using weights to reduce the influence of outliers. Their approach is to perform outlier detection in a PCA space, obtain weights based on robust Mahalanobis distances in the PCA score space and derive weights from these distances. These weights are then used to down-weight the negative log likelihoods in the penalized objective function to reduce the influence of outliers. However, it is not guaranteed that outliers can be detected in the PCA score space. An increasing number of uninformative variables will disguise observations deviating from the majority only in few informative variables, but these hidden outlying observations can still distort the model. Therefore, model based outlier detection is highly recommended as proposed in our algorithm.

The algorithms to compute the proposed estimators are implemented in R functions, which are available upon request from the authors. The basis for the computation of the robust estimator is the R package FriedmanR16 (). This package also implements the case of multinomial and Poisson regression. Naturally, a further extension of the algorithms introduced here could go into these directions. Further work will be devoted to the theoretical properties of the family of enet-LTS estimators.

## Acknowledgments

This work is partly supported by the Austrian Science Fund (FWF), project P 26871-N20 and by grant TUBITAK 2214/A from the Scientific and Technological Research Council of Turkey (TUBITAK).

The authors thank F. Brandstätter, L. Ferrière, and C. Koeberl (Natural History Museum Vienna, Austria) for providing meteorite samples, C. Engrand (Centre de Sciences Nucléaires et de Sciences de la Matière, Orsay, France) for sample preparation, and M. Hilchenbach (Max Planck Institute for Solar System Research, Göttingen, Germany) for TOF-SIMS measurements. The authors are grateful to Kurt Varmuza for valuable feedback on the results of the meteorite data.

## References

• (1) A. Hoerl, R. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1970) 55–67.
• (2) R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series (Methodological) 58(1) (1996) 267–288.
• (3) H. Zou, T. Hastie, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B 67(2) (2005) 301–320.
• (4) P. Filzmoser, M. Gschwandtner, V. Todorov, Review of sparse methods in regression and classification with application to chemometrics, Journal of Chemometrics 26(3–4) (2012) 42–51.
• (5) Y.-Z. Liang, O. Kvalheim, Robust methods for multivariate analysis – a tutorial review, Chemometrics and Intelligent Laboratory Systems 32 (1) (1996) 1–10.
• (6) K. Liang, Y.Z. anf Fang, Robust multivariate calibration algorithm based on least median of squares and sequential number theory optimization method, Analyst 121 (8) (1996) 1025–1029.
• (7) R. Maronna, R. Martin, V. Yohai, Robust Statistics: Theory and Methods, Wiley, New York, 2006.
• (8) P. Rousseeuw, A. Leroy, Robust Regression and Outlier Detection, Wiley 2nd edition: John Wiley & Sons, New York, 2003.
• (9) P. J. Rousseeuw, Least median of squares regression, Journal of the American Statistical Association 79(388) (1984) 871–880.
• (10) P. J. Rousseeuw, K. Van Driessen, Computing LTS regression for large data sets, Data Mining and Knowledge Discovery 12(1) (2006) 29–45.
• (11) A. Alfons, C. Croux, S. Gelper, Sparse least trimmed squares regression for analyzing high-dimensional large data sets, The Annals of Applied Statistics 7(1) (2013) 226–248.
• (12) A. Alfons, robustHD: Robust methods for high dimensional data, R Foundation for Statistical Computing, Vienna, Austria, R package version 0.4.0 (2013).
URL http://CRAN.R-project.org/package=robustHD
• (13) J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for generalized linear models via coordinate descent, Journal of Statistical Software 33(1) (2010) 1–22.
• (14) A. Albert, J. Anderson, On the existence of maximum likelihood estimates in logistic regression models, Biometrika 71 (1984) 1–10.
• (15) J. Friedman, T. Hastie, N. Simon, R. Tibshirani, glmnet: Lasso and Elastic Net Regularized Generalized Linear Models, R Foundation for Statistical Computing, Vienna, Austria, R package version 2.0-5 (2016).
URL http://CRAN.R-project.org/package=glmnet
• (16) C. Croux, G. Haesbroeck, Implementing the Bianco and Yohai estimator for logistic regression, Computational Statistics and Data Analysis 44(1–2) (2003) 273–295.
• (17) V. Bianco, A.M. Yohai, Robust Estimation in Logistic Regression Model, in robust Statistics, Data Analysis, and Computer Intensive Methods, 17–34; Lecture Notes in Statistics 109, Springer Verlag, Ed. H., Rieder: New York, 1996.
• (18) R Development Core Team, R: A Language and Environment for Statistical Computing, Vienna, Austria, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0 (2017).
URL http://www.R-project.org
• (19) J. Kissel, K. Altwegg, B. Clark, L. Colangeli, H. Cottin, S. Czempiel, J. Eibl, C. Engrand, H. Fehringer, B. Feuerbacher, et al., COSIMA–high resolution time-of-flight secondary ion mass spectrometer for the analysis of cometary dust particles onboard Rosetta, Space Science Reviews 128(1-4) (2007) 823–867.
• (20) R. Schulz, M. Hilchenbach, Y. Langevin, J. Kissel, J. Silen, C. Briois, C. Engrand, K. Hornung, D. Baklouti, A. Bardyn, et al., Comet 67P/Churyumov-Gerasimenko sheds dust coat accumulated over the past four years, Nature 518(7538) (2015) 216–218.
• (21) K. Varmuza, C. Engrand, P. Filzmoser, M. Hilchenbach, J. Kissel, H. Krüger, J. Silén, M. Trieloff, Random projection for dimensionality reduction - applied to time-of-flight secondary ion mass spectrometry data, Analytica Chimica Acta 705(1) (2011) 48–55.
• (22) K. Janssens, I. Deraedt, A. Freddy, J. Veekman, Composition of 15-17 th century archeological glass vessels excavated in Antwerp, Belgium, Mikrochima Acta 15 (1998) 253–267.
• (23) P. Filzmoser, R. Maronna, W. M., Outlier identification in high dimensions, Computational Statistics & Data Analysis 52(3) (2008) 1694–1711.
• (24) B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, The Annals of Statistics 32(2) (2004) 407–499.
• (25) Robust logistic regression modelling via the elastic net-type regularization and tuning parameter selection, Journal of Statistical Computation and Simulation 86 (7) (2016) 1450–1461.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters