[

[

[
Abstract

Motivation: The high dimensionality of genomic data calls for the development of specific classification methodologies, especially to prevent over-optimistic predictions. This challenge can be tackled by compression and variable selection, which combined constitute a powerful framework for classification, as well as data visualization and interpretation. However, current proposed combinations lead to unstable and non convergent methods due to inappropriate computational frameworks. We hereby propose a computationally stable and convergent approach for classification in high dimensional based on sparse Partial Least Squares (sparse PLS).
Results: We start by proposing a new solution for the sparse PLS problem that is based on proximal operators for the case of univariate responses. Then we develop an adaptive version of the sparse PLS for classification, called logit-SPLS, which combines iterative optimization of logistic regression and sparse PLS to ensure computational convergence and stability. Our results are confirmed on synthetic and experimental data. In particular we show how crucial convergence and stability can be when cross-validation is involved for calibration purposes. Using gene expression data we explore the prediction of breast cancer relapse. We also propose a multicategorial version of our method, used to predict cell-types based on single-cell expression data.
Availability: Our approach is implemented in the plsgenomics R-package.
Contact: ghislain.durif@inria.fr
Supplementary information: Supplementary materials are available at Bioinformatics online.

Adaptive Sparse PLS for Logistic Regression]High Dimensional Classification with combined Adaptive Sparse PLS and Logistic Regression G. Durif et al.]G. Durif , L. Modolo , J. Michaelsson , J. E. Mold , S. Lambert-Lacroix  and F. Picard 

1 Introduction

Molecular classification is at the core of many recent studies based on Next-Generation Sequencing data. For instance, the genomic characterization of diseases based on genomic signatures has been one Grail for many studies to predict patient outcome, survival or relapse (Guedj et al., 2012). Moreover, following the recent advances of sequencing technologies, it is now possible to isolate and sequence the genetic material from a single cell (Stegle et al., 2015). Single-cell data give the opportunity to characterize the genomic diversity between the individual cells of a specific population. However, in both cases, the specific context of high dimensionality constitutes a major challenge for the development of new statistical methodologies (Marimont and Shapiro, 1979; Donoho, 2000). Indeed, the number of recorded variables (as gene expression) being far larger than the sample size , classical regression or classification methods are inappropriate (Aggarwal et al., 2001; Hastie et al., 2009), due to spurious dependencies between variables, that lead to singularities in the optimization processes, with neither unique nor stable solution.

This challenge calls for the development of specific statistical tools, such as the following dimension reduction approaches: Compression methods that search for a representation of the data in lower dimensional space and ) Variable selection methods, based on a parsimony hypothesis, i.e., among all recorded variables, a lot are supposed to be uninformative and can be considered as noise to be removed from the model. For instance, the Partial Least Squares (PLS) regression (Wold, 1975; Wold et al., 1983) is a compression approach appropriate for linear regression, especially with highly correlated covariates, that constructs new components, i.e. latent directions, explaining the response. An example of sparsity-based approach is the Lasso (Tibshirani, 1996) where coefficients of less relevant variables are shrunk to zero thanks to a penalty in the optimization procedure. Eventually, sparse PLS (SPLS) regression (Lê Cao et al., 2008; Chun and Keleş, 2010) combines both compression and variable selection to reduce dimension. It introduces a selection step based on the Lasso in the PLS framework, constructing new components as sparse linear combinations of predictors. It occurs as well that combining compression and ‘‘sparse’’ approach improves the efficiency of prediction and the accuracy of selection. Such an association (compression and selection) is also relevant for data visualization, a crucial challenge when considering high dimensional data. Existing SPLS methods are based on resolutions of approximations of the associated optimization problem. In this work, we first propose a new formulation of the sparse PLS optimization problem with a simple exact resolution, derived from proximal operators (Bach et al., 2012). We also introduce an adaptive sparsity-inducing penalty, inspired from the adaptive Lasso (Zou, 2006), to improve the variable selection accuracy.

SPLS has shown excellent performance for regression with a continuous response, but its adaptation to classification is not straightforward. Chung and Keleş (2010) or Lê Cao et al. (2011) proposed to use sparse PLS as a preliminary dimension reduction step before a standard classification method, such as discriminant analysis (SPLS-DA) or logistic regression, following previous approaches using classical PLS for molecular classification (Nguyen and Rocke, 2002; Boulesteix, 2004). Their approach gives interesting results in SNPs data analysis (Lê Cao et al., 2011) or in tumor classification (Chung and Keleş, 2010).

Another method for classification consists in using logistic regression (binary or multicategorial) (McCullagh and Nelder, 1989), for which optimization is achieved via the Iteratively Reweighted Least Squares (IRLS) algorithm (Green, 1984). However its convergence is not guaranteed especially in the high dimensional case. Computational convergence is a crucial issue when estimating parameters, as non-convergent methods may lead to unstable and inconsistent estimations, impacting analysis interpretation and reproducibility, especially when tuning hyper-parameters by cross-validation.

The combination of logistic regression and (sparse) PLS could lead to a classification method processing dimension reduction based on lower space representation and variable selection. However, the combination of such iterative algorithms is not necessarily straightforward, due to convergence issues. Performing compression with SPLS on the categorical response as a first step before logistic regression remains counter-intuitive, because SPLS was designed to handle a continuous response within homoskedastic models. Based on the generalized PLS by Marx (1996) or Ding and Gentleman (2005), Chung and Keleş (2010) proposed to use sparse PLS within the IRLS iterations to solve reweighted least squares at each step, however we will see that convergence issues remain. Fort and Lambert-Lacroix (2005) proposed to use a Ridge regularization (Eilers et al., 2001) to ensure the convergence of the IRLS algorithm and to use the classical PLS to estimate predictor coefficients by using a continuous pseudo-response generated by the IRLS algorithm. We will develop a similar approach based on sparse PLS.

Our new SPLS-based approach, called logit-SPLS, combines compression and variable selection in a GLM framework. We show the accuracy, the computational stability and convergence of our method, compared with other state-of-the-art approaches on simulations. Especially, we show that compression increases variable selection accuracy, and that our method is more stable regarding the choice of hyper-parameters by cross-validation, contrary to other methods processing classification with sparse PLS. Thus, our method is the only one that correctly performs considering all criteria (prediction, selection, stability), whereas all the other approaches present a weak spot. Our simulations illustrate the interest of both selection and compression over selection or compression only. Our work was implemented in the existing R-package plsgenomics, available on the CRAN.

We will first introduce our adaptive sparse PLS approach. Then, we will develop and discuss our classification framework based on Ridge IRLS and adaptive sparse PLS for logistic regression. We will finish by a comparative study and eventually two applications of our method: binary classification to predict breast cancer relapse after 5 years based on gene expression data, with an illustration of data visualization through compression, prediction of cell types with multinomial classification based on single-cell expression profiles. To do so, we extend our approach to the multi-group case, based on a ‘‘one-class vs a reference’’ type of multi-classification. One strength of our approach is to propose a sparse PLS that admits a closed-form solution in both binary and multi-group classifications. This leads to computationally efficient procedures in both cases, contrary to sparse PLS-DA approaches for instance, that are based on a multivariate response sparse PLS algorithm in the multi-group case, for which there is no closed-form solution (c.f. Chung and Keleş, 2010; Lê Cao et al., 2011).

2 Compression and selection in the GLM framework

We first define the sparse PLS and introduce a new formulation of the associated optimization problem, based on proximal operator. Contrary to existing approaches, this formulation provides a simple resolution of the covariance maximization problem associated to sparse PLS. Then, we propose an adaptive version of the sparse PLS selection step. Eventually, we will develop our approach to combine sparse PLS and logistic regression.

2.1 Proximal sparse PLS

Let be a -sample, with a continuous response and a set of covariates, gathered in the matrix . The PLS solves a linear regression problem. We consider centered data and to neglect the intercept and the model , with the coefficients . The metric in the observation space is weighted by the matrix .

In the univariate response case, the PLS (Boulesteix and Strimmer, 2007) consists in constructing components that explain the response, and defined as linear combinations of predictors, i.e. with weight vectors (). These weights are defined to maximize the empirical covariance of the corresponding components with the response . Other PLS algorithms consider the maximization of the squared covariance, however both definitions are equivalent in the univariate response case (De Jong, 1993; Boulesteix and Strimmer, 2007). To exclude the inherent noise induced by non pertinent covariates in the model, the sparse PLS (Lê Cao et al., 2008; Chun and Keleş, 2010) introduces a variable selection step into the PLS framework. It constructs ‘‘sparse’’ weight vectors, whose coordinates are required to be null for covariates that are irrelevant to explain the response. Following the Lasso principle (Tibshirani, 1996), the shrinkage to zero is achieved with a norm penalty in the covariance maximization problem:

(1)

under the constraints and orthogonality between components, with the sparsity parameter . The empirical covariance between and is explicitly , where is the empirical covariance , depending on the metric weighted by ( is a vector because the response is univariate).

Different methodologies (Lê Cao et al., 2008; Chun and Keleş, 2010) have been proposed to solve the optimization problem (2.1). However, both approaches give an approximate solution. We propose a new approach to exactly solve this problem in the univariate response case. In the standard PLS algorithm, is proven to be the dominant singular vector of the empirical covariance . In the univariate response case (PLS1 algorithm), is univariate and . Thus, we introduce the following equivalent formulation of the penalized problem (2.1):

(2)

under the constraints and orthogonality between components (the equivalence between (2.1) and (2.1) is shown in the Supp. Mat.). We consider a range of values for so that the problem  (2.1) admits a solution.

Resolution.

Applying the method of Lagrange multipliers, the problem (2.1) becomes ():

(3)

The method of Lagrange multipliers was proposed by Witten et al. (2009) or Tenenhaus et al. (2014) for different decomposition problems. The objective is continuous and convex, thus the strong duality holds and the solutions of primal (2.1) and dual (2.1) problems are equivalent. The resolution of the dual problem is based on proximal (or proximity) operators defined as the solution of the following problem (Bach et al., 2012):

(4)

for any fixed , any function . It is denoted by . When corresponds to the Elastic Net penalty (combination of and penalty), i.e. (with and ), the closed-form solution of problem (2.1) is explicitely given by the proximal operator whose coordinates are defined by (Yu, 2013, Theo. 4):

(5)

This corresponds to the normalized soft-thresholding operator applied to the covariance vector . When choosing so that has a unitary norm, the pair given by the proximal operator (2.1) with is a candidate point and then a solution (by convexity) for the dual problem (2.1). Hence, the SPLS weights used to construct the SPLS components are given by . This new resolution of the sparse PLS problem is a general result, that also applies in the case of the standard homoskedastic linear model by replacing by the identity matrix. This is consistent with the derivation of the sparse PLS by Chun and Keleş (2010), but provides a more direct resolution framework. In addition, is renormalized to lie in (c.f. Chun and Keleş, 2010).

The resolution of the problem (2.1) allows to compute and construct the first components . At step , is computed by solving Eq. 2.1, using a ‘‘deflated’’ version of and , i.e. the residuals of the respective regression of and onto the previous components , guaranteeing the orthogonality between components. The active set of selected variables up to component is a subset of , defined as the variables with a non null weights in , and denoted by . Eventually, the estimation of in the model is given by the weighted PLS regression of onto the selected variables in the active set . The coefficient is set to zero if the predictor is not in . Indeed, following the definition of the SPLS regression, the sparse structure of the weight vectors directly induces the sparse structure of . The variables selected to construct the new components are the ones that contribute the most to the response and correspond to those with non-null entries in the true vector .

2.2 Adaptive sparse PLS

We also propose to adjust the constraint to further penalize the less significant variables, which can lead to a more accurate selection process. Such an approach is inspired by component wise penalization as adaptive Lasso (Zou, 2006). We use the weights from classical PLS (without sparsity constraint) to adapt the penalty on the weight vector . The penalty in problem (2.1) becomes , with to account for the significance of the predictor (higher weights in absolute values correspond to more important variables). The closed-form solution accounts for the adaptive penalty and remains the soft-thresholding operator applied to but with parameter for predictor (c.f. Supp. Mat.). We called this method adaptive sparse PLS.

2.3 Ridge-based logistic regression and logistic regression

We now present our approach based on sparse PLS for logistic regression.

The Logistic Regression model.

We now consider a -sample with being a label variable in , gathered in . We use the Generalized Linear Models (GLM) framework (McCullagh and Nelder, 1989) to relate the predictors to the random response variable , using the logistic link function, such that with , , and . In the sequel, . With , the log-likelihood of the model is defined by , and the coefficients are estimated by maximum likelihood (MLE).

The Ridge IRLS algorithm.

Optimization relies on a Newton-Raphson iterative procedure (McCullagh and Nelder, 1989) to construct a sequence , whose limit (if it exists) is the estimation of . The Iteratively Reweighted Least Squares (IRLS) algorithm (Green, 1984) explicitly defines as the solutions of successive weighted least squares regressions of a pseudo-response onto the predictors at each iteration . The pseudo-response is linearly generated from the predictors based on previous iterations, c.f. Eq (2.3). However, when , the matrix is singular, which leads to optimization issues. Le Cessie and Van Houwelingen (1992) proposed to optimize a Ridge penalized log-likelihood, i.e. , with the diagonal empirical variance matrix of and the Ridge parameter. A unique solution of this regularized problem always exists and is computed by the Ridge IRLS (RIRLS) algorithm (Eilers et al., 2001), where the weighted regression at each IRLS iteration is replaced by a Ridge weighted regression, hence:

(6)

with the estimated probabilities , i.e. for each , and is the diagonal empirical variance matrix of at step .

Following the definition of , the (R)IRLS algorithm produces a pseudo-response as the limit of the sequence , verifying , where is the solution of the likelihood optimization, and is a noise vector of covariance matrix , with the limit of the matrix sequence .

Sparse PLS regression.

The pseudo-response produced by Ridge IRLS depends on predictors through a linear model. Following the approach by Fort and Lambert-Lacroix (2005), we propose to use the sparse PLS regression on to process dimension reduction and estimate in the logistic model . In this case, the metric (in the observation space) is weighted by the empirical inverse covariance matrix , to account for the heteroskedasticity of noise . To neglect the intercept in the SPLS step, we consider the centered version of and , regarding the metric weighted by , denoted by and . The intercept will be estimated later.

The estimates are renormalized to correspond to the non-centered and non-scaled data, i.e. giving the estimation in the original logistic model . The intercept is estimated by where and are respectively the sample average of the pseudo-response and the sample average vector of predictors regarding the metric weighted by . Our method can be summarized as follow:

  1. Center and regarding the scalar product weighted by

  2. Renormalization of

The label of new observations (non-centered and non-scaled) is predicted through the logit function thanks to the estimates . Note that does not need to be centered nor scaled thanks to the intercept parameter and to the renormalization of the coefficient estimates in the algorithm.

Our method estimates predictor coefficients in the logistic model by sparse PLS regression of a pseudo-response, considered as continuous and therefore in accordance with the conceptual framework of PLS, while completing compression and variable selection simultaneously. An additional interest is that the iterative optimization in the RIRLS algorithm does not depend on the number of components nor on the sparsity parameter . Consequently, the convergence of our method is robust to the choice of and by definition, contrary to other approaches for logistic regression based on sparse PLS (c.f. Supp. Mat. section A.3). Our approach will be called logit-SPLS in the following while the method by Fort and Lambert-Lacroix (2005) will be called logit-PLS.

2.4 Tuning sparsity by stability selection

Our logit-SPLS approach depends on three hyper-parameters: the sparsity parameter , the Ridge parameter and the number of components . We first propose to tune all the parameters by 10-fold cross-validation (to reduce the sampling dependence). Details about the choice of the grid of candidates values for are given in Supp. Mat. (c.f. sections A.5.1 and A.6.1).

In addition, we propose to adapt the stability selection method developed by Meinshausen and Bühlmann (2010), to the sparse PLS framework. The interest of this approach is to avoid choosing a value for the sparsity parameter to find the degree of the sparsity in the model, i.e. to select the relevant predictors. In this framework, the grid of all parameter candidate values for is denoted by . The principle consists in fitting the model for all points , then estimating the probability for each covariate to be selected over 100 resamplings of size depending on , i.e. the probability for predictor to be in the set , where are the corresponding estimated coefficients. Finally, the procedure retains the predictors that are in the set of stable selected variables, defined as }, where is a threshold value. This means that predictors with high selection probability are kept and predictors with low selection probability are discarded.

The average number of selected variables over the entire grid , is denoted by , and defined as . Meinshausen and Bühlmann (2010, Theo. 1) provided a bound on the expected number of wrongly stable selected variables (equivalent to false positives) in , depending on the threshold , the expectation and the number of covariates:

(7)

where FP is the number of false positives i.e. and the unknown set of true relevant variables. This results is derived under some reasonable conditions that are discussed in Supp. Mat. (section A.2). Following the recommendation of Meinshausen and Bühlmann (2010, p. 424), we use Eq. 2.4 to determine the range of the parameter grid to avoid too many false positives (corresponding to a weak penalization). Indeed, since the number of false positives is controlled by , we automatically exclude candidate points corresponding to small (near 0) for which there is no selection and for which all variables contribute to the mode, so that we can control . Without removing these points, and the number of false positives are too high. For instance, when the threshold probability is set to 0.9, is defined as a subset of the parameter grid so that . In practice, is unknown but can be estimated by the empirical average number of selected variables over all . In this context, the expected number of false positives will be lower than (in practice, we set ). Details about the candidate values for are given in Supp. Mat. (section A.6.2).

A clear interest here is that we do not have to choose a specific value for the hyper-parameters, instead we retain the variables that are selected by most of the models when exploring the grid of candidate values for hyper-parameters (including ).

3 Simulation study

We assess the performance of our approach for prediction, compression and variable selection compared to state-of-the-art methods that were previously introduced. We also use a ‘‘baseline’’ method, called GLMNET (Friedman et al., 2010), that performs variable selection, by solving the GLM likelihood maximization with norm penalty for selection and norm penalty for regularization, also known as the Elastic Net approach (Zou and Hastie, 2005). We compare different approaches based on (sparse) PLS for classification (c.f. Tab. 1 and Supp. Mat. section A.3 and A.4 for details).

  Method Algorithm Sparse? Reference   GPLS (S)PLS inside the IRLS algorithm Ding and Gentleman (2005) SGPLS Chung and Keleş (2010)   PLS-log (S)PLS before logistic regression Wang et al. (1999), Nguyen and Rocke (2002) SPLS-log Chung and Keleş (2010)   logit-PLS (S)PLS on the pseudo-response after the RIRLS algorithm Fort and Lambert-Lacroix (2005) logit-SPLS Our algorithm  

Table 1: The different algorithms to process dimension reduction by (sparse) PLS in the framework of the logistic regression.

Simulation design.

Our simulated data are constructed to assess the interest of compression and variable selection for prediction performance. The simulations are inspired from Zou et al. (2006), Shen and Huang (2008) or Chung and Keleş (2010). The purpose is to control the redundancy within predictors, and the relevance of each predictor to explain the response. We consider a predictor matrix of dimension , with fixed, and , so that we examine different high dimensional models. The true vector coefficients is generated to be sparse, the sparsity structure is thus known. Hence, it is possible to assess whether a method selects the relevant predictors or not. The response variable for observation is a Bernoulli variable, with parameter . The pattern of data simulation and the tuning of hyper-parameters are detailed in Supp. Mat. (section A.5). Regarding other methods, we use the range of parameters recommended by their respective authors and the cross-validation procedures supplied in the corresponding packages.

Ridge penalty ensures convergence.

Convergence is crucial when combining PLS and IRLS algorithm as pointed by Fort and Lambert-Lacroix (2005). With the analysis of high dimensional data and the use of selection in the estimating process, it becomes even more essential to ensure the convergence of the optimization algorithms, otherwise the output estimates may not be relevant. Our simulations show that the Ridge regularization systematically ensures the convergence of the IRLS algorithm in our method (logit-SPLS), for any configuration of simulation: , , high or low sparsity, high or low redundancy (see Tabs. A.3 and A.2 in Supp. Mat.). On the contrary, approaches that use (sparse) PLS before or within the IRLS algorithm (resp. SPLS-log and (S)GPLS) encounter severe convergence issues.

Whereas the SPLS-log or (S)GPLS approaches were designed to overcome convergence issues, it appears that they do not, which questions the reliability of the results supplied by these methods. Then, it confirms the interest of the Ridge regularization to ensure the convergence of the IRLS algorithm. Moreover, this convergence seems to be fast (around 15 iterations even when ), which depicts an interesting outcome for computational time. For instance, the tuning of three parameters in the logit-SPLS approach is less costly thanks to the fast convergence of the algorithm. Although both SGPLS and SPLS-log methods are based on two parameters, they iterates further (until the limit set by the user) which is less computationally efficient, especially with high dimensional data. On this matter, details regarding computation times are given in Supp. Mat. (section A.5.2).

Adaptive selection improves cross-validation stability.

A cross-validation procedure would be expected to be stable under multiple runs, i.e. the chosen values must not be variable when running the procedure many times on the same sample. Otherwise, selection and prediction become uncertain and not suitable for experiment reproducibility. We quantified the standard deviation of the sparse parameter chosen by cross-validation for the three sparse PLS methods (SGPLS, SPLS-log and our logit-SPLS) when repeating the procedure on the same samples. The standard deviation (all three methods consider the same range of values for ) is smaller for our approach (c.f. Tab. 2) than for other methods. Thus, the cross-validation procedure in our adaptive method is more stable than other SPLS approaches. A similar comment can be made regarding the choice of the number of components (c.f. Fig. A.1 in Supp. Mat.). This behavior can be linked to the convergence of the different approaches. The methods with convergence issues (SGPLS and SPLS-log) present a higher cross-validation instability, whereas our method (logit-SPLS) converges efficiently and shows a better cross-validation stability. Similarly, the variable selection accuracy, defined as the proportion of rightly selected and rightly non selected variables (Chong and Jun, 2005), is also influenced by the cross-validation stability and the convergence of the method. Indeed, the standard deviation of the selection accuracy (computed across multiple runs) is higher for the less stable and less convergent methods (SGPLS and SPLS-log) compared to our logit-SPLS approach (c.f. Tab. 2).

  Method   logit-spls sgpls 0.17 0.14 0.15 0.12 spls-log 0.23 0.12 0.21 0.17  

Table 2: Comparing computational stability between sparse PLS approaches. stands for the estimated standard deviation of the tuned hyper-parameter (over repetitions on the same simulated data set), which measures the stability of the hyper-parameter tuning by cross-validation. stands for the estimated standard deviation of the accuracy in variable selection, which measures the stability of the selection steps. The results are presented for different model dimensions ().

Compression and selection increase prediction accuracy.

We now assess the importance of compression and variable selection for prediction performance. We consider the prediction accuracy, evaluated through the prediction error rate. A first interesting point is that the prediction performance of compression methods is improved by the addition of a selection step: logit-SPLS, SGPLS and SPLS-DA perform better than logit-PLS, GPLS and PLS-DA respectively (c.f. Tab. 3). In addition, sparse PLS approaches also present a lower classification error rate than the GLMNET method that performs variable selection only. These two points support our claim that in any case compression and selection should be both considered for prediction. Similar results are observed for other configurations of simulated data (c.f. Supp. Mat. section A.5.2). All different SPLS-based approaches show similar prediction performance, even methods that are not converging (SPLS-log or SGPLS) compared to our adaptive approach logit-SPLS. Thus, checking prediction accuracy only may not be a sufficient criterion to assess the relevance of a method. The GPLS method is a good example of non-convergent method (c.f. Tab. 3 and Tab. A.2 in Supp. Mat.) that presents high variability and poor performance regarding prediction.

Actually, the combination of Ridge IRLS and sparse PLS in our method ensures convergence and provides good prediction performance (prediction error rate at 10% on average) even in the most difficult configurations and , which makes it an appropriate framework for classification.

  Method Prediction Selection Selection Selection error sensitivity specificity accuracy   gpls / / / pls-da / / / logit-pls / / /   glmnet 0.74 logit-spls sgpls spls-da spls-log  

Table 3: Prediction error and selection sensitivity/specificity (if relevant) when , for non-sparse or sparse approaches (delimited by the line). Results for other values of are joined in Supp. Mat. (section A.5.2).

Compression increases selection accuracy.

A sparse model will be useful if characterized by good prediction performances but also if the selected covariates are the genuine important predictors that explain the response. To assess the selection accuracy, we compare the selected predictors returned by each sparse method to the set of relevant ones used to construct the response, i.e. with a non zero coefficient in our simulation model. We consider sensitivity and specificity (Chong and Jun, 2005), respectively the proportion of true positive and true negative regarding the selected variables.

A first striking point is that, in our simulations (see Tab. 3 and Tabs. A.4A.5A.6 in Supp. Mat.), the baseline GLMNET presents a very low sensitivity and a very high specificity (low true positive and low false positive rates), meaning that it selects a small number of predictors (that are relevant), which leads to a lower accuracy compared to SPLS-based approaches. Thus, using approaches that combine compression and variable selection such as sparse PLS has a true impact on selection accuracy, compared to ‘‘selection-only’’ approach such as GLMNET.

Then, we focus on the comparison of the different sparse PLS approaches. On the one hand, our method logit-SPLS selects less irrelevant predictors since the false positive rate is lower (higher specificity), compared to other SPLS approaches. On the other hand, SGPLS, SPLS-log and SPLS-DA select more true positives (higher sensitivity). Since all methods achieve a similar level of accuracy, this result clearly illustrates a difference of strategy regarding variable selection. The balance between sensitivity and specificity indicates that our method logit-SPLS selects predictors which are more likely to be relevant, discarding most of the non-pertinent predictors, while other approaches tend to select more predictors with higher false positive rate. With high dimensional data set (large ), we are generally interested in highly sparse model, thus it is an advantage to have a sharper control on the false positive rate, as in our method. In addition, the relative good sensitivity of other sparse PLS approaches (SGPLS and SPLS-log) is also balanced by a selection process that is less stable than ours, as the standard deviation of the accuracy is higher over simulations (as previously mentioned, see Tab. 2).

4 Classification of breast tumors using adaptive sparse PLS for logistic regression

We consider a publicly available data set on breast cancer (Guedj et al., 2012) containing the level expression of 54613 genes for 294 patients affected by breast cancer. We focus on the relapse after 5 years, considering a valued response, if the relapse occurred or not. There were 214 patients without relapse and 80 with a relapse. We reduce the number of genes by considering the top 5000 most differentially expressed genes, by using a standard t-test with a Benjamini-Hochberg correction. Computation details (resamplings, cross-validation, stability selection, training and test set definition) are joined in Supp. Mat. (c.f. section A.6).

Convergence and stability with Ridge IRLS and adaptive sparse PLS.

The Ridge IRLS algorithm confirms its usual convergence (see Tab. 4). Other approaches based on SPLS (SGPLS and SPLS-log) again encounter severe issues and almost never converge. Following a similar pattern, our adaptive selection is far more stable under the tuning of the sparsity parameter by cross-validation than any other approach using sparse PLS (Tab.4), as the precision on this hyper-parameter value is the highest for our method, illustrating less variability in the tuning over repetitions.

  Method Prediction error Conv. perc. s.d.   glmnet / / logit-pls / logit-spls logit-spls-ad sgpls spls-log  

Table 4: Averaged prediction error, convergence percentage over 100 resamplings and standard deviation of cross-validated .

Interest of adaptive selection for prediction and selection.

Regarding prediction performance, the adaptive version of our algorithm logit-SPLS gives better results (c.f. Tab. 4) which highlights the interest of adaptive selection. It can also be noted that our approach performs better on prediction than both logit-PLS (compression only) and GLMNET (selection only), which again supports the interest of using both compression and variable selection. The SGPLS method does not confirm its performance on our simulatiosn with poor and highly variable results, illustrating the potential lack of stability of non-convergent method. Only the SPLS-log method achieves a classification that is as good as our adaptive method. However this point will be counterbalanced by its assessment over the other criteria in the following.

Regarding variable selection, the stability selection analysis (see Fig. 1) shows that, when the number of false positives is bounded (on average), our approach logit-SPLS selects more genes than any other approach (SGPLS, SPLS-log and GLMNET). Hence, we discover more true positives (because the number of false positives is bounded), unraveling more relevant genes than other approaches. This again illustrates the good performance of our method for selection. More generally, approaches that use sparse PLS, i.e. performing selection and compression, select more variables than GLMNET with the same false positive rate, thus retrieving more true positives than GLMNET which performs only selection. This again supports our previously developed idea that compression and selection are both very suitable for high dimensional data analysis. We recall that the curves in Fig. 1 correspond to the number of variables that are selected by most models when exploring the grid of candidate values for hyper-parameters (including ). Additional results regarding the overlap between the genes selected by the different methods and the list of selected genes with their score (i.e. the maximum estimated probability of selection) are given in Supp. Mat. (section A.6.2).

Figure 1: Number of variables in the set of stable selected variables versus the threshold , when forcing the average number of false positives to be smaller than . Methods: glmnet (), logit-spls-adapt (), sgpls (), spls-log (). Note: here, all hyper-parameters (including ) vary across the grid of candidate values (c.f. Supp. Mat. section A.6.2).

Efficient compression to discriminate the response.

To assess the interest of our approach for data visualization, we represent the score of the observations on the first two components, i.e. the point cloud . The points are colored according to their -labels. An efficient compression technique would separate the -classes with fewer components. We fit the different compression-based approach (when the number of components is set to ). We use PCA as a reference for compression and data visualization, based on unsupervised learning contrary to other compared approaches. Fig. 2 represents the first two components computed by logit-PLS, logit-SPLS, SGPLS, SPLS-log and PCA. It appears that the first two components from our logit-SPLS are sufficient to easily separate the two -classes. On the contrary, other sparse PLS approaches do not achieve a similar efficiency in the compression process. Thus, our method turns out to be very efficient for data visualization, especially compared to principal component analysis.

Figure 2: Observations scores on the first two components for the different methods. The points are shaped according to the value of the response: 0 () and 1 (). The scores are normalized for comparison.

5 Characterization of T lymphocyte types based on single-cell data

We generalized our approach to the multicategorial case and developed a new method, called multinomial-SPLS (or MSPLS), that was applied to the prediction of cell types using single cell expression data (Stegle et al., 2015; Gawad et al., 2016). Our approach (detailed in Supp. Mat., section A.7) is based on a direct extension of the logistic model. It is specifically a ‘‘one-class vs a reference’’ type of multi-classification, in which the membership probabilities of each class (except the reference) are estimated based on linear combinations of the predictors. The membership probability of the reference class is then deduced from the rest. The resolution is derived from our logit-SPLS method. One interest is that our multi-group classification approach uses a univariate response sparse PLS algorithm (that admits a closed-form solution, c.f. section 2), contrary to sparse multigroup PLS-DA for instance (c.f. Supp. Mat. section A.3).

Understanding the mechanisms of an adaptive immune response is of great interest for the creation of new vaccines. This response is made possible thanks to antigen-specific ‘‘effector’’ T cells capable of recognizing and killing infected cells, and to the long-lasting ‘‘memory’’ T cells that will constitute a repertoire for later secondary immune responses. These two types of T cells have then been described as 4 sub-groups: CM, TSCM (‘‘Memory’’), TEMRA, EM, (‘‘Effector Memory’’). Generally speaking, CM and TSCM can be considered as ‘‘Memory’’ cells and TEMRA and EM can be considered as ‘‘Effectors’’ as CM/TSCM and EM/TEMRA share significant functional overlap with each other (Willinger et al., 2005a; Gattinoni et al., 2011). Understanding the transcriptomic diversity of T cells constitutes a new challenge to better characterize the short and long-term vaccinal responses, as T cells are increasingly recognized as being highly heterogeneous populations (Newell et al., 2012). However, these investigations have been limited by current practices that consist in defining those 4 cell types based by drawing non-overlapping gates on the 2D-space defined two surface markers only: CCR7 and CD45RA (Sallusto et al., 1999). Consequently this rule leads to the selection of a fraction of cells that only correspond to cells with the most extreme values of markers, which ignores the complexity of a T cell population sampled from real blood.

We developed a SPLS-based multi-categorical classification to better characterize the transcriptomic diversity that supports the 4 different cell types of T cells. This approach aims at classifying more cells, and at inferring the type of the non-identified cells. To do so, we considered the measurements of 11 surface markers (CCR7, CD45RA, CD27, IL7R, FAS, CD49F, PD1, CD57, CD3E, CD8A), along with the expression of the corresponding genes. All these measurements were available on the single-cell basis. We will show that even in this low dimensional case, the use of variable selection will help to improve the accuracy of the results. In the following, hyper-paramaters (including ) were tuned by cross-validation. Details about the candidate values for are given in Supp. Mat. (section A.8.1).

We developed the following two-step analysis. We started by considering the measures of the 11 surface markers and the expression of the 11 associated genes. The multinomial-SPLS was trained on a subset of cells that were tagged manually, and used to predict the types of the unknown cells (136 annotated over 943 cells). On this training set of 136 cells, including 44 CM and 28 TSCM cells (i.e. 72 ‘‘Memory’’ cells), 30 EM and 34 TEMRA cells (i.e. 64 ‘‘Effector’’ cells), a 5-fold cross-validation procedure (with 50 repetitions) is used to tune the hyper-parameters. The cross-validation prediction error over the resamplings was .Fig. A.3 in Supp. Mat. shows that the cells in the training set are well discriminated in this first step. In addition, our SPLS procedure selected the proteins CCR7 and CD45RA in 100% of the runs, which is coherent with the manual annotation of the cells based on these two markers.

In a second step we wanted to enrich the set of genes that discriminate cell types. To proceed we considered the expression of all genes of these predicted cell types, and performed a differential analysis from which we retained 61 differentially expressed genes (corresponding to a 5% FDR). By considering these 61 genes added to the first 22 markers considered for the first prediction step, we performed the MSPLS-based prediction on the complete data set annotated by our first prediction. Our method selected 8 new biologically relevant genes (more details in Supp. Mat. section A.8.2) with a cross-validation prediction error rate over re-samplings (again 5-fold cross-validation) of (on the whole data set, not only considering the most extreme phenotypes). The main interests of this two-step procedure were to be computationally efficient and to narrow the list of potential genes of interest, which was conclusive since this second prediction greatly improved the biological relevance of the predicted cell types by accounting for more information than the one contained in the classical markers like CCR7 and provided us with new insight to better understand the T cells immune response.

Fig. 3 illustrates the representation of the cells in the latent dimensional space computed by the multinomial PLS in the second step of prediction. The reference class is ‘‘CM’’. The SPLS computes latent directions discriminating each other class (‘‘EM’’, ‘‘TEMRA’’ and ‘‘TSCM’’ respectively) versus the reference class (c.f. Supp Mat.). The cells are represented on the first two components for the three different pairs: ‘‘CM versus EM’’, ‘‘CM versus TEMRA’’ and ‘‘CM versus TSCM’’. The latent components clearly discriminate the group of cells in the three different cases, which confirms the result of the second prediction based both on markers and differentially expressed genes. The different groups are clearly identified but there is no gap between them, contrary to the representation of the cells in the training set for the the first prediction (c.f. Supp. Mat.). This indicates that the multinomial-SPLS was able to predict the type of the lost common cells based on the most extreme phenotypes.

Figure 3: Cell scores on the first two PLS components in the latent space that discriminate between the reference class (‘‘CM’’) and each other class separately (‘‘EM’’, ‘‘TEMRA’’ and ‘‘TSCM’’ respectively, from left to right). T cells are identified by their predicted types after the second prediction step.

This application highlights the interest of dimension reduction by compression and variable selection, even when dealing with low dimensional data. It can also be noted that, even when using sparse approaches, a step of pre-selection is always useful, especially in the analysis of single-cell expression data, which are very noisy compared to standard RNA-seq data, because of the important inter-cellular diversity.

6 Conclusion

We have introduced a new formulation of sparse PLS and proposed an adaptive version of our algorithm to improve the selection process. Using proximal operators, we provide an explicit resolution framework with a closed-form solution based on soft-thresholding operators.

In addition, we developed a method that performs compression and variable selection suitable for classification. It combines Ridge regularized Iterative Least Square algorithm and sparse PLS in the logistic regression context. It is particularly appropriate for the case of high dimensional data, which appears to be a crucial issue nowadays, for instance in genomics. Our main consideration was to ensure the convergence of IRLS algorithm, which is a critical point in logistic regression. Another concern was to properly incorporate into the GLM framework a dimension reduction approach such as sparse PLS.

Ridge regularization ensures the convergence of the IRLS algorithm, which is confirmed in our simulations and tests on experimental data sets. Applying adaptive sparse PLS as a second step on the pseudo-response produced by IRLS respects the definition of PLS regression for continuous response. Moreover, the combination of compression and variable selection increases the prediction performance and selection accuracy of our method, which turns out to be more efficient than state-of-the-art approaches that do not use both dimension reduction techniques. Such a combination also improves the compression process, illustrated by the efficiency of our method for data visualization compared to standard supervised or unsupervised approaches. Furthermore it appears that previous procedures using sparse PLS with logistic regression encounter convergence issues linked to a lack of stability inthe cross-validation parameter tuning process, highlighting the crucial importance of convergence when dealing with iterative algorithms.

It can be noted that our approach can be used to include additional covariates in the model. For example, we used a combination of surface marker levels and gene expression levels in the single cell data analysis. On this matter, an interesting research direction would be to work on a Least Square-Partial Least Square (LS-PLS) approach, in which some part of the predictors are compressed into PLS components and some others are not. There have been recent advances regarding LS-PLS for logistic regression (see Bazzoli and Lambert-Lacroix, 2016). However, to our knowledge, there is no work on a potential LS-SPLS method, even in the regression case.

In addition, an interesting extension of our work would be to investigate theoretical properties of the sparse PLS regression (especially regarding its consistency or any oracle properties). Deriving such properties would be an opportunity to assess the underlying statistical properties of our method and remains an open question.

Funding

This work was supported by the french National Resarch Agency (ANR) as part of the ‘‘Algorithmics, Bioinformatics and Statistics for Next Generation Sequencing data analysis’’ (ABS4NGS) ANR project [grant number ANR-11-BINF-0001-06] and as part of the ‘‘MACARON’’ ANR project [grant number ANR-14-CE23-0003]. It was performed using the computing facilities of the computing center LBBE/PRABI.

References

  • Aggarwal et al. (2001) Aggarwal, C. C., Hinneburg, A., and Keim, D. A. (2001). On the surprising behavior of distance metrics in high dimensional space. In International Conference on Database Theory, pages 420–434. Springer.
  • Bach et al. (2012) Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. (2012). Optimization with Sparsity-Inducing Penalties. Found. Trends Mach. Learn., 4(1), 1–106.
  • Barker and Rayens (2003) Barker, M. and Rayens, W. (2003). Partial least squares for discrimination. Journal of chemometrics, 17(3), 166–173.
  • Bazzoli and Lambert-Lacroix (2016) Bazzoli, C. and Lambert-Lacroix, S. (2016). Classification using LS-PLS with logistic regression based on both clinical and gene expression variables. Preprint.
  • Boulesteix (2004) Boulesteix, A.-L. (2004). PLS dimension reduction for classification with microarray data. Statistical applications in genetics and molecular biology, 3(1), 1–30.
  • Boulesteix and Strimmer (2007) Boulesteix, A.-L. and Strimmer, K. (2007). Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Briefings in bioinformatics, 8(1), 32–44.
  • Chong and Jun (2005) Chong, I.-G. and Jun, C.-H. (2005). Performance of some variable selection methods when multicollinearity is present. Chemometrics and Intelligent Laboratory Systems, 78(1), 103–112.
  • Chun and Keleş (2010) Chun, H. and Keleş, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(1), 3–25.
  • Chung and Keleş (2010) Chung, D. and Keleş, S. (2010). Sparse Partial Least Squares Classification for High Dimensional Data. Statistical Applications in Genetics and Molecular Biology, 9(1).
  • De Jong (1993) De Jong, S. (1993). SIMPLS: An alternative approach to partial least squares regression. Chemometrics and intelligent laboratory systems, 18(3), 251–263.
  • Ding and Gentleman (2005) Ding, B. and Gentleman, R. (2005). Classification Using Generalized Partial Least Squares. Journal of Computational and Graphical Statistics, 14(2), 280–298.
  • Donoho (2000) Donoho, D. L. (2000). High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture, pages 1–32.
  • Eilers et al. (2001) Eilers, P. H., Boer, J. M., van Ommen, G.-J., and van Houwelingen, H. C. (2001). Classification of microarray data with penalized logistic regression. In BiOS 2001 The International Symposium on Biomedical Optics, pages 187–198. International Society for Optics and Photonics.
  • Eksioglu (2011) Eksioglu, E. M. (2011). Sparsity regularised recursive least squares adaptive filtering. IET signal processing, 5(5), 480–487.
  • Firth (1993) Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80(1), 27–38.
  • Fort and Lambert-Lacroix (2005) Fort, G. and Lambert-Lacroix, S. (2005). Classification using partial least squares with penalized logistic regression. Bioinformatics, 21(7), 1104–1111.
  • Fort et al. (2005) Fort, G., Lambert-Lacroix, S., and Peyre, J. (2005). Réduction de dimension dans les modèles linéaires généralisés: application à la classification supervisée de données issues des biopuces (in french). Journal de la société française de statistique, 146(1-2), 117–152.
  • Friedman et al. (2010) Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1), 1.
  • Gattinoni et al. (2011) Gattinoni, L., Lugli, E., Ji, Y., Pos, Z., Paulos, C. M., Quigley, M. F., Almeida, J. R., Gostick, E., Yu, Z., Carpenito, C., Wang, E., Douek, D. C., Price, D. A., June, C. H., Marincola, F. M., Roederer, M., and Restifo, N. P. (2011). A human memory T cell subset with stem cell-like properties. Nat. Med., 17(10), 1290–1297.
  • Gawad et al. (2016) Gawad, C., Koh, W., and Quake, S. R. (2016). Single-cell genome sequencing: Current state of the science. Nature Reviews Genetics, 17(3), 175–188.
  • Green (1984) Green, P. J. (1984). Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society. Series B (Methodological), pages 149–192.
  • Guedj et al. (2012) Guedj, M., Marisa, L., de Reynies, A., Orsetti, B., Schiappa, R., Bibeau, F., MacGrogan, G., Lerebours, F., Finetti, P., Longy, M., Bertheau, P., Bertrand, F., Bonnet, F., Martin, A. L., Feugeas, J. P., Bièche, I., Lehmann-Che, J., Lidereau, R., Birnbaum, D., Bertucci, F., de Thé, H., and Theillet, C. (2012). A refined molecular taxonomy of breast cancer. Oncogene, 31(9), 1196–1206.
  • Hastie et al. (2009) Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer Series in Statistics. Springer New York, New York, NY, second edition.
  • Lê Cao et al. (2008) Lê Cao, K.-A., Rossouw, D., Robert-Granié, C., and Besse, P. (2008). A sparse PLS for variable selection when integrating omics data. Statistical applications in genetics and molecular biology, 7(1).
  • Lê Cao et al. (2011) Lê Cao, K.-A., Boitard, S., and Besse, P. (2011). Sparse PLS discriminant analysis: Biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinformatics, 12, 253.
  • Le Cessie and Van Houwelingen (1992) Le Cessie, S. and Van Houwelingen, J. C. (1992). Ridge estimators in logistic regression. Applied statistics, pages 191–201.
  • Marimont and Shapiro (1979) Marimont, R. B. and Shapiro, M. B. (1979). Nearest Neighbour Searches and the Curse of Dimensionality. IMA Journal of Applied Mathematics, 24(1), 59–70.
  • Marx (1996) Marx, B. D. (1996). Iteratively reweighted partial least squares estimation for generalized linear regression. Technometrics, 38(4), 374–381.
  • McCullagh and Nelder (1989) McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, Second Edition. CRC Press.
  • Meinshausen and Bühlmann (2010) Meinshausen, N. and Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417–473.
  • Newell et al. (2012) Newell, E. W., Sigal, N., Bendall, S. C., Nolan, G. P., and Davis, M. M. (2012). Cytometry by time-of-flight shows combinatorial cytokine expression and virus-specific cell niches within a continuum of CD8+ T cell phenotypes. Immunity, 36(1), 142–152.
  • Nguyen and Rocke (2002) Nguyen, D. V. and Rocke, D. M. (2002). Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, 18(1), 39–50.
  • Sallusto et al. (1999) Sallusto, F., Lenig, D., Forster, R., Lipp, M., and Lanzavecchia, A. (1999). Two subsets of memory T lymphocytes with distinct homing potentials and effector functions. Nature, 401(6754), 708–712.
  • Shen and Huang (2008) Shen, H. and Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. Journal of multivariate analysis, 99(6), 1015–1034.
  • Stegle et al. (2015) Stegle, O., Teichmann, S. A., and Marioni, J. C. (2015). Computational and analytical challenges in single-cell transcriptomics. Nature Reviews Genetics, 16(3), 133–145.
  • Tenenhaus et al. (2014) Tenenhaus, A., Philippe, C., Guillemot, V., Le Cao, K.-A., Grill, J., and Frouin, V. (2014). Variable selection for generalized canonical correlation analysis. Biostatistics, 15(3), 569–583.
  • Tibshirani (1996) Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288.
  • Wang et al. (1999) Wang, C.-Y., Chen, C.-T., Chiang, C.-P., Young, S.-T., Chow, S.-N., and Chiang, H. K. (1999). A Probability-based Multivariate Statistical Algorithm for Autofluorescence Spectroscopic Identification of Oral Carcinogenesis. Photochemistry and photobiology, 69(4), 471–477.
  • Wherry et al. (2007) Wherry, E. J., Ha, S.-J., Kaech, S. M., Haining, W. N., Sarkar, S., Kalia, V., Subramaniam, S., Blattman, J. N., Barber, D. L., and Ahmed, R. (2007). Molecular Signature of CD8+ T Cell Exhaustion during Chronic Viral Infection. Immunity, 27(4), 670–684.
  • Willinger et al. (2005a) Willinger, T., Freeman, T., Hasegawa, H., McMichael, A. J., and Callan, M. F. (2005a). Molecular signatures distinguish human central memory from effector memory CD8 T cell subsets. J. Immunol., 175(9), 5895–5903.
  • Willinger et al. (2005b) Willinger, T., Freeman, T., Hasegawa, H., McMichael, A. J., and Callan, M. F. C. (2005b). Molecular Signatures Distinguish Human Central Memory from Effector Memory CD8 T Cell Subsets. The Journal of Immunology, 175(9), 5895–5903.
  • Witten et al. (2009) Witten, D. M., Tibshirani, R., and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3), 515–534.
  • Wold (1975) Wold, H. (1975). Soft Modeling by Latent Variables; the Nonlinear Iterative Partial Least Squares Approach. Perspectives in Probability and Statistics. Papers in Honour of M. S. Bartlett.
  • Wold et al. (1983) Wold, S., Martens, H., and Wold, H. (1983). The multivariate calibration problem in chemistry solved by the PLS method. In Matrix Pencils, pages 286–293. Springer.
  • Yu (2013) Yu, Y.-L. (2013). On decomposing the proximal map. In Advances in Neural Information Processing Systems, pages 91–99.
  • Zou (2006) Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418–1429.
  • Zou and Hastie (2005) Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.
  • Zou et al. (2006) Zou, H., Hastie, T., and Tibshirani, R. (2006). Sparse principal component analysis. Journal of computational and graphical statistics, 15(2), 265–286.

Supplementary Information

a.1 Optimization in sparse PLS

a.1.1 Reformulation of the sparse PLS problem

As previously introduced, the sparse PLS constructs components as sparse linear combination of the covariates. When considering the first components, i.e. , the weight vector is defined to maximize the empirical covariance between the component and the response, i.e. (centered and ) with a penalty on the -norm of to enforce sparsity in the weights. Thus, the weight vector is computed as the solution of the following optimization problem:

(A.1)

with . The problem (A.1.1) is equivalent to the following, when denoting the standard scalar product by :

because the term is constant thanks to the additional constraint. This new problem remains equivalent to the following:

since the norm of the empirical covariance is constant. Then, thanks to the Euclidean norm properties, it can be rewritten as:

(A.2)

with and when setting . Actually, in the case of a univariate response, the formulation (A.2) is natural. Indeed, in the standard (non-sparse) PLS, the optimal weight vector is the normalized dominant singular vector of the covariance matrix . However, when the response is univariate, the matrix is a vector and the solution for is the normalized vector (normalized to 1). This corresponds exactly to the solution of the problem:

(without the penalty).

The solution of the penalized problem (A.3) defines the first component () of the sparse PLS. We use deflated predictors and response to construct the following component ().

a.1.2 Resolution of the sparse PLS problem

Applying the method of Lagrange multipliers, the problem (A.2) becomes:

(A.3)

with . The objective is continuous and convex, thus the strong duality holds and the solutions of primal and dual problems are equivalent.

To solve the problem A.3, we use proximity operator (also called proximal operator) defined as the solution of the following problem (Bach et al., 2012):

(A.4)

for any fixed , any function . It is denoted by . When corresponds to the Elastic Net penalty (combination of and penalty), i.e. when considering the problem (with and ):

(A.5)

the closed-form solution is explicitly given by the proximal operator that is in particular the composition of and (Yu, 2013, Theo. 4), i.e.

Both proximal operators and are known (Bach et al., 2012), respectively being:

where is the soft-thresholding operator. Eventually, the coordinates of the solution are then:

(A.6)

which correspond to the normalized soft-thresholding operator applied to the vector .

We use the solution (A.6) of the Elastic Net problem (A.5), where and is chosen so that the solution has a unitary norm, to find a candidate point and then the solution (by convexity) of the dual problem (A.3).

Finally, we have reformulated the problem defining the sparse PLS as a least squares problem with an Elastic Net penalty and we have shown that the solution of this problem is the (normalized) soft-thresholding operator.

a.1.3 Adaptive penalty

When considering an adaptive penalty, the optimization problem associated to the sparse PLS can be similarly rewritten as:

(A.7)

with the penalty constant (c.f. main text). By a similar reasoning (continuity and convexity), it is possible to use Lagrange multiplier to resolve the problem (A.7).

In order to explicitly derive the solution, we will use the proximal operator that is solution of the following problem:

(A.8)

with .

If , it can be shown that the solution of the problem (A.4) when considering is given by:

because the subgradient of is given by (Eksioglu, 2011).

The link between subgradient and proximal operator is described in Bach et al. (2012). In particular, if and only if for any couple , where is the subdifferential of at point , i.e. the set of all subgradients of at point . If is differentiable in , then the only subgradient is the gradient .

The proximal operator corresponding to the function is known (c.f. previously):

Eventually, thanks to theorem 4 in Yu (2013), the solution of problem (A.8) is explicitly defined as the combination of and :

for any . Thus, the solution of the problem (A.8) is given by:

with .

We choose so that the norm of the solution is unitary to find a candidate point and thus the solution (by convexity) of the adaptive problem (A.7).

a.2 Conditions for stability selection

The result by Meinshausen and Bühlmann (2010) regarding the expected number of wrongly selected variables is derived for under two conditions: assuming that the indicators are exchangeable for any . The original procedure of selection is not worst than random guessing. The first assumptions assumes that the considered method does not ‘‘prefer’’ to select some covariates rather than some other in the set of the non-pertinent predictors. This hypothesis seems reasonable in our SPLS framework. The second one is verified according to the results on our simulations (c.f. section 3). Moreover, in the method we consider, the grid of hyper-parameters lies in , however the parameter that truly influences the sparsity of the estimation is the parameter . Therefore, the sparse PLS appears to be a reasonable framework to apply the concept of stability selection.

a.3 Comparison with state-of-the-art approaches

In the literature, other methodologies have been proposed to adapt (sparse) PLS for binary classification. We detail here different approaches based on (sparse) PLS and GLMs, especially regarding the potential issues raised by the combination of two optimization frameworks.

PLS and GLMs.

To overcome the convergence issue in the IRLS algorithm, Marx (1996) proposed to solve the weighted least square problem at each IRLS step with a PLS regression, i.e. is computed by weighted PLS regression of the pseudo-response onto the predictors . However, such iterative scheme does not correspond to the optimization of an objective function. Hence, the convergence of the procedure cannot be guaranteed and the potential solution is not clearly defined.

Alternatively, Wang et al. (1999) and Nguyen and Rocke (2002) proposed to achieve the dimension reduction before the logistic regression. Their algorithm use the PLS regression as a preliminary compression step. The components in the subspace of dimension are then used in the logistic regression instead of the predictors. Therefore, the IRLS algorithm does not deal with high dimensional data (as ). In this context, the PLS algorithm treats the discrete response as continuous. Such approach seems counter-intuitive as it neglects the definition of PLS to resolve a linear regression problem and it ignores the inherent heteroskedastic context. This algorithm is called PLS-log in the following. It can be noted that Nguyen and Rocke (2002) or Boulesteix (2004) also proposed to use discriminant analysis as a classifier after the PLS step. This method, known as PLS-DA, is not directly linked to the GLM framework but we cite it as an alternative for classification with PLS-based approaches. It can be noted that Barker and Rayens (2003) proposed a slightly different implementation of PLS-DA, which is however equivalent to Boulesteix’s approach in the binary response case, since they both rely on equivalent univariate response PLS algorithms (De Jong, 1993; Boulesteix and Strimmer, 2007).

Then, Ding and Gentleman (2005) proposed the GPLS method. They introduced a modification in Marx’s algorithm based on the Firth procedure (Firth, 1993), in order to avoid the non-convergence and the potential infinite parameter estimation in logistic regression. However, this approach is also characterized by the absence of an explicit optimization criterion. Eventually, as introduced previously, Fort and Lambert-Lacroix (2005) proposed to integrate the dimension reduction PLS step after a Ridge regularized IRLS algorithm. We presented the adaptation of such methodology in the context of sparse PLS in the previous section.

Sparse PLS and GLMs.

More recently, based on the SPLS algorithm by Chun and Keleş (2010), Chung and Keleş (2010) presented two different approaches. The first one, called SGPLS, is a direct extension from the GPLS algorithm by Ding and Gentleman (2005). It solves the successive weighted least square problems of IRLS using a sparse PLS regression, with the idea that variable selection reduces the model complexity and helps to overwhelm numerical singularities. Unfortunately, our simulations will show that convergence issues remain. Indeed, the use of SPLS does not resolve the issue link to the absence of an associated optimization problem. The second approach is a generalization of the PLS-log algorithm and uses sparse PLS to reduce the dimension before running the logistic regression on the SPLS components. This method will be called SPLS-log.In both case, i.e. in SGPLS and SPLS-log , the iterative optimization in the IRLS algorithm or modified IRLS algorithm does depend on the number of components and on the sparsity parameter . Thus, the convergence of the algorithm is potentially affected by the choice of the hyper-parameters.

Eventually, we cite the SPLS-DA method developed by (Chung and Keleş, 2010) or Lê Cao et al. (2011). Generalizing the approach from Boulesteix (2004), they used sparse PLS as a preliminary dimension reduction step before a discriminant analysis. In the binary response case, thanks to the equivalence between Boulesteix (2004) and Barker and Rayens (2003) works, the sparse extension of Barker and Rayens’ PLS-DA for binary classification corresponds to the work of Chung and Keleş (2010) or Lê Cao et al. (2011). A disadvantage of sparse PLS-DA approaches is that, in the multi-group classification case, they both rely on multivariate response sparse PLS algorithms, which do not admit a closed-form solution. On the contrary, our approach uses a univariate response sparse PLS algorithm (which admits a closed-form solution, c.f. main text) in both binary and multi-group classifications, being computationally efficient in both cases.

a.4 Performance evaluation

In order to assess the performance of our method, we compare it to other state-of-the-art approaches taking into account sparsity and/or performing compression. We eventually use a ‘‘reference’’ method, called GLMNET (Friedman et al., 2010), that performs variable selection, by solving the GLM likelihood maximization penalized by norm penalty for selection and norm penalty for regularization, also known as the Elastic Net approach (Zou and Hastie, 2005). Computations were performed using the software environment for statistics R. The GPLS approach used in our computation comes from the archive of the former R-package gpls, the methods logit-PLS and PLS-DA from the package plsgenomics, SGPLS, SPLS-log and SPLS-DA from the package spls, GLMNET from the package glmnet.

a.5 Complements on the simulation study

a.5.1 Simulation design

We consider a predictor matrix of dimension , with fixed, and , so that we examine low and high dimensional models. To simulate redundancy within predictors, is partitioned into blocks (10 or 50 in practice) denoted by for block . Then for each predictor , is generated depending on a latent variable as , with and some noise . The correlation between the blocks is regulated by , the higher the less dependency. In the following we consider or 1/3.

The true vector of predictor coefficients is structured according to the blocks in . Actually, blocks in are randomly chosen among the ones to be associated with non null coefficients (with or ). All coefficients within the designated blocks are constant (with value 1/2). In our model, the relevant predictors contributing to the response will be those with non zero coefficient, and our purpose will be to retrieve them via selection. The response variable for observation is sampled as a Bernoulli variable, with parameter that follows a logistic model: .

For our method, the parameter values that are tuned by cross-validation are the following: the number of components varies from 1 to 10, candidate values for the Ridge parameter in RIRLS are 31 points that are -linearly spaced in the range , candidate values for the sparse parameter are 10 points that are linearly spaced in the range . Other SPLS approaches (SGPLS and SPLS-log) only depend on hyper-parameters for which candidate values are the same as for our method. Regarding GLMNET, we let the procedure chooses by itself the grid of hyper-parameters, as recommended by the authors in the documentation.

a.5.2 Additional simulation results

Convergence.

Tab. A.3 summarized the convergence of the different methods (logit-SPLS, SGPLS and SPLS-log) during the cross-validation procedure (including the tuning of ) on the simulations, depending on the number of predictors . Our approach logit-SPLS always converges on contrary to other SPLS approaches for logistic regression.

  Method   sgpls spls-log logit-spls 100 100 100 100  

Table A.1: Averaged percentage of model that converged during cross-validation tuning of hyper-parameters for different values of .

Tab. A.2 summarized the convergence of the different methods on the simulations, when fitting the model after tuning the hyper-parameters (including ) by cross-validation, depending on the number of predictors. Again, our approach logit-SPLS always converges on contrary to other SPLS approaches for logistic regression.

  Method   gpls sgpls spls-log logit-spls 100 100 100 100  

Table A.2: Averaged percentage of model fitting that converged over 75 simulations for different values of . Hyper-parameters are tuned by cross-validation.

In addition, Tab. A.3 shows the percentage of convergence for the different SPLS approaches across cross-validation repeated runs. We see a similar pattern as in Tab. A.2, only our method logit-SPLS almost certainly converge.

  Method   sgpls spls-log logit-spls 100 100 100 100  

Table A.3: Averaged precentage of runs that converged across repeated cross-validations (tuning of all hyper-parameters, including ).

Cross-validation stability.

Fig. A.1 illustrates the stability of the cross-validation procedure for the different SPLS approaches regarding the number of components. Our approach logit-SPLS always chooses , while other SPLS approaches mostly returns . A first comment can be made on the stability of the cross-validation procedure. Our approach is also more stable regarding the choice of compared to other SPLS methods. A second comment is that, as explained in the manuscript, the stability of the cross-validation is directly linked to convergence of the method (c.f. Tab. A.2). Our method always converges on our simulations and is thus more stable regarding cross-validation than other SPLS approaches that do not converge most of the time and are less stable when tuning hyper-parameters. In addition, based on these results, we decided to set and only tune the sparsity parameter and the Ridge parameter when evaluating the performance of the different approaches (Tab. 3 in the manuscript) to save computation time.

Figure A.1: Values chosen for by cross-validation over repetitions for the different SPLS approaches for different values of (with ) on simulated data.

Prediction and selection

Tabs. A.4A.5 and A.6 collects the results regarding performance in prediction and selection (sensitivity, specificity, accuracy) for the different approaches compared in the simulation study, and for data respectively simulated with . These results are consistent with the case presented in the manuscript. In details, approaches that combines compression and variable selection (sparse PLS) achieve better prediction performance than compression only (PLS) or selection only (GLMNET) approaches. Regarding selection, sparse PLS is generally better in term of selection sensitivity (true positive rate) compared to GLMNET, which is too conservative. However, our approach logit-SPLS seems to select less false positives compared to other SPLS approaches, since the specificity is higher for a similar accuracy level.

  Method Prediction Selection Selection Selection error sensitivity specificity accuracy   gpls / / / pls-da / / / logit-pls / / /   glmnet 0.71 logit-spls sgpls spls-da spls-log  

Table A.4: Prediction error and selection sensitivity/specificity (if relevant) when , for non-sparse or sparse approaches (delimited by the line).

  Method Prediction Selection Selection Selection error sensitivity specificity accuracy   gpls / / / pls-da / / / logit-pls / / /   glmnet 0.74 logit-spls sgpls spls-da spls-log  

Table A.5: Prediction error and selection sensitivity/specificity (if relevant) when , for non-sparse or sparse approaches (delimited by the line).

  Method Prediction Selection Selection Selection error sensitivity specificity accuracy   gpls / / / pls-da / / / logit-pls / / /   glmnet 0.74 logit-spls sgpls