Cross validating extensions of kernel, sparse or regular partial least squares regression models to censored data.

# Cross validating extensions of kernel, sparse or regular partial least squares regression models to censored data.

Frédéric Bertrand,
IRMA, CNRS UMR 7501, Labex IRMIA,
Université de Strasbourg, Strasbourg, France
fbertran@math.unistra.fr
Philippe Bastien,
L’Oréal Recherche, Aulnay, France
pbastien@rd.loreal.com
Myriam Maumy-Bertrand,
IRMA, CNRS UMR 7501, Labex IRMIA,
Université de Strasbourg, Strasbourg, France
mmaumy@math.unistra.fr
###### Abstract

When cross-validating standard or extended Cox models, the commonly used criterion is the cross-validated partial loglikelihood using a naive or a van Houwelingen scheme -to make efficient use of the death times of the left out data in relation to the death times of all the data-. Quite astonishingly, we will show, using a strong simulation study involving three different data simulation algorithms, that these two cross-validation methods fail with the extensions, either straightforward or more involved ones, of partial least squares regression to the Cox model.

This is quite an interesting result for at least two reasons. Firstly, several nice features of PLS based models, including regularization, interpretability of the components, missing data support, data visualization thanks to biplots of individuals and variables -and even parsimony for SPLS based models-, account for a common use of these extensions by statisticians who usually select their hyperparameters using cross-validation. Secondly, they are almost always featured in benchmarking studies to assess the performance of a new estimation technique used in a high dimensional context and often show poor statistical properties.

We carried out a vast simulation study to evaluate more than a dozen of potential cross-validation criteria, either AUC or prediction error based. Several of them lead to the selection of a reasonable number of components. Using these newly found cross-validation criteria to fit extensions of partial least squares regression to the Cox model, we performed a benchmark reanalysis that showed enhanced performances of these techniques.

## 1 Introduction

Regular PLS regression is used to find the fundamental relations between two matrices ( and ), i.e. a latent variable approach to modeling the covariance structures in these two spaces. A PLS regression model will try to find iteratively the multidimensional direction in the space that explains the maximum multidimensional variance direction in the space. A key step in PLSR, is to select the right unknown number of these latent variables (called components) to use. PLS regression is particularly suited when the matrix of predictors has more variables than observations, and when there is multicollinearity among values. By contrast, standard regression will fail in these cases (unless it is regularized).

PLS has become an established tool in various experimental -including chemometric, networks, or systems biology- modeling, primarily because it is often possible to interpret the extracted components in terms of the underlying physical system -that is, to derive “hard” modeling information from the soft model: chemical components for NIR spectra, gene subnetwork for GRN or biological function for systems biology-. As a consequence, choosing the right number of components is not only a major aim to avoid under or overfitting and ensure a relevant modeling or good predicting ability but also per se.

A vast literature from the last decade is devoted to relating gene profiles and subject survival or time to cancer recurrence. Biomarker discovery from high-dimensional data, such as transcriptomic or SNP profiles, is a major challenge in the search for more precise diagnoses. The proportional hazard regression model suggested by Cox, 1972 to study the relationship between the time to event and a set of covariates in the presence of censoring is the most commonly used model for the analysis of survival data. However, like multivariate regression, it supposes that more observations than variables, complete data, and not strongly correlated variables are available. In practice when dealing with high-dimensional data, these constraints are crippling.

In this article we deal with several PLS regression based extensions of the Cox model. These extensions share features praised by practionners, including regularization, interpretability of the components, missing data support, biplots of individuals and variables -and even parsimony for SPLS based models-, and allow to deal with highly correlated predictors or even rectangular datasets, which is especially relevant for high dimensional datasets.

## 2 Models

### 2.1 Modeling censored data

#### 2.1.1 The Cox proportional hazards model

The model assumes the following hazard function for the occurrence of an event at time in the presence of censoring:

 λ(t)=λ0(t)exp(β′X), (1)

where is an unspecified baseline hazard function, the vector of the regression coefficients and the matrix of prognosis factors which will be the gene expressions in the following. The event could be death or cancer relapse. Based on the available data, the Cox’s partial likelihood can be written as:

 PL(β)=∏k∈Dexp(β′xk)∑j∈Rkexp(β′xj), (2)

where is the set of indices of the events and denotes the set of indices of the individuals at risk at time .

The goal is to find the coefficients which maximize the log partial likelihood function

 l(β)=logPL(β). (3)

The vector is the solution of the equation:

 u(β)=∂l∂β=0 (4)

with the vector of efficient scores.

However, there is no explicit solution and the minimization is generally accomplished using the Newton-Raphson procedure. An estimate of the vector of -parameters at the cycle of the iterative procedure is:

 ^βk+1=^βk+I−1(^βk)u(^βk) (5)

where is the observed information matrix. The process can be started by taking and iterated up to convergence, i.e. when the change in the log likelihood function is small enough. When the iterative procedure has converged, the variance-covariance matrix of the parameter estimates can be approximated by the inverse of the observed information matrix .

Note that when , there is no unique to maximize this log partial likelihood function. Even when , covariates could be highly correlated and regularization may still be required in order to reduce the variances of the estimates and to improve the predictive performance.

#### 2.1.2 Deviance Residuals

For the Cox model with no time-dependent explanatory variables and at most one event per patient, the martingale residuals for the subject with observation time and event status , where if is a censored time, and otherwise is:

 ^Mi=δi−^Ei=δi−^Δ0(ti)exp(^β′xi) (6)

with the estimated cumulative hazard function at time .

Martingale residuals are highly skewed. The deviance residuals are a normalized transform of the martingale residuals. For the Cox model, the deviance residuals (Collett, 1994) amount to the form:

 di=sign(^Mi)⋅[2{−^Mi−δilog(δi−^Miδi)}]1/2⋅ (7)

The function is to ensure that the deviance residuals have the same sign as the martingale residuals. Martingale residuals take values between and . The square root shrinks large negative martingale residuals, while the logarithmic transformation expands towards martingale residuals that are close to . As such, the deviance residuals are more symmetrically distributed around zero than the martingale residuals. The deviance residual is a measure of excess of death and can therefore be interpreted as a measure of hazard. Moreover, Segal showed that the expression to be minimized in step 3 of the Cox-Lasso procedure of Tibshirani can be approximated, in a first order Taylor-series approximation sense, by the deviance residual sum of squares:

with , , , and .

### 2.2 PLS regression models and extensions

#### 2.2.1 Plsr

Prediction in high-dimensional and low-sample size settings already arose in chemistry in the eighties. PLS regression, that can be viewed as a regularization method based on dimension reduction, was developed as a chemometric tool in an attempt to find reliable predictive models with spectral data (Wold et al., 1983; Tenenhaus, 1998). Nowadays, the difficulty encountered with the use of genomic or proteomic data for classification or prediction, using very large matrices, is of comparable nature. It was thus natural to use PLS regression principles in this new context. The method starts by constructing latent components, using linear combinations of the original variables, which are then used as new descriptors in standard regression analysis. Different from the principal components analysis (PCA), this method makes use of the response variable in constructing the latent components. The PLS regression can be viewed as a regularized approach searching the solution in a sub-space named Krylov space giving biased regression coefficients but with lower variance. In the framework of censored genomic data, the PLS regression operates a reduction of the dimensionality of the gene’s space oriented towards the explanation of the hazard function. It allows transcriptomic signatures correlated to survival to be determined.

#### 2.2.2 Sparse PLSR

Recently, Chun and Keles, 2010, provided both empirical and theoretical results showing that the performance of PLS regression was ultimately affected by the large number of predictors. In particular, a higher number of irrelevant variables leads to inconsistency of coefficient estimates in linear regression setting. There is a need to filter the descriptors as a preprocessing step before PLS fit. However, commonly used variables filtering approaches are all univariate and ignore correlation between variables. To solve these issues, Chun and Keles proposed ”sparse PLS regression” which promotes variables selection within the course of PLS dimension reduction. sPLS has the ability to include variables that variable filtering would select in the construction of the first direction vector. Moreover, it can select additional variables, i.e., variables that become significant once the response is adjusted for other variables in the construction of the subsequent direction vectors. This is the case of ”proxy genes” acting as suppressor variables which do not predict the outcome variable directly but improve the overall prediction by enhancing the effects of prime genes despite having no direct predictive power, Magidson and Wassmann, 2010.

A direct extension of PLS regression to sPLS regression could be provided by imposing constraint on PLS direction vector :

 maxww′Mw\textupsubjecttow′w=∥w∥2=1,∥w∥1⩽λ, \textupwhereM=X′YY′X.

When =, the objective function coincides with that of sPCA (Jolliffe et al., 2003). However in that case Jolliffe et al. pointed out that the solution tends not to be sparse enough and the problem is not convex. To solve these issues, Chun and Keles provided an efficient implementation of sPLS based on the LARS algorithm by generalizing the regression formulation of sPCA of Zou et al., 2006:

 minw,c−κw′Mw+(1−κ)(c−w)′M(c−w)+λ1∥c∥1+λ2∥c∥2 \textupsubjecttow′w=1,\textupwhereM=X′YY′X.

This formulation promotes exact zero property by imposing penalty onto a surrogate of the direction vector instead of the original direction while keeping and close to each other. The penalty takes care of the potential singularity of . Moreover, they demonstrated that for univariate PLS, regressed on , the first direction vector of the sparse PLS algorithm was obtained by soft thresholding of the original PLS direction vector:

 (|Z|−λ2)+sign(Z),\textupwhereZ=X′y/∥X′y∥2. (9)

In order to inherit the property of the Krylov subsequences which is known to be crucial for the correctness of the algorithm (Krämer, 2007), the thresholding phase is followed by a PLS regression on the previously selected variables. The algorithm is then iterated with replaced by , the residuals of the PLS regression based on the variables selected from the previous steps. The sPLS algorithm leads therefore to sparse solutions by keeping the Krylov subsequence structure of the direction vectors in a restricted space which is composed of the selected variables. The thresholding parameter and the number of hidden components are tuned by cross validation.

sPLS has connections to other variable selection algorithms including the elastic net method (Zou and Hastie, 2005) and the threshold gradient method (Friedman and Popescu, 2004). The elastic net algorithm deals with the collinearity issue in variable selection problem by incorporating the ridge regression method into the LARS algorithm. In a way, sPLS handles the same issue by fusing the PLS technique into the LARS algorithm. sPLS can also be related to threshold gradient method in that both algorithms use only thresholded gradient and not the Hessian. However, sPLS achieves fast convergence by using conjugate gradient. Hence, LARS and sPLS algorithms use the same criterion to select active variables in the univariate case. However, the sPLS algorithm differs from LARS in that sPLS selects more than one variable at a time and utilizes the conjugate gradient method to compute coefficients at each step. The computational cost for computing coefficients at each step of the sPLS algorithm is less than or equal to the computational cost of computing step size in LARS since conjugate gradient methods avoid matrix inversion.

### 2.3 Extensions of PLSR models to censored data

#### 2.3.1 PLS-Cox

Garthwaite, 1994, showed that PLS regression could be obtained as a succession of simple and multiple linear regressions. Tenenhaus, 1999, proposed a fairly similar approach but one which could cope with missing data by using the principles of the Nipals algorithm (Wold, 1966). As a result, Tenenhaus suggested that PLS regression be extended to logistic regression (PLS-LR) by replacing the succession of simple and multiple regressions by a succession of simple and multiple logistic regressions in an approach much simpler than that developed by Marx, 1996. By using this alternative formulation of the PLS regression, Bastien and Tenenhaus, 2001, extended the PLS regression to any generalized linear regression model (PLS-GLR) and to the Cox model (PLS-Cox) as a special case. Further improvements have then been described (Bastien et al., 2005) in the case of categorical descriptors with variable selection using hard thresholding and model validation by bootstrap resampling. Since then many developments in the framework of PLS and Cox regressions have appeared in the literature. Nguyen and Rocke, 2002, directly applied PLS regression to survival data and used the resulting PLS components in the Cox model for predicting survival time. However such a direct application did not really generalize PLS regression to censored survival data since it did not take into account the failure time in the dimension reduction step. Based on a straightforward generalization of Garthwaite, 1994, Li and Gui, 2004, presented a solution, Partial Cox Regression, quite similar to the one proposed by Bastien and Tenenhaus, using different weights to derive the PLS components but not coping with missing data.

#### 2.3.2 (Dk)(s)pls(dr)

##### The PLSDR algorithm (Bastien, 2008)

Following Segal, 2006, who suggested initially computing the null deviance residuals and then using these as outcomes for the LARS-Lasso algorithm, Bastien, 2008, proposed PLSDR, an alternative in high-dimensional settings using deviance residuals based PLS regression. This approach is advantageous both by its simplicity and its efficiency because it only needs to carry out null deviance residuals using a simple Cox model without covariates and use these as outcome in a standard PLS regression. The final Cox model is then carried out on the retained PLSDR components.

Moreover, following the principles of the Nipals algorithm, weights, loadings, and PLS components are computed as regression slopes. These slopes may be computed even when there are missing data: let the value of the PLS component for individual , with the vector of residual descriptors and the vector of weights at step . represents the slope of the OLS line without constant term related to the cloud of points . In such case, in computing the PLS component, the denominator is computed only on the data available also for the denominator.

##### The DKsPLSDR algorithm (Bastien et al., 2015)

In the case of very many descriptors, PLS regression being invariant by orthogonal transformation (De Jong and Ter Braak, 1994), an even faster procedure could be derived by replacing the matrix by the matrix of principal components . This could be viewed as the simple form of linear kernel PLS regression algorithms which have been proposed in the nineties (Lindgren et al., 1993; Rännar et al., 1994) to solve computational problems posed by very large matrices in chemometrics. The objective of these methods was to obtain PLS components by working on a condensed matrix of a considerably smaller size than the original one. Moreover, in addition to dramatically reducing the size of the problem, non-linear pattern in the data could also be analyzed using non-linear kernel.

Rosipal and Trejo, 2002, proposed a nonlinear extension of PLS regression using kernels. Assuming a nonlinear transformation of the input variables into a feature space , i.e. a mapping , their goal was to construct a linear PLS regression model in . They derived an algorithm named KPLS for Kernel PLS by performing the PLS regression on . It amounts to replacing, in the expression of PLS components, the product by using the so-called kernel trick which allows the computation of dot products in high-dimensional feature spaces using simple functions defined on pairs of input patterns: . This avoids having to explicitly calculate the coordinates in the feature space which could be difficult for a highly dimensional feature space. By using the kernel functions corresponding to the canonical dot product in the feature space, non-linear optimization can be avoided and simple linear algebra used.

Bennett and Embrecht, 2003, proposed to perform PLS regression directly on the kernel matrix instead of . DKPLS corresponds to a low rank approximation of the kernel matrix. Moreover, Tenenhaus et al., 2007 demonstrated that, for one dimensional output response, PLS of (KPLS) is equivalent to PLS on (DKPLS).

Using previous works, it becomes straightforward to derive a non-linear Kernel sPLSDR algorithm by replacing in the sPLSDR algorithm the matrix by a kernel matrix . The main kernel functions are the linear kernel () and the Gaussian kernel ().

However, non-linear kernel (sparse) PLS regression loses the explanation with the original descriptors unlike linear kernel PLS regression, which could limit the interpretation of the results.

## 3 Simulation studies

### 3.1 Scheme of the studies

The aim of our two in silico studies is twofold: evaluating the accuracy of the cross validation methods, see Section 4, and revisit the performance of the component based methods, see Section 5.

We performed a simulation study (Algorithm 3) in order to evaluate the methods by simulating 100 datasets with exponential survival distribution and 40% censored rate (100 observations 1000 genes) according to three different simulation types (cluster by Bair et al. (2006), factorial by Kaiser and Dickman (1962) and Fan et al. (2002) or eigengene by Langfelder et al. (2013)), using either no link or a linear one between the response and the predictors.

We divided each of these 600 datasets into a training set, of 7/10 (70) of the observations, used for estimation, and a test set, of 3/10 (30) of the observations, used for evaluation or testing of the prediction capability of the estimated model. This choice was made to stay between the 2:1 scheme of Bøvelstad et al. (2007); van Wieringen et al. (2009); Lambert-Lacroix and Letué (2011) and the 9:1 scheme of Li (2006). The division between training and test sets was balanced using the caret package, Kuhn (2014), both according to the response value and censor rate.

### 3.2 Data generation

#### 3.2.1 Eigengene: Langfelder et al. (2013)

Given module seeds and a desired size for the genes modules around the seeds of genes, module genes expression profiles are generated such that the -th rank correlated gene from module with its module seed is :

 cor(xk,I,seedI)=1−k/nI(1−r\textupmin)=rk,I (10)

that is, the first gene has correlation with the seed while the last (-th) gene has correlation .

The required correlation (10) is achieved by calculating the -th gene profile as the sum of the seed vector and a noise term

 xk,I=seedI+akεk\textupwhereak= ⎷var(seedI)var(εk)(1r2k,I−1) (11)

This technique produces modules consisting of genes distributed symmetrically around the module seed; in this sense, the simulated modules are spherical clusters whose centers coincide (on average) with the module seed.

In the simulations the parameters have been let as follow , , with and .

Survival and censoring times, with censoring probability, are generated from exponential survival distributions. When linked to survival (linear or quadratic case), only expressions from genes from the first two modules () are related to survival time.

Each simulated data set consists of genes and samples. Only the first hundred genes are structured. The last are random noise generated from .

#### 3.2.2 Cluster Bair et al. (2006)

The gene expression data is distributed as:

 Xij=⎧⎪ ⎪⎨⎪ ⎪⎩3+εij if i≤50,j≤504+εij if i>50,j≤503.5+εij if j>50. (12)

Where the are drawn from a .

Each simulated data set consists of genes and samples. Survival and censoring times, with censoring probability, are generated from exponential survival distributions. When linked to survival (linear or quadratic case), only expressions from genes from the first genes are related to survival.

#### 3.2.3 Factorial Kaiser and Dickman (1962), Fan et al. (2002)

We have supposed that genes expressions are related to latent variables associated each to a specific biological function. Let for each group a specified population inter-correlation pattern matrix . By applying principal component factorization (PCA) to the matrix and following Kaiser and Dickman, we can generate multivariate normally distributed sample data with a specific correlation pattern.

Where:

is the number of descriptors (genes)

is the number of observations

is a matrix of uncorrelated random normal variables

is a matrix containing principal component factor pattern coefficients obtained by applying Principal Components Analysis (PCA) to the given population correlation matrix

is the resultant sample data matrix, as if sampled from a population with the given population correlation matrix

We have chosen a compound symmetry structure for the correlation matrix with a same correlation () between two descriptors of a same group, descriptors between different groups being independent.

Moreover the choice of the correlation coefficient allows specifying the percentage of variance explained by the first factorial axes. Given four groups with inter-genes correlation coefficient of corresponds to expend of the inertia in principal directions.

Survival and censoring times, with censoring probability, are generated from exponential survival distributions. When linked to survival (linear or quadratic case), only expressions from genes from the first two groups () are related to survival time.

Each simulated data set consists of genes and samples. Only the first hundred genes are structured. The last are random noise generated from .

Figure 1 displays the pattern of correlation for the first 150 descriptors with the four groups of genes each well defined.

### 3.3 Hyperparameters and cross-validation

In standard -fold cross-validation of a dataset of size , folds of size are created by sampling from the data without replacement and each of the remaining data points is assigned randomly to a different fold. In stratified or balanced crossvalidation (Breiman et al., 1984, p. 246), the data are first ordered by the response value or class. This list is broken up into bins each containing points with many similar response values. Any remaining points at the end of the list are assigned to an additional bin. A fold is formed by sampling one point without replacement from each of the bins. Except for the ordering of the data, this is equivalent to standard cross-validation. We used balanced cross-validation with respect to the response value and censor rate. The folds were design using the caret package, Kuhn (2014).

In traditional cross-validation, i.e. with a dataset without censored events, each fold would yield a test set and a value of a prediction error measure (for instance log partial likelihood, integrated area under the curve, integrated area under the prediction error curve). When dealing with censored events and using the CV partial likelihood (CVLL, Verweij and Van Houwelingen (1993)) criterion, it is possible to make more efficient use of risk sets: van Houwelingen et al. (2006) recommended to derive the CV log partial likelihood for the th fold by subtraction; by subtracting the log partial likelihood evaluated on the full dataset from that evaluated on the full dataset minus the th fold, called the dataset. This yields the van Houwelingen CV partial likelihood (vHCVLL) .

Hyperparameters were tuned using 7-fold cross-validation on the training set. The number of folds was chosen following the recommandation of Wold et al. (2001), Breiman and Spector (1992) and Kohavi (1995). As in, Bøvelstad et al. (2007), van Wieringen et al. (2009) and Lambert-Lacroix and Letué (2011), mean values were then used to summarize these cross validation criteria over the 7 runs and the hyperparameters were chosen according to the best values of these measures.

## 4 Highlighting relevant cross validation criteria

### 4.1 The failure of the two usual criteria

The van Houwelingen CV partial likelihood (vHCVLL, see Figure 4.4) criterion behave poorly for all the PLS or sPLS based methods by selecting zero components where, according to our simulation types, the PLS-Cox, autoPLS-Cox, Cox-PLS, PLSDR, sPLSDR, DKPLSDR and DKsPLSDR methods were expected to select, for the factor or eigengene schemes, about two components and slightly more for the cluster scheme. As with the the classic CV partial likelihood (CVLL), it almost always selects at most one component and hence systematically underestimates the number of components. The simulations results for the selection of the number of components using CVLL are plotted on Figure 4.4. We confirmed this poor property by performing cross-validation on a simpler simulation scheme designed by Simon et al. (2011).

### 4.2 Proposal of new criteria

As a consequence, we had to search for other CV criteria (CVC) for the models featuring components. Li (2006) used the integrated area under the curves of time-dependent ROC curves (iAUCsurvROC, Heagerty et al. (2000)) to carry out his cross-validations, implemented in the survcomp package, (Schröder et al., 2011). Apart from that criterion (Figure 4.4) we added five other integrated AUC measures: integrated Chambless and Diao’s (2006) estimator (iAUCCD, Figure 4.4), integrated Hung and Chiang’s (2010) estimator (iAUCHC, Figure 4.4), integrated Song and Zhou’s (2008) estimator (iAUCSH, Figure 4.4), integrated Uno et al.’s (2007) estimator (iAUCUno, Figure 4.4) and integrated Heagerty and Zheng’s (2005) estimator (iAUCHZ, Figure 4.4) of cumulative/dynamic AUC for right-censored time-to-event data, implemented in the survAUC package, Potapov et al. (2012), and the risksetROC package, Heagerty and packaging by Paramita Saha-Chaudhuri (2012). We also studied two versions of two prediction error criteria, the integrated (un)weighted Brier Score (Graf et al. (1999), Gerds and Schumacher (2006), iBS(un)w, integrated (un)weighted squared deviation between predicted and observed survival, Figures 4.4 and 4.4) and the integrated (un)weighted Schmid Score (Schmid et al. (2011), iSS(un)w, integrated (un)weighted absolute deviation between predicted and observed survival, Figure 4.4 and 4.4), also implemented in the survAUC package, Potapov et al. (2012).

### 4.3 Analysis of the results

The simulation results highlighted the integrated Song and Zhou’s estimator of cumulative/dynamic AUC for right-censored time-to-event data (iAUCSH), implemented in the survAUC package, Potapov et al. (2012), as the best CV criterion for the PLS-Cox and the autoPLS-Cox methods even though it behaves poorly in all the other cases.

As for the other models featuring components, the iAUCsurvROC, iAUCUno criterion exhibited the best performances.

The two unweighted criteria iBSunw and iSSunw uniformly fail for all the models.

The iBSw criterion is too conservative and wrongly selects null models in more than half of the cases in the linear link scheme and in almost every times in the quadratic scheme.

The iSSw provides very poor results for Cox-PLS, sPLSDR and DKsPLSDR and average for PLSDR and DKPLSDR methods.

The two models SPLSDR and DKSPLSDR use an additional parameter: the thresholding parameter . The same figures were produced for all the criteria (results not shown): both iAUCUno criterion and iAUCsurvROC criterion provided a reasonable spread for the parameter.

### 4.4 Recommendation

In a word, this simulation campaign enables us to state the following recommendations to firmly improve the selection of the right number of components: use iAUCSH to cross-validate PLS-Cox or autoPLS-Cox models and either iAUCUno or iAUCsurvROC to cross-validate Cox-PLS, PLSDR, sPLSDR, DKPLSDR and DKsPLSDR. We implemented these recommendations (iAUCSH for PLS-Cox or autoPLS-Cox models and iAUCsurvROC for Cox-PLS, PLSDR, sPLSDR, DKPLSDR and DKsPLSDR) as the default cross validation techniques in the plsRcox package. We will apply them in the remaining of the article to assess goodness of fit of the model.

## 5 Reassessing performance of (s)PLS based models

We will now provide evidence that the changes of the cross-validation criteria recommended in Section 4.4 actually lead to performance improvements for the fitted models.

### 5.1 Introduction to performance criteria analysis

We followed the methodological recommendations of van Wieringen et al. (2009) to design a simulation plan that ensures a good evaluation of the predictive performance of the models.

“The true evaluation of a predictor’s performance is to be done on independent data. In the absence of independent data (the situation considered here) the predictive accuracy can be estimated as follows Dupuy and Simon (2007). The samples are split into mutually exclusive training and test sets. The gene expression and survival data of the samples in the training set are used to build the predictor. No data from the test set are used in predictor construction (including variable selection) by any of the methods compared. This predictor is considered to be representative of the predictor built on all samples (of which the training set is a subset). The test set is used to evaluate the performance of the predictor built from the training set: for each sample in the test set, survival is predicted from gene expression data. The predicted survival is then compared to the observed survival and summarized into an evaluation measure. To avoid dependency on the choice of training and test set, this procedure is repeated for multiple splits. The average of the evaluation measures resulting from each split is our estimate of the performance of the predictor built using the data from all samples.”

As to the performance criteria themselves, Schmid et al. (2011) made several points that we will take into account to carry out our performance comparison analysis.

“Evaluating the prognostic performance of prediction rules for continuous survival outcomes is an important topic of recent methodological discussion in survival analysis. The derivation of measures of prediction accuracy for survival data is not straightforward in the presence of censored observations (Kent and O’Quigley (1988), Schemper and Stare (1996), Rosthøj and Keiding (2004)). This is mainly due to the fact that traditional performance measures for continuous outcomes (such as the mean squared error or the fraction of explained variation) lead to biased predictions if applied to censored data (Schemper and Stare (1996)).

To overcome this problem, a variety of new approaches has been suggested in the literature. These developments can be classified into three groups: “likelihood-based approaches” (Nagelkerke (1991), Xu and O’Quigley (1999), O’Quigley et al. (2005)), “ROC-based approaches” (Heagerty et al. (2000), Heagerty and Zheng (2005), Cai et al. (2006), Uno et al. (2007), Pepe et al. (2008)), and “distance-based approaches” (Korn and Simon (1990), Graf et al. (1999), Schemper and Henderson (2000), Gerds and Schumacher (2006), 2007, Schoop et al. (2008)).

When using likelihood-based approaches, the log likelihood of a prediction model is related to the corresponding log likelihood obtained from a “null model” with no covariate information. ROC-based approaches use the idea that survival outcomes can be considered as time-dependent binary variables with levels -event- and -no event,- so that time-dependent misclassification rates and ROC curves can be computed for each threshold of a predictor variable of interest. If distance-based approaches are applied, a measure of prediction error is given by the distance between predicted and observed survival functions of the observations in a sample. None of these approaches has been adopted as a standard for evaluating survival predictions so far.”

To assess the goodness of fit and prediction accuracy of all the methods, we found 23 performance measures (PM) that are commonly used (LRT, VarM, R2Nag, R2XO, R2OXS, iR2BSunw, iR2BSw, iRSSunw, iRSSw, iAUCCD, iAUCHC, iAUCSH, iAUCUno, IAUCHZ, iAUCSurvROC, C, UnoC, GHCI, SchemperV, iBSunw, iBSw, iSSunw, iSSw). We chose, on statistical grounds, 14 among them (LRT, R2XO, iR2BSw, iRSSw, iAUCCD, iAUCHC, iAUCSH, iAUCUno, IAUCHZ, iAUCSurvROC, GHCI, SchemperV, iBSw, iSSw) and reported the results of six indices of several kind: two -like measure (a likelihood-based approach (LBA), R2XO, and a distance-based approach (DBA), iRSSw), a index (GHCI), two (ROC-based approaches (ROCBA), iAUCCD and iAUCSurvROC), and an integrated robust prediction error (distance-based approach, iSSw), see Table 1. The results for the remaining eight indices are similar to those shown. We now explain our process of selection of the performance criteria.

### 5.2 Selection of performance criteria

The likelihood ratio test (LRT, Lehmann and Romano (2005)) evaluates the null hypothesis , i.e. the built predictor has no effect on survival. The null hypothesis is evaluated using the likelihood ratio test statistic , with denoting the value of the log-likelihood function. Under the null hypothesis this test statistic has a distribution, which is used to calculate the -value. The -value summarizes the evidence against : the lower the -value the more probable that is not true.The -value of the likelihood ratio test has been used as an evaluation measure for predictive performance of gene expression based predictors of survival by many others Bair and Tibshirani (2004); Bøvelstad et al. (2007); Park et al. (2002); Segal (2006).

In the Cox model, an alternative measure of predictive performance is the variance of the martingale residuals (VarM, cf. section 2.1.2). As in van Wieringen et al. (2009), we found that this measure is not able to discriminate very well between good and poor predictors in the considered setting (data not shown). It is therefore omitted here.

To quantify the proportion of variability in survival data of the test set that can be explained by the predictor, we use the coefficient of determination (henceforth called ). A predictor with good predictive performance explains a high proportion of variability in the survival data of the test set, and vice versa a poor predictor explains little variability in the test set. However, the traditional definition of the cannot be used in the context of censored data and modified criteria have been proposed in the past. Three types of likelihood-based coefficients for right-censored time-to-event data are were put forward (R2NAG, R2XO and R2OXS).

• The coefficient (R2Nag) proposed by Nagelkerke (1991):

 R2Nag=1−exp(−2n(l(^β)−l(0))) (13)

where denotes the log-likelihood function.

• The coefficient (R2XO) proposed by Xu and O’Quigley (1999) that is restricted to proportional hazards regression models, because here the means of squared residuals in the measure for linear regression are replaced by the (weighted) sums of squared Schoenfeld residuals, denoted by :

 R2XO=1−J(^β)J(0)⋅ (14)
• The coefficient (R2OXS) proposed by O’Quigley et al. (2005) who replaced the number of observations by the number of events :

 R2OXS(^β)=1−exp(−2e(l(^β)−l(0)))=1−(L(^β)L(0))−2/e⋅ (15)

All three were implemented in the survAUC package, Potapov et al. (2012). Others have also used these modified statistics to assess predictive performance of gene expression based predictors on survival Bair and Tibshirani (2004); Segal (2006).

Hielscher et al. (2010) carried out a comparison of the properties of these three coefficients. In a word, R2Nag is strongly influenced by censoring (negative correlation with censoring); R2OXS is less influenced by censoring and exhibits a positive correlation with censoring. From those three R2XO is the less influenced by censoring. As a consequence, we selected the R2XO as the -like measure to compare the models.

The weighted Brier score (Brier (1950); Hothorn et al. (2004); Radespiel-Tröger et al. (2003)) is a distance-based measure of prediction error that is based on the squared deviation between survival functions. It is defined as a function of time by

 BSw(t)=1nn∑i=1[^S(t∣Xi)2I(ti⩽t∧δi=1)^G(ti)+(1−^S(t∣Xi))2I(ti>t)^G(ti)] (16)

where denotes the Kaplan-Meier estimate of the censoring distribution, that is the Kaplan–Meier estimate based on the observations and stands for the indicator function. The expected Brier score of a prediction model which ignores all predictor variables corresponds to the KM estimate. To derive the unweighted Brier score, , clear the value of the denominators.

The Schmid score (Schmid et al. (2011)) is a distance-based measure of prediction error that is based on the absolute deviation between survival functions, instead of the squared one for the Brier-Score. It is a robust improvement over the following empirical measure of absolute deviation between survival functions that was suggested by Schemper and Henderson (2000) as a function of time by:

 SH(t)=1nn∑i=1[^S(t∣Xi)I(ti⩽t∧δi=1)^G(ti)+(1−^S(t∣Xi))I(ti>t)^G(ti)] (17)

where denotes the Kaplan-Meier estimate of the censoring distribution which is based on the observations and stands for the indicator function. With the same notations, the Schmid score is defined as a function of time by:

 SS(t)=1nn∑i=1|I(ti>t)−^S(t∣Xi)|[I(ti⩽t∧δi=1)^G(t−i)+I(ti>t)^G(ti)] (18)

where is a survival time that is marginally smaller than . To derive the unweighted Schmid score, , clear the and values of the denominators.

The values of the Brier-Score range between 0 and 1. Good predictions at time result in small Brier-Scores. The numerator of the first summand is the squared predicted probability that individual survives until time if he actually died (uncensored) before , or zero otherwise. The better the survival function is estimated, the smaller is this probability. Analogously, the numerator of the second summand is the squared probability that individual dies before time if he was observed at least until , or zero otherwise. Censored observations with survival times smaller than are weighted with 0. The Brier-score as defined in Eq. 16 depends on . It makes sense to use the integrated Brier-Score () given by

 IBS=1max(ti)∫max(ti)0BS(t)dt. (19)

as a score to assess the goodness of the predicted survival functions of all observations at every time between 0 and , . Note that the is also appropriate for prediction methods that do not involve Cox regression models: it is more general than the and the -value criteria associated to the log likelihood test and has thus become a standard evaluation measure for survival prediction methods (Hothorn et al. (2006); Schumacher et al. (2007)).

Denoting by , the Kaplan-Meier estimator based on the , , which corresponds to a prediction without covariates, we first define for all :

 R2BS(t)=1−BS(t)BS0(t)⋅ (20)

Then the integrated iR2BSw, Graf et al. (1999), is defined by:

 iR2BSw=1max(ti)∫max(ti)0R2BS(t)dt. (21)

This criterion has already been used in Bøvelstad et al. (2007) and Lambert-Lacroix and Letué (2011). The integrated iR2BSw is slightly influenced by censoring, Hielscher et al. (2010), and, as a measure based on the the quadratic norm, not robust.

As a consequence, we propose and use a similar measure based on the Schmid score, the integrated R Schmid Score weighted (iRSSw), by turning the traditional , derived from the quadratic norm, into the R coefficient of determination for least absolute deviation, introduced by McKean and Sievers (1987). Denoting by the Schmid score which corresponds to a prediction without covariates, we first define for all :

Then the integrated iRSSw, is defined by:

The C-index provides a global assessment of a fitted survival model for the continuous event time rather than focuses on the prediction of t-year survival for a fixed time and is arguably the most widely used measure of predictive accuracy for censored data regression models. It is a rank-correlation measure motivated to quantify the correlation between the ranked predicted and observed survival times. The C index estimates the probability of concordance between predicted and observed responses. A value of 0.5 indicates no predictive discrimination and a value of 1.0 indicates perfect separation of patients with different outcomes. A popular nonparametric C-statistic for estimating was proposed by Harrell et al. (1996). It is computed by forming all pairs of the observed data, where the smaller follow-up time is a failure time and defined as:

 c=∑1⩽i^β′Xj)I(δi=1)+I(yj^β′Xi)I(δj=1)∑1⩽i

We used the improved version (GHCI) by Gönen and Heller (2005) for the Cox proportional hazards models as a performance comparison criterion. Their estimator is a function of the regression parameters and the covariate distribution only and does not use the observed event and censoring times. For this reason it is asymptotically unbiased, unlike Harrell’s C-index based on informative pairs. It focuses on the concordance probability as a measure of discriminatory power within the framework of the Cox model. The appeal of this formulation is that it provides a stable estimator of predictive accuracy that is easily computed:

 Kn(^β)=2n(n−1)∑1⩽i

In contrast to Harrell’s C-index, the effect of the observed times on is mediated through the partial likelihood estimator , and, since the effect of censoring on the bias of is negligible, the measure is robust to censoring. In addition, remains invariant under monotone transformations of the survival times.

### 5.3 Ranking the performance of the CV criteria

We stated several recommendations, in Section 4 based of the accuracy of the selection of the number of components. Selecting the right number of components is a goal per se.

Moreover, these recommendations are also relevant from a performance criteria point of view (see Section 5.1) as the following analysis showed.

1. For all the models and simulation types, we carried out the cross-validation according to all of the 12 criteria and, for each of these criteria, we derived the value of all the 14 the performance measures.

2. In order to lay the stress on the improvements of performance made when switching from the classic and the van Houwelingen log likelihood cross validation techniques to the recommended ones, we computed, for every datasets and models, all the paired differences between CVLL or vHCVLL and the eleven other CV techniques.

• Paired comparison with CVLL. For every simulated dataset we evaluated: Delta = Performance Measure(with any CV criteria CVLL) Performance Measure(with CVLL)

• Paired comparison with vHCVLL. For every simulated dataset we evaluated: Delta = Performance Measure(with any CV criteria vHCVLL) Performance Measure(with vHCVLL)

An analysis of these results showed a steady advantage of the recommended criteria versus either CVLL or vHCVLL especially in the linear and quadratic cases.
In the case of paired comparison with vHCVLL and for some couples of the type (performance measure, model), namely (UnoC, PLSCox), (UnoC, sPLSDR), (iBSW, PLSDR), (iBSW, DKsPLSDR), (iRSSW, autoPLSCox), (iRSSW, PLSDR), (iAUCSurvROC, PLSDR) and (iAUCSurvROC, sPLSDR), those deltas are plotted on Figures 5.3, 5.3, 5.3, 5.3, 5.3, 5.3, 5.3 and 5.3.

### 5.4 Performance comparison revisited

#### 5.4.1 Selection of competing benchmark methods

Simon et al. (2011), introduced the coxnet procedure, which is an elastic net-type procedure for the Cox model, in a similar but not equivalent way than two competing ones: coxpath (glmpath package, Park and Hastie, 2007) and penalized (penalized package, Goeman, 2010). In Section 3 of the same article, these authors extensively compared coxnet to coxpath and to penalized for the lasso penalty that is the only one relevant for these comparisons since the three procedures use different elastic net penalties. Their results show tremendous timing advantage for coxnet over the two other procedures. The coxnet procedure was integrated in the glmnet package (Friedman et al., 2010) and is called in the R language by applying the glmnet function with the option family=cox: coxnet is glmnet for the Cox model. The timing results of Simon et al. (2011) on both simulated and real datasets show some advantage to coxpath over penalized.

As to pure lasso-type penalty algorithms, we selected two of them: “Univariate Shrinkage in the Cox Model for High Dimensional data” (uniCox, Tibshirani, 2009) and ‘Gradient Lasso for Cox Proportional Hazards Model” (glcoxph, Sohn et al., 2009).

The uniCox package implements “Univariate Shrinkage in the Cox Model for High Dimensional data” (Tibshirani, 2009). Being “essentially univariate”, it differs from applying a classical lasso penalty when fitting the Cox model and hence from both coxnet/glmnet and coxpath/glmpath. It can be used on highly correlated and even rectangular datatsets.

In their article, Sohn et al. (2009), show that the glcoxph package is very competitive compared with popular existing methods coxpath by Park and Hastie (2007) and penalized by Goeman (2010) in its computational time, prediction and selectivity. As a very competitive procedure to coxpath, that we included in our benchmarks, and since no comparisons were carried out with coxnet, we selected glcoxph as well.

Cross validation criteria were recommended for several of our benchmark methods by their authors. We followed these recommendations -classic CV partial likelihood for coxpath, glcoxph and uniCox; van Houwelingen CV partial likelihood for coxnet with both the , the value of that gives minimum of the mean cross-validated error, or , the largest value of such that the cross-validated error is within 1 standard error of the minimum of the mean cross-validated error, criteria- and used the same 7 folds fo the training set as those described in Section 3.3 for the other models.

It seemed unfair to compare the methods using a performance measure that is recommended as a cross-validation criterion for some, but not all, of them. Hence, we decided not to use any of the three recommended cross-validation criteria iAUCSH, iAUCUno or iAUCsurvROC -even if it has already been used by Li (2006)- as a performance measure, in order to strive to perform fair comparisons with the methods that are recommended to be cross validated using partial likelihood with either the classic or van Houwelingen technique.

As a consequence and in order to still provide results for a ROC-based performance measure on a fair basis, we selected the Chambless and Diao’s (2006) estimator of cumulative/dynamic AUC for right-censored time-to-event data in a form restricted to Cox regression. The iAUCCD summary measure is given by the integral of AUC on (weighted by the estimated probability density of the time-to-event outcome).

#### 5.4.2 Results

For coxnet, coxlars or ridgecox with both the or CV criteria, the criterion yield similar yet superior results than the one whose main default is to select too often no explanatory variable (a null model) for the linear or quadratic links. As a consequence, we only reported results for the former one.

We plotted some of the performance measures when the cross-validation is done according to the vHCVLL criterion on Figures 5.4.2, 5.4.2, 5.4.2, 5.4.2, 5.4.2 and 5.4.2. The results are terrible for all the (s)PLSlike models apart from PLSCox and autoPLSCox.

We then provide, for each of the (s)PLSlike method, the increases in terms of performance measures when switching from the vHCVLL as a cross validation criterion to the recommended one in Section 4.4. Virtually, for PLS-Cox and autoPLS-Cox we switch to the iAUCSH cross-validation criterion and for other (s)PLS based models to either iAUCUno or iAUCSurvROC .

For iAUCUno, these results are plotted on Figures 5.4.2, 5.4.2, 5.4.2, 5.4.2, 5.4.2 and 5.4.2 whereas for iAUCSurvROC they are displayed on Figures 5.4.2, 5.4.2, 5.4.2, 5.4.2, 5.4.2 and 5.4.2. These figures show a firm increase for the 6 criteria (R2XO, GHCI, iAUCCD, iAUCSurvROC, IRSSW, iSSW).

As can be seen for iAUCUno (Figures 6, 6, 6, 6, 6 and 6) and iAUCSurvROC (Figures 6, 6, 6, 6, 6 and 6), the improvement of the performances due to switch to the recommended CV criteria is high enough to even have some (S)PLS based models, for instance SPLSDR, show some advantage over the other benchmark methods.

## 6 Conclusion

When cross-validating standard or extended Cox models, the commonly used criterion is the cross-validated partial loglikelihood using a naive or a van Houwelingen scheme. Quite astonishingly, these two cross-validation methods fail with all the 7 extensions of partial least squares regression to the Cox model, namely PLS-Cox, autoPLS-Cox, Cox-PLS, PLSDR, sPLSDR, DKPLSDR and DKsPLSDR.

In our simulation study, we introduced 12 cross validation criteria based on three different kind of model quality assessment:

• Likelihood (2): Verweij and Van Houwelingen (classic CVLL, 1993), van Houwelingen et al. (vHCVLL, 2006).

• Integrated AUC measures (6): Chambless and Diao’s (iAUCCD, 2006), Hung and Chiang’s (iAUCHC, 2010), Song and Zhou’s (iAUCSH, 2008), Uno et al.’s (iAUCUno, 2007), Heagerty and Zheng’s (iAUCHZ, 2005), Heagerty et al.’s (iAUCsurvROC, 2000).

• Prediction error criteria (4): integrated (un)weighted Brier Score (iBS(un)w, Gerds and Schumacher (2006)) or Schmid Score (iSS(un)w, Schmid et al. (2011))

Our simulation study was successful in finding good CV criterion for PLS or sPLS based extensions of the Cox model:

• iAUCsh for PLS-Cox and autoPLS-Cox.

• iAUCSurvROC and iAUCUno ones for Cox-PLS, (DK)PLSDR and (DK)sPLSDR.

The derivation of measures of prediction accuracy for survival data is not straightforward in the presence of censored observations. To overcome this problem, a variety of new approaches has been suggested in the literature. We spotted 23 performance measures that can be classified into three groups:

• Likelihood-based approaches (llrt, varresmart, 3 R2-type).

• ROC-based approaches such as integrated AUC (iAUCCD, iAUCHC, iAUCSH, iAUCUno, iAUCHZ, iAUCsurvROC), 3 C-index (Harrell, GHCI, UnoC).

• Distance-based approaches such as the V of Schemper and Henderson (2000) or derived from Brier or Schmid Scores (iBS(un)w, iSS(un)w and 4 derived R2-type measures).

Using the newly found cross-validation, and these measures of prediction accuracy, we performed a benchmark reanalysis that showed enhanced performances of these techniques and a much better behaviour even against other well known competitors such as coxnet, coxpath, uniCox and glcoxph.

Hence the recommended criteria not only improve the accuracy of the choice of the number of components but also strongly raise the performances of the models, which enables some of them to overperform the other benchmark methods.

## References

• Bair and Tibshirani [2004] Eric Bair and Robert Tibshirani. Semi-supervised methods to predict patient survival from gene expression data. PLoS biology, 2(4):E108, April 2004. ISSN 1545-7885.
• Bair et al. [2006] Eric Bair, Trevor Hastie, Debashis Paul, and Robert Tibshirani. Prediction by supervised principal components. Journal of the American Statistical Association, 101:119–137, 2006.
• Bastien and Tenenhaus [2001] P. Bastien and M. Tenenhaus. PLS generalised linear regression, Application to the analysis of life time data. In PLS and Related Methods, Proceedings of the PLS’01 International Symposium, pages 131–140. CISIA-CERESTA, 2001.
• Bastien [2008] Philippe Bastien. Deviance residuals based PLS regression for censored data in high dimensional setting. Chemometrics and Intelligent Laboratory Systems, 91:78–86, 2008. ISSN 01697439.
• Bastien et al. [2005] Philippe Bastien, Vincenzo Esposito Vinzi, and Michel Tenenhaus. PLS generalised linear regression. Computational Statistics & Data Analysis, 48(1):17–46, January 2005. ISSN 01679473.
• Bastien et al. [2015] Philippe Bastien, Frederic Bertrand, Nicolas Meyer, and Myriam Maumy-Bertrand. Deviance residuals-based sparse PLS and sparse kernel PLS regression for censored data. Bioinformatics, 31(3):397–404, oct 2015. ISSN 14602059.
• Bennett and Embrecht [2003] K.P. Bennett and M.J. Embrecht. An optimization perspective on kernel partial least squares regression. In Advances in learning theory: methods, models and applications, NATO Sciences Series III: Computer & Systems Sciences, pages 227–250, 2003.
• Bertrand et al. [2014a] F. Bertrand, M. Maumy-Bertrand, and N. Meyer. Partial least squares Regression for Cox models and related techniques, 2014a. R package version 1.1.0.
• Bertrand et al. [2014b] F. Bertrand, N. Meyer, and M. Maumy-Bertrand. Partial least squares Regression for generalized linear models, 2014b. R package version 1.7.0.
• Bøvelstad et al. [2007] H. M. Bøvelstad, S. Nygård, H. L. Størvold, M. Aldrin, O. Borgan, A. Frigessi, and O. C. Lingjærde. Predicting survival from microarray data - A comparative study. Bioinformatics, 23:2080–2087, 2007.
• Breiman et al. [1984] L Breiman, J H Friedman, R A Olshen, and C J Stone. Classification and Regression Trees. CRC Press, Boca Raton, 1984. ISBN 0412048418.
• Breiman and Spector [1992] Leo Breiman and Philip Spector. Submodel selection and evaluation in regression . The X-random case. International Statistical Review, 60:291–319, 1992. doi: 10.2307/1403680.
• Brier [1950] Glenn W. Brier. Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1):1–3, January 1950. ISSN 0027-0644.
• Cai et al. [2006] Tianxi Cai, Margaret Sullivan Pepe, Yingye Zheng, Thomas Lumley, and Nancy Swords Jenny. The sensitivity and specificity of markers for event times. Biostatistics, 7:182–197, 2006. ISSN 14654644.
• Chambless and Diao [2006] Lloyd E. Chambless and Guoqing Diao. Estimation of time-dependent area under the ROC curve for long-term risk prediction. Statistics in Medicine, 25:3474–3486, 2006. ISSN 02776715. doi: 10.1002/sim.2299.
• Chun and Keles [2010] Hyonho Chun and Sündüz Keles. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society. Series B, Statistical methodology, 72(1):3–25, January 2010. ISSN 1369-7412.
• Chung et al. [2012] D. Chung, H. Chun, and S. Keles. spls: Sparse Partial Least Squares (SPLS) Regression and Classification, 2012. R package version 2.1-2.
• Collett [1994] D Collett. Modelling survival data in medical research, volume 46. 1994. ISBN 9780412448805.
• Cox [1972] D. R. Cox. Regression models and life tables. Journal of the Royal Statistical Society. Series B:, 34:187–220, 1972. ISSN 13697412.
• De Jong and Ter Braak [1994] Sijmen De Jong and Cajo J. F. Ter Braak. Comments on the PLS kernel algorithm. Journal of Chemometrics, 8(2):169–174, March 1994. ISSN 0886-9383.
• Dejean et al. [2013] S. Dejean, I. Gonzalez, and K.-A. Le Cao. mixOmics: Omics Data Integration Project, 2013. R package version 4.1-4.
• Dupuy and Simon [2007] Alain Dupuy and Richard M Simon. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. Journal of the National Cancer Institute, 99(2):147–57, January 2007. ISSN 1460-2105. doi: 10.1093/jnci/djk018.
• Fan et al. [2002] X. Fan, A. Felsovalyi, S. Sivo, and S.C. Keenan. SAS for Monte Carlo Studies: A guide for quantitative Researchers. SAS publishing, Cary, NY, 2002.
• Friedman et al. [2010] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of statistical software, 33:1–22, 2010. ISSN 1548-7660. doi: 10.1359/JBMR.0301229.
• Friedman and Popescu [2004] J.H. Friedman and B.E. Popescu. Gradient Directed Regularization. Working Paper, Department of Statistics, Stanford University., 2004.
• Garthwaite [1994] Paul H. Garthwaite. An Interpretation of Partial Least Squares. Journal of the American Statistical Association, 89(425):122–127, February 1994.
• Gerds and Schumacher [2006] Thomas A. Gerds and Martin Schumacher. Consistent estimation of the expected brier score in general survival models with right-censored event times. Biometrical Journal, 48:1029–1040, 2006. ISSN 03233847.
• Gerds and Schumacher [2007] Thomas A. Gerds and Martin Schumacher. Efron-type measures of prediction error for survival analysis. Biometrics, 63:1283–1287, 2007. ISSN 0006341X.
• Goeman [2010] Jelle J. Goeman. L1 penalized estimation in the Cox proportional hazards model. Biometrical Journal, 52(1):70–84, 2010. ISSN 03233847.
• Gönen and Heller [2005] Mithat Gönen and Glenn Heller. Concordance probability and discriminatory power in proportional hazards regression. Biometrika, 92:965–970, 2005. ISSN 00063444.
• Graf et al. [1999] E Graf, C Schmoor, W Sauerbrei, and M Schumacher. Assessment and comparison of prognostic classification schemes for survival data. Statistics in medicine, 18(17-18):2529–2545, 1999. ISSN 0277-6715.
• Harrell et al. [1996] Frank E. Harrell, Kerry L. Lee, and Daniel B. Mark. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine, 15(4):361–387, 1996. ISSN 02776715.
• Hastie and Efron [2012] T. Hastie and B. Efron. lars: Least Angle Regression, Lasso and Forward Stagewise, 2012. R package version 1.1.
• Heagerty et al. [2000] P J Heagerty, T Lumley, and M S Pepe. Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics, 56(2):337–344, 2000. ISSN 0006-341X.
• Heagerty and packaging by Paramita Saha-Chaudhuri [2012] Patrick J. Heagerty and packaging by Paramita Saha-Chaudhuri. risksetROC: Riskset ROC curve estimation from censored survival data, 2012. R package version 1.0.4.
• Heagerty and Zheng [2005] Patrick J. Heagerty and Yingye Zheng. Survival model predictive accuracy and ROC curves. Biometrics, 61(1):92–105, 2005. ISSN 0006341X.
• Hielscher et al. [2010] Thomas Hielscher, Manuela Zucknick, Wiebke Werft, and Axel Benner. On the Prognostic Value of Gene Expression Signatures for Censored Data. In Andreas Fink, Berthold Lausen, Wilfried Seidel, and Alfred Ultsch, editors, Advances in Data Analysis, Data Handling and Business Intelligence Studies in Classification, Data Analysis, and Knowledge Organization, pages 663–673. Springer, Berlin, 2010. ISBN 978-3-642-01043-9.
• Hothorn et al. [2004] Torsten Hothorn, Berthold Lausen, Axel Benner, and Martin Radespiel-Tröger. Bagging survival trees. Statistics in medicine, 23(1):77–91, January 2004. ISSN 0277-6715. doi: 10.1002/sim.1593.
• Hothorn et al. [2006] Torsten Hothorn, Peter Bühlmann, Sandrine Dudoit, Annette Molinaro, and Mark J van der Laan. Survival ensembles. Biostatistics (Oxford, England), 7(3):355–73, July 2006. ISSN 1465-4644.
• Hung and Chiang [2010] H. Hung and C. T. Chiang. Estimation methods for time-dependent AUC models with survival data. Canadian Journal of Statistics-Revue Canadienne De Statistique, 38:8–26, 2010. ISSN 03195724. doi: 10.1002/cjs.
• Jolliffe et al. [2003] Ian T Jolliffe, Nickolay T Trendafilov, and Mudassir Uddin. A Modified Principal Component Technique Based on the LASSO. Journal of Computational and Graphical Statistics, 12(3):531–547, September 2003. ISSN 1061-8600.
• Kaiser and Dickman [1962] Henry F. Kaiser and Kern Dickman. Sample and population score matrices and sample correlation matrices from an arbitrary population correlation matrix. Psychometrika, 27:179–182, 1962. ISSN 00333123. doi: 10.1007/BF02289635.
• Karatzoglou et al. [2004] Alexandros Karatzoglou, Alex Smola, Kurt Hornik, and Achim Zeileis. kernlab ? An S4 Package for Kernel Methods in R. Journal of Statistical Software, 11(9):1–20, 2004. ISSN 1548-7660.
• Kent and O’Quigley [1988] John T. Kent and John O’Quigley. Measures of dependence for censored survival data. Biometrika, 75:525–534, 1988. ISSN 00063444.
• Kim [2009] J. Kim. glcoxph: The penalized Cox PH model using the generalized lasso algorithm, 2009. R package version 0.99-13.
• Kohavi [1995] Ron Kohavi. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In International Joint Conference on Artificial Intelligence, volume 14, pages 1137–1143, 1995. ISBN 1-55860-363-8.
• Korn and Simon [1990] E L Korn and R Simon. Measures of explained variation for survival data. Statistics in medicine, 9:487–503, 1990. ISSN 0277-6715.
• Krämer [2007] Nicole Krämer. An overview on the shrinkage properties of partial least squares regression. Computational Statistics, 22(2):249–273, March 2007. ISSN 0943-4062.
• Kuhn [2014] Max Kuhn. caret: Classification and Regression Training, 2014. Contributions from Jed Wing and Steve Weston and Andre Williams and Chris Keefer and Allan Engelhardt and Tony Cooper and Zachary Mayer and the R Core Team. R package version 6.0-30.
• Lambert-Lacroix and Letué [2011] S. Lambert-Lacroix and F. Letué. Partial Least Squares and Cox model with application to gene expression. Technical report, 2011.
• Langfelder et al. [2013] Peter Langfelder, Paul S. Mischel, and Steve Horvath. When Is Hub Gene Selection Better than Standard Meta-Analysis? PLoS ONE, 8(4):e61505, 2013. ISSN 19326203.
• Lehmann and Romano [2005] E. Lehmann and J. Romano. Testing Statistical Hypotheses. Springer Texts in Statistics. Springer, New York, 3rd edition, 2005. ISBN 0387988645.
• Li and Gui [2004] Hongzhe Li and Jiang Gui. Partial Cox regression analysis for high-dimensional microarray gene expression data. Bioinformatics, 20 Suppl 1:i208–i215, 2004. ISSN 1367-4811.
• Li [2006] Lexin Li. Survival prediction of diffuse large-B-cell lymphoma based on both clinical and gene expression information. Bioinformatics, 22(4):466–471, 2006. ISSN 13674803.
• Lindgren et al. [1993] Fredrik Lindgren, Paul Geladi, and Svante Wold. The kernel algorithm for PLS. Journal of Chemometrics, 7(1):45–59, January 1993. ISSN 0886-9383.
• Magidson and Wassmann [2010] J. Magidson and K. Wassmann. The Role of Proxy Genes in Predictive Models: An Application to Early Detection of Prostate Cancer. In Proceedings of the American Statistical Association., 2010.
• Marx [1996] Brian D Marx. Iteratively Reweighted Partial Least Squares Estimation for Generalized Linear Regression. Technometrics, 38:374–381, 1996. ISSN 00401706. doi: 10.2307/1271308.
• McKean and Sievers [1987] Joseph W McKean and Gerald L Sievers. Coefficients of determination for least absolute deviation analysis. Statistics & Probability Letters, 5(1):49–54, January 1987. ISSN 01677152.
• Mevik et al. [2011] B.-H. Mevik, R. Wehrens, and K.H. Liland. pls: Partial Least Squares and Principal Component regression, 2011. R package version 2.3-0.
• Nagelkerke [1991] N. J D Nagelkerke. A note on a general definition of the coefficient of determination. Biometrika, 78:691–692, 1991. ISSN 00063444.
• Nguyen and Rocke [2002] D V Nguyen and D M Rocke. Partial least squares proportional hazard regression for application to DNA microarray survival data. Bioinformatics, 18:1625–1632, 2002. ISSN 1367-4803 (Print)1367-4803 (Linking).
• O’Quigley et al. [2005] John O’Quigley, Ronghui Xu, and Janez Stare. Explained randomness in proportional hazards models. Statistics in Medicine, 24:479–489, 2005. ISSN 02776715. doi: 10.1002/sim.1946.
• Park and Hastie [2007] Mee Young Park and Trevor Hastie. -regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 69:659–677, 2007. ISSN 13697412.
• Park and Hastie [2013] M.Y. Park and T. Hastie. glmpath: Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model, 2013. R package version 0.97.
• Park et al. [2002] P. J. Park, L. Tian, and I. S. Kohane. Linking gene expression data with patient survival times using partial least squares. Bioinformatics, 18(Suppl 1):S120–S127, July 2002. ISSN 1367-4803.
• Pepe et al. [2008] Margaret S. Pepe, Yingye Zheng, Yuying Jin, Ying Huang, Chirag R. Parikh, and Wayne C. Levy. Evaluating the ROC performance of markers for future events. Lifetime Data Analysis, 14:86–113, 2008. ISSN 13807870.
• Potapov et al. [2012] S. Potapov, W. Adler, and M. Schmid. survAUC: Estimators of prediction accuracy for time-to-event data, 2012. R package version 1.0-5.
• Radespiel-Tröger et al. [2003] M. Radespiel-Tröger, T. Rabenstein, H.T. Schneider, and B. Lausen. Comparison of tree-based methods for prognostic stratification of survival data. Artificial Intelligence in Medicine, 28(3):323–341, July 2003. ISSN 09333657.
• Rännar et al. [1994] Stefan Rännar, Fredrik Lindgren, Paul Geladi, and Svante Wold. A PLS kernel algorithm for data sets with many variables and fewer objects. Part 1: Theory and algorithm. Journal of Chemometrics, 8(2):111–125, March 1994. ISSN 0886-9383.
• Rosipal and Trejo [2002] Roman Rosipal and Leonard J Trejo. Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space. Journal of Machine Learning Research, 2:97–123, 2002. ISSN 15324435.
• Rosthøj and Keiding [2004] Susanne Rosthøj and Niels Keiding. Explained variation and predictive accuracy in general parametric statistical models: the role of model misspecification. Lifetime data analysis, 10(4):461–72, December 2004. ISSN 1380-7870.
• Schemper and Henderson [2000] M Schemper and R Henderson. Predictive accuracy and explained variation in Cox regression. Biometrics, 56(1):249–255, March 2000. ISSN 0006-341X.
• Schemper and Stare [1996] Michael Schemper and Janez Stare. Explained variation in survival analysis. Statistics in Medicine, 15:1999–2012, 1996. ISSN 02776715.
• Schmid et al. [2011] Matthias Schmid, Thomas Hielscher, Thomas Augustin, and Olaf Gefeller. A Robust Alternative to the Schemper-Henderson Estimator of Prediction Error. Biometrics, 67:524–535, 2011. ISSN 0006341X.
• Schoop et al. [2008] R. Schoop, E. Graf, and M. Schumacher. Quantifying the predictive performance of prognostic models for censored survival data with time-dependent covariates. Biometrics, 64:603–610, 2008. ISSN 0006341X.
• Schröder et al. [2011] Markus S. Schröder, Aedín C. Culhane, John Quackenbush, and Benjamin Haibe-Kains. survcomp: An R/BioconductoR package for performance assessment and comparison of survival models. Bioinformatics, 27(22):3206–3208, 2011. ISSN 13674803.
• Schumacher et al. [2007] Martin Schumacher, Harald Binder, and Thomas Gerds. Assessment of survival prediction models based on microarray data. Bioinformatics (Oxford, England), 23(14):1768–74, July 2007. ISSN 1367-4811.
• Segal [2006] Mark R. Segal. Microarray gene expression data with linked survival phenotypes: Diffuse large-B-cell lymphoma revisited. Biostatistics, 7:268–285, 2006. ISSN 14654644.
• Simon et al. [2011] Noah Simon, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for Cox’s proportional hazards model via coordinate descent. Journal of Statistical Software, 39:1–13, 2011. ISSN 15487660.
• Sohn et al. [2009] Insuk Sohn, Jinseog Kim, Sin Ho Jung, and Changyi Park. Gradient lasso for Cox proportional hazards model. Bioinformatics, 25(14):1775–1781, 2009. ISSN 13674803.
• Song and Zhou [2008] Xiao Song and Xiao-hua Zhou. A semiparametric approach for the covariate specific ROC curve with survival outcome. Statistica Sinica, 18:947–965, 2008. ISSN 10170405.
• Tenenhaus et al. [2007] Arthur Tenenhaus, Alain Giron, Emmanuel Viennet, Michel Béra, Gilbert Saporta, and Bernard Fertil. Kernel logistic PLS: A tool for supervised nonlinear dimensionality reduction and binary classification. Computational Statistics and Data Analysis, 51(9):4083–4100, 2007. ISSN 01679473.
• Tenenhaus [1998] M. Tenenhaus. La régression PLS: théorie et pratique. Éditions Technip, Paris, 1998. ISBN 9782710807353.
• Tenenhaus [1999] M. Tenenhaus. La regression logistique PLS. In Proceedings of the 32èmes journées de Statistique de la Société française de Statistique, pages 721–723. FES, 1999.
• Therneau [2013] T.M. Therneau. A Package for Survival Analysis in S, 2013. R package version 2.37-4.
• Therneau and Grambsch [2000] T.M. Therneau and P.M. Grambsch. Modeling Survival Data: Extending the Cox Model. Statistics for Biology and Health. Springer, New York, 2000. ISBN 9780387987842.
• Tibshirani [2009] R. Tibshirani. uniCox: Univariate shrinkage prediction in the Cox model, 2009. R package version 1.0.
• Uno et al. [2007] Hajime Uno, Tianxi Cai, Lu Tian, and Lee-Jen WEI. Evaluating Prediction Rules for t-Year Survivors With Censored Regression Models. Journal of the American Statistical Association, 102:527–537, 2007. ISSN 0162-1459.
• van Houwelingen et al. [2006] Hans C. van Houwelingen, Tako Bruinsma, Augustinus A M Hart, Laura J. van’t Veer, and Lodewyk F A Wessels. Cross-validated Cox regression on microarray gene expression data. Statistics in Medicine, 25(18):3201–3216, 2006. ISSN 02776715. doi: 10.1002/sim.2353.
• van Wieringen et al. [2009] Wessel N. van Wieringen, David Kun, Regina Hampel, and Anne-Laure Boulesteix. Survival prediction using gene expression data: A review and comparison. Computational Statistics & Data Analysis, 53(5):1590–1603, 2009. ISSN 01679473.
• Verweij and Van Houwelingen [1993] P J Verweij and H C Van Houwelingen. Cross-validation in survival analysis. Statistics in medicine, 12(24):2305–2314, 1993. ISSN 0277-6715.
• Wold [1966] H. Wold. Estimation of Principal Components and Related Models by Iterative Least Squares. In P.R. Krishnaiah, editor, Multivariate Analysis, pages 391–420. Academic Press, New York, 1966.
• Wold et al. [1983] S Wold, H Martens, and H Wold. The multivariate calibration problem in chemistry solved by the PLS method. In A. Ruhe and B. Kåstrøm, editors, Proc. Conf. Matrix pencils, Lecture Notes in Mathematics, pages 286–293, Heidelberg, 1983. Springer-Verlag.
• Wold et al. [2001] Svante Wold, Michael Sjöström, and Lennart Eriksson. PLS-regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58:109–130, 2001. ISSN 01697439.
• Xu and O’Quigley [1999] Ronghui Xu and John O’Quigley. A type measure of dependence for proportional hazards models. Journal of Nonparametric Statistics, 12(1):83–107, 1999.
• Zou and Hastie [2005] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, April 2005. ISSN 1369-7412.
• Zou et al. [2006] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics, 15(2):265–286, June 2006. ISSN 1061-8600.

## Supplemental Information

### Insights on the implementation of the methods

We detail the implementation of the algorithms that we used in the article and start with some shared properties of these. Whenever the deviance residuals and survival models were to be derived, we used the survival R-package (Therneau and Grambsch, 2000, Therneau, 2013). As a PLS regression function in the plsRcox R-package, we made three wrappers using either the pls function of the pls R-package (Mevik et al., 2011), the plsR function of the plsRglm R-package (Bertrand et al., 2014b) or the pls function of the mixOmics R-package (Dejean et al., 2013). The last two are based on the NIPALS algorithm and hence automatically handle missing data (Tenenhaus, 1998) in the explanatory variables. In addition, the pls function of the mixOmics R-package (Dejean et al., 2013) can quickly handle big datasets such as Dataset 5 with 242 rows and 44754 variables. As a consequence, we had to make the spls function of the spls R-package use this function instead of the pls function of the pls R-package (Mevik et al., 2011).

1. The authors made a wrapper in the plsRcox R-package (Bertrand et al., 2014a) to derive a sPLSDR implementation from the spls R-package (Chung et al., 2012).

2. The authors made a wrapper in the plsRcox R-package (Bertrand et al., 2014a) to derive a DKsPLSDR implementation from the spls (Chung et al., 2012) and the kernlab (Karatzoglou et al., 2004) R-packages.

3. The coxpath implementation was the coxpath function found in the glmpath R-package (Park and Hastie, 2013).

4. The coxnet implementation was the glmnet function, with the Cox family option, in the glmnet R-package (Friedman et al., 2010, Simon et al., 2011).

5. The coxlars implementation was the glmnet function, with the Cox family option and the alpha option set to , in the glmnet R-package (Friedman et al., 2010, Simon et al., 2011).

6. The coxridge implementation was the glmnet function, with the Cox family option and the alpha option set to , in the glmnet R-package (Friedman et al., 2010, Simon et al., 2011).

7. PLS-Cox has not yet been implemented in R and the authors made it available as the function plsRcox of the plsRcox R-package (Bertrand et al., 2014a).

8. autoPLS-Cox is PLS-Cox with sparse PLS components and automatic selection of their optimal number. The computed components are sparse since, to have a non zero coefficient, a variable must be significant at a given level in the cox regression of the response by this variable adjusted by the previously found components. The model stops adding a new component when there is no longer any of the explanatory variable that is significant at a given level. The authors made it available as the function plsRcox of the plsRcox R-package (Bertrand et al., 2014a). The number of components can also be determined using cross-validation, the components being still sparse.

9. The authors made a wrapper in the plsRcox R-package (Bertrand et al., 2014a) to derive a LARS-LassoDR implementation from the lars R-package (Hastie and Efron, 2012).

10. The authors made a wrapper in the plsRcox R-package (Bertrand et al., 2014a) to derive a Cox-PLS implementation from either the pls or plsRglm R-packages.

11. The authors made a wrapper in the plsRcox R-package (Bertrand et al., 2014a) to derive a PLSDR implementation from either the pls or plsRglm R-packages.

12. The authors made a wrapper in the plsRcox R-package (Bertrand et al., 2014a) to derive a DKPLSDR implementation from the pls (Mevik et al., 2011) and the kernlab (Karatzoglou et al., 2004) R-packages.

13. The uniCox implementation was the uniCox function found in the uniCox R-package (Tibshirani, 2009).

14. The glcoxph implementation was the glcoxph function found in the glcoxph R-package (Kim, 2009).

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters