Cross validating extensions of kernel, sparse or regular partial least squares regression models to censored data.
Abstract
When crossvalidating standard or extended Cox models, the commonly used criterion is the crossvalidated partial loglikelihood using a naive or a van Houwelingen scheme to make efficient use of the death times of the left out data in relation to the death times of all the data. Quite astonishingly, we will show, using a strong simulation study involving three different data simulation algorithms, that these two crossvalidation methods fail with the extensions, either straightforward or more involved ones, of partial least squares regression to the Cox model.
This is quite an interesting result for at least two reasons. Firstly, several nice features of PLS based models, including regularization, interpretability of the components, missing data support, data visualization thanks to biplots of individuals and variables and even parsimony for SPLS based models, account for a common use of these extensions by statisticians who usually select their hyperparameters using crossvalidation. Secondly, they are almost always featured in benchmarking studies to assess the performance of a new estimation technique used in a high dimensional context and often show poor statistical properties.
We carried out a vast simulation study to evaluate more than a dozen of potential crossvalidation criteria, either AUC or prediction error based. Several of them lead to the selection of a reasonable number of components. Using these newly found crossvalidation criteria to fit extensions of partial least squares regression to the Cox model, we performed a benchmark reanalysis that showed enhanced performances of these techniques.
The Rpackage plsRcox used in this article is available on the CRAN, http://cran.rproject.org/web/packages/plsRcox/index.html.
1 Introduction
Regular PLS regression is used to find the fundamental relations between two matrices ( and ), i.e. a latent variable approach to modeling the covariance structures in these two spaces. A PLS regression model will try to find iteratively the multidimensional direction in the space that explains the maximum multidimensional variance direction in the space. A key step in PLSR, is to select the right unknown number of these latent variables (called components) to use. PLS regression is particularly suited when the matrix of predictors has more variables than observations, and when there is multicollinearity among values. By contrast, standard regression will fail in these cases (unless it is regularized).
PLS has become an established tool in various experimental including chemometric, networks, or systems biology modeling, primarily because it is often possible to interpret the extracted components in terms of the underlying physical system that is, to derive “hard” modeling information from the soft model: chemical components for NIR spectra, gene subnetwork for GRN or biological function for systems biology. As a consequence, choosing the right number of components is not only a major aim to avoid under or overfitting and ensure a relevant modeling or good predicting ability but also per se.
A vast literature from the last decade is devoted to relating gene profiles and subject survival or time to cancer recurrence. Biomarker discovery from highdimensional data, such as transcriptomic or SNP profiles, is a major challenge in the search for more precise diagnoses. The proportional hazard regression model suggested by Cox, 1972 to study the relationship between the time to event and a set of covariates in the presence of censoring is the most commonly used model for the analysis of survival data. However, like multivariate regression, it supposes that more observations than variables, complete data, and not strongly correlated variables are available. In practice when dealing with highdimensional data, these constraints are crippling.
In this article we deal with several PLS regression based extensions of the Cox model. These extensions share features praised by practionners, including regularization, interpretability of the components, missing data support, biplots of individuals and variables and even parsimony for SPLS based models, and allow to deal with highly correlated predictors or even rectangular datasets, which is especially relevant for high dimensional datasets.
2 Models
2.1 Modeling censored data
2.1.1 The Cox proportional hazards model
The model assumes the following hazard function for the occurrence of an event at time in the presence of censoring:
(1) 
where is an unspecified baseline hazard function, the vector of the regression coefficients and the matrix of prognosis factors which will be the gene expressions in the following. The event could be death or cancer relapse. Based on the available data, the Cox’s partial likelihood can be written as:
(2) 
where is the set of indices of the events and denotes the set of indices of the individuals at risk at time .
The goal is to find the coefficients which maximize the log partial likelihood function
(3) 
The vector is the solution of the equation:
(4) 
with the vector of efficient scores.
However, there is no explicit solution and the minimization is generally accomplished using the NewtonRaphson procedure. An estimate of the vector of parameters at the cycle of the iterative procedure is:
(5) 
where is the observed information matrix. The process can be started by taking and iterated up to convergence, i.e. when the change in the log likelihood function is small enough. When the iterative procedure has converged, the variancecovariance matrix of the parameter estimates can be approximated by the inverse of the observed information matrix .
Note that when , there is no unique to maximize this log partial likelihood function. Even when , covariates could be highly correlated and regularization may still be required in order to reduce the variances of the estimates and to improve the predictive performance.
2.1.2 Deviance Residuals
For the Cox model with no timedependent explanatory variables and at most one event per patient, the martingale residuals for the subject with observation time and event status , where if is a censored time, and otherwise is:
(6) 
with the estimated cumulative hazard function at time .
Martingale residuals are highly skewed. The deviance residuals are a normalized transform of the martingale residuals. For the Cox model, the deviance residuals (Collett, 1994) amount to the form:
(7) 
The function is to ensure that the deviance residuals have the same sign as the martingale residuals. Martingale residuals take values between and . The square root shrinks large negative martingale residuals, while the logarithmic transformation expands towards martingale residuals that are close to . As such, the deviance residuals are more symmetrically distributed around zero than the martingale residuals. The deviance residual is a measure of excess of death and can therefore be interpreted as a measure of hazard. Moreover, Segal showed that the expression to be minimized in step 3 of the CoxLasso procedure of Tibshirani can be approximated, in a first order Taylorseries approximation sense, by the deviance residual sum of squares:
(8) 
with , , , and .
2.2 PLS regression models and extensions
2.2.1 Plsr
Prediction in highdimensional and lowsample size settings already arose in chemistry in the eighties. PLS regression, that can be viewed as a regularization method based on dimension reduction, was developed as a chemometric tool in an attempt to find reliable predictive models with spectral data (Wold et al., 1983; Tenenhaus, 1998). Nowadays, the difficulty encountered with the use of genomic or proteomic data for classification or prediction, using very large matrices, is of comparable nature. It was thus natural to use PLS regression principles in this new context. The method starts by constructing latent components, using linear combinations of the original variables, which are then used as new descriptors in standard regression analysis. Different from the principal components analysis (PCA), this method makes use of the response variable in constructing the latent components. The PLS regression can be viewed as a regularized approach searching the solution in a subspace named Krylov space giving biased regression coefficients but with lower variance. In the framework of censored genomic data, the PLS regression operates a reduction of the dimensionality of the gene’s space oriented towards the explanation of the hazard function. It allows transcriptomic signatures correlated to survival to be determined.
2.2.2 Sparse PLSR
Recently, Chun and Keles, 2010, provided both empirical and theoretical results showing that the performance of PLS regression was ultimately affected by the large number of predictors. In particular, a higher number of irrelevant variables leads to inconsistency of coefficient estimates in linear regression setting. There is a need to filter the descriptors as a preprocessing step before PLS fit. However, commonly used variables filtering approaches are all univariate and ignore correlation between variables. To solve these issues, Chun and Keles proposed ”sparse PLS regression” which promotes variables selection within the course of PLS dimension reduction. sPLS has the ability to include variables that variable filtering would select in the construction of the first direction vector. Moreover, it can select additional variables, i.e., variables that become significant once the response is adjusted for other variables in the construction of the subsequent direction vectors. This is the case of ”proxy genes” acting as suppressor variables which do not predict the outcome variable directly but improve the overall prediction by enhancing the effects of prime genes despite having no direct predictive power, Magidson and Wassmann, 2010.
A direct extension of PLS regression to sPLS regression could be provided by imposing constraint on PLS direction vector :
When =, the objective function coincides with that of sPCA (Jolliffe et al., 2003). However in that case Jolliffe et al. pointed out that the solution tends not to be sparse enough and the problem is not convex. To solve these issues, Chun and Keles provided an efficient implementation of sPLS based on the LARS algorithm by generalizing the regression formulation of sPCA of Zou et al., 2006:
This formulation promotes exact zero property by imposing penalty onto a surrogate of the direction vector instead of the original direction while keeping and close to each other. The penalty takes care of the potential singularity of . Moreover, they demonstrated that for univariate PLS, regressed on , the first direction vector of the sparse PLS algorithm was obtained by soft thresholding of the original PLS direction vector:
(9) 
In order to inherit the property of the Krylov subsequences which is known to be crucial for the correctness of the algorithm (Krämer, 2007), the thresholding phase is followed by a PLS regression on the previously selected variables. The algorithm is then iterated with replaced by , the residuals of the PLS regression based on the variables selected from the previous steps. The sPLS algorithm leads therefore to sparse solutions by keeping the Krylov subsequence structure of the direction vectors in a restricted space which is composed of the selected variables. The thresholding parameter and the number of hidden components are tuned by cross validation.
sPLS has connections to other variable selection algorithms including the elastic net method (Zou and Hastie, 2005) and the threshold gradient method (Friedman and Popescu, 2004). The elastic net algorithm deals with the collinearity issue in variable selection problem by incorporating the ridge regression method into the LARS algorithm. In a way, sPLS handles the same issue by fusing the PLS technique into the LARS algorithm. sPLS can also be related to threshold gradient method in that both algorithms use only thresholded gradient and not the Hessian. However, sPLS achieves fast convergence by using conjugate gradient. Hence, LARS and sPLS algorithms use the same criterion to select active variables in the univariate case. However, the sPLS algorithm differs from LARS in that sPLS selects more than one variable at a time and utilizes the conjugate gradient method to compute coefficients at each step. The computational cost for computing coefficients at each step of the sPLS algorithm is less than or equal to the computational cost of computing step size in LARS since conjugate gradient methods avoid matrix inversion.
2.3 Extensions of PLSR models to censored data
2.3.1 PLSCox
Garthwaite, 1994, showed that PLS regression could be obtained as a succession of simple and multiple linear regressions. Tenenhaus, 1999, proposed a fairly similar approach but one which could cope with missing data by using the principles of the Nipals algorithm (Wold, 1966). As a result, Tenenhaus suggested that PLS regression be extended to logistic regression (PLSLR) by replacing the succession of simple and multiple regressions by a succession of simple and multiple logistic regressions in an approach much simpler than that developed by Marx, 1996. By using this alternative formulation of the PLS regression, Bastien and Tenenhaus, 2001, extended the PLS regression to any generalized linear regression model (PLSGLR) and to the Cox model (PLSCox) as a special case. Further improvements have then been described (Bastien et al., 2005) in the case of categorical descriptors with variable selection using hard thresholding and model validation by bootstrap resampling. Since then many developments in the framework of PLS and Cox regressions have appeared in the literature. Nguyen and Rocke, 2002, directly applied PLS regression to survival data and used the resulting PLS components in the Cox model for predicting survival time. However such a direct application did not really generalize PLS regression to censored survival data since it did not take into account the failure time in the dimension reduction step. Based on a straightforward generalization of Garthwaite, 1994, Li and Gui, 2004, presented a solution, Partial Cox Regression, quite similar to the one proposed by Bastien and Tenenhaus, using different weights to derive the PLS components but not coping with missing data.
2.3.2 (Dk)(s)pls(dr)
The PLSDR algorithm (Bastien, 2008)
Following Segal, 2006, who suggested initially computing the null deviance residuals and then using these as outcomes for the LARSLasso algorithm, Bastien, 2008, proposed PLSDR, an alternative in highdimensional settings using deviance residuals based PLS regression. This approach is advantageous both by its simplicity and its efficiency because it only needs to carry out null deviance residuals using a simple Cox model without covariates and use these as outcome in a standard PLS regression. The final Cox model is then carried out on the retained PLSDR components.
Moreover, following the principles of the Nipals algorithm, weights, loadings, and PLS components are computed as regression slopes. These slopes may be computed even when there are missing data: let the value of the PLS component for individual , with the vector of residual descriptors and the vector of weights at step . represents the slope of the OLS line without constant term related to the cloud of points . In such case, in computing the PLS component, the denominator is computed only on the data available also for the denominator.
The DKsPLSDR algorithm (Bastien et al., 2015)
In the case of very many descriptors, PLS regression being invariant by orthogonal transformation (De Jong and Ter Braak, 1994), an even faster procedure could be derived by replacing the matrix by the matrix of principal components . This could be viewed as the simple form of linear kernel PLS regression algorithms which have been proposed in the nineties (Lindgren et al., 1993; Rännar et al., 1994) to solve computational problems posed by very large matrices in chemometrics. The objective of these methods was to obtain PLS components by working on a condensed matrix of a considerably smaller size than the original one. Moreover, in addition to dramatically reducing the size of the problem, nonlinear pattern in the data could also be analyzed using nonlinear kernel.
Rosipal and Trejo, 2002, proposed a nonlinear extension of PLS regression using kernels. Assuming a nonlinear transformation of the input variables into a feature space , i.e. a mapping , their goal was to construct a linear PLS regression model in . They derived an algorithm named KPLS for Kernel PLS by performing the PLS regression on . It amounts to replacing, in the expression of PLS components, the product by using the socalled kernel trick which allows the computation of dot products in highdimensional feature spaces using simple functions defined on pairs of input patterns: . This avoids having to explicitly calculate the coordinates in the feature space which could be difficult for a highly dimensional feature space. By using the kernel functions corresponding to the canonical dot product in the feature space, nonlinear optimization can be avoided and simple linear algebra used.
Bennett and Embrecht, 2003, proposed to perform PLS regression directly on the kernel matrix instead of . DKPLS corresponds to a low rank approximation of the kernel matrix. Moreover, Tenenhaus et al., 2007 demonstrated that, for one dimensional output response, PLS of (KPLS) is equivalent to PLS on (DKPLS).
Using previous works, it becomes straightforward to derive a nonlinear Kernel sPLSDR algorithm by replacing in the sPLSDR algorithm the matrix by a kernel matrix . The main kernel functions are the linear kernel () and the Gaussian kernel ().
However, nonlinear kernel (sparse) PLS regression loses the explanation with the original descriptors unlike linear kernel PLS regression, which could limit the interpretation of the results.
3 Simulation studies
3.1 Scheme of the studies
The aim of our two in silico studies is twofold: evaluating the accuracy of the cross validation methods, see Section 4, and revisit the performance of the component based methods, see Section 5.
We performed a simulation study (Algorithm 3) in order to evaluate the methods by simulating 100 datasets with exponential survival distribution and 40% censored rate (100 observations 1000 genes) according to three different simulation types (cluster by Bair et al. (2006), factorial by Kaiser and Dickman (1962) and Fan et al. (2002) or eigengene by Langfelder et al. (2013)), using either no link or a linear one between the response and the predictors.
We divided each of these 600 datasets into a training set, of 7/10 (70) of the observations, used for estimation, and a test set, of 3/10 (30) of the observations, used for evaluation or testing of the prediction capability of the estimated model. This choice was made to stay between the 2:1 scheme of Bøvelstad et al. (2007); van Wieringen et al. (2009); LambertLacroix and Letué (2011) and the 9:1 scheme of Li (2006). The division between training and test sets was balanced using the caret
package, Kuhn (2014), both according to the response value and censor rate.
3.2 Data generation
3.2.1 Eigengene: Langfelder et al. (2013)
Given module seeds and a desired size for the genes modules around the seeds of genes, module genes expression profiles are generated such that the th rank correlated gene from module with its module seed is :
(10) 
that is, the first gene has correlation with the seed while the last (th) gene has correlation .
The required correlation (10) is achieved by calculating the th gene profile as the sum of the seed vector and a noise term
(11) 
This technique produces modules consisting of genes distributed symmetrically around the module seed; in this sense, the simulated modules are spherical clusters whose centers coincide (on average) with the module seed.
In the simulations the parameters have been let as follow , , with and .
Survival and censoring times, with censoring probability, are generated from exponential survival distributions. When linked to survival (linear or quadratic case), only expressions from genes from the first two modules () are related to survival time.
Each simulated data set consists of genes and samples. Only the first hundred genes are structured. The last are random noise generated from .
3.2.2 Cluster Bair et al. (2006)
The gene expression data is distributed as:
(12) 
Where the are drawn from a .
Each simulated data set consists of genes and samples. Survival and censoring times, with censoring probability, are generated from exponential survival distributions. When linked to survival (linear or quadratic case), only expressions from genes from the first genes are related to survival.
3.2.3 Factorial Kaiser and Dickman (1962), Fan et al. (2002)
We have supposed that genes expressions are related to latent variables associated each to a specific biological function. Let for each group a specified population intercorrelation pattern matrix . By applying principal component factorization (PCA) to the matrix and following Kaiser and Dickman, we can generate multivariate normally distributed sample data with a specific correlation pattern.
Where:
is the number of descriptors (genes)
is the number of observations
is a matrix of uncorrelated random normal variables
is a matrix containing principal component factor pattern coefficients obtained by applying Principal Components Analysis (PCA) to the given population correlation matrix
is the resultant sample data matrix, as if sampled from a population with the given population correlation matrix
We have chosen a compound symmetry structure for the correlation matrix with a same correlation () between two descriptors of a same group, descriptors between different groups being independent.
Moreover the choice of the correlation coefficient allows specifying the percentage of variance explained by the first factorial axes. Given four groups with intergenes correlation coefficient of corresponds to expend of the inertia in principal directions.
Survival and censoring times, with censoring probability, are generated from exponential survival distributions. When linked to survival (linear or quadratic case), only expressions from genes from the first two groups () are related to survival time.
Each simulated data set consists of genes and samples. Only the first hundred genes are structured. The last are random noise generated from .
Figure 1 displays the pattern of correlation for the first 150 descriptors with the four groups of genes each well defined.
3.3 Hyperparameters and crossvalidation
In standard fold crossvalidation of a dataset of size , folds of size are created by sampling from the data without replacement and each of the remaining data points is assigned randomly to a different fold. In stratified or balanced crossvalidation (Breiman et al., 1984, p. 246), the data are first ordered by the response value or class. This list is broken up into bins each containing points with many similar response values. Any remaining points at the end of the list are assigned to an additional bin. A fold is formed by sampling one point without replacement from each of the bins. Except for the ordering of the data, this is equivalent to standard crossvalidation. We used balanced crossvalidation with respect to the response value and censor rate. The folds were design using the caret
package, Kuhn (2014).
In traditional crossvalidation, i.e. with a dataset without censored events, each fold would yield a test set and a value of a prediction error measure (for instance log partial likelihood, integrated area under the curve, integrated area under the prediction error curve). When dealing with censored events and using the CV partial likelihood (CVLL, Verweij and Van Houwelingen (1993)) criterion, it is possible to make more efficient use of risk sets: van Houwelingen et al. (2006) recommended to derive the CV log partial likelihood for the th fold by subtraction; by subtracting the log partial likelihood evaluated on the full dataset from that evaluated on the full dataset minus the th fold, called the dataset. This yields the van Houwelingen CV partial likelihood (vHCVLL) .
Hyperparameters were tuned using 7fold crossvalidation on the training set. The number of folds was chosen following the recommandation of Wold et al. (2001), Breiman and Spector (1992) and Kohavi (1995). As in, Bøvelstad et al. (2007), van Wieringen et al. (2009) and LambertLacroix and Letué (2011), mean values were then used to summarize these cross validation criteria over the 7 runs and the hyperparameters were chosen according to the best values of these measures.
4 Highlighting relevant cross validation criteria
4.1 The failure of the two usual criteria
The van Houwelingen CV partial likelihood (vHCVLL, see Figure 4.4) criterion behave poorly for all the PLS or sPLS based methods by selecting zero components where, according to our simulation types, the PLSCox, autoPLSCox, CoxPLS, PLSDR, sPLSDR, DKPLSDR and DKsPLSDR methods were expected to select, for the factor or eigengene schemes, about two components and slightly more for the cluster scheme. As with the the classic CV partial likelihood (CVLL), it almost always selects at most one component and hence systematically underestimates the number of components. The simulations results for the selection of the number of components using CVLL are plotted on Figure 4.4. We confirmed this poor property by performing crossvalidation on a simpler simulation scheme designed by Simon et al. (2011).
4.2 Proposal of new criteria
As a consequence, we had to search for other CV criteria (CVC) for the models featuring components. Li (2006) used the integrated area under the curves of timedependent ROC curves (iAUCsurvROC, Heagerty et al. (2000)) to carry out his crossvalidations, implemented in the survcomp
package, (Schröder et al., 2011). Apart from that criterion (Figure 4.4) we added five other integrated AUC measures: integrated Chambless and Diao’s (2006) estimator (iAUCCD, Figure 4.4), integrated Hung and Chiang’s (2010) estimator (iAUCHC, Figure 4.4), integrated Song and Zhou’s (2008) estimator (iAUCSH, Figure 4.4), integrated Uno et al.’s (2007) estimator (iAUCUno, Figure 4.4) and integrated Heagerty and Zheng’s (2005) estimator (iAUCHZ, Figure 4.4) of cumulative/dynamic AUC for rightcensored timetoevent data, implemented in the survAUC
package, Potapov et al. (2012), and the risksetROC
package, Heagerty and packaging by Paramita SahaChaudhuri (2012). We also studied two versions of two prediction error criteria, the integrated (un)weighted Brier Score (Graf et al. (1999), Gerds and Schumacher (2006), iBS(un)w, integrated (un)weighted squared deviation between predicted and observed survival, Figures 4.4 and 4.4) and the integrated (un)weighted Schmid Score (Schmid et al. (2011), iSS(un)w, integrated (un)weighted absolute deviation between predicted and observed survival, Figure 4.4 and 4.4), also implemented in the survAUC
package, Potapov et al. (2012).
4.3 Analysis of the results
The simulation results highlighted the integrated Song and Zhou’s estimator of cumulative/dynamic AUC for rightcensored timetoevent data (iAUCSH), implemented in the survAUC
package, Potapov et al. (2012), as the best CV criterion for the PLSCox and the autoPLSCox methods even though it behaves poorly in all the other cases.
As for the other models featuring components, the iAUCsurvROC, iAUCUno criterion exhibited the best performances.
The two unweighted criteria iBSunw and iSSunw uniformly fail for all the models.
The iBSw criterion is too conservative and wrongly selects null models in more than half of the cases in the linear link scheme and in almost every times in the quadratic scheme.
The iSSw provides very poor results for CoxPLS, sPLSDR and DKsPLSDR and average for PLSDR and DKPLSDR methods.
The two models SPLSDR and DKSPLSDR use an additional parameter: the thresholding parameter . The same figures were produced for all the criteria (results not shown): both iAUCUno criterion and iAUCsurvROC criterion provided a reasonable spread for the parameter.
4.4 Recommendation
In a word, this simulation campaign enables us to state the following recommendations to firmly improve the selection of the right number of components: use iAUCSH to crossvalidate PLSCox or autoPLSCox models and either iAUCUno or iAUCsurvROC to crossvalidate CoxPLS, PLSDR, sPLSDR, DKPLSDR and DKsPLSDR. We implemented these recommendations (iAUCSH for PLSCox or autoPLSCox models and iAUCsurvROC for CoxPLS, PLSDR, sPLSDR, DKPLSDR and DKsPLSDR) as the default cross validation techniques in the plsRcox
package. We will apply them in the remaining of the article to assess goodness of fit of the model.
5 Reassessing performance of (s)PLS
based models
We will now provide evidence that the changes of the crossvalidation criteria recommended in Section 4.4 actually lead to performance improvements for the fitted models.
5.1 Introduction to performance criteria analysis
We followed the methodological recommendations of van Wieringen et al. (2009) to design a simulation plan that ensures a good evaluation of the predictive performance of the models.
“The true evaluation of a predictor’s performance is to be done on independent data. In the absence of independent data (the situation considered here) the predictive accuracy can be estimated as follows Dupuy and Simon (2007). The samples are split into mutually exclusive training and test sets. The gene expression and survival data of the samples in the training set are used to build the predictor. No data from the test set are used in predictor construction (including variable selection) by any of the methods compared. This predictor is considered to be representative of the predictor built on all samples (of which the training set is a subset). The test set is used to evaluate the performance of the predictor built from the training set: for each sample in the test set, survival is predicted from gene expression data. The predicted survival is then compared to the observed survival and summarized into an evaluation measure. To avoid dependency on the choice of training and test set, this procedure is repeated for multiple splits. The average of the evaluation measures resulting from each split is our estimate of the performance of the predictor built using the data from all samples.”
As to the performance criteria themselves, Schmid et al. (2011) made several points that we will take into account to carry out our performance comparison analysis.
“Evaluating the prognostic performance of prediction rules for continuous survival outcomes is an important topic of recent methodological discussion in survival analysis. The derivation of measures of prediction accuracy for survival data is not straightforward in the presence of censored observations (Kent and O’Quigley (1988), Schemper and Stare (1996), Rosthøj and Keiding (2004)). This is mainly due to the fact that traditional performance measures for continuous outcomes (such as the mean squared error or the fraction of explained variation) lead to biased predictions if applied to censored data (Schemper and Stare (1996)).
To overcome this problem, a variety of new approaches has been suggested in the literature. These developments can be classified into three groups: “likelihoodbased approaches” (Nagelkerke (1991), Xu and O’Quigley (1999), O’Quigley et al. (2005)), “ROCbased approaches” (Heagerty et al. (2000), Heagerty and Zheng (2005), Cai et al. (2006), Uno et al. (2007), Pepe et al. (2008)), and “distancebased approaches” (Korn and Simon (1990), Graf et al. (1999), Schemper and Henderson (2000), Gerds and Schumacher (2006), 2007, Schoop et al. (2008)).
When using likelihoodbased approaches, the log likelihood of a prediction model is related to the corresponding log likelihood obtained from a “null model” with no covariate information. ROCbased approaches use the idea that survival outcomes can be considered as timedependent binary variables with levels event and no event, so that timedependent misclassification rates and ROC curves can be computed for each threshold of a predictor variable of interest. If distancebased approaches are applied, a measure of prediction error is given by the distance between predicted and observed survival functions of the observations in a sample. None of these approaches has been adopted as a standard for evaluating survival predictions so far.”
To assess the goodness of fit and prediction accuracy of all the methods, we found 23 performance measures (PM) that are commonly used (LRT, VarM, R2Nag, R2XO, R2OXS, iR2BSunw, iR2BSw, iRSSunw, iRSSw, iAUCCD, iAUCHC, iAUCSH, iAUCUno, IAUCHZ, iAUCSurvROC, C, UnoC, GHCI, SchemperV, iBSunw, iBSw, iSSunw, iSSw). We chose, on statistical grounds, 14 among them (LRT, R2XO, iR2BSw, iRSSw, iAUCCD, iAUCHC, iAUCSH, iAUCUno, IAUCHZ, iAUCSurvROC, GHCI, SchemperV, iBSw, iSSw) and reported the results of six indices of several kind: two like measure (a likelihoodbased approach (LBA), R2XO, and a distancebased approach (DBA), iRSSw), a index (GHCI), two (ROCbased approaches (ROCBA), iAUCCD and iAUCSurvROC), and an integrated robust prediction error (distancebased approach, iSSw), see Table 1. The results for the remaining eight indices are similar to those shown. We now explain our process of selection of the performance criteria.
Criterion  Type  As a Cross Validation Criterion  As a Performance Measure  

Criterion  Type  Tested  Results  Recom. for  Is a  Selected on  Results 
PM ?  statistical  
grounds  
CVLL  LBA  Yes  Yes  No  No  No  
vHCVLL  LBA  Yes  Yes  No  No  No  
LRT value  LBA  No  No  Yes  Yes  No  
VarM  LBA  No  No  Yes  No  No  
R2Nag  LBA  No  No  Yes  No  No  
R2XO  LBA  No  No  Yes  Yes  Yes  
R2OXS  LBA  No  No  Yes  No  No  
iR2BSunw  DBA  No  No  Yes  No  No  
iR2BSw  DBA  No  No  Yes  Yes  No  
iRSSunw  DBA  No  No  New  No  No  
iRSSw  DBA  No  No  New  Yes  Yes  
iAUCCD  ROCBA  Yes  Yes  Yes  Yes  Yes  
iAUCHC  ROCBA  Yes  Yes  Yes  Yes  No  
iAUCSH  ROCBA  Yes  Yes  PLSCox,  Yes  Yes  No 
autoPLSCox  
iAUCUno  ROCBA  Yes  Yes  (DK)(s)PLSDR  Yes  Yes  No 
CoxPLS  
iAUCHZ  ROCBA  Yes  Yes  Yes  Yes  No  
iAUCSurvROC  ROCBA  Yes  Yes  (DK)(s)PLSDR  Yes  Yes  Yes 
CoxPLS  
C  ROCBA  No  No  Yes  No  No  
UnoC  ROCBA  No  No  Yes  No  Sup. Info.  
GHCI  ROCBA  No  No  Yes  Yes  Yes  
SchemperV  DBA  No  No  Yes  Yes  No  
iBSunw  DBA  Yes  Yes  Yes  No  No  
iBSw  DBA  Yes  Yes  Yes  Yes  Sup. Info.  
iSSunw  DBA  Yes  Yes  Yes  No  No  
iSSw  DBA  Yes  Yes  Yes  Yes  Yes  
Total Number  25  12  12  23  14  6 (+2 SI) 
5.2 Selection of performance criteria
The likelihood ratio test (LRT, Lehmann and Romano (2005)) evaluates the null hypothesis , i.e. the built predictor has no effect on survival. The null hypothesis is evaluated using the likelihood ratio test statistic , with denoting the value of the loglikelihood function. Under the null hypothesis this test statistic has a distribution, which is used to calculate the value. The value summarizes the evidence against : the lower the value the more probable that is not true.The value of the likelihood ratio test has been used as an evaluation measure for predictive performance of gene expression based predictors of survival by many others Bair and Tibshirani (2004); Bøvelstad et al. (2007); Park et al. (2002); Segal (2006).
In the Cox model, an alternative measure of predictive performance is the variance of the martingale residuals (VarM, cf. section 2.1.2). As in van Wieringen et al. (2009), we found that this measure is not able to discriminate very well between good and poor predictors in the considered setting (data not shown). It is therefore omitted here.
To quantify the proportion of variability in survival data of the test set that can be explained by the predictor, we use the coefficient of determination (henceforth called ). A predictor with good predictive performance explains a high proportion of variability in the survival data of the test set, and vice versa a poor predictor explains little variability in the test set. However, the traditional definition of the cannot be used in the context of censored data and modified criteria have been proposed in the past. Three types of likelihoodbased coefficients for rightcensored timetoevent data are were put forward (R2NAG, R2XO and R2OXS).

The coefficient (R2Nag) proposed by Nagelkerke (1991):
(13) where denotes the loglikelihood function.

The coefficient (R2XO) proposed by Xu and O’Quigley (1999) that is restricted to proportional hazards regression models, because here the means of squared residuals in the measure for linear regression are replaced by the (weighted) sums of squared Schoenfeld residuals, denoted by :
(14) 
The coefficient (R2OXS) proposed by O’Quigley et al. (2005) who replaced the number of observations by the number of events :
(15)
All three were implemented in the survAUC
package, Potapov et al. (2012). Others have also used these modified statistics to assess predictive performance of gene expression based predictors on survival Bair and Tibshirani (2004); Segal (2006).
Hielscher et al. (2010) carried out a comparison of the properties of these three coefficients. In a word, R2Nag is strongly influenced by censoring (negative correlation with censoring); R2OXS is less influenced by censoring and exhibits a positive correlation with censoring. From those three R2XO is the less influenced by censoring. As a consequence, we selected the R2XO as the like measure to compare the models.
The weighted Brier score (Brier (1950); Hothorn et al. (2004); RadespielTröger et al. (2003)) is a distancebased measure of prediction error that is based on the squared deviation between survival functions. It is defined as a function of time by
(16) 
where denotes the KaplanMeier estimate of the censoring distribution, that is the Kaplan–Meier estimate based on the observations and stands for the indicator function. The expected Brier score of a prediction model which ignores all predictor variables corresponds to the KM estimate. To derive the unweighted Brier score, , clear the value of the denominators.
The Schmid score (Schmid et al. (2011)) is a distancebased measure of prediction error that is based on the absolute deviation between survival functions, instead of the squared one for the BrierScore. It is a robust improvement over the following empirical measure of absolute deviation between survival functions that was suggested by Schemper and Henderson (2000) as a function of time by:
(17) 
where denotes the KaplanMeier estimate of the censoring distribution which is based on the observations and stands for the indicator function. With the same notations, the Schmid score is defined as a function of time by:
(18) 
where is a survival time that is marginally smaller than . To derive the unweighted Schmid score, , clear the and values of the denominators.
The values of the BrierScore range between 0 and 1. Good predictions at time result in small BrierScores. The numerator of the first summand is the squared predicted probability that individual survives until time if he actually died (uncensored) before , or zero otherwise. The better the survival function is estimated, the smaller is this probability. Analogously, the numerator of the second summand is the squared probability that individual dies before time if he was observed at least until , or zero otherwise. Censored observations with survival times smaller than are weighted with 0. The Brierscore as defined in Eq. 16 depends on . It makes sense to use the integrated BrierScore () given by
(19) 
as a score to assess the goodness of the predicted survival functions of all observations at every time between 0 and , . Note that the is also appropriate for prediction methods that do not involve Cox regression models: it is more general than the and the value criteria associated to the log likelihood test and has thus become a standard evaluation measure for survival prediction methods (Hothorn et al. (2006); Schumacher et al. (2007)).
Denoting by , the KaplanMeier estimator based on the , , which corresponds to a prediction without covariates, we first define for all :
(20) 
Then the integrated iR2BSw, Graf et al. (1999), is defined by:
(21) 
This criterion has already been used in Bøvelstad et al. (2007) and LambertLacroix and Letué (2011). The integrated iR2BSw is slightly influenced by censoring, Hielscher et al. (2010), and, as a measure based on the the quadratic norm, not robust.
As a consequence, we propose and use a similar measure based on the Schmid score, the integrated R Schmid Score weighted (iRSSw), by turning the traditional , derived from the quadratic norm, into the R coefficient of determination for least absolute deviation, introduced by McKean and Sievers (1987). Denoting by the Schmid score which corresponds to a prediction without covariates, we first define for all :
(22) 
Then the integrated iRSSw, is defined by:
(23) 
The Cindex provides a global assessment of a fitted survival model for the continuous event time rather than focuses on the prediction of tyear survival for a fixed time and is arguably the most widely used measure of predictive accuracy for censored data regression models. It is a rankcorrelation measure motivated to quantify the correlation between the ranked predicted and observed survival times. The C index estimates the probability of concordance between predicted and observed responses. A value of 0.5 indicates no predictive discrimination and a value of 1.0 indicates perfect separation of patients with different outcomes. A popular nonparametric Cstatistic for estimating was proposed by Harrell et al. (1996). It is computed by forming all pairs of the observed data, where the smaller followup time is a failure time and defined as:
(24) 
We used the improved version (GHCI) by Gönen and Heller (2005) for the Cox proportional hazards models as a performance comparison criterion. Their estimator is a function of the regression parameters and the covariate distribution only and does not use the observed event and censoring times. For this reason it is asymptotically unbiased, unlike Harrell’s Cindex based on informative pairs. It focuses on the concordance probability as a measure of discriminatory power within the framework of the Cox model. The appeal of this formulation is that it provides a stable estimator of predictive accuracy that is easily computed:
(25) 
In contrast to Harrell’s Cindex, the effect of the observed times on is mediated through the partial likelihood estimator , and, since the effect of censoring on the bias of is negligible, the measure is robust to censoring. In addition, remains invariant under monotone transformations of the survival times.
5.3 Ranking the performance of the CV criteria
We stated several recommendations, in Section 4 based of the accuracy of the selection of the number of components. Selecting the right number of components is a goal per se.
Moreover, these recommendations are also relevant from a performance criteria point of view (see Section 5.1) as the following analysis showed.

For all the models and simulation types, we carried out the crossvalidation according to all of the 12 criteria and, for each of these criteria, we derived the value of all the 14 the performance measures.

In order to lay the stress on the improvements of performance made when switching from the classic and the van Houwelingen log likelihood cross validation techniques to the recommended ones, we computed, for every datasets and models, all the paired differences between CVLL or vHCVLL and the eleven other CV techniques.

Paired comparison with CVLL. For every simulated dataset we evaluated: Delta = Performance Measure(with any CV criteria CVLL) Performance Measure(with CVLL)

Paired comparison with vHCVLL. For every simulated dataset we evaluated: Delta = Performance Measure(with any CV criteria vHCVLL) Performance Measure(with vHCVLL)
An analysis of these results showed a steady advantage of the recommended criteria versus either CVLL or vHCVLL especially in the linear and quadratic cases.
In the case of paired comparison with vHCVLL and for some couples of the type (performance measure, model), namely (UnoC, PLSCox), (UnoC, sPLSDR), (iBSW, PLSDR), (iBSW, DKsPLSDR), (iRSSW, autoPLSCox), (iRSSW, PLSDR), (iAUCSurvROC, PLSDR) and (iAUCSurvROC, sPLSDR), those deltas are plotted on Figures 5.3, 5.3, 5.3, 5.3, 5.3, 5.3, 5.3 and 5.3. 
5.4 Performance comparison revisited
5.4.1 Selection of competing benchmark methods
Simon et al. (2011), introduced the coxnet
procedure, which is an elastic nettype procedure for the Cox model, in a similar but not equivalent way than two competing ones: coxpath
(glmpath
package, Park and Hastie, 2007) and penalized
(penalized
package, Goeman, 2010). In Section 3 of the same article, these authors extensively compared coxnet
to coxpath
and to penalized
for the lasso penalty that is the only one relevant for these comparisons since the three procedures use different elastic net penalties. Their results show tremendous timing advantage for coxnet
over the two other procedures. The coxnet
procedure was integrated in the glmnet
package (Friedman et al., 2010) and is called in the R
language by applying the glmnet
function with the option family=cox
: coxnet
is glmnet
for the Cox model. The timing results of Simon et al. (2011) on both simulated and real datasets show some advantage to coxpath
over penalized
.
As to pure lassotype penalty algorithms, we selected two of them: “Univariate Shrinkage in the Cox Model for High Dimensional data” (uniCox
, Tibshirani, 2009) and ‘Gradient Lasso for Cox Proportional Hazards Model” (glcoxph
, Sohn et al., 2009).
The uniCox
package implements “Univariate Shrinkage in the Cox Model for High Dimensional data” (Tibshirani, 2009). Being “essentially univariate”, it differs from applying a classical lasso penalty when fitting the Cox model and hence from both coxnet/glmnet and coxpath/glmpath. It can be used on highly correlated and even rectangular datatsets.
In their article, Sohn et al. (2009), show that the glcoxph
package is very competitive compared with popular existing methods coxpath
by Park and Hastie (2007) and penalized
by Goeman (2010) in its computational time, prediction and selectivity. As a very competitive procedure to coxpath
, that we included in our benchmarks, and since no comparisons were carried out with coxnet
, we selected glcoxph
as well.
Cross validation criteria were recommended for several of our benchmark methods by their authors. We followed these recommendations classic CV partial likelihood for coxpath, glcoxph and uniCox; van Houwelingen CV partial likelihood for coxnet with both the , the value of that gives minimum of the mean crossvalidated error, or , the largest value of such that the crossvalidated error is within 1 standard error of the minimum of the mean crossvalidated error, criteria and used the same 7 folds fo the training set as those described in Section 3.3 for the other models.
It seemed unfair to compare the methods using a performance measure that is recommended as a crossvalidation criterion for some, but not all, of them. Hence, we decided not to use any of the three recommended crossvalidation criteria iAUCSH, iAUCUno or iAUCsurvROC even if it has already been used by Li (2006) as a performance measure, in order to strive to perform fair comparisons with the methods that are recommended to be cross validated using partial likelihood with either the classic or van Houwelingen technique.
As a consequence and in order to still provide results for a ROCbased performance measure on a fair basis, we selected the Chambless and Diao’s (2006) estimator of cumulative/dynamic AUC for rightcensored timetoevent data in a form restricted to Cox regression. The iAUCCD summary measure is given by the integral of AUC on (weighted by the estimated probability density of the timetoevent outcome).
5.4.2 Results
For coxnet, coxlars or ridgecox with both the or CV criteria, the criterion yield similar yet superior results than the one whose main default is to select too often no explanatory variable (a null model) for the linear or quadratic links. As a consequence, we only reported results for the former one.
We plotted some of the performance measures when the crossvalidation is done according to the vHCVLL criterion on Figures 5.4.2, 5.4.2, 5.4.2, 5.4.2, 5.4.2 and 5.4.2. The results are terrible for all the (s)PLSlike models apart from PLSCox and autoPLSCox.
We then provide, for each of the (s)PLSlike method, the increases in terms of performance measures when switching from the vHCVLL as a cross validation criterion to the recommended one in Section 4.4. Virtually, for PLSCox and autoPLSCox we switch to the iAUCSH crossvalidation criterion and for other (s)PLS based models to either iAUCUno or iAUCSurvROC .
For iAUCUno, these results are plotted on Figures 5.4.2, 5.4.2, 5.4.2, 5.4.2, 5.4.2 and 5.4.2 whereas for iAUCSurvROC they are displayed on Figures 5.4.2, 5.4.2, 5.4.2, 5.4.2, 5.4.2 and 5.4.2. These figures show a firm increase for the 6 criteria (R2XO, GHCI, iAUCCD, iAUCSurvROC, IRSSW, iSSW).
As can be seen for iAUCUno (Figures 6, 6, 6, 6, 6 and 6) and iAUCSurvROC (Figures 6, 6, 6, 6, 6 and 6), the improvement of the performances due to switch to the recommended CV criteria is high enough to even have some (S)PLS based models, for instance SPLSDR, show some advantage over the other benchmark methods.
6 Conclusion
When crossvalidating standard or extended Cox models, the commonly used criterion is the crossvalidated partial loglikelihood using a naive or a van Houwelingen scheme. Quite astonishingly, these two crossvalidation methods fail with all the 7 extensions of partial least squares regression to the Cox model, namely PLSCox, autoPLSCox, CoxPLS, PLSDR, sPLSDR, DKPLSDR and DKsPLSDR.
In our simulation study, we introduced 12 cross validation criteria based on three different kind of model quality assessment:

Likelihood (2): Verweij and Van Houwelingen (classic CVLL, 1993), van Houwelingen et al. (vHCVLL, 2006).

Integrated AUC measures (6): Chambless and Diao’s (iAUCCD, 2006), Hung and Chiang’s (iAUCHC, 2010), Song and Zhou’s (iAUCSH, 2008), Uno et al.’s (iAUCUno, 2007), Heagerty and Zheng’s (iAUCHZ, 2005), Heagerty et al.’s (iAUCsurvROC, 2000).

Prediction error criteria (4): integrated (un)weighted Brier Score (iBS(un)w, Gerds and Schumacher (2006)) or Schmid Score (iSS(un)w, Schmid et al. (2011))
Our simulation study was successful in finding good CV criterion for PLS or sPLS based extensions of the Cox model:

iAUCsh for PLSCox and autoPLSCox.

iAUCSurvROC and iAUCUno ones for CoxPLS, (DK)PLSDR and (DK)sPLSDR.
The derivation of measures of prediction accuracy for survival data is not straightforward in the presence of censored observations. To overcome this problem, a variety of new approaches has been suggested in the literature. We spotted 23 performance measures that can be classified into three groups:

Likelihoodbased approaches (llrt, varresmart, 3 R2type).

ROCbased approaches such as integrated AUC (iAUCCD, iAUCHC, iAUCSH, iAUCUno, iAUCHZ, iAUCsurvROC), 3 Cindex (Harrell, GHCI, UnoC).

Distancebased approaches such as the V of Schemper and Henderson (2000) or derived from Brier or Schmid Scores (iBS(un)w, iSS(un)w and 4 derived R2type measures).
Using the newly found crossvalidation, and these measures of prediction accuracy, we performed a benchmark reanalysis that showed enhanced performances of these techniques and a much better behaviour even against other well known competitors such as coxnet, coxpath, uniCox and glcoxph.
Hence the recommended criteria not only improve the accuracy of the choice of the number of components but also strongly raise the performances of the models, which enables some of them to overperform the other benchmark methods.
References
References
 Bair and Tibshirani [2004] Eric Bair and Robert Tibshirani. Semisupervised methods to predict patient survival from gene expression data. PLoS biology, 2(4):E108, April 2004. ISSN 15457885. doi: 10.1371/journal.pbio.0020108. URL http://dx.plos.org/10.1371/journal.pbio.0020108.
 Bair et al. [2006] Eric Bair, Trevor Hastie, Debashis Paul, and Robert Tibshirani. Prediction by supervised principal components. Journal of the American Statistical Association, 101:119–137, 2006.
 Bastien and Tenenhaus [2001] P. Bastien and M. Tenenhaus. PLS generalised linear regression, Application to the analysis of life time data. In PLS and Related Methods, Proceedings of the PLS’01 International Symposium, pages 131–140. CISIACERESTA, 2001.
 Bastien [2008] Philippe Bastien. Deviance residuals based PLS regression for censored data in high dimensional setting. Chemometrics and Intelligent Laboratory Systems, 91:78–86, 2008. ISSN 01697439. doi: 10.1016/j.chemolab.2007.09.009.
 Bastien et al. [2005] Philippe Bastien, Vincenzo Esposito Vinzi, and Michel Tenenhaus. PLS generalised linear regression. Computational Statistics & Data Analysis, 48(1):17–46, January 2005. ISSN 01679473. doi: 10.1016/j.csda.2004.02.005. URL http://www.sciencedirect.com/science/article/pii/S0167947304000271.
 Bastien et al. [2015] Philippe Bastien, Frederic Bertrand, Nicolas Meyer, and Myriam MaumyBertrand. Deviance residualsbased sparse PLS and sparse kernel PLS regression for censored data. Bioinformatics, 31(3):397–404, oct 2015. ISSN 14602059. doi: 10.1093/bioinformatics/btu660. URL http://bioinformatics.oxfordjournals.org/content/early/2014/10/06/bioinformatics.btu660.abstract.
 Bennett and Embrecht [2003] K.P. Bennett and M.J. Embrecht. An optimization perspective on kernel partial least squares regression. In Advances in learning theory: methods, models and applications, NATO Sciences Series III: Computer & Systems Sciences, pages 227–250, 2003.
 Bertrand et al. [2014a] F. Bertrand, M. MaumyBertrand, and N. Meyer. Partial least squares Regression for Cox models and related techniques, 2014a. URL http://CRAN.Rproject.org/package=plsRcox. R package version 1.1.0.
 Bertrand et al. [2014b] F. Bertrand, N. Meyer, and M. MaumyBertrand. Partial least squares Regression for generalized linear models, 2014b. URL http://CRAN.Rproject.org/package=plsRglm. R package version 1.7.0.
 Bøvelstad et al. [2007] H. M. Bøvelstad, S. Nygård, H. L. Størvold, M. Aldrin, O. Borgan, A. Frigessi, and O. C. Lingjærde. Predicting survival from microarray data  A comparative study. Bioinformatics, 23:2080–2087, 2007. doi: 10.1093/bioinformatics/btm305.
 Breiman et al. [1984] L Breiman, J H Friedman, R A Olshen, and C J Stone. Classification and Regression Trees. CRC Press, Boca Raton, 1984. ISBN 0412048418.
 Breiman and Spector [1992] Leo Breiman and Philip Spector. Submodel selection and evaluation in regression . The Xrandom case. International Statistical Review, 60:291–319, 1992. doi: 10.2307/1403680.
 Brier [1950] Glenn W. Brier. Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1):1–3, January 1950. ISSN 00270644. doi: 10.1175/15200493(1950)078¡0001:VOFEIT¿2.0.CO;2. URL http://journals.ametsoc.org/doi/abs/10.1175/15200493%281950%29078%3C0001%3AVOFEIT%3E2.0.CO%3B2.
 Cai et al. [2006] Tianxi Cai, Margaret Sullivan Pepe, Yingye Zheng, Thomas Lumley, and Nancy Swords Jenny. The sensitivity and specificity of markers for event times. Biostatistics, 7:182–197, 2006. ISSN 14654644. doi: 10.1093/biostatistics/kxi047.
 Chambless and Diao [2006] Lloyd E. Chambless and Guoqing Diao. Estimation of timedependent area under the ROC curve for longterm risk prediction. Statistics in Medicine, 25:3474–3486, 2006. ISSN 02776715. doi: 10.1002/sim.2299.
 Chun and Keles [2010] Hyonho Chun and Sündüz Keles. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society. Series B, Statistical methodology, 72(1):3–25, January 2010. ISSN 13697412. doi: 10.1111/j.14679868.2009.00723.x. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2810828&tool=pmcentrez&rendertype=abstract.
 Chung et al. [2012] D. Chung, H. Chun, and S. Keles. spls: Sparse Partial Least Squares (SPLS) Regression and Classification, 2012. URL http://CRAN.Rproject.org/package=spls. R package version 2.12.
 Collett [1994] D Collett. Modelling survival data in medical research, volume 46. 1994. ISBN 9780412448805. doi: 10.1198/tech.2004.s817. URL http://books.google.com/books?id=4t3GWDKDRQC&pgis=1.
 Cox [1972] D. R. Cox. Regression models and life tables. Journal of the Royal Statistical Society. Series B:, 34:187–220, 1972. ISSN 13697412.
 De Jong and Ter Braak [1994] Sijmen De Jong and Cajo J. F. Ter Braak. Comments on the PLS kernel algorithm. Journal of Chemometrics, 8(2):169–174, March 1994. ISSN 08869383. doi: 10.1002/cem.1180080208. URL http://doi.wiley.com/10.1002/cem.1180080208.
 Dejean et al. [2013] S. Dejean, I. Gonzalez, and K.A. Le Cao. mixOmics: Omics Data Integration Project, 2013. URL http://CRAN.Rproject.org/package=mixOmics. R package version 4.14.
 Dupuy and Simon [2007] Alain Dupuy and Richard M Simon. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. Journal of the National Cancer Institute, 99(2):147–57, January 2007. ISSN 14602105. doi: 10.1093/jnci/djk018. URL http://jnci.oxfordjournals.org/content/99/2/147.long.
 Fan et al. [2002] X. Fan, A. Felsovalyi, S. Sivo, and S.C. Keenan. SAS for Monte Carlo Studies: A guide for quantitative Researchers. SAS publishing, Cary, NY, 2002.
 Friedman et al. [2010] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of statistical software, 33:1–22, 2010. ISSN 15487660. doi: 10.1359/JBMR.0301229. URL http://www.jstatsoft.org/v33/i01/.
 Friedman and Popescu [2004] J.H. Friedman and B.E. Popescu. Gradient Directed Regularization. Working Paper, Department of Statistics, Stanford University., 2004.
 Garthwaite [1994] Paul H. Garthwaite. An Interpretation of Partial Least Squares. Journal of the American Statistical Association, 89(425):122–127, February 1994. doi: 10.1080/01621459.1994.10476452. URL http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1994.10476452#.VBF6j0utvCY.
 Gerds and Schumacher [2006] Thomas A. Gerds and Martin Schumacher. Consistent estimation of the expected brier score in general survival models with rightcensored event times. Biometrical Journal, 48:1029–1040, 2006. ISSN 03233847. doi: 10.1002/bimj.200610301.
 Gerds and Schumacher [2007] Thomas A. Gerds and Martin Schumacher. Efrontype measures of prediction error for survival analysis. Biometrics, 63:1283–1287, 2007. ISSN 0006341X. doi: 10.1111/j.15410420.2007.00832.x.
 Goeman [2010] Jelle J. Goeman. L1 penalized estimation in the Cox proportional hazards model. Biometrical Journal, 52(1):70–84, 2010. ISSN 03233847. doi: 10.1002/bimj.200900028.
 Gönen and Heller [2005] Mithat Gönen and Glenn Heller. Concordance probability and discriminatory power in proportional hazards regression. Biometrika, 92:965–970, 2005. ISSN 00063444. doi: 10.1093/biomet/92.4.965.
 Graf et al. [1999] E Graf, C Schmoor, W Sauerbrei, and M Schumacher. Assessment and comparison of prognostic classification schemes for survival data. Statistics in medicine, 18(1718):2529–2545, 1999. ISSN 02776715. URL http://www.ncbi.nlm.nih.gov/pubmed/10474158.
 Harrell et al. [1996] Frank E. Harrell, Kerry L. Lee, and Daniel B. Mark. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine, 15(4):361–387, 1996. ISSN 02776715. doi: 10.1002/(SICI)10970258(19960229)15:4¡361::AIDSIM168¿3.0.CO;24.
 Hastie and Efron [2012] T. Hastie and B. Efron. lars: Least Angle Regression, Lasso and Forward Stagewise, 2012. URL http://CRAN.Rproject.org/package=lars. R package version 1.1.
 Heagerty et al. [2000] P J Heagerty, T Lumley, and M S Pepe. Timedependent ROC curves for censored survival data and a diagnostic marker. Biometrics, 56(2):337–344, 2000. ISSN 0006341X. doi: 10.1111/j.0006341X.2000.00337.x.
 Heagerty and packaging by Paramita SahaChaudhuri [2012] Patrick J. Heagerty and packaging by Paramita SahaChaudhuri. risksetROC: Riskset ROC curve estimation from censored survival data, 2012. URL http://CRAN.Rproject.org/package=risksetROC. R package version 1.0.4.
 Heagerty and Zheng [2005] Patrick J. Heagerty and Yingye Zheng. Survival model predictive accuracy and ROC curves. Biometrics, 61(1):92–105, 2005. ISSN 0006341X. doi: 10.1111/j.0006341X.2005.030814.x.
 Hielscher et al. [2010] Thomas Hielscher, Manuela Zucknick, Wiebke Werft, and Axel Benner. On the Prognostic Value of Gene Expression Signatures for Censored Data. In Andreas Fink, Berthold Lausen, Wilfried Seidel, and Alfred Ultsch, editors, Advances in Data Analysis, Data Handling and Business Intelligence Studies in Classification, Data Analysis, and Knowledge Organization, pages 663–673. Springer, Berlin, 2010. ISBN 9783642010439. doi: 10.1007/9783642010446_61.
 Hothorn et al. [2004] Torsten Hothorn, Berthold Lausen, Axel Benner, and Martin RadespielTröger. Bagging survival trees. Statistics in medicine, 23(1):77–91, January 2004. ISSN 02776715. doi: 10.1002/sim.1593. URL http://www.ncbi.nlm.nih.gov/pubmed/14695641.
 Hothorn et al. [2006] Torsten Hothorn, Peter Bühlmann, Sandrine Dudoit, Annette Molinaro, and Mark J van der Laan. Survival ensembles. Biostatistics (Oxford, England), 7(3):355–73, July 2006. ISSN 14654644. doi: 10.1093/biostatistics/kxj011. URL http://biostatistics.oxfordjournals.org/content/7/3/355.full.
 Hung and Chiang [2010] H. Hung and C. T. Chiang. Estimation methods for timedependent AUC models with survival data. Canadian Journal of StatisticsRevue Canadienne De Statistique, 38:8–26, 2010. ISSN 03195724. doi: 10.1002/cjs. URL <GotoISI>://000276598600003.
 Jolliffe et al. [2003] Ian T Jolliffe, Nickolay T Trendafilov, and Mudassir Uddin. A Modified Principal Component Technique Based on the LASSO. Journal of Computational and Graphical Statistics, 12(3):531–547, September 2003. ISSN 10618600. doi: 10.1198/1061860032148. URL http://amstat.tandfonline.com/doi/abs/10.1198/1061860032148#.VBF12UutvCY.
 Kaiser and Dickman [1962] Henry F. Kaiser and Kern Dickman. Sample and population score matrices and sample correlation matrices from an arbitrary population correlation matrix. Psychometrika, 27:179–182, 1962. ISSN 00333123. doi: 10.1007/BF02289635.
 Karatzoglou et al. [2004] Alexandros Karatzoglou, Alex Smola, Kurt Hornik, and Achim Zeileis. kernlab ? An S4 Package for Kernel Methods in R. Journal of Statistical Software, 11(9):1–20, 2004. ISSN 15487660. doi: 10.1016/j.csda.2009.09.023. URL http://www.jstatsoft.org/v11/i09/paper.
 Kent and O’Quigley [1988] John T. Kent and John O’Quigley. Measures of dependence for censored survival data. Biometrika, 75:525–534, 1988. ISSN 00063444. doi: 10.1093/biomet/75.3.525.
 Kim [2009] J. Kim. glcoxph: The penalized Cox PH model using the generalized lasso algorithm, 2009. URL http://datamining.dongguk.ac.kr/R/glcoxph/. R package version 0.9913.
 Kohavi [1995] Ron Kohavi. A Study of CrossValidation and Bootstrap for Accuracy Estimation and Model Selection. In International Joint Conference on Artificial Intelligence, volume 14, pages 1137–1143, 1995. ISBN 1558603638. doi: 10.1067/mod.2000.109031.
 Korn and Simon [1990] E L Korn and R Simon. Measures of explained variation for survival data. Statistics in medicine, 9:487–503, 1990. ISSN 02776715. doi: 10.1002/sim.4780090503.
 Krämer [2007] Nicole Krämer. An overview on the shrinkage properties of partial least squares regression. Computational Statistics, 22(2):249–273, March 2007. ISSN 09434062. doi: 10.1007/s001800070038z. URL http://link.springer.com/10.1007/s001800070038z.
 Kuhn [2014] Max Kuhn. caret: Classification and Regression Training, 2014. URL http://CRAN.Rproject.org/package=caret. Contributions from Jed Wing and Steve Weston and Andre Williams and Chris Keefer and Allan Engelhardt and Tony Cooper and Zachary Mayer and the R Core Team. R package version 6.030.
 LambertLacroix and Letué [2011] S. LambertLacroix and F. Letué. Partial Least Squares and Cox model with application to gene expression. Technical report, 2011. URL http://hal.archivesouvertes.fr/hal00568234.
 Langfelder et al. [2013] Peter Langfelder, Paul S. Mischel, and Steve Horvath. When Is Hub Gene Selection Better than Standard MetaAnalysis? PLoS ONE, 8(4):e61505, 2013. ISSN 19326203. doi: 10.1371/journal.pone.0061505.
 Lehmann and Romano [2005] E. Lehmann and J. Romano. Testing Statistical Hypotheses. Springer Texts in Statistics. Springer, New York, 3rd edition, 2005. ISBN 0387988645. URL http://www.worldcat.org/isbn/0387988645.
 Li and Gui [2004] Hongzhe Li and Jiang Gui. Partial Cox regression analysis for highdimensional microarray gene expression data. Bioinformatics, 20 Suppl 1:i208–i215, 2004. ISSN 13674811. doi: 10.1093/bioinformatics/bth900.
 Li [2006] Lexin Li. Survival prediction of diffuse largeBcell lymphoma based on both clinical and gene expression information. Bioinformatics, 22(4):466–471, 2006. ISSN 13674803. doi: 10.1093/bioinformatics/bti824.
 Lindgren et al. [1993] Fredrik Lindgren, Paul Geladi, and Svante Wold. The kernel algorithm for PLS. Journal of Chemometrics, 7(1):45–59, January 1993. ISSN 08869383. doi: 10.1002/cem.1180070104. URL http://doi.wiley.com/10.1002/cem.1180070104.
 Magidson and Wassmann [2010] J. Magidson and K. Wassmann. The Role of Proxy Genes in Predictive Models: An Application to Early Detection of Prostate Cancer. In Proceedings of the American Statistical Association., 2010.
 Marx [1996] Brian D Marx. Iteratively Reweighted Partial Least Squares Estimation for Generalized Linear Regression. Technometrics, 38:374–381, 1996. ISSN 00401706. doi: 10.2307/1271308.
 McKean and Sievers [1987] Joseph W McKean and Gerald L Sievers. Coefficients of determination for least absolute deviation analysis. Statistics & Probability Letters, 5(1):49–54, January 1987. ISSN 01677152. doi: 10.1016/01677152(87)900265. URL http://www.sciencedirect.com/science/article/pii/0167715287900265.
 Mevik et al. [2011] B.H. Mevik, R. Wehrens, and K.H. Liland. pls: Partial Least Squares and Principal Component regression, 2011. URL http://CRAN.Rproject.org/package=pls. R package version 2.30.
 Nagelkerke [1991] N. J D Nagelkerke. A note on a general definition of the coefficient of determination. Biometrika, 78:691–692, 1991. ISSN 00063444. doi: 10.1093/biomet/78.3.691.
 Nguyen and Rocke [2002] D V Nguyen and D M Rocke. Partial least squares proportional hazard regression for application to DNA microarray survival data. Bioinformatics, 18:1625–1632, 2002. ISSN 13674803 (Print)13674803 (Linking). URL http://www.ncbi.nlm.nih.gov/pubmed/12490447?dopt=Citation.
 O’Quigley et al. [2005] John O’Quigley, Ronghui Xu, and Janez Stare. Explained randomness in proportional hazards models. Statistics in Medicine, 24:479–489, 2005. ISSN 02776715. doi: 10.1002/sim.1946.
 Park and Hastie [2007] Mee Young Park and Trevor Hastie. regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 69:659–677, 2007. ISSN 13697412. doi: 10.1111/j.14679868.2007.00607.x.
 Park and Hastie [2013] M.Y. Park and T. Hastie. glmpath: Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model, 2013. URL http://CRAN.Rproject.org/package=glmpath. R package version 0.97.
 Park et al. [2002] P. J. Park, L. Tian, and I. S. Kohane. Linking gene expression data with patient survival times using partial least squares. Bioinformatics, 18(Suppl 1):S120–S127, July 2002. ISSN 13674803. doi: 10.1093/bioinformatics/18.suppl_1.S120. URL http://bioinformatics.oxfordjournals.org/content/18/suppl_1/S120.abstract.
 Pepe et al. [2008] Margaret S. Pepe, Yingye Zheng, Yuying Jin, Ying Huang, Chirag R. Parikh, and Wayne C. Levy. Evaluating the ROC performance of markers for future events. Lifetime Data Analysis, 14:86–113, 2008. ISSN 13807870. doi: 10.1007/s109850079073x.
 Potapov et al. [2012] S. Potapov, W. Adler, and M. Schmid. survAUC: Estimators of prediction accuracy for timetoevent data, 2012. URL http://CRAN.Rproject.org/package=glmpath. R package version 1.05.
 RadespielTröger et al. [2003] M. RadespielTröger, T. Rabenstein, H.T. Schneider, and B. Lausen. Comparison of treebased methods for prognostic stratification of survival data. Artificial Intelligence in Medicine, 28(3):323–341, July 2003. ISSN 09333657. doi: 10.1016/S09333657(03)000605. URL http://www.aiimjournal.com/article/S0933365703000605/fulltext.
 Rännar et al. [1994] Stefan Rännar, Fredrik Lindgren, Paul Geladi, and Svante Wold. A PLS kernel algorithm for data sets with many variables and fewer objects. Part 1: Theory and algorithm. Journal of Chemometrics, 8(2):111–125, March 1994. ISSN 08869383. doi: 10.1002/cem.1180080204. URL http://doi.wiley.com/10.1002/cem.1180080204.
 Rosipal and Trejo [2002] Roman Rosipal and Leonard J Trejo. Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space. Journal of Machine Learning Research, 2:97–123, 2002. ISSN 15324435. doi: 10.1162/15324430260185556. URL http://www.crossref.org/jmlr_DOI.html.
 Rosthøj and Keiding [2004] Susanne Rosthøj and Niels Keiding. Explained variation and predictive accuracy in general parametric statistical models: the role of model misspecification. Lifetime data analysis, 10(4):461–72, December 2004. ISSN 13807870. URL http://www.ncbi.nlm.nih.gov/pubmed/15690996.
 Schemper and Henderson [2000] M Schemper and R Henderson. Predictive accuracy and explained variation in Cox regression. Biometrics, 56(1):249–255, March 2000. ISSN 0006341X. URL http://www.ncbi.nlm.nih.gov/pubmed/10783803.
 Schemper and Stare [1996] Michael Schemper and Janez Stare. Explained variation in survival analysis. Statistics in Medicine, 15:1999–2012, 1996. ISSN 02776715. doi: 10.1002/(SICI)10970258(19961015)15:19¡1999::AIDSIM353¿3.0.CO;2D.
 Schmid et al. [2011] Matthias Schmid, Thomas Hielscher, Thomas Augustin, and Olaf Gefeller. A Robust Alternative to the SchemperHenderson Estimator of Prediction Error. Biometrics, 67:524–535, 2011. ISSN 0006341X. doi: 10.1111/j.15410420.2010.01459.x.
 Schoop et al. [2008] R. Schoop, E. Graf, and M. Schumacher. Quantifying the predictive performance of prognostic models for censored survival data with timedependent covariates. Biometrics, 64:603–610, 2008. ISSN 0006341X. doi: 10.1111/j.15410420.2007.00889.x.
 Schröder et al. [2011] Markus S. Schröder, Aedín C. Culhane, John Quackenbush, and Benjamin HaibeKains. survcomp: An R/BioconductoR package for performance assessment and comparison of survival models. Bioinformatics, 27(22):3206–3208, 2011. ISSN 13674803. doi: 10.1093/bioinformatics/btr511.
 Schumacher et al. [2007] Martin Schumacher, Harald Binder, and Thomas Gerds. Assessment of survival prediction models based on microarray data. Bioinformatics (Oxford, England), 23(14):1768–74, July 2007. ISSN 13674811. doi: 10.1093/bioinformatics/btm232. URL http://bioinformatics.oxfordjournals.org/content/23/14/1768.full.
 Segal [2006] Mark R. Segal. Microarray gene expression data with linked survival phenotypes: Diffuse largeBcell lymphoma revisited. Biostatistics, 7:268–285, 2006. ISSN 14654644. doi: 10.1093/biostatistics/kxj006.
 Simon et al. [2011] Noah Simon, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for Cox’s proportional hazards model via coordinate descent. Journal of Statistical Software, 39:1–13, 2011. ISSN 15487660. URL http://www.jstatsoft.org/v39/i05/.
 Sohn et al. [2009] Insuk Sohn, Jinseog Kim, Sin Ho Jung, and Changyi Park. Gradient lasso for Cox proportional hazards model. Bioinformatics, 25(14):1775–1781, 2009. ISSN 13674803. doi: 10.1093/bioinformatics/btp322.
 Song and Zhou [2008] Xiao Song and Xiaohua Zhou. A semiparametric approach for the covariate specific ROC curve with survival outcome. Statistica Sinica, 18:947–965, 2008. ISSN 10170405.
 Tenenhaus et al. [2007] Arthur Tenenhaus, Alain Giron, Emmanuel Viennet, Michel Béra, Gilbert Saporta, and Bernard Fertil. Kernel logistic PLS: A tool for supervised nonlinear dimensionality reduction and binary classification. Computational Statistics and Data Analysis, 51(9):4083–4100, 2007. ISSN 01679473. doi: 10.1016/j.csda.2007.01.004.
 Tenenhaus [1998] M. Tenenhaus. La régression PLS: théorie et pratique. Éditions Technip, Paris, 1998. ISBN 9782710807353. URL http://books.google.fr/books?id=OesjK2KZhsAC.
 Tenenhaus [1999] M. Tenenhaus. La regression logistique PLS. In Proceedings of the 32èmes journées de Statistique de la Société française de Statistique, pages 721–723. FES, 1999.
 Therneau [2013] T.M. Therneau. A Package for Survival Analysis in S, 2013. URL http://CRAN.Rproject.org/package=survival. R package version 2.374.
 Therneau and Grambsch [2000] T.M. Therneau and P.M. Grambsch. Modeling Survival Data: Extending the Cox Model. Statistics for Biology and Health. Springer, New York, 2000. ISBN 9780387987842. URL http://books.google.com.my/books?id=9kY4XRuUMUsC.
 Tibshirani [2009] R. Tibshirani. uniCox: Univariate shrinkage prediction in the Cox model, 2009. URL http://CRAN.Rproject.org/package=uniCox. R package version 1.0.
 Uno et al. [2007] Hajime Uno, Tianxi Cai, Lu Tian, and LeeJen WEI. Evaluating Prediction Rules for tYear Survivors With Censored Regression Models. Journal of the American Statistical Association, 102:527–537, 2007. ISSN 01621459. doi: Doi10.1198/016214507000000149. URL papers2://publication/doi/10.1198/016214507000000149.
 van Houwelingen et al. [2006] Hans C. van Houwelingen, Tako Bruinsma, Augustinus A M Hart, Laura J. van’t Veer, and Lodewyk F A Wessels. Crossvalidated Cox regression on microarray gene expression data. Statistics in Medicine, 25(18):3201–3216, 2006. ISSN 02776715. doi: 10.1002/sim.2353.
 van Wieringen et al. [2009] Wessel N. van Wieringen, David Kun, Regina Hampel, and AnneLaure Boulesteix. Survival prediction using gene expression data: A review and comparison. Computational Statistics & Data Analysis, 53(5):1590–1603, 2009. ISSN 01679473. doi: 10.1016/j.csda.2008.05.021.
 Verweij and Van Houwelingen [1993] P J Verweij and H C Van Houwelingen. Crossvalidation in survival analysis. Statistics in medicine, 12(24):2305–2314, 1993. ISSN 02776715. doi: 10.1002/sim.4780122407.
 Wold [1966] H. Wold. Estimation of Principal Components and Related Models by Iterative Least Squares. In P.R. Krishnaiah, editor, Multivariate Analysis, pages 391–420. Academic Press, New York, 1966.
 Wold et al. [1983] S Wold, H Martens, and H Wold. The multivariate calibration problem in chemistry solved by the PLS method. In A. Ruhe and B. Kåstrøm, editors, Proc. Conf. Matrix pencils, Lecture Notes in Mathematics, pages 286–293, Heidelberg, 1983. SpringerVerlag.
 Wold et al. [2001] Svante Wold, Michael Sjöström, and Lennart Eriksson. PLSregression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58:109–130, 2001. ISSN 01697439. doi: 10.1016/S01697439(01)001551.
 Xu and O’Quigley [1999] Ronghui Xu and John O’Quigley. A type measure of dependence for proportional hazards models. Journal of Nonparametric Statistics, 12(1):83–107, 1999. doi: 10.1080/10485259908832799. URL http://dx.doi.org/10.1080/10485259908832799.
 Zou and Hastie [2005] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, April 2005. ISSN 13697412. doi: 10.1111/j.14679868.2005.00503.x. URL http://doi.wiley.com/10.1111/j.14679868.2005.00503.x.
 Zou et al. [2006] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics, 15(2):265–286, June 2006. ISSN 10618600. doi: 10.1198/106186006X113430. URL http://dx.doi.org/10.1198/106186006X113430.
Supplemental Information
Insights on the implementation of the methods
We detail the implementation of the algorithms that we used in the article and start with some shared properties of these. Whenever the deviance residuals and survival models were to be derived, we used the survival
Rpackage (Therneau and Grambsch, 2000, Therneau, 2013).
As a PLS regression function in the plsRcox
Rpackage, we made three wrappers using either the pls
function of the pls
Rpackage (Mevik et al., 2011), the plsR
function of the plsRglm
Rpackage (Bertrand et al., 2014b) or the pls
function of the mixOmics
Rpackage (Dejean et al., 2013). The last two are based on the NIPALS algorithm and hence automatically handle missing data (Tenenhaus, 1998) in the explanatory variables. In addition, the pls
function of the mixOmics
Rpackage (Dejean et al., 2013) can quickly handle big datasets such as Dataset 5 with 242 rows and 44754 variables. As a consequence, we had to make the spls
function of the spls
Rpackage use this function instead of the pls
function of the pls
Rpackage (Mevik et al., 2011).

The coxpath implementation was the
coxpath
function found in theglmpath
Rpackage (Park and Hastie, 2013). 
PLSCox has not yet been implemented in
R
and the authors made it available as the functionplsRcox
of theplsRcox
Rpackage (Bertrand et al., 2014a). 
autoPLSCox is PLSCox with sparse PLS components and automatic selection of their optimal number. The computed components are sparse since, to have a non zero coefficient, a variable must be significant at a given level in the cox regression of the response by this variable adjusted by the previously found components. The model stops adding a new component when there is no longer any of the explanatory variable that is significant at a given level. The authors made it available as the function
plsRcox
of theplsRcox
Rpackage (Bertrand et al., 2014a). The number of components can also be determined using crossvalidation, the components being still sparse. 
The authors made a wrapper in the
plsRcox
Rpackage (Bertrand et al., 2014a) to derive a CoxPLS implementation from either thepls
orplsRglm
Rpackages. 
The authors made a wrapper in the
plsRcox
Rpackage (Bertrand et al., 2014a) to derive a PLSDR implementation from either thepls
orplsRglm
Rpackages. 
The uniCox implementation was the
uniCox
function found in theuniCox
Rpackage (Tibshirani, 2009). 
The glcoxph implementation was the
glcoxph
function found in theglcoxph
Rpackage (Kim, 2009).