Abstract
The development of molecular signatures for the prediction of timetoevent outcomes is a methodologically challenging task in bioinformatics and biostatistics. Although there are numerous approaches for the derivation of marker combinations and their evaluation, the underlying methodology often suffers from the problem that different optimization criteria are mixed during the feature selection, estimation and evaluation steps. This might result in marker combinations that are suboptimal regarding the evaluation criterion of interest. To address this issue, we propose a unified framework to derive and evaluate biomarker combinations. Our approach is based on the concordance index for timetoevent data, which is a nonparametric measure to quantify the discriminatory power of a prediction rule. Specifically, we propose a gradient boosting algorithm that results in linear biomarker combinations that are optimal with respect to a smoothed version of the concordance index. We investigate the performance of our algorithm in a largescale simulation study and in two molecular data sets for the prediction of survival in breast cancer patients. Our numerical results show that the new approach is not only methodologically sound but can also lead to a higher discriminatory power than traditional approaches for the derivation of gene signatures.
Boosting the concordance index for survival data – a unified framework to derive and evaluate biomarker combinations^{1}^{1}1 This is a preliminary version of an article submitted to PLOS ONE. If citing, please refer to the original article.
Andreas Mayr, Matthias Schmid
Department of Medical Informatics, Biometry and Epidemiology, FriedrichAlexanderUniversität ErlangenNürnberg, Erlangen, Germany
Department of Statistics, LudwigMaximiliansUniversität München, Munich, Germany
Introduction
Recent technological developments in the fields of genomics and biomedical research have led to the discovery of large numbers of gene signatures for the prediction of clinical survival outcomes. In cancer research, for example, gene expression signatures are nowadays used to predict the time to occurrence of metastases [1, 2] as well as the time to progression [3] and overall patient survival [4, 5]. While the importance of molecular data in clinical and epidemiological research is expected to grow considerably in the next years [6, 7, 8], the detection of clinically useful gene signatures remains a challenging problem for bioinformaticians and biostatisticians, especially when the outcome is a survival time.
After normalization and data preprocessing, the development of a new gene signature usually comprises three methodological tasks:

Task 1: Select a subset of genes that is associated with the clinical outcome.

Task 2: Derive a marker signature by finding the “optimal” combination of the selected genes.

Task 3: Evaluate the prediction accuracy of the optimal combination using future or external data.
Task 1, the selection of a clinically relevant subset of genes, is often addressed by calculating scores to rank the univariate association between the survival outcome and each of the genes [9, 8]. In a subsequent step, the genes with the strongest associations are selected to be included in the gene signature.
Task 2, the derivation of an optimal combination of the selected genes, is usually fulfilled by forming linear combinations of gene expression levels based on Cox regression. Due to multicollinearity problems and the high dimensionality of molecular data, a direct optimization of the Cox partial likelihood is often unfeasible [8]. Consequently, marker combinations are often derived by combining coefficients of univariate Cox regression models [10], or by applying regularized Cox regression techniques (such as the Lasso [11, 12] or ridgepenalized regression [13, 14]).
Task 3, the evaluation of prediction accuracy, is considered to be a challenging problem in survival analysis. This is because traditional performance measures for continuous outcomes (such as the mean squared error) are no longer applicable in the presence of censoring. In the literature, several approaches to address this problem exist (see, e.g., [15] for an overview). In this article, we focus on the concordance index for timetoevent data (index [16, 17, 18]), which has become a widely used measure of the performance of biomarkers in survival studies [19, 20, 21, 22]. Briefly spoken, the index can be interpreted as the probability that a patient with a small survival time is associated with a high value of a biomarker combination (and vice versa). Consequently, it measures the concordance between the rankings of the survival times and the biomarker values and therefore the ability of a biomarker to discriminate between patients with small survival times and patients with large survival times. This strategy is especially helpful if the aim is to subdivide patients into groups with good or poor prognosis (as applied in many articles in the medical literature, e.g., [10]). By definition, the index has the same scale as the classical area under the curve (AUC) in binary classification: While prediction rules based on random guessing yield , a perfectly discriminating biomarker combination leads to .
Interestingly, the derivation of new gene signatures for survival outcomes via Tasks 1–3 is often addressed by combining completely different methodological approaches and estimation techniques. For example, the estimation of biomarker combinations is usually based on Cox regression and is hence carried out via the optimization of a partial likelihood criterion. On the other hand, the resulting combinations are often evaluated by using the index [21, 22] which has its roots in the receiver operating characteristics (ROC) methodology. This methodological inconsistency is also problematic from a practical point of view, as the marker combination that optimizes the partial log likelihood criterion is not necessarily the one that optimizes the index. In other words, if the index and therefore the discriminatory power is the evaluation criterion of interest, it may be suboptimal to use a likelihoodbased criterion to optimize the marker combination. This issue is, of course, not only problematic in survival analysis but also in regression and classification. A theoretical discussion on the differences between performance measures for binary classification can, e.g., be found in [23].
To overcome the aforementioned inconsistencies, we propose a unified framework for survival analysis that is based on the same statistical methodology for gene selection (Task 1), derivation of an optimal biomarker combination (Task 2) and the evaluation of a new gene signature (Task 3). As will be demonstrated, all three tasks can be addressed by using the concordance index for timetoevent data as performance criterion. While the index has already been proposed for gene selection (Task 1) and the evaluation of prediction accuracy (Task 3) [9, 15], the main contribution of this article is a new estimation technique that addresses the development of optimal combinations of genes (Task 2). To achieve this goal, we propose a method for finding gene combinations that is based on the gradient boosting framework [24]. As will be shown, it is possible to use boosting to derive predictionoptimized gene combinations via direct optimization of the index. Because this new approach allows for using the index to address all three tasks, the proposed method leads to a consistent framework for the derivation of gene signatures in biomarker studies where the index is the main performance criterion.
Methods
Notation
We first introduce some basic notation that will be used throughout the article. Let be the survival time of interest and a vector of predictor variables. In addition to the gene expression levels, may contain the measurements of some clinical predictor variables. Denote the conditional survival function of given by . The probability density and survival functions of are denoted by and , respectively. Further let be a random censoring time and denote the observed survival time by . The random variable indicates whether is rightcensored () or not ().
A prediction rule for will be formed as a linear combination
(1) 
where is an unknown vector of coefficients. We generally assume that the estimation of is based on an i.i.d. learning sample . In case of the Cox regression model, for example, is related to by the equation
(2) 
where is the cumulative baseline hazard function. Because there is a onetoone relationship between and the expected survival time , the linear combination can be used as a biomarker to predict the survival of individual patients.
Concordance index
Our proposed framework to derive and evaluate biomarker combinations is based on the concordance index (“index”) which is a general discrimination measure for the evaluation of prediction models [16, 17]. It can be applied to continuous, ordinal and dichotomous outcomes [25]. For timetoevent outcomes, the index is defined as
(3) 
where , and , are the event times and the predicted marker values, respectively, of two observations in an i.i.d. test sample . By definition, the index for survival data measures whether large values of are associated with short survival times and vice versa. Moreover, it can be shown that the index is equivalent to the area under the timedependent ROC curve, which is a measure of the discriminative ability of at each time point under consideration (see [26], p. 95 for a formal proof).
During the last decades, the index has gained enormous popularity in biomedical research; for example, searching for the terms “concordance index” and “cindex” in PubMed [27] resulted in 1156 articles by the time of writing this article. Generally, a value of close to 1 indicates that the marker is close to a perfect discriminatory power, while a marker that does not perform better than chance results in a value of 0.5. For example, the famous Gail model [28] for the prediction of breast cancer is estimated to yield a value of [29].
Being a flexible discrimination measure, the index is especially useful for selecting and ranking genes from a preprocessed set of highdimensional gene expression data (Task 1 described in the Introduction). In other words, Task 1 can be addressed by computing the index (and hence the marginal discriminatory power) for each individual gene or biomarker, where only those genes with the highest index are incorporated into the yettoderive optimal combination (Task 2). Although there exist various other ways to rank genes and select the most influential ones, the index has been demonstrated to be especially advantageous for this task [9].
An estimator of the index for survival data is given by
(4) 
with (“Harrell’s for survival data”, [30]). Generally, is a consistent estimator of the index in situations where no censoring is present. However, because pairs of observations where the smaller observed survival time is censored are ignored in formula (4), is known to show a notable upward bias in the presence of censoring. This bias tends to be correlated with the censoring rate [30, 15].
To overcome the censoring bias of , Uno et al. [18] proposed a modified version of , which is defined as
(5) 
where denotes the KaplanMeier estimator of the unconditional survival function of (estimated from the learning data). In the following, we will assume that the censoring times are independent of . Under this assumption, is a consistent and asymptotically normal estimator of (see [18], pp. 1113–1115). Consistency is ensured by applying inverse probability weighting (using the weights , which account for the inverse probability that an observation in the test data is censored [31]). Numerical studies suggest that is remarkably robust against violations of the random censoring assumption [32].
Apart from the estimators and , there exist various other approaches to estimate the probability in (3) (see, e.g., [15] for an overview). Most of these approaches are based on the assumptions of a Cox proportional hazards model, so that they are not valid in case these assumptions are violated. Because is modelfree and because the consistency of is guaranteed even in situations where censoring rates are high (in contrast to the estimator ), we will base our boosting method on .
Boosting the concordance index
The core of our proposed framework to address Tasks 1 – 3 is the derivation of a predictionoptimized linear combination of genes that is optimal w.r.t. to the index for timetoevent data. Our approach will be based on a componentwise gradient boosting algorithm [24] that uses the index as optimization criterion.
Gradient boosting algorithms [33] are generally based on a loss function that is assumed to be differentiable with respect to the predictor . The aim is then to estimate the “optimal” prediction function
(6) 
by using gradient descent techniques. As the theoretical mean in (6) is usually unknown in practice, gradient boosting algorithms minimize the empirical risk over instead.
When considering the index for survival data, the aim is to determine the optimal predictor that maximizes the concordance measure – which is essentially the solution to Task 2. Hence a natural choice for the empirical risk function would be the negative concordance index estimator
(7) 
Setting , however, is unfeasible because is not differentiable with respect to and can therefore not be used in a gradient boosting algorithm. To solve this problem, we follow the approach of Ma and Huang [34] and approximate the indicator function in (7) by the sigmoid function . Here, is a tuning parameter that controls the smoothness of the approximation (details on the choice of will be given in the Numerical Results section). Replacing the indicator function in (7) by its smoothed version results in the smoothed empirical risk function
(8) 
with weights
(9) 
By definition, the smoothed empirical risk is differentiable with respect to the predictor . Its derivative is given by
(10) 
In the next step of the gradient boosting algorithm, the derivative in (10) is iteratively fitted to a set of baselearners. Typically, an individual baselearner (simple regression tool, e.g., a tree or a simple linear model) is specified for each marker. To ensure that the estimate of the optimal predictor is a linear combination of the components of , we will apply simple linear models as baselearners (cf. [35]). In other words, each baselearner is a simple linear model with one component of as input variable. Consequently, there are baselearners, which will be denoted by , . Each baselearner refers to one component of and therefore to one marker (or gene).
The componentwise gradient boosting algorithm for the optimization of the smoothed index is then given as follows:

Initialize the estimate of the marker combination with offset values. For example, set , leading to for all components . Choose a sufficiently large maximum number of iterations and set the iteration counter to 1.

Compute the negative gradient vector by using formula (10) and evaluate it at the marker combination of the previous iteration:

Fit the negative gradient vector separately to each of the components of via the baselearners :

Select the component that best fits the negative gradient vector according to the least squares criterion, i.e., select the baselearner defined by

Update the marker combination for this component:
where sl is a small step length . For example, if , only 10% of the fit of the baselearner is added to the current marker. This procedure shrinks the effect estimates towards zero, effectively increasing the numerical stability of the update step [23, 25].
As only the base learner was selected, only the effect of component is updated () while all other effects stay constant ( for ).

Stop if . Else increase by one and go back to step (2).
By construction, the proposed boosting algorithm automatically estimates the optimal linear biomarker combination that maximizes the smoothed index. The principle behind the proposed algorithm is to minimize the empirical risk by using gradient descent in function space, where the function space is spanned by the baselearners , . In other words, the algorithm iteratively descents the empirical risk by updating via the best fitting baselearner. Because the baselearners are simple linear models (each containing only one possible biomarker as predictor variable) and because the update in step (5) of the algorithm is additive, the final solution effectively becomes a linear combination of these markers.
The two main tuning parameters of gradient boosting algorithms are the stopping iteration and the step length sl. In the literature it has been argued that the choice of the step length is of minor importance for the performance of boosting algorithms [36]. Generally, a larger step length leads to faster convergence of the algorithm. However, it also increases the risk of overshooting near the minimum of . In the following sections we will use a fixed steplength of , which is a common recommendation in the literature on gradient boosting [36, 37] (and which is also the default value in the R package mboost [38]). The stopping iteration is considered to be the most important tuning parameter of boosting algorithms [37]. The optimal value of is usually determined by using crossvalidation techniques [24]. Small values of reduce the complexity of the resulting linear combination and avoid overfitting via shrinking the effect estimates. In case of boosting the index, however, overfitting is less problematic as the predictive performance of is not related to the actual size of the coefficients but to the concordance of the rankings between marker values and the observed survival times. As a result, the stopping iteration in this specific case is less relevant and can be also specified by a fixed large value (e.g., ).
Regarding the boosting algorithm for the smoothed index, an additional tuning parameter is given by the smoothing parameter . While too large values of will lead to a poor approximation of the indicator functions in (7), too small values of might overfit the data (and might therefore result in a decreased prediction accuracy). Details on how to best choose the value of will be given in the Numerical Results section.
The boosting algorithm presented above is implemented in the addon software package mboost of the open source statistical programming environment R [39]. The specification of the new Cindex() family and a short description of how to apply the algorithm in practice are given in the Appendix.
Evaluation
After having applied the index to select the most influential genes (Task 1), and after having used the proposed boosting algorithm to combine the selected genes (Task 2), a final challenge is to evaluate the prediction accuracy of the resulting gene combination (Task 3). Since the index was used for Tasks 1 and 2, it is also a natural criterion to evaluate the derived marker combination in Task 3. As argued before, it is advantageous from both a methodological perspective as well as from a practical one to use the same criterion for estimation and evaluation of a biomarker combination.
To avoid overoptimistic estimates of prediction accuracy, it is crucial to use different observations for the optimization and evaluation of the marker signature [25, 40]. This can be achieved either by using two completely different data sets (external evaluation) or by splitting one dataset into a learning sample and a test sample . The learning sample is used to optimize the marker combination while the “external” test sample serves only for evaluation purposes. A more elaborate strategy is subsampling (such as fivefold or tenfold crossvalidation), which is based on multiple random splits of the data. In our numerical analysis, we will use subsampling techniques in combination with stratification to divide the underlying data sets into learning and test samples (for details, see the next section).
When it comes to the index, two additional points have to be taken into consideration: First, as the task is to obtain the most precise estimation for the discriminatory power, it is no longer necessary to use the smoothed version (which was included for numerical reasons in the boosting algorithm) for evaluation. Consequently, we propose to apply the original estimator for evaluating biomarker combinations in Task 3. Second, when applying the estimator to the observations in a test sample, a natural question is how to calculate the KaplanMeier estimator of the unconditional survival function of . In principle, there are three possibilities for the calculation of : The KaplanMeier estimator can be computed from either the test or from the training data, or, alternatively, from the combined data set containing all observations in the learning and test samples. Following the principle that all estimation steps should be carried out prior to Task 3, we will base computation of the KaplanMeier estimator on the learning data.
Numerical results
Simulation study
We first investigated the performance of our approach based on simulated data. The aim of our simulation study was:

To analyze if the proposed framework is able to select a small amount of informative markers from a much larger set of candidate variables.

To check if gradient boosting is able to derive the optimal combination of the selected markers, and to compare its performance to competing Coxbased estimation schemes.

To investigate the effect of the smoothing parameter that controls the smoothness inside the sigmoid function, as well as potential effects of the sample size and the censoring rate on the performance of our approach.
The simulated survival times are generated via a loglogistic distribution for accelerated failure time (AFT) models [41]. They are based on the model equation , where is the survival time, and are location and scale parameters, and is a noise variable, following a standard logistic distribution. As a result, the density function for realizations can be written as
(11) 
with and . The possible markers were drawn from a multivariate normal distribution with pairwise correlation (). The true underlying combinations of the predictors were given by
Note that only four out of possible markers have an actual effect on the survival time. Furthermore, those four markers do not only contribute to the location parameter but also to the scale parameter (cf. [42]) – a setting where standard survival analysis clearly becomes problematic. Additionally to the survival times , we generated an independent sample of censoring times and defined the observed survival time by leading to independent censoring of 50% of the observations.
In order to answer the above questions, we investigated the performance of our framework in all three tasks that are necessary to develop new gene signatures in practice (Tasks 1–3 described in the Introduction). To avoid overoptimism and biased results, we simulated separate data sets for the different tasks. In simulation runs, we first simulated a data set with observations to preselect the most influential predictors based on the index (Task 1). The optimal combination of those predictors was later estimated (Task 2) by our boosting algorithm based on smaller training samples of size . For the final external evaluation of the prediction accuracy (Task 3) we simulated a separate test data set with .
For Task 1, we first preselected a subset of predictors from the available markers. We ranked the predictors based on their individual values of and included only the bestperforming markers in the boosting algorithm. The results suggest that the index is clearly able to identify markers that are truly related to the outcome: Although all predictors had a relatively high pairwise correlation (), the four informative markers had a selection probability of 98.5% for (99% for and 99.5% for ).
setting  method  

index boosting  Cox lasso  Cox ridge  true index  
100  5  50%  0.764 (0.04)  0.731 (0.06)  0.739 (0.04)  0.779 
100  10  50%  0.746 (0.06)  0.709 (0.08)  0.707 (0.06)  0.779 
100  30  50%  0.689 (0.07)  0.673 (0.11)  0.637 (0.07)  0.779 
100  5  30%  0.820 (0.02)  0.774 (0.04)  0.724 (0.04)  0.830 
100  5  70%  0.668 (0.10)  0.628 (0.10)  0.593 (0.11)  0.690 
50  5  50%  0.741 (0.07)  0.662 (0.18)  0.722 (0.09)  0.772 
200  5  50%  0.774 (0.02)  0.748 (0.04)  0.752 (0.04)  0.782 
500  5  50%  0.778 (0.03)  0.759 (0.03)  0.760 (0.02)  0.781 
To find the optimal combination of the preselected markers (Task 2), we applied the proposed boosting approach on training samples with size . The resulting coefficients for and smoothing parameter are presented in Figure 1. The boosting algorithm seems to be able to derive the optimal combination of the preselected markers, as the structure displayed by the coefficients is essentially the same as the one of the underlying true combination . The discriminatory power of the resulting biomarker does not depend on the absolute size of the coefficients: As the index is based solely on the concordance between biomarker and survival time, what matters in practice is the relative size of the coefficients. As can be seen from Figure 1, the estimated positive effect for is larger than the one for . On the other hand, the negative effect of is correctly estimated to be more pronounced than the the one of . The coefficient of the falsely selected marker is on average close to zero.
In a third step, we evaluated the performance of the resulting optimized marker combinations (Task 3) on separate test samples. The resulting estimates for different simulation settings are presented in Table 1. The highest discriminatory power (median , range = 0.559–0.819) can be observed for , which is closest to the true number of informative markers. We compared the performance of our proposed algorithm to penalized Cox regression approaches such as CoxLasso [11, 12] and Cox regression with ridgepenalization [13, 14] – see Figure 2. The proposed boosting approach clearly outperforms the competing estimation schemes, supporting our view that applying traditional Cox regression might be suboptimal if the discriminatory power is the performance criterion of interest. We additionally computed the optimal index resulting from the true marker combination with known coefficients. The values of the true index on the test samples are on average only slightly better than the ones of boosting the concordance index (median – see Table 1).
To evaluate the possible effects of different sample sizes and censoring rates we modified the mean censoring time leading to approximate censoring rates of 30% and 70% and generated training samples of size . Results are included in Table 1. As expected, the index resulting from our framework increases as censoring rates become small (median , range = 0.736–0.858) and decreases in settings with a large proportion of censored observations (median , range = 0.421–0.776). However, the same effect can be observed for the true index resulting from the true marker combination (30% censoring , 70% censoring ). For larger training samples, the variance of the coefficient estimates decreases (see Figure 1). As a result, the discriminatory power resulting from our boosting algorithm improves (for , median , range = 0.614–0.818) and gets nearly as good as the true index (). This finding further confirms the ability of our approach to find the most optimal marker combination possible – see Figure 2. Note that also the true index differs slightly between the different sample sizes, as the training sample enters in via the KaplanMeier estimator .
To investigate the effect of the smoothing parameter inside the sigmoid function, we additionally applied our boosting procedure for every simulation setting with different values of . The resulting estimates are presented in Table 2. Compared to the effects of the sample size or the number of preselected markers , the smoothing parameter only seems to have a minor effect on the performance of our algorithm. In light of these empirical results, we recommend to apply a fixed small value (e.g., , which is also the default value in the Cindex() family for the mboost package [38] – see the Appendix).
setting  smoothing parameter  

100  5  50%  0.738 (0.06)  0.757 (0.05)  0.764 (0.04)  0.763 (0.04)  0.762 (0.04) 
100  10  50%  0.728 (0.06)  0.744 (0.06)  0.746 (0.06)  0.746 (0.06)  0.741 (0.05) 
100  30  50%  0.700 (0.06)  0.702 (0.07)  0.689 (0.07)  0.683 (0.07)  0.666 (0.07) 
100  5  30%  0.802 (0.03)  0.815 (0.02)  0.820 (0.02)  0.821 (0.02)  0.822 (0.02) 
100  5  70%  0.665 (0.10)  0.667 (0.10)  0.668 (0.10)  0.665 (0.10)  0.661 (0.10) 
50  5  50%  0.719 (0.07)  0.737 (0.07)  0.741 (0.07)  0.740 (0.06)  0.725 (0.06) 
200  5  50%  0.743 (0.05)  0.768 (0.03)  0.774 (0.02)  0.775 (0.02)  0.778 (0.02) 
500  5  50%  0.723 (0.05)  0.769 (0.02)  0.778 (0.03)  0.779 (0.03)  0.781 (0.03) 
Applications to predict the time to distant metastases
In the next step, we further analyzed the performance of our gradient boosting algorithm in two applications to estimate and evaluate the optimal combination of preselected biomarkers. All markers are used to predict the time to distant metastases in breast cancer patients. As in the simulation study, we compared the results of our proposed algorithm to Cox regression with lasso and ridge penalization. Additionally, we considered four competing boosting approaches for survival analysis, which do not directly optimize the index. The first is classical Cox regression, estimated via gradient boosting, while the other three are parametric accelerated failuretime (AFT) models assuming a Weibull, lognormal or loglogistic distribution [45, 46]. For all boosting approaches (Weibull AFT boosting, loglogAFT boosting and Cox boosting) we used the corresponding preimplemented functions of the mboost package. To ensure comparability, we used the same linear baselearners as described above for all boosting approaches.
To ensure that the combined predictor did not only work on the data it was derived from but also on “external” validation data, we carried out a subsampling procedure for both data sets to generate different learning samples and test samples , respectively. Consequently, we randomly split the corresponding data sets to use of the observations as learning sample in order to optimize the biomarker combination . To ensure an equal distribution of patients with and without an observed development of distant metastases we applied stratified sampling. Correspondingly, the of the observations not included in the learning sample were used to evaluate the resulting predictions via the index. As a result, for every data set and every method we computed 100 different combinations and 100 corresponding values of .
Breast cancer data by Desmedt et al.
Desmedt et al. [1] collected a data set of 196 nodenegative breast cancer patients to validate a 76gene expression signature developed by Wang et al. [10]. The signature, which is based on Affymetrix microarrays, was developed separately for estrogenreceptor (ER) positive patients (60 genes) and ERnegative patients (16 genes). In addition to the expression levels of the 76 genes, four clinical predictor variables were considered (tumor size, estrogen receptor (ER) status, grade of the tumor and patient age). The data are publicly available on GEO (http://www.ncbi.nlm.nih.gov/geo, accession number GSE 7390).
Similar to Wang et al. [10], we used the time from diagnosis to distant metastases as primary outcome and considered the 76 genes together with the four clinical predictors. Observed metastasisfree survival ranged from 125 days to 3652 days, with 79.08% of the survival times being censored.
The main results of our analysis are presented in Figure 3. As expected, the unified framework to estimate and evaluate the optimal marker signature based on the index is not only methodologically more consistent than the Cox and AFT approaches, but also leads to to marker signatures that show a higher discriminatory power on external or future data (median , range = 0.467–0.854). As discussed in the methodological section, it is crucial to evaluate the discriminatory power on external data: the estimated index on the training sample was more than 35% higher (median ) and hence extremely overoptimistic [25, 40].
Considering the interpretation of the resulting coefficient estimates for the clinical predictors, it is crucial to note that boosting methods for the index and the AFT models yield biomarker combinations where larger values indicate longer predicted survival times. On the other hand, classical Cox regression models rely on the hazard; higher values are hence associated with smaller survival times. If this is taken into account, results from the different approaches were in fact very similar. Both age of the patients and size of the tumor had a negative effect on the time to recurrence for all seven approaches. The same holds true for the tumor grade poor differentiation which resulted in a negative effect compared to good differentiation and intermediate differentiation. A positive ER status, on the other hand, was associated with a larger metastasisfree survival in all approaches. Regarding the coefficients of the 76 genes, results from our approach to boost the index were highly correlated to the ones of the other four boosting approaches (which rely on the same baselearners) – absolute correlation coefficients computed from the 100 subsamples ranged from 0.77 to 0.99. Also coefficients resulting from the popular ridgepenalized Cox regression were highly correlated with the ones from our approach – absolute correlation coefficients ranged from 0.47 to 0.84.
Breast cancer data by van de Vijver et al.
The second data set consists of lymph node positive breast cancer patients that was collected by the Netherlands Cancer Institute [2]. The data set, which is publicly available as part of the R addon package penalized [43], was used by van de Vijver et al. [2] to validate a 70gene signature for metastasisfree survival after surgery developed by van’t Veer et al. [47]. In addition to the expression levels of the 70 genes, the data set contains five clinical predictor variables (tumor diameter, number of affected lymph nodes, ER status, grade of the tumor and patient age). Observed metastasisfree survival times ranged from months to months, with of the survival times being censored.
Resulting values of the index of the new approach and the six considered competitors are presented in Figure 3. The improvement from applying the proposed unified framework compared to boosting the Cox proportional hazard model or applying ridgepenalized Cox regression was much less pronounced than in the previous data set. However, on average, boosting the index still led to the best combination of markers regarding the discriminatory power (median , range = 0.257–0.836). Interestingly, as in the previous data set, the lasso penalized Cox regression was clearly outperformed by the ridgepenalized competitor (which has been suggested for this specific data set by van Houwelingen et al. [48]). Furthermore, the ridgepenalized approach performed at least as good as the considered boosting approaches (except the new approach to boost the index). As in the previous data set, we again additionally evaluated the index on the training sample in order to assess the resulting overoptimism. As expected, the estimated index on the training sample was extremely biased (median ).
The resulting coefficients for the clinical predictors were again comparable for the seven different approaches. A positive ER status was associated with a larger metastasisfree survival for all seven approaches, the same also holds true for the age of the patient. On the other hand, the size of the tumor, the number of affected lymph nodes and a poor tumor grade resulted for all approaches in a negative effect on the survival time. Regarding the coefficients of the 79 genes, the highest correlation could again be observed for the boosting algorithms: Absolute correlation coefficients obtained from the 100 subsamples ranged from 0.64 to 0.95. Correlation between coefficients resulting from our approach to boost the index and the ones from ridgepenalized Cox regression was slightly lower, it ranging from 0.30 to 0.82.
Discussion
In this article we have proposed a framework for the development of survival prediction rules that is based on the concordance index for timetoevent data (index). As the index is an easytointerpret measure of the accuracy of survival predictions (based on clinical or molecular data), it has become an important tool in medical decision making. Generally, the focus of the index is on measuring the “discriminatory power” of a prediction rule: It quantifies how well the rankings of the survival times and the values of a biomarker (or marker combinations) in a sample agree. In particular, the index is methodologically different from measures that evaluate how well a prediction rule is “calibrated” (i.e., from measures that quantify “how closely the predicted probabilities agree numerically with the actual outcomes” [49]). Specifically, prediction rules that are well calibrated do not necessarily have a high discriminatory power (and vice versa).
While several authors have proposed the use of the index for feature selection [9] and the evaluation of molecular signatures [21, 22], the main contribution of this paper is a new approach for the derivation of marker combinations that is based directly on the index. Consequently, when using the proposed method, it is no longer necessary to rely on traditional methods such as Cox regression – which focus on the derivation of wellcalibrated prediction rules instead of welldisciminating prediction rules and may therefore be suboptimal when the optimization of the discriminatory power is of main interest.
Conceptually, our approach is in direct line with recent articles by Ma and Huang [34], Wang and Chang [50] and Schmid et al. [51] who developed a set of algorithms for the optimization of discrimination measures for binary outcomes (such as the area under the curve (AUC) and the partial area under the curve and (PAUC)). Because the index is in fact a summary measure of a correspondingly defined AUC measure for timetoevent data [26], our optimization technique relies on similar methodological concepts, such as the application of boosting methods and the use of smoothed indicator functions.
A possible future extension of our approach might be to include the task of selecting the most influential genes in the proposed boosting algorithm. While our simulation study and the breastcancer examples were based on the preselection of genes, the proposed boosting method could also be applied directly to highdimensional molecular data, so that Tasks 1 and 2 are effectively combined. This can be accomplished by optimizing the stopping iteration so that only a (lowdimensional) subset of the candidate genes is incorporated in the resulting marker combination (“early stopping”, cf. [37]). Further research is warranted on the issues of early stopping and automated feature selection in the case of boosting the concordance index for survival data.
The results of our empirical analysis suggest that the new approach is competitive with stateoftheart methods for the derivation of marker combinations. As demonstrated in the Numerical Results section, the resulting marker combinations are not only easy to compute and have a meaningful interpretation but can also lead to a higher discriminatory power of the resulting gene signatures.
Acknowledgments
The authors thank Sergej Potapov for his help with the analysis of the breast cancer data. The work of Matthias Schmid and Andreas Mayr was supported by Deutsche Forschungsgemeinschaft (DFG) (www.dfg.de), grant SCHM 2966/11. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
 Desmedt et al. [2007] C. Desmedt, F. Piette, S. Loi, Y. Wang, F. Lallemand, B. HaibeKains, G. Viale, M. Delorenzi, Y. Zhang, M. S. d’Assignies, J. Bergh, R. Lidereau, P. Ellis, A. L. Harris, J. G. M. Klijn, J. A. Foekens, F. Cardoso, M. J. Piccart, M. Buyse, and C. Sotiriou. Strong time dependence of the 76gene prognostic signature for nodenegative breast cancer patients in the TRANSBIG multicenter independentvalidation series. Clinical Cancer Research, 13:3207–3214, 2007.
 van de Vijver et al. [2002] M. J. van de Vijver, Y. D. He, L. J. van’t Veer, H. Dai, A. A. M. Hart, D. W. Voskuil, G. J. Schreiber, J. L. Peterse, C. Roberts, M. J. Marton, M. Parrish, D. Atsma, A. Witteveen, A. Glas, L. Delahaye, T. van der Velde, H. Bartelink, S. Rodenhuis, E. T. Rutgers, S. H. Friend, and R. Bernards. A geneexpression signature as a predictor of survival in breast cancer. New England Journal of Medicine, 347(25):1999–2009, 2002.
 Kok et al. [2009] M. Kok, S. C. Linn, R. K. Van Laar, M. P. H. M. Jansen, T. M. van den Berg, L. J. M. J. Delahaye, A. M. Glas, J. L. Peterse, M. Hauptmann, J. A. Foekens, J. G. M. Klijn, L. F. A. Wessels, L. J. Van’t Veer, and E. M. J. J. Berns. Comparison of gene expression profiles predicting progression in breast cancer patients treated with tamoxifen. Breast Cancer Research and Treatment, 13:275–283, 2009.
 Li and Gui [2004] H. Li and J. Gui. Partial Cox regression analysis for highdimensional microarray gene expression data. Bioinformatics, 20:i208–i215, 2004.
 Chang et al. [2004] H. Y. Chang, J. B. Sneddon, A. A. Alizadeh, R. Sood, R. B. West, K. Montgomery, J.T. Chi, M. van de Rijn, D. Botstein, and P. O. Brown. Gene expression signature of fibroblast serum response predicts human cancer progression: Similarities between tumors and wounds. PLoS Biology, 2(2), 2004. e7.
 Gilad et al. [2008] Shlomit Gilad, Eti Meiri, Yariv Yogev, Sima Benjamin, Danit Lebanony, Noga Yerushalmi, Hila Benjamin, Michal Kushnir, Hila Cholakh, Nir Melamed, Zvi Bentwich, Moshe Hod, Yaron Goren, and Ayelet Chajut. Serum micrornas are promising novel biomarkers. PLoS ONE, 3(9):e3148, 09 2008. doi: 10.1371/journal.pone.0003148.
 Wistuba et al. [2011] Ignacio I Wistuba, Juri G Gelovani, Jörg J Jacoby, Suzanne E Davis, and Roy S Herbst. Methodological and practical challenges for personalized cancer therapies. Nature reviews Clinical oncology, 8(3):135–141, 2011.
 Witten and Tibshirani [2010] Daniela M Witten and Robert Tibshirani. Survival analysis with highdimensional covariates. Statistical methods in medical research, 19(1):29–51, 2010.
 Ma and Song [2011] S. Ma and X. Song. Ranking prognosis markers in cancer genomic studies. Briefings in Bioinformatics, 12(1):33–40, 2011.
 Wang et al. [2005] Y. Wang, J. G. Klijn, Y. Zhang, A. M. Sieuwerts, M. P. Look, F. Yang, D. Talantov, M. Timmermans, M. E. Meijervan Gelder, J. Yu, T. Jatkoe, E. M. Berns, D. Atkins, and J. A. Foekens. Geneexpression profiles to predict distant metastasis of lymphnodenegative primary breast cancer. Lancet, 365(9460):671–679, 2005.
 Tibshirani et al. [1997] Robert Tibshirani et al. The lasso method for variable selection in the Cox model. Statistics in medicine, 16(4):385–395, 1997.
 Goeman [2010] Jelle J Goeman. L penalized estimation in the Cox proportional hazards model. Biometrical Journal, 52(1):70–84, 2010.
 Li and Luan [2002] Hongzhe Li and Yihui Luan. Kernel cox regression models for linking gene expression profiles to censored survival data. In Pacific Symposium on Biocomputing, volume 8, page 65, 2002.
 Gui and Li [2005] Jiang Gui and Hongzhe Li. Penalized cox regression analysis in the highdimensional and lowsample size settings, with applications to microarray gene expression data. Bioinformatics, 21(13):3001–3008, 2005.
 Schmid and Potapov [2012] M. Schmid and S. Potapov. A comparison of estimators to evaluate the discriminatory power of timetoevent models. Statistics in Medicine, 31(23):2588–2609, 2012.
 Harrell et al. [1982] F. E. Harrell, R. M. Califf, D. B. Pryor, K. L. Lee, and R. A. Rosati. Evaluating the yield of medical tests. Journal of the American Medical Association, 247(18):2543–2546, 1982.
 Harrell et al. [1984] F. E. Harrell, K. L. Lee, and Califf, R. M. et al. Regression modeling strategies for improved prognostic prediction. Statistics in Medicine, 3(2):143–152, 1984.
 Uno et al. [2011] H. Uno, T. Cai, M. J. Pencina, R. B. D’Agostino, and L. J. Wei. On the Cstatistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics in Medicine, 30(10):1105–1117, 2011.
 Kaderali et al. [2006] L. Kaderali, T. Zander, U. Faigle, J. Wolf, J. L. Schultze, and R. Schrader. CASPAR: A hierarchical bayesian approach to predict survival times in cancer from gene expression data. Bioinformatics, 22(12):1495–1502, 2006.
 Pepe et al. [2008] M. S. Pepe, Y. Zheng, and Jin, Y. et al. Evaluating the ROC performance of markers for future events. Lifetime Data Analysis, 14(1):86–113, 2008.
 Zhang et al. [2013] Haibo Zhang, Weixiong Xia, Xing Lu, Rui Sun, Lin Wang, Lisheng Zheng, Yanfang Ye, Yingna Bao, Yanqun Xiang, and Xiang Guo. A novel statistical prognostic score model that includes serum CXCL5 levels and clinical classification predicts risk of disease progression and survival of nasopharyngeal carcinoma patients. PLOS ONE, 8, 02 2013.
 Zhao et al. [2011] Xi Zhao, Einar Andreas Rødland, Therese Sørlie, Bjørn Naume, Anita Langerød, Arnoldo Frigessi, Vessela N. Kristensen, AnneLise BørresenDale, and Ole Christian Lingjærde. Combining gene signatures improves prediction of breast cancer survival. PLoS ONE, 6(3):e17845, 03 2011.
 Friedman [1997] Jerome H. Friedman. On bias, variance, 0/1loss, and the curseofdimensionality. Data Mining and Knowledge Discovery, 1(1):55–77, 1997.
 Bühlmann and Hothorn [2007] P. Bühlmann and T. Hothorn. Boosting algorithms: Regularization, prediction and model fitting (with discussion). Statistical Science, 22:477–522, 2007.
 Harrell et al. [1996] FE Harrell, Kerry L Lee, and Daniel B Mark. Tutorial in biostatistics multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine, 15:361–387, 1996.
 Heagerty and Zheng [2005] Patrick J. Heagerty and Yingye Zheng. Survival model predictive accuracy and ROC curves. Biometrics, 61(1):92–105, 2005.
 PubMed [2013] PubMed. US national library of medicine national institutes of health, 2013. URL http://www.ncbi.nlm.nih.gov/pubmed/.
 Gail et al. [1989] Mitchell H Gail, Louise A Brinton, David P Byar, Donald K Corle, Sylvan B Green, Catherine Schairer, and John J Mulvihill. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. Journal of the National Cancer Institute, 81(24):1879–1886, 1989.
 gai [2005] Mammographic breast density and the gail model for breast cancer risk prediction in a screening population. Breast Cancer Research and Treatment, 94(2), 2005. doi: 10.1007/s1054900551524.
 Antolini et al. [2005] L. Antolini, P. Boracchi, and E. Biganzoli. A timedependent discrimination index for survival data. Stat. Med., 24(24):3927–3944, 2005.
 van der Laan and Robins [2003] M. J. van der Laan and J. M. Robins. Unified Methods for Censored Longitudinal Data and Causality. Springer, New York, 2003.
 Schmid et al. [2013] M. Schmid, H. A. Kestler, and S. Potapov. On the validity of timedependent AUC estimators. Briefings in Bioinformatics, to appear, 2013.
 Friedman et al. [2000] J. H. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting (with discussion). The Annals of Statistics, 28:337–407, 2000.
 Ma and Huang [2005] S. Ma and J. Huang. Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics, 21(24):4356–4362, 2005.
 Buehlmann [2006] Peter Buehlmann. Boosting for highdimensional linear models. The Annals of Statistics, pages 559–583, 2006.
 Schmid and Hothorn [2008a] Matthias Schmid and Torsten Hothorn. Boosting additive models using componentwise Psplines. Computational Statistics & Data Analysis, 53:298–311, 2008a.
 Mayr et al. [2012a] A Mayr, B Hofner, and M Schmid. The importance of knowing when to stop. A sequential stopping rule for componentwise gradient boosting. Methods of Information in Medicine, 51(2):178, 2012a.
 Hothorn et al. [2013] Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid, and Benjamin Hofner. mboost: ModelBased Boosting, 2013. URL http://CRAN.Rproject.org/package=mboost. R package version 2.23.
 R Core Team [2013] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. URL http://www.Rproject.org/. ISBN 3900051070.
 Lecocke and Hess [2006] Michael Lecocke and Kenneth Hess. An empirical study of univariate and genetic algorithmbased feature selection in binary classification with microarray data. Cancer Informatics, 2:313, 2006.
 Klein and Moeschberger [2003] J. P. Klein and M. L. Moeschberger. Survival Analysis  Techniques for Censored and Truncated Data. Springer, New York, 2nd edition, 2003.
 Mayr et al. [2012b] A. Mayr, N. Fenske, B. Hofner, T. Kneib, and M. Schmid. Generalized additive models for location, scale and shape for highdimensional data – a flexible approach based on boosting. Journal of the Royal Statistical Society: Series C (Applied Statistics), 61(3):403–427, 2012b.
 Goeman [2012] J. Goeman. penalized: L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMs and in the Cox Model, 2012. R package version 0.942.
 Potapov et al. [2012] S. Potapov, W. Adler, and M. Schmid. survAUC: Estimators of Prediction Accuracy for TimetoEvent Data, 2012. R package version 1.05.
 Hothorn et al. [2006] T. Hothorn, P. Bühlmann, S. Dudoit, A. Molinaro, and M.J. Van Der Laan. Survival ensembles. Biostatistics, 7(3):355–373, 2006.
 Schmid and Hothorn [2008b] Matthias Schmid and Torsten Hothorn. Flexible boosting of accelerated failure time models. BMC Bioinformatics, 9:269, 2008b.
 van’t Veer et al. [2002] L. J. van’t Veer, H. Y. Dai, M. J. van de Vijver, Y. D. D. He, A. A. M. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernard, and S. H. Friend. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415(6871):530–536, 2002.
 van Houwelingen et al. [2006] Hans C van Houwelingen, Tako Bruinsma, Augustinus AM Hart, Laura J van’t Veer, and Lodewyk FA Wessels. Crossvalidated cox regression on microarray gene expression data. Statistics in medicine, 25(18):3201–3216, 2006.
 Pencina and D’Agostino [2004] M. J. Pencina and R. B. D’Agostino. Overall C as a measure of discrimination in survival analysis: Model specific population value and confidence interval estimation. Stat. Med., 23(13):2109–2123, 2004.
 Wang and Chang [2011] Z. Wang and Y.C.I. Chang. Marker selection via maximizing the partial area under the ROC curve of linear risk scores. Biostatistics, 12(2):369–385, 2011.
 Schmid et al. [2012] M. Schmid, T. Hothorn, F. Krause, and C. Rabe. A PAUCbased estimation technique for disease classification and biomarker selection. Statistical Applications in Genetics and Molecular Biology, 11, Article 3, 2012.
 Schapire [1989] Robert E. Schapire. The Strength of Weak Learnability. CICS (Series). Massachusetts Institute of Technology. Laboratory for Computer Science Center for Intelligent Control Systems, 1989.
 Freund [1990] Yoav Freund. Boosting a weak learning algorithm by majority. In Mark A. Fulk and John Case, editors, Proceedings of the Third Annual Workshop on Computational Learning Theory, COLT 1990, University of Rochester, Rochester, NY, USA, August 68, 1990, pages 202–216, 1990.
 Ridgeway [1999] Greg Ridgeway. The state of boosting. Computing Science and Statistics, 31:172–181, 1999.
 Vapnik [1996] Vladimir Vapnik. The Nature of Statistical Learning Theory (Information Science and Statistics). Springer, 1996.
 Breiman [2001] L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.
 Binder et al. [2009] Harald Binder, Arthur Allignol, Martin Schumacher, and Jan Beyersmann. Boosting for highdimensional timetoevent data with competing risks. Bioinformatics, 25(7):890–896, 2009.
 Tutz and Binder [2006] G. Tutz and H. Binder. Generalized additive modeling with implicit variable selection by likelihoodbased boosting. Biometrics, 62:961–971, 2006.
 Hofner et al. [2012] Benjamin Hofner, Andreas Mayr, Nikolay Robinzonov, and Matthias Schmid. Modelbased boosting in R: a handson tutorial using the R package mboost. Computational Statistics, pages 1–33, 2012.
Appendix
Gradient boosting of the index
The concept of boosting was first introduced in the field of machine learning [52, 53]. The basic idea is to boost the accuracy of a relatively weak performing classificator (termed “baselearner”) to a more accurate prediction via iteratively applying the baselearner and aggregating its solutions. Generally, the concept of boosting leads to a drastically improved prediction accuracy compared to a single solution of the baselearner [54]. This basic concept was later adapted to fit statistical regression models in a forward stagewise fashion [33, 35]. One of the main advantages of this approach is the interpretability of the final solution, which is basically the same as in any other statistical model [24]. This can not be achieved with competing machine learning algorithms as Support Vector Machines [55] or Random Forests [56]. Specifically, the boosting approach can be used to develop prediction rules for survival outcomes [45, 46, 57]. Although there exist also likelihoodbased approaches for boosting [58], we will focus here on gradientbased boosting [24] as it is the better fitting approach for boosting the distributionfree index.
The most flexible implementation of gradient boosting is the mboost [38] addon package for the Open Source programming environment R [39]. The mboost package contains a large variety of different preimplemented baselearners and loss functions, that can be combined by the user via different fitting functions. For a tutorial on the how to apply the package for practical data analysis, see [59].
To apply gradient boosting to optimize linear biomarker combinations w.r.t. the index in the version of Uno et al. [18], it is necessary to specify the newly developed Cindex() family inside the glmboost() function.
The Cindex family object includes the sigmoid function as approximation of the indicator functions in the estimated index. The sigmoid function is evaluated inside the R functions approxGrad() and approxLoss(), which are part of the Cindex object. The weights
(12) 
are computed via the internal function compute_weights() for both the empirical risk
(13) 
(implemented in the risk() function) as well as for the negative gradient
(14) 
(implemented in the ngradient() function).
Those different functions that define the optimization problem are finally plugged into the mboost specific Family() function to build a new boostfamily. Details on how to implement userspecific families in mboost are presented in the Appendix of [59]. The complete Cindex object is then given as follows:
Cindex < function (sigma = 0.1) { approxGrad < function(x) { ## sigmoid function for gradient exp(x/sigma) / (sigma * (1 + exp(x/sigma))^2) } approxLoss < function(x) { ## sigmoid function for loss 1 / (1 + exp(x / sigma)) } compute_weights < function(y, w = 1){ ## compute weights ipcw_wow < IPCweights(y[w != 0,]) ipcw < numeric(nrow(y)) ipcw[w!=0] < ipcw_wow survtime < y[,1] n < nrow(y) wweights < matrix( (ipcw)^2, nrow = n, ncol = n) weightsj < matrix(survtime, nrow = n, ncol = n) weightsk < matrix(survtime, nrow = n, ncol = n, byrow = TRUE) weightsI < (weightsj < weightsk) + 0 wweights < wweights * weightsI Wmat < w %o% w wweights < wweights * Wmat wweights < wweights / sum(wweights) rm(weightsI); rm(weightsk); rm(weightsj) return(wweights) } ngradient = function(y, f, w = 1) { ## negative gradient if (!all(w %in% c(0,1))) stop(sQuote("weights"), " must be either 0 or 1 for family ", sQuote("UnoC")) survtime < y[,1] event < y[,2] if (length(w) == 1) w < rep(1, length(event)) if (length(f) == 1) { f < rep(f, length(survtime)) } n < length(survtime) etaj < matrix(f, nrow = n, ncol = n, byrow = TRUE) etak < matrix(f, nrow = n, ncol = n) etaMat < etak  etaj rm(etaj); rm(etak); weights_out < compute_weights(y, w) M1 < approxGrad(etaMat) * weights_out ng < colSums(M1)  rowSums(M1) return(ng) } risk = function(y, f, w = 1) { ## empirical risk survtime < y[,1] event < y[,2] if (length(f) == 1) { f < rep(f, length(y)) } n < length(survtime) etaj < matrix(f, nrow = n, ncol = n, byrow = TRUE) etak < matrix(f, nrow = n, ncol = n) etaMat < (etak  etaj) rm(etaj); rm(etak); weights_out < compute_weights(y, w) M1 < approxLoss(etaMat) * weights_out return( sum(M1)) } Family( ## build the family object ngradient = ngradient, risk = risk, weights = "zeroone", offset = function(y, w = 1) {0}, check_y = function(y) { if (!inherits(y, "Surv")) stop("response is not an object of class ", sQuote("Surv"), " but ", sQuote("family = UnoC()")) y}, rclass = function(f){}, name = paste("Concordance Probability by Uno") ) }
Application
We will briefly demonstrate how to apply the Cindex family in practice to derive the optimal combination of preselected biomarkers. We will use the van de Vijver et al. [2] data set of lymph node positive breast cancer patients that was also considered in the main article. The data set is publicly available as part of the R addon package penalized [43]. The 70gene signature for metastasisfree survival after surgery was originally developed by van’t Veer et al. [47].
We first split the data set in 100 training observations and 44 test observations. To ensure better readability of the code we do not carry out stratified subsampling but just use the first 100 patients as training sample. Model fitting is carried out by the glmboost() function of the mboost package. As linear models are the default baselearners for glmboost(), no additional baselearner has to be specified. As appropriate family object we specify the Cindex family described above.
For evaluating the discriminatory power of the resulting prediction on test data, we use the UnoC() function of the survAUC package [44]. It implements the unbiased estimator , as proposed by Uno et al. [18].
## load addon packages library(penalized) ## for the data set library(mboost) ## for boosting library(survAUC) ## for evaluation data(nki70) ## loading the data source("Cindex.R") ## loading the family defined above ## split the data set in training and test sample (simplified): dtrain < nki70[1:100,] dtest < nki70[101:144,] ## fit a model via the glmboost() function ## formula : defines the candidate model; the response is the survival ## object Surv(time, event); via ’~ ." all remaining variables ## in the data set serve as possible predictors ## family : defines the optimization problem (in this case the Cindex) ## sigma is the smoothing parameter of the sigmoid function that ## approximates the indicator functions. The default value is 0.1. ## control : defines other boostingspecific tuning parameters like the ## stopping iteration mstop or the steplength nu; trace = TRUE is ## only for convenience (shows the trace of the empirical risk). ## data : defines the data set > training sample mod1 < glmboost(Surv(time, event) ~ ., family = Cindex(sigma = 0.1), control = boost_control(mstop = 500, trace = TRUE, nu = 0.1), data = dtrain) ## The stopping iteration can be changes via simple indexing: mod1 < mod1[50000] ## Long runtime: 50000 iterations ## takes at least a couple of minutes on a standard machine ## Now take a look at the resulting combination coef(mod1) ## Prediction on test data preds < predict(mod1, newdata = dtest) ## Evaluate the discriminatory power UnoC(Surv(dtrain$time, dtrain$event), Surv(dtest$time, dtest$event), lpnew = preds)