# Variable Selection with Random Survival Forest and Bayesian Additive Regression Tree for Survival Data

###### Abstract

In this paper we utilize a survival analysis methodology incorporating Bayesian additive regression trees to account for nonlinear and additive covariate effects. We compare the performance of Bayesian additive regression trees, Cox proportional hazards and random survival forests models for censored survival data, using simulation studies and survival analysis for breast cancer with U.S. SEER database for the year 2005. In simulation studies, we compare the three models across varying sample sizes and censoring rates on the basis of bias and prediction accuracy. In survival analysis for breast cancer, we retrospectively analyze a subset of 1500 patients having invasive ductal carcinoma that is a common form of breast cancer mostly affecting older woman. Predictive potential of the three models are then compared using some widely used performance assessment measures in survival literature.

^{†}

^{†}footnotetext: Satabdi Saha is Ph.D. student in Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824 (Email: sahasata@stt.msu.edu). Duchwan Ryu is Associate Professor, Division of Statistics, Northern Illinois University, Dekalb, IL 60115 (E-mail: dryu@niu.edu). Nader Ebrahimi is Professor of Statistics, Northern Illinois University, Dekalb, IL 60115 (E-mail: nebrahim@niu.edu).

Keywords: Classification and Regression Trees; Survival Ensemble Learning Algorithm; Integrated Brier Score; Receiving Operating Characteristic Curve

## 1 Introduction

The identification of appropriate covariates and the adequate manipulation of non-linearity and high dimensionality are highly critical to investigate the effect of several important covariates on the event times and the accurate survival prediction. The Cox proportional hazards (CPH) model seems to be the first natural step since it is the most commonly used in modeling event times with the covariates. It assumes a nonparametric form for the baseline hazard and allows testing for differences in event times of two or more groups of interest, while allowing to adjust for covariates of interest. The CPH model, however, may result in biased parameter estimates if the covariates fail to follow the proportional hazards assumption. Moreover the assumption is often violated due to the presence of complex relationships in the data structure. Failure of this assumption sometimes leads to the use of stratified Cox models that stratify with respect to covariates, or time dependent Cox models that incorporate a time dependent interaction term to deal with the non-proportionality of hazards. However several other problems including non-linearity, interactions between covariates and high dimensional parameter spaces are still hard to effectively address using CPH. Various nonparametric modeling methods such as penalized regression (tibshirani1997lasso; zhang2007adaptive), survival trees (leblanc1993survival), boosting with Cox-gradient descent (ma2007clustering) and random survival forests (ishwaran2008random) have been proposed to address the various problems faced by CPH model.

The use of classification and regression trees (CART) and other recursive partitioning methods has allowed for a more detailed study of the effects of covariates on the survival distribution. It has been proved to be an efficient tool in identifying significant covariates and their interactions. Its main advantages over other methods include effectively analyzing and interpreting complex nonlinear and high dimensional survival data and bringing a reduced dimension. However, CART method often suffers from high variation in prediction. Various methods that combine a set of tree models, so called ensemble methods, have attracted much attention to decrease the variance and to increase the prediction accuracy of CART. These include boosting (freund1995desicion; friedman2001greedy), bagging (breiman1996bagging) and random forests (breiman2001random), each of which uses different technique to fit a linear combination of trees. These flexible nonparametric modeling methods have been successfully applied in the context of survival analysis. hothorn2004bagging used the forms of bagging survival trees, hothorn2006survival utilized survival ensembles, and ishwaran2008random considered random survival forests (RSF).

Some Bayesian models take the average of posteriors arising from single-tree models as in chipman1998bayesian; Mallick1999; blanchard2004algorithme. Such model averaging uses posterior probabilities as weights for averaging the predictions from individual trees. This idea has been further developed in a Bayesian sum-of-trees model, where each tree is constrained by a regularization prior to be a weak learner. Motivated by ensemble methods, this sum-of-trees model known as Bayesian additive regression trees (BART) provides a framework to model the association of covariates and outcomes nonparametrically and flexibly. BART models have shown an excellent predictive performance, for both continuous and binary outcomes, comparing to random forests and boosting (chipman2010bart). Recently the idea of BART has been extended to analyze survival data, by expressing the nonparametric likelihood for the Kaplan-Meier estimator in a form suitable for BART (sparapani2016nonparametric). It has performed very well in terms of prediction error with medium variance and medium bias, while it has identified complex non-linear relationships and interactions in the dataset with efficient variable selection.

Among the proposed flexible methods, in this paper, we compare the performance of RSF and BART, as well as CPH, in the prediction of survival probability, in the variable selection and in the determination of marginal effects of the covariates. The models are then compared and contrasted using simulation studies and a benchmark breast cancer dataset of invasive ductal carcinoma, sometimes called infiltrating ductal carcinoma, that is the most common type of breast cancer. According to the American Cancer Society, more than 180,000 women in the United States are diagnosed with invasive breast cancer each year and most of these are cases of invasive ductal carcinoma. Although invasive ductal carcinoma can affect women at any age, it is more common among women of older age. About two-thirds of women are 55 or older when they are diagnosed with an invasive breast cancer. There have been several studies reporting the effects of tumor size, tumor stage and tumor grade on the survival of breast cancer patients including d2001prognostic; rosenberg2005effect; delen2005predicting; omurlu2009comparisons; faradmal2014comparison. We utilize the data extracted from the U.S. National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) database for years 2005 to 2013 based on November 2015 submission that contains huge magnitude of data on several factors affecting breast cancer.

The remainder of this paper is presented as follows. In section 2 we review the three models chosen for analysis. Several performance assessment measures are reviewed to compare and contrast the predictive abilities of the models described. In section 3, we conduct simulation studies to compare the effectiveness of BART, CPH and RSF models in regression scenarios. In section 4, we use a real life dataset to demonstrate and compare the potential of the two ensemble learning methods. The survival experience of 1500 patients having invasive ductal carcinoma is studied using CPH, RSF and the BART models and their respective performances are compared using standard methods. Finally, conclusions of our findings gives in section 5.

## 2 Methods

For the th subject, , let denote the survival data with the time to event , the censoring indicator that takes 0 if the event is right censored and takes 1 otherwise, and the vector of covariates . In the analysis of survival time the intended purpose is to estimate the survival function and hazard function at time that are defined as

(1) |

where denotes the cumulative distribution function and denotes the probability density function at time . In the following subsections we briefly review three survival analysis methods: Cox proportional hazards, random survival forests, and Bayesian regression trees methods, regarding the order statistics of distinct event times such that with , and variable selection methods in survival analysis.

### 2.1 Cox Proportional Hazards Regression

The Cox proportional hazards model fits the survival data with the proportional hazard at time given the vector of covariates such that

(2) |

where is the vector of unknown regression coefficients and is the arbitrary baseline hazard function. Hence, the cumulative hazard function and survival function are respectively given as

(3) |

where is the baseline cumulative hazard function.

### 2.2 Random Survival Forests

Random Forests are a machine learning ensemble method that combines the idea of bootstrap aggregation and random selection of features. For each of the bootstrap samples, it recursively splits the root nodes into interior and terminal nodes: (1) At each tree node a random selection of a subset of predictor variables is made; (2) Among all the binary splits made by the predictor variables selected in (1), the best split is determined using the predictor variable that maximizes the survival difference between daughter nodes; and Repeat (1) and (2) recursively unless a terminal node has a minimum of deaths. It calculates the cumulative hazard function (or survival function) for each tree and then averages over the bootstrap samples to obtain a ensemble cumulative hazard function (or ensemble survival function).

For the survival time and the censoring indicator of the th individual in the th terminal node, consider a censoring idicator that takes 0 for the event right censored and 1 for the event non-censored. Further, for the vector of covariates for the individual cases , consider the ordered distinct event and censoring times in the th terminal node as , where denotes the th ordered statistic among distinct observation and censoring times with . Denoting and as the number of deaths and the individuals at risk at time , respectively, the estimate for the cumulative hazard function (CHF) and the survival function with respect to are given as

(4) |

The bootstrap ensemble CHF takes the average over the survival trees such that

(5) |

where indicates the CHF obtained from a tree grown on the th bootstrap sample.

In a random forest, each of the trees is grown using an independent bootstrap sample from the set of training observations. One-third of observations are not used to construct a tree from the particular bootstrap sample. The remaining oobservations are referred to as out of bag (OOB) observations and are used to predict the ensemble CHF (or ensemble survival function) The resulting OOB predicted ensemble CHF (or ensemble survival function) is used as a valid test set prediction obtained from the random forest model.

### 2.3 Bayesian Additive Regression Trees

chipman2010bart proposed Bayesian additive regression trees (BART) as a nonparametric Bayesian method that uses a Bayesian sum of trees model. BART enables full posterior inference including point and interval estimates of the unknown regression function. Inspired from ensemble methods, each tree in the model is constrained by a regularization prior to be a weak learner, where fitting and inference are accomplished via an iterative Bayesian back-fitting algorithm.

For the th individual let denote an outcome and denote the vector of covariates. First, regarding a single tree model, let denotes a binary tree consisting of interior and terminal nodes. A branch decision rule at each interior node typically splits the covariate space into two regions and , for a subset of the range of . Let

and denotes a set of functional values associated with the terminal nodes.

modeled by the structure . Notationally, denotes a binary tree consisting of interior and terminal nodes. A branch decision rule at each interior node, typically splits the predictor space into two regions vs , where S is a subset of the range of x, based on a single component of the covariate vector. denotes a set of functional values associated with the b terminal nodes. For a given T and M we use to denote a function that assigns a to . Thus,

(6) |

is a single tree model. Therefore referring to this notation, the BART model can be expressed as

(7) |

where .

For the Bayesian specification of the sum of trees model we need to impose priors over the parameters of the BART model, namely and . Priors are carefully chosen to regularize the fit and to curtail strong individual tree effects. Thus the regularization prior is specified as

(8) |

where . Following chipman2010bart, we specify by three aspects, the probability that a node at depth d is non-terminal is calculated as where and , the choice of the splitting covariate at each interior node is uniformly distributed over the set of available covariates and the choice of a branching rule given a covariate at the interior node also follows an uniform distribution over the discrete set of available splitting values. The default values as specified by (chipman2010bart) are and . For we use the conjugate normal prior on the values of the terminal nodes. This prior has the effect of curbing strong individual effects of the trees so that every tree forms only a small part of the entire sum of trees model. Now we need to complete the Bayesian model in (8), but it is not required in the reformulation of the BART model for survival analysis, described in 2.3.1, as it uses a probit regression with latent variables having unit variance.

#### 2.3.1 BART Model in Survival Analysis

Considering our previously described format for survival data, now we formulate event indicators for each subject and distinct time where where constitutes the total number of events occurred before the ordered time reaches time . Let us now define as the unconditional probability of an event occurring at time . We then use the nonparametric probit regression structure to regress on time and the covariates , and then use truncated normal latent variables to convert to a continuous variate as required by the BART model for continuous outcomes as formulated in equation (7). Therefore,

(9) |

(10) | |||||

BART | (11) | ||||

(12) |

(14) |

We considered for centering the latent variables. This described model is estimated using the existing software for binary BART. The data pairs available for survival analysis are then used to create the binary response vector for using the survival using BART model as described by sparapani2016nonparametric. The event times, constructed as described above are treated as the first column of the covariate matrix and the rest of the columns are constructed by duplicating the the individual level covariate values to match the repetition pattern of the first subscript of . Thus using the BART model for binary data, we are able to get samples from the the posterior distribution of . Then for any given covariate containing time and rest of the covariates , we can calculate the posterior distribution as

(15) |

Finally our intended purpose lies in estimating the survival and hazard functions as possible targets of inference. The survival and hazard function at event or censoring time can be estimated as,

(16) |

These functions can be only calculated at distinct survival times, however using the constant hazard assumption,interpolation between these times can be accomplished.

##### Partial dependence survival function

For exploring the effects of individual covariates and their specific interactions on overall survival, we calculate marginal survival functions involving single covariates or a subset of covariates using formulas derived in chipman2010bart. The partial dependence function can be defined as

(17) |

and the survival function (sparapani2016nonparametric) can be written as

(18) |

The means or medians over the samples of the posterior distribution can then be treated as the required estimates.

### 2.4 Variable selection in survival models

#### 2.4.1 Stepwise variable selection using Cox Regression

We select the variables using a backward stepwise variable selection approach (implemented using the selectCox function in the R package pec, (mogensen2012evaluating)). The variables in each step are selected using the Akaike information criteria (AIC) and then a Cox regression model is fit using the chosen predictors. The AIC criteria is closely related to the logarithmic scoring rule, which is strictly proper, (gneiting2007strictly) and thus can be used for identifying a prediction model. The estimates of the regression coefficients and the baseline hazard so obtained are then used to estimate the survival function in 3.

#### 2.4.2 Variable selection using Random survival forest

We start by calculating the OOB prediction error, which is obtained by subtracting the OOB c-index 2.5.3 from one, for each of the predictor variables x. For each x, this is achieved by dropping OOB cases down their in-bag survival tree and then assigning a daughter node randomly as soon as a split for the predictor variable is encountered. A variable importance measure is then defined as the difference between the original OOB prediction error and the new OOB prediction error. Predictor variables having a large variable importance measure are considered to have greater prognostic capacities, whereas the variables with zero or negative variable importance measure can be dropped from the original model, as they add nothing to its predictive ability.

#### 2.4.3 Variable selection using BART

Variable selection in BART is carried out by selecting those variables which occurs most in the fitted sum of trees model. This method works better when the number of trees grown is small, as growing a large number of trees can give rise to an inappropriate mix of relevant and irrelevant predictors, leading to redundancy (chipman2010bart). Variable selection is thus accomplished by observing the individual predictor usage frequency in a sequence of MCMC samples as the number of trees grown becomes smaller and smaller. Thus predictors with a higher usage frequency in the MCMC samples are considered to have higher prognostic competence as compared to the other predictors.

### 2.5 Performance assessment of the prediction models

An extremely important process in model building is assessing the model’s prognostic competence. An important feature of this prognostic competence is discrimination, which refers to the ability of a predictive model to correctly classify subjects for their actual outcomes. In order to compare the discriminating potential of the aforementioned risk prediction models, we utilize some pre-existing methodologies such as concordance index (c-index), time-dependent Receiving Operating Characteristic (ROC) curves, Area under the ROC curve (AUC), Integrated area under the ROC curve (IAUC) and Integrated Brier Score (IBS) statistics.

#### 2.5.1 Time dependent ROC curves

ROC curves are very popular in displaying the sensitivity and specificity of a diagnostic marker say x and a disease variable say . The disease status variable in survival analysis is often a time dependent variable where if an event has occurred prior to time u, and zero otherwise. Following the method proposed by heagerty2000time let us consider data in the form where is the time to event, is the censoring indicator where and forms the vector of risk set predictions. We define if and if , with indicating that the event has occurred prior to time u. At given time u, and a cut-off value c, we can define

(19) | |||

(20) |

The ROC curve at time u is defined as

(21) |

This definition is often referred to as ”cumulative/dynamic” ROC curve in literature. ”Cumulative” means that all the events that occur before time u are considered as ”cases”. Other definitions of ROC curves can also be found in heagerty2005survival.

The AUC statistic at time u is defined as the area under the ROC curve at time u:

(22) |

There are several available methods including Inverse Probability of Censoring Weights (IPCW) (uno2007evaluating), Conditional Kaplan-Meier (heagerty2000time), Nearest Neighbor Estimator (NNE) (heagerty2000time) and Recursive method (chambless2006estimation) for estimating the time dependent ROC curves. For our analysis we have used the IPCW method by uno2007evaluating. Uno’s estimators, based on IPCW method, do not assume a specific working model for deriving the risk predictor. In addition to the non-informative censoring that the other methods assume, the IPCW method also assumes that censoring occurs independently of all the covariates.

#### 2.5.2 Integrated Brier Score

The Brier score (brier1950verification) is a quadratic score function, calculating the squared differences between actual binary outcomes Y and its predictions. In survival settings, let be the observed status of the subject i and be the predicted survival probability at time u for subject i with predictor variables . For censored survival outcomes, if an independent test dataset is considered then the expected Brier score (mogensen2012evaluating) is estimated by

(23) |

where is the Kaplan-Meier estimate of the censoring distribution, M is the number of subjects in and is based on the training data. Finally the Integrated Brier Score (mogensen2012evaluating) can be calculated as

(24) |

where can be set to any value smaller than the maximum time of the test sample.

The IBS ranges from 0 to 1; the
smaller the score, the better the fit. The most important benchmark is the IBS calculated using Kaplan-Meier estimate of survival.

#### 2.5.3 Concordance Index

The concordance index (c-index) by (harrell1982evaluating, harrell1996tutorial) can be interpreted as a global discrimination index for understanding the predictive competence of a survival model. In calculating the c-index, we consider all possible pairs of patients () with survival times () and censoring indicator (). We delete pairs () for which () and () or () and () or () and both ( and ). The remaining pairs can be referred to as the utilizable pairs. For each utilizable pair () we count 1, if the predicted outcome is worse for the patient with the shorter observed survival time and count 0.5, if (). For each utilizable tied time pair () if both ( and ), we count 1, if the predicted outcomes are tied and count 0.5; otherwise. For each utilizable tied time pair () when not both ( and ) we count 1, if the patient who died has a worse predicted outcome and count 0.5; otherwise. We sum over the count values calculated using all the utilizable pairs to reach at the total number of concordant pairs. The c-index can then be calculated as a proportion of all the utilizable patient pairs that are concordant. Thus the c-index estimates the agreement probability between the observed and predicted survival outcomes. The close the value of the c-index is to 1, the better is the fit.

## 3 Simulation Studies

Semi-parametric methods such as the CPH model and the other parametric survival models aims to model a particular functional relationship between the covariates and some survival outcomes. However BART and RSF offers a more flexible approach allowing nonparametric functional relationships. In this subsection, we aim to study and compare BART, RSF and the CPH models via simulation models designed in sparapani2016nonparametric

Two simulation studies are used, one having proportional hazards and the other having non proportional hazards. It is presumed that the model having non-proportional hazards should pose significant challenges to the semi parametric CPH model. Nine independent binary covariates are generated from the Bernoulli distribution with probability 0.5. They are then related to the Weibull event time t using survival function

(25) |

with different rate and scale parameters for the proportional and non proportional hazards model.

For the proportional hazards model

(26) |

For the non proportional hazards model

(27) |

Censoring times were generated independently from an exponential distribution with parameters selected to induce 20% censoring. Samples of size 300 were considered and 100 datasets were generated, for each of the models. Each of the datasets were divided in a 2:1 ratio for training and evaluation. Performances of the CPH, RSF and BART methods were compared based on measures of accuracy and bias, derived from the holdout test set of 100 samples, for both the model settings. The test root mean square error and bias are calculated for the CPH, RSF and BART survival prediction estimates. The expected value of the simulated estimates is then used as a measure of performance. Figure 1 (a) and Figure 2 (a) shows box plots of test set bias and RMSE for the proportional hazards model and Figure 1 (b) and Figure 2 (b) shows box plots of test set bias and RMSE for the non proportional hazards model measured at the 25th, 50th and 75th percentiles of the overall survival distribution. It is clear from these plots that the BART method performs closely to the CPH and RSF models in the proportional hazards case, however non parametric methods of BART and RSF performs significantly better than the semi parametric CPH model in the non-proportional hazards scenario.

## 4 Application to breast cancer dataset

In this section, we have tried to apply the CPH, RSF and BART models to a random sample of 1500 female patients between ages 24-90 having invasive ductal carcinoma as obtained from the U.S. SEER database for the year 2005. Samples with missing data were not incorporated in the study, to facilitate the demonstration of methods. Ten covariates were considered in the analysis namely Age (in years), Race (White, Black, Others), disease stage (In-situ, localized, regional, or distant), tumor grade (well-differentiated, moderately differentiated, poorly differentiated, or undifferentiated), tumor size (in cms), estrogen receptor status (positive, negative or borderline), progesterone receptor status (positive, negative or borderline), radiotherapy (received or denied), surgery (received or denied) and the number of lymph nodes. All the covariates selected for the analysis are important in breast cancer studies. Our response variable for the study was disease-specific survival (in months) based on the SEER cause-of-death code. Death from other causes was treated as censoring (non-informative censoring). The censoring times were assumed to be independent of the failure times. For evaluating the performance of the three models, the dataset was split randomly in the ratio 2:1 into a training set and a validation set. Of the 1500 patient cases, 1081 patients were white, 298 black, and the rest were people from other different origins. A total of 1191 deaths occurred in the cohort of 1500 patients. The number of survival months (our outcome of study) ranges from 1-106 months. The mean follow up time was 34.75 months and the median follow-up months was 30 months. Most of the tumors were staged regionally (38.9%). Most tumors were graded as poorly differentiated (56.5%). The mean age was 60.33 years with an SD of 14.59 years. The median tumor size was 29 mm with an IQR of 33 mm. 65.4% of the tumors were estrogen positive. 43.7 % of the patients received both surgery and radiation.

On application of the CPH model using the backward variable selection mechanism described in subsection 2.4.1, race of the patient, age at diagnosis, tumor grade, tumor size, tumor stage, radiation therapy, surgery and estrogen receptor status are chosen as the most important variables. However, the accuracy with which the model estimates the hazard ratios depends the proportional hazards assumption, which was violated by our dataset, thereby indicating the need for a more generalized model structure. We applied RSF models using logrank splitting and logrankscore splitting, which ranked its covariates by level of OOB-importance, based on 1000 trees as described in subsubsection 2.4.2. The five most important covariates in both the RSF approaches are surgery, tumor size, tumor stage, tumor grade and er status with a slightly different ranking. The bottom five covariates based on importance values are ranked similarly for both the RSF models. The top five predictors having maximum variable importance values were also selected by the Cox model. However predictors race, age at diagnosis and radiation therapy, selected by the Cox model, have very low variable importance values for both the RSF models and are therefore considered unimportant for prediction purposes.

The BART survival model was fit to the training dataset with 50 trees in the sum and the default prior, having a burn-in of 5000 draws and a long chain of 10,000 draws from the posterior distribution after the burn-in, for estimating the survival function given the predictors. For convergence checks, we generated several chains with different initial values and found comparable results. We obtained the partial dependence survival functions using equation for a particular subcategory of predictors. These functions can be explained as a marginal survival function for a single predictor level, averaged across the distribution of the remaining predictors. In Figure 3 we have plotted the partial dependence functions for four different tumor sizes and five different ages. From the right plot in Figure 3 it can be seen that survival probability drops rapidly with an increase in tumor size. For example the five year survival probability for a patient with tumor size 20 mm is 0.93 as compared to 0.71 for a patient with tumor size 120 mm. From the left plot of Figure 3 it can be seen that the 5 year survival probability for a 50 year old breast cancer patient is 0.87 compared to 0.80 for a 70 year old patient.

The BART survival model can also be used to study the effect of interactions between covariates on the survival outcome. The left plot in Figure 4 studies the effect of the interaction between estrogen and progesterone receptor status on the survival probability. It can be seen from the plot that estrogen receptor (er) and progesterone receptor (pr) positive breast cancer patients have a higher survival probability as compared to er and pr negative patients. Since Age and Stage variables were selected by both the CPH and RSF models we wanted to check whether there exists any interaction between them. There is no evidence of interaction as the right plot in Figure 4 shows nearly parallel patterns, while there may be an indication of a nonlinear relationship between median survival probability especially at age 70.

BART can also be conveniently used, to draw inference on various aspects of the survival distribution (obtained by regressing on all or a subset of covariates), directly from the posterior samples. As another illustration on exploring significant interactions between the covariates, we explored the difference in the partial dependence survival function at five years between patients having Stage 1 tumor and Stage 4 tumor, separately by tumor size, age, tumor grade, surgery status, radiation status and estrogen receptor status. These variables were selected as they have been considered important by both the CPH and the RSF models. Results obtained are shown as a forest plot in Figure 5. One of the results indicated by the plot would be that surgery decreases the 5-year survival across the disease stages, although the magnitude of the effect may vary slightly.

Finally, we carried out variable selection as described in 2.4.3 by examining the average frequency per splitting rule for all 11 predictor variables (time post diagnosis plus 10 predictors), the model being run using different numbers of trees (). As can be seen from Figure 6, covariate time naturally is the most selected covariate across the different number of trees. Besides time, the model identifies stage, tumor size, age, ER status and surgery as the five most important covariates impacting overall survival.

We repeated the cross-validation procedure 20 times, splitting the dataset randomly into training and test sets in the ratio 2:1, keeping the same censoring rates between training and test sets. Then we used the training set to build the predictor and applied the predictor on the test set to adjudge the performance of the competing methods.

The performance of the RSF model using logrank splitting was marginally better as compared to its counterparts (Table 2). Values for c-index calculated for all the models indicated that all the estimates were different from 0.5, implying greater capacity of predicting higher probabilities of survival for higher observed survival times. From the plotted time dependent ROC curves in Figure 7, we can see that over the first 12 and 24 months of follow up, the BART model has the highest AUC. At time=36 and time=48 months the RSF model with logrank splitting has the highest AUC.

The estimated AUC value for the RSF model with logrank splitting tends to decline over time to 0.785 for (Table 1). Thus the estimated AUC values suggests good short-term discriminatory potential of the model score. Estimates of AUC(t) also become increasingly variable over time due to the diminishing size of the risk set. Using a follow up of 106 months yields a IAUC estimate of 0.839 for the RSF model with logrank splitting (Table 2).

Time (months) | ||||
---|---|---|---|---|

Model | 12 | 24 | 36 | 48 |

CPH | 0.766 | 0.773 | 0.806 | 0.810 |

BART | 0.839 | 0.789 | 0.811 | 0.813 |

RF | 0.807 | 0.785 | 0.821 | 0.822 |

RF-LS | 0.782 | 0.771 | 0.813 | 0.816 |

This implies that conditional on one event occurring within 106 months, the probability that the model score is larger for the subject with the smaller event time is 83.9%. The integrated Brier score values between 0 and 106 months, for the test set, are lowest for random survival forest with logrankscore splitting. The CPH, BART and RSF with logrank splitting models have approximately the same IBS values (Table 2).

c-index | IAUC | IBS | ||||
---|---|---|---|---|---|---|

Model | Train | Test | Train | Test | Train | Test |

CPH | 0.730 | 0.722 | 0.850 | 0.839 | 0.112 | 0.119 |

BART | 0.761 | 0.702 | 0.895 | 0.876 | 0.063 | 0.115 |

RF | 0.856 | 0.731 | 0.893 | 0.852 | 0.065 | 0.113 |

RF-LS | 0.825 | 0.726 | 0.889 | 0.849 | 0.065 | 0.112 |

All three models perform substantially better than Kaplan-Meier having an IBS estimate of 0.150. Based on all these evaluation measures, it can be inferred that the BART method improves survival prediction accuracy in some cases or has comparable performance to the other two methods. This improvement could be due to BART’s ability to naturally account for additive and non-linear effects.

## 5 Conclusion

This paper focuses on the comparison of BART, CPH and RSF models in analyzing survival data. It reviews three modeling approaches and compares them in terms of interpretative competence, prediction accuracy and variable selection methods. Simulation studies are performed to judge the competence of the ensemble algorithms with CPH model in regression scenarios. In regression scenarios the three models perform closely when the proportional hazards assumption is met. However the performance of the CPH model depreciates with respect to the other two models when the proportional hazards assumption is violated.

We then apply the three models to analyze a real life breast cancer dataset.The covariates selected for our analysis of the breast cancer survival times fail to follow the proportional hazards assumption of the CPH model. Thus we apply RSF and BART to our dataset to deal with its complex structure and attain increased accuracy in predicting survival times. We chose BART because of its flexibility to accommodate high dimensional datasets and account for non-linearity and interactions present in covariates. Additionally working under a Bayesian paradigm allowed for natural quantification of uncertainty,that helped in construction of credible and prediction intervals. Thus, regressing on selected predictors we could estimate the median survival time and credible intervals, for a given patient, using the posterior distributions of the process parameters obtained using the BART model. Alternatively survival curves along with confidence bounds for the population could be plotted using all or a subset of covariates. The RSF model was chosen as a competing method to the BART model because similar to BART, it is a decision tree structured black box model having high prediction accuracy and an efficient variable selection mechanism. Being black box models RSF and BART lack interpretative capacities. It cannot directly quantify the risks presented by individual covariates to the overall hazard like the CPH model does in terms of hazard ratios. However the partial dependence survival functions do give us an idea about how each of the covariates individually and jointly affect the overall survival risk. Additionally the performance of all the three models were compared using several assessment measures. BART’s and RSF’s comparable values of c-index, IAUC and IBSC further validates BART’s predictive ability. BART’s lack of interpretability as compared to the CPH model can thus be counterbalanced by its gains in prediction accuracy and the ability to incorporate complex interaction effects among the covariates.

Our primary motivation in using the breast cancer dataset was that we were more interested in identifying a statistical model that predicts overall survival effectively based on a set of covariates. We also wanted to understand the impact of these clinical covariates on the survival of breast cancer patients; and that was carried out successfully by the variable selection methods of BART. We discovered important associations between stage of the tumor, tumor size, er status, surgery status and long-term survival using the BART model. The RSF model additionally considered tumor grade as important. Age at diagnosis was considered an important predictor by the CPH model. There are many studies which have estimated the risk factor importance of breast cancer using CPH and machine learning models. In line with our findings for the CPH model, rosenberg2005effect concluded that tumor size, tumor grade and race, all have significant constant effects on disease-specific survival in breast cancer, while the effects of age at diagnosis and disease stage have significant effects that donot follow the proportional hazards assumption. Also similar to our results for the RSF and BART models d2001prognostic reported tumor size and tumor grade as the most informative medical factors using a RSF model and delen2005predicting affirmed the effectiveness of tumor size and tumor stage through the sensitivity analysis of Artificial Neural Network. In contrast with our findings omurlu2009comparisons reported the importance of pr status and number of lymph nodes using a CPH and RSF model for analysis.

One of the serious disadvantages of BART in comparison to RSF was its computational time. BART is highly computationally demanding because the model requires expanding the data at a grid of event times. This problem is aggravated in case of large datasets. The authors mention using parallel processing and time scale coarsening as possible remedies. Both the BART and RSF models have been incorporated as R packages survbart and randomForestSRC respectively and the function computing the RSF model algorithm is a lot faster.