# Bayesian Variable Selection in High Dimensional Survival Time Cancer Genomic Datasets using Nonlocal Priors

###### Abstract

Variable selection in high dimensional cancer genomic studies has become very popular in the past decade, due to the interest in discovering significant genes pertinent to a specific cancer type. Censored survival data is the main data structure in such studies and performing variable selection for such data type requires certain methodology. With recent developments in computational power, Bayesian methods have become more attractive in the context of variable selection. In this article we introduce a new Bayesian variable selection approach that exploits a mixture of a point mass at zero and an inverse moment prior which is a non-local prior density on the Cox proportional hazard model coefficients. Our method utilizes parallel computing structure and takes advantage of a stochastic search based method to explore the model space and to circumvent the computationally expensive MCMC procedure. It then reports highest posterior probability model, median probability model and posterior inclusion probability for each covariate in the design matrix. Bayesian model averaging is also exploited for predictive power measurements. The proposed algorithm provides improved performance in identifying true models by reducing estimation and prediction error in simulation studies as well as real genomic datasets. This algorithm is implemented in an R package named BVSNLP.

## 1 Introduction

Recent developments in gene sequencing technology have made it easier to generate massive genomic data that can be used to make new discoveries in genomics. The outcomes for most cancer studies are survival times for subjects, and the goal is to investigate the relation or any potential association between survival times and the covariates in the model; namely, genes in this context.

Survival times for each subject represent either the time to death or disease progression, or the time to study termination or the time until the subject is lost to follow up. In the latter cases, the subject’s survival time is censored. The relation between survival times and covariates is modeled through the conditional hazard function, which is the probability of death in the interval when becomes really small, given the covariates in the study. A more precise definition is

(1) |

Here, is the design matrix with observations and covariates. Proportional hazard models are of the form

(2) |

with identifiability constraint of . In this formula, denotes the baseline hazard function. One special case of such model is the one proposed by Cox (1972) where . Consequently the hazard function in Cox proportional hazard model is obtained by

(3) |

Here, is vector of coefficients. In this model, estimating does not depend on . Moreover, this term is canceled out in comparing marginal probabilities of data under different models in the Bayesian variable selection procedure. For general survival analysis, however, the baseline hazard function is necessary for predicting survival times and can be estimated non-parametrically. For more information, interested readers can consult Cox and Oakes (1984); Kalbfleisch and Prentice (1980).

Gene expression datasets are usually in ultrahigh dimensions where thousands of genes are examined for only hundreds of subjects. However, only a limited number of genes contribute significantly to the outcome. In other words, most of the elements in the vector are zero. This is the sparsity assumption imposed in variable selection problems. The primary target is then to find covariates with non-zero coefficients or, equivalently, those genes that contribute the most in determining the survival outcome.

Many of common penalized likelihood methods originally introduced for linear regression have been extended to survival data as well. These include Tibshirani et al. (1997) where the LASSO penalty is imposed on the coefficients in survival analysis, similar to the linear regression problem. Zhang and Lu (2007) utilized adaptive LASSO methodology for time to event data while Antoniadis et al. (2010) adopted the Dantzig selector for survival outcome. The extension of non-convex penalized likelihood approaches, in particular SCAD, to the Cox proportional hazard model is discussed in Fan and Li (2002). The ISIS approach is also extended for ultrahigh dimensional survival data in Fan et al. (2010) where it is used on Cox proportional hazard models and the SCAD penalty is employed for variable selection.

Some Bayesian approaches have also been proposed to address this problem. Faraggi and Simon (1998) proposed a method based on approximating the posterior distribution of the parameters in the proportional hazard model where they define a Gaussian prior on a vector of coefficients. A loss function is then defined in order to select a parsimonious model. A semi-parametric Bayesian approach is utilized by Ibrahim et al. (1999), where they employ a discrete gamma process for the baseline hazard function and a multivariate Gaussian prior for the coefficient vector. Sha et al. (2006) considered Accelerated Failure Time (AFT) models along with data augmentation to impute censored times. A mixture prior in a similar fashion to George and McCulloch (1997) is exploited for sparse selection procedure.

Due to the huge computational load of Bayesian data analysis imposed by Monte Carlo Markov Chain (MCMC) procedure, in particular for high dimensional survival data, the frequentist approaches outnumber their Bayesian counterparts in real genomic applications. Consequently, developing a fairly fast Bayesian variable selection method for high dimensional datasets that can outperform dominant frequentist algorithms seems compelling.

All of the aforementioned Bayesian methods use local priors for model coefficients. In this article, as an extension to the previous work for logistic regression (Nikooienejad et al., 2016), we propose a Bayesian method based on a mixture prior of a point mass at zero and a nonlocal prior. In particular, inverse moment prior for elements of in (3) and find the model with highest posterior probability. The computationally burdensome MCMC process is avoided by adapting a stochastic search based method, S5 (Shin et al., 2015). A general algorithm is also provided to set the tuning parameter of the nonlocal prior.

This article is structured as follows. In Section 2 we introduce notation, review preliminary points, and discuss the methodology behind our proposed method. Section 3 provides simulation and real data analyses to illustrate the performance of the proposed method and compares it to several competing methods. Finally Section 4 concludes this article.

## 2 Methods

### 2.1 Preliminaries

Let denote the survival and denote the censoring times for individual . Each element in the observed vector of survival times, , is defined as . The status for each individual is defined as . The status vector is represented by . We assume the censoring mechanism is at random, meaning that and are conditionally independent given , where are the covariates for individual , and comprise the row of . The observed data is of the form .

Model is defined as where and it is assumed that and all other elements of are 0. The design matrix corresponding to model is denoted by , and the regression vector by.

Let represent the risk set at time , the set of all individuals who are still present in the study at time and are neither dead nor censored. We also assume throughout the article that the failure times are distinct. In other words, only one individual fails at a specific failure time. With this assumption and letting , the partial likelihood (Cox, 1972) for in model can be written as

(4) |

Our method uses this partial likelihood as the sampling distribution in our Bayesian model selection scheme. This is a point of discussion as there might be some information loss in (4) with respect to . For instance, Basu (Ghosh, 1988) argues that partial likelihoods can not usually be interpreted as sampling distributions. On the other hand, Berger et al. (1999) encourage the use of partial likelihoods when the nuisance parameters are marginalized out. We chose to test this idea and use the partial likelihood in (4) as if it was the sampling distribution for observed survival times.

Sorting the observed unique survival times in ascending order and consequently re-ordering the status vector as well as the design matrix with respect to the ordered , the sampling distribution of for model can be written as

(5) |

A Bayesian hierarchical model can be defined in which in (5) can be used as the sampling distribution, is the prior of model coefficients, and is the prior for model . Using the Bayes rule, the posterior probability for model is written as

(6) |

where is the set of all possible models and the marginal probability of the data under model is computed by

(7) |

The prior density for and the prior on model space impact the overall performance of the selection procedure and the amount of sparsity imposed on candidate models. Note that the sampling distribution in (5) is continuous in , and we define an inverse moment prior (Johnson and Rossell, 2010) on each of the coefficients in model .

### 2.2 Prior on Model Space

Inclusion of covariates in model can be modeled as an exchangeable Bernoulli trials. Let be the binary vector showing which covariates are included in model . The size of the model is the number of nonzero elements in . These nonzero indices represent the index of nonzero elements in the coefficient vector, . Assume the model size is and the success probability for the Bernoulli trial is for every . As discussed in Scott et al. (2010), no fixed value for independent of adjusts for multiplicity. As a result, it is necessary to define a prior on . The resulting marginal probability for model in a fully Bayesian approach is then

(8) |

A common choice for is the beta distribution, , where in the special case of , is a uniform distribution. The marginal probability for model derived from (8) is then equal to

(9) |

where is the Beta function. In case of choosing and , the mean and variance of the selected model size, , is

(10) |

To incorporate the belief that the optimal predictive models are sparse, our suggestion is to choose and . By this structure, comparatively small prior probabilities are assigned to models that contain many covariates. However, in some situations, depending on the penalization provided by the prior on coefficients and the nature of the problem, the uniform-binomial might be the preferred choice

### 2.3 Product Inverse MOMent (piMOM) Prior

In our method, the form of the nonlocal prior densities imposed on the non-zero coefficients, , take the form of a product of independent iMOM priors, or piMOM densities (Johnson and Rossell, 2010), expressible as

(11) |

Here is a vector of coefficients of length , and . The hyperparameter represents a scale parameter that determines the dispersion of the prior around , while is similar to the shape parameter in Inverse Gamma distribution and determines the tail behavior of the density. Unlike most penalized likelihood methods, large values of regression coefficients are not penalized by this prior. As a result, it does not necessarily impose significant penalties on non-sparse models provided that the estimated coefficients in those models are not too small.

Figure 1 depicts an example of iMOM prior for and . It shows the heavy tail of iMOM and its behavior around zero.

For selecting hyperparameters of the piMOM prior, we adopt the algorithm in Nikooienejad et al. (2016), where the null distribution of the maximum likelihood estimator for (when all components of are 0), obtained from a randomly selected design matrix , is compared to the prior density on under the alternative assumption that the components are non-zero. We then choose a that makes the overlap between two densities less than a specified threshold, namely .

To generate the response vector under the null model the following procedure is performed. Under the null model, the survival times are sampled using the methodology proposed in Bender et al. (2005) under Cox-exponential distribution models, when all components of are zero. As a result, for each individual, the sampled survival time under the null is computed as

(12) |

In this formulation, is the baseline hazard function, , which we assume to be in our analysis, and is the uniform distribution between and . I define the event rate to be the proportion of subjects that have . This can be estimated from observed data by taking the average of the event status vector, . Defining censoring rate as one minus the even rate, The estimated censoring rate can be obtained by

(13) |

where and are the estimated event and censoring rate, respectively.

Let and be the vector of sampled survival times, and censoring times, respectively. The censoring times, are obtained independently by sampling from an exponential distribution with rate . The rate is computed from the assumed rate for exponential distribution in sampling survival times in (12), , and the estimated censoring rate in observations, . More precisely, is set so the censoring rate is equal to the observed censoring rate, . The details of this calculation are described by the following equation:

(14) |

Thus, letting , the rate is computed as

(15) |

which leads us to obtaining censoring times vector, . The sampled survival time and status for each observation is then computed as

(16) |

which comprise and under the null model.

Using the pair , the MLE from Cox model is computed as part of the hyperparameter selection algorithm. It should be noted that the distribution of the MLE for the Cox model under the null hypothesis is , where is the information matrix of the partial likelihood function. Thus, it is appropriate to approximate the pooled estimated coefficients in that algorithm with a normal density function. When the sample size gets really large, the variance of the MLE decreases and causes the overlap to become really small and consequently small values of are selected.

In general, we found that and are good default values if one chooses not to run the hyperparameter selection algorithm. Those values are based on various simulation results that show reasonable behavior of the prior to cover small and fairly large coefficients. When , the peaks of the iMOM prior occurs at and . By equating to the absolute value of the most common effect size for that application, it provides insight on what default value of would be for different applications. Nikooienejad et al. (2016) discuss the benefit of exploiting this algorithm for nonlocal priors compared to local ones.

### 2.4 Computing Posterior Probability of Models

Computing the posterior probability for each model requires the marginal probability of observed survival times under each model as shown in (6), (7). The marginal probability is approximated by using the Laplace approximation where the regression coefficients in are integrated out. This leads to

Here, is the maximum a posteriori (MAP) estimate of , is the Hessian of the negative of the log posterior function, , computed at and is the size of model . Finding the MAP of is equivalent to finding the minimum of function.

#### 2.4.1 Calculating the Gradient and Hessian of

Let and . For an matrix , let denote the vector corresponding to the column of and denote the vector corresponding to the row of . Also let , where is the sub-matrix of from row to the last row where all columns are included. This makes the dimension of to be . Similarly, for a vector of size , let denote the sub-vector of from element to the last one, a vector of size .

Let and . Also let denote the column vector . The logarithm of in (5) can then be expressed as

(17) |

For each design matrix and vector, define a new matrix , where its column is obtained by

(18) |

Here, and are obtained as discussed before. This is repeated for every to obtain the matrix .

As a result, the negative gradient of cn be written as

(19) |

To compute the Hessian matrix, let be the element in row and column of matrix defined in (18). In what follows, the identity matrix is denoted by and denotes a diagonal matrix with the elements of the vector on its diagonal. Finally, let , the column of .

The row of the Hessian matrix of is derived as

(20) |

The matrix itself is constructed row by row and its row is computed by

(21) |

The gradient and Hessian of the logarithm of piMOM prior is more straightforward. That is,

(22) |

while the Hessian of is a diagonal matrix, , where

(23) |

Consequently, the gradient and Hessian matrix of is obtained using equations (19) to (23). These expressions are then used in finding the MAP, as well as computing the Laplace approximation to the marginal probability of .

We use limited memory version of Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS) for the optimization problem of finding the MAP. The initial value for the algorithm was , the MLE for the Cox proportional hazard model in (17).

Having all the components of formula (6), it is now possible to set up a MCMC framework to sample from the posterior distribution on the model space. The same technique of birth-death scheme, similar to that used in Nikooienejad et al. (2016), can be exploited here. However, computing the Hessian matrix in (21), which has complexity of , makes MCMC iterations overwhelmingly slow and the whole process infeasible. An alternative stochastic search based approach to search the model space is discussed in the following section.

The estimated HPPM is defined as the highest posterior probability model among all visited models. In numerous applications many models are within a small margin of the HPPM. For this reason, and for predictive purposes, it is tempting to obtain the Median Probability Model (MPM) (Barbieri et al., 2004), which is a model containing covariates that have posterior inclusion probability of at least . According to Barbieri et al. (2004), the posterior inclusion probability for covariate is defined as

(24) |

That is, the sum of posterior probabilities of all models that have covariate as one of their variables. In this expression, is a binary value determining the inclusion of the covariate in model .

#### 2.4.2 Stochastic Search Algorithm

We utilize the S5 technique, proposed by Shin et al. (2015) for variable selection in linear regression problems, and adopt it for the survival data model. It is a stochastic search method that screens covariates at each step. The algorithm is scalable and its computational complexity is independent of (Shin et al., 2015).

Screening is the essential part of the S5 algorithm. In linear regression, screening is defined based on the correlation between remaining covariates and the residuals of the regression using the current model (Fan and Lv, 2008). The concept of screening covariates for survival response data is proposed in Fan et al. (2010) and is defined based on the marginal utility for each covariate.

To illustrate the screening technique, suppose the current model is . The conditional utility of covariate is basically the amount of information it contributes to the survival outcome, given model , and is defined as

(25) |

By comparing this to (17), it can be seen that the conditional utility is the maximum likelihood for covariate , while accounting for the information provided by model . Finding is a univariate optimization procedure and thus fast to compute.

The S5 algorithm for survival data works as follows. At each step, the covariates with highest conditional utility are candidates to be added to the current model and comprise the addition set, . Hence, contains models of size each. The deletion set, comprises models with the same covariates as current model except with one variable that is removed. Therefore, has models of size . From the current model, , We can potentially move to each of its neighbors in and , with a probability proportional to the marginal probabilities of those neighboring models. This is how the model space is explored.

Note that this is done in a simulated annealing fashion and the marginal probabilities are raised to the power of where is the temperature in the annealing schedule and the temperatures decrease. To increase the number of visited models, a specified number of iterations are performed at each temperature. At the end, the model with the highest posterior probability out of those visited models is picked as the HPPM.

In our algorithm, we used equally spaced temperatures varying from to and iterations within each temperature. To increase the number of visited models, we parallelize the S5 procedure so that it could be distributed to multiple CPUs. Each CPU then executes S5 algorithm independently with a different starting model. All visited models are pooled together at the end and the HPPM and MPM are found accordingly. Using posterior probabilities of the visited models, the posterior inclusion probability for each covariate can be computed using (24). In our simulations, we found CPUs were sufficient to explore the model space for design matrices with covariates.

### 2.5 Predictive Accuracy Measurement

To measure the predictive accuracy of our method, in addition to looking at the selected genes and their pathways to check how relevant they are, we also compute the time dependent AUC values obtained from time dependent ROC curves introduced by Heagerty et al. (2000) for survival time datasets. In particular, the dataset is divided into two parts of training and test sets and the AUC values are computed for the test set using the model that is obtained by performing variable selection process on the training set. This can be fit into a cross validation scheme for assessing the predictive power. There are different methods to estimate time dependent sensitivity specificity. In our algorithm, we adapt a method proposed by Uno et al. (2007), henceforth called Uno’s method. In that, sensitivity is estimated by

(26) |

and the specificity is estimated by

(27) |

Those values are estimated for the test set. Therefore, in equations above, is the number of observations in the test set, is the status of observation and is the observed time for that observation in the test set. The variable is the discrimination threshold that is changing to obtain ROC curve. The function is the Kaplan-Meier estimate of the survival function obtained by the training set. For each observation in the test set with the observed time of , is computed by basic interpolation procedure. That is,

(28) |

Here, is the set of all observed survival times in the training set. Note that in (26) and (27), is the estimated coefficient under a specific model. This is where different models can play role in determining sensitivity and specificity.

#### 2.5.1 Bayesian Model Averaging (BMA)

In analyzing real datasets, to incorporate the uncertainty in the selected models, and the fact that only limited number of models are being visited in the selection procedure, it is compelling to use Bayesian model averaging for predictive purposes. In particular, we account for different models in computing time dependent sensitivity and specificity, and not just HPPM. Following (26) and (27), let the final sensitivity and specificity using BMA is obtained by,

(29) |

and

(30) |

where, be the posterior probability of model The value of depends on what type of BMA is being used. Here, we use Occam’s window which means only models that have posterior probability of at least will be considered in the model averaging. We use in our formulations.

## 3 Results

To investigate the performance of the proposed model selection procedure, we applied our method to both simulated data sets and real cancer genomic data. For simulation data, we compared the performance of our algorithm to ISIS-SCAD (Fan et al., 2010). For the real data, we compared our algorithm to ISIS-SCAD, glmnet algorithm (Friedman et al., 2010) and a recently introduced method, named CoxHD, for classification of genes in Acute Myeloid Leukemia (AML). This algorithm has shown promising performance in selecting genes (Papaemmanuil et al., 2016). Although CoxHD is suitable for cases when , it does not work well for . Finally we applied our method to renal cell carcinoma reported in Cancer Genome Atlas Research Network (2013). We were not able to run the CoxHD algorithm for this dataset with , given the runtime limits imposed by our High Performance Computational facility. Therefore, the results for renal cell carcinoma are only available for our method, gmlnet and ISIS-SCAD. Moreover, ISIS-SCAD failed to produce results for the AML dataset. The details are discussed in the following sections.

### 3.1 Simulation Studies

We first examined the six different simulation settings described in Fan et al. (2010). These settings consider different aspects of variable selection with respect to the correlation between true covariates and the magnitude of true coefficients. Here, we report two of the hardest settings which are named as Equi-correlated covariates with correlation = 0.5 and two very hard variables to be selected. We refer to these settings as case 1 and 2, respectively.

For case 1, are multivariate Gaussian random variable with mean and marginal variance of . The correlation structure is for . The size of the true model is six with true regression coefficients and for all . The number of observations and covariates are and . The censoring rate for this simulation case is .

For case 2, are multivariate Gaussian random variables with mean and marginal variance of . The correlation structure is for all , for all and for . The size of the true model is five with true regression coefficients and for all . The corresponding censoring rate is for this case. Similar to the previous case, The number of observations and covariates are and . The censoring rate for this simulation setting is .

In both cases the survival times are simulated from an exponential distribution with mean , measuring that the baseline hazard function was for .

To measure the performance of the algorithms, we repeated each simulation setting 50 different times and at the end four different outcomes are reported. The first two of those outcomes are the median norm and the median squared norm for coefficient estimation error, denoted by and respectively. The norm is computed as , where the squared norm is computed as . Here, is the estimated and is the true coefficient vector. The third outcome that is considered is the median model size of the selected models in 50 different iterations which is denoted by MMS. The last out come is the proportion of times that the selected model contains all true variables. This parameter is denoted by .

Table 1 shows the performance comparison between our method, BVSNLP and three different versions of ISIS-SCAD algorithm. The LASSO method is not in listed in that table because it takes several days to complete a single repetition of any of these simulation cases (Fan et al., 2010).

BVSNLP | Van-ISIS | Var1-ISIS | Var2-ISIS | |

Case 1: | ||||

0.43 | 0.52 | 0.55 | 0.51 | |

0.04 | 0.07 | 0.08 | 0.07 | |

MMS | 6 | 6 | 6 | 6 |

1 | 1 | 1 | 1 | |

Case 2: | ||||

0.77 | 0.99 | 1.1 | 1.29 | |

0.16 | 0.39 | 0.44 | 1.35 | |

MMS | 5 | 5 | 5 | 5 |

1 | 1 | 1 | 0.99 |

In the S5 algorithm, 30 iterations are used within each temperature. The parameter was chosen as . Each S5 algorithm was run in parallel on 120 CPUs for both simulation cases. The Beta binomial prior was used for the model space with average model size equal to 1. The hyperparameters were selected using the algorithm discussed in section 2.

As demonstrated in Table 1, our method performed better in estimating true coefficients compared to the best variants of the ISIS-SCAD algorithm. In addition, the BVSNLP algorithm chose the correct model with zero false positives in all 50 iterations for both simulation scenarios.

### 3.2 Real Data

We have studied two major datasets. The first dataset contains leukemia patients’ survival times and was introduced in Papaemmanuil et al. (2016). For those data, the number of observations is almost the same as number of covariates . The other data set involves survival times for renal cell carcinoma in which . This dataset was previously considered in Nikooienejad et al. (2016), where survival patients were converted to binary outcomes.

#### 3.2.1 Leukemia Data

The recent work of Papaemmanuil et al. (2016) considers genomic classification in Acute Myeloid Leukemia (AML) patients. In this study, 1540 patients in three prospective trials were enrolled in order to investigate the effects of known mutated genes in AML. The censoring rate for this dataset is .

Papaemmanuil et al. (2016) implemented a sparse random effects model for the Cox proportional hazards using their proposed R package, CoxHD. In that method, the parameters of random effects model are assumed to come from a shared Normal distribution. The coefficients are then obtained by finding the maximum a posteriori (MAP) of a penalized partial likelihood function. The covariates are clustered into different groups with a shared mean and variance. The shared mean vector and covariance matrix are estimated iteratively using the expectation Maximization (EM) algorithm. This method has shown to have a dominant predictive performance compared to existing frequentist methods for datasets where and henceforth is denoted by NEJM.

To compare NEJM method to ours, we used the exact same dataset they provided when computing the predictive accuracy of the CoxHD method. That dataset contains 229 covariates in 8 different categories. Those categories are listed as follows, with the number of covariates at each category listed in the parenthesis. Clinical (11), copy number alterations (18), demographics (2), fusion genes (8), genes (58), gene:gene interactions (126), nuisances (4) and treatment (2). In this list, nuisance variables are variables such as the trial a patient was enrolled in, the year a patient entered the clinical trial and whether cytogenetic data were missing. Moreover, gene:gene interactions are between two mutated genes.

A five fold cross validation was performed to measure the predictive accuracy of both algorithms. The Area Under Curve (AUC) for right censored data is used to evaluate the predictive power for each of the methods. To estimate AUC, we exploit the methodology in section 2.5. We also applied glmnet and ISIS-SCAD algorithms to this dataset.

Figure 2 illustrates the predictive power for the mentioned methods. The BVSNLP method performs similarly to the NEJM method. The mean square difference between predictive AUC curves is only .

#### 3.2.2 Renal Cell Carcinoma Data

This dataset was generated by Cancer Genome Atlas Research Network (2013) and contains Illumina HiSeq data on mRNA expression for 467 patient samples. The survival outcomes of these patients were available. A preprocessing step using DeMix algorithm (Ahn et al., 2013) was performed on the data in order to remove stromal contamination. The resulting number of observations included in our analysis was 509, with 12,664 gene expression covariates in the design matrix. Three of those covariates are clinical ones, including age, gender and stage of cancer. The censoring rate for this dataset is .

We applied our method, BVSNLP to this dataset with uniform-beta prior on model space. Moreover, we kept the ‘stage’ as a fixed covariate, meaning that it is present in all models and does not enter the selection procedure. Figure 3 shows the average predictive AUC for 5 fold cross validation for BVSNLP, glmnet and ISIS-SCAD. The CoxHD could not be run for this dataset as and its undetermined runtime was more than the limit allowed in our local cluster.

## 4 Discussion

In this article a Bayesian method, named BVSNLP, for selecting variables in high and ultrahigh dimensional datasets with survival time outcomes was proposed. Our method imposes inverse moment nonlocal prior density on non-zero regression coefficients. This positively impacts variable selection and coefficient estimation performance, as demonstrated by simulation studies.

Various outputs can be reported as the outcomes of the algorithm. Those include highest posterior probability model, median probability model and posterior inclusion probability for each covariate in the model. For real datasets, Bayesian model averaging is used to incorporate uncertainty in selected models in computing time dependent AUC plots using Uno’s method. Finally, an R package named BVSNLP, which makes the algorithm available to interested researchers, has been implemented.

Two real datasets were considered in this article. BVSNLP found sparse models with biologically relevant genes that comply with previous findings in both cases. The proposed method showed a reliable predictive accuracy as measured by predictive AUC and outperformed competing methods. It should be noted that our algorithm complexity increases with the sample size at an rate, which slows down the processing of datasets with thousands of observations.

## References

- Ahn et al. (2013) Ahn, J., Yuan, Y., Parmigiani, G., Suraokar, M. B., Diao, L., Wistuba, I. I., and Wang, W. (2013). Demix: deconvolution for mixed cancer transcriptomes using raw measured data. Bioinformatics, 29(15), 1865–1871.
- Antoniadis et al. (2010) Antoniadis, A., Fryzlewicz, P., and Letué, F. (2010). The dantzig selector in cox’s proportional hazards model. Scandinavian Journal of Statistics, 37(4), 531–552.
- Barbieri et al. (2004) Barbieri, M. M., Berger, J. O., et al. (2004). Optimal predictive model selection. The Annals of Statistics, 32(3), 870–897.
- Bender et al. (2005) Bender, R., Augustin, T., and Blettner, M. (2005). Generating survival times to simulate cox proportional hazards models. Statistics in medicine, 24(11), 1713–1723.
- Berger et al. (1999) Berger, J. O., Liseo, B., Wolpert, R. L., et al. (1999). Integrated likelihood methods for eliminating nuisance parameters. Statistical Science, 14(1), 1–28.
- Cancer Genome Atlas Research Network (2013) Cancer Genome Atlas Research Network (2013). Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature, 499(7456), 43–49.
- Cox (1972) Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society. Series B (Methodological), 36(2), 187–220.
- Cox and Oakes (1984) Cox, D. R. and Oakes, D. (1984). Analysis of survival data, volume 21. CRC Press.
- Fan and Li (2002) Fan, J. and Li, R. (2002). Variable selection for cox’s proportional hazards model and frailty model. Annals of Statistics, pages 74–99.
- Fan and Lv (2008) Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849–911.
- Fan et al. (2010) Fan, J., Feng, Y., Wu, Y., et al. (2010). High-dimensional variable selection for coxâs proportional hazards model. In Borrowing Strength: Theory Powering Applications–A Festschrift for Lawrence D. Brown, pages 70–86. Institute of Mathematical Statistics.
- Faraggi and Simon (1998) Faraggi, D. and Simon, R. (1998). Bayesian variable selection method for censored survival data. Biometrics, pages 1475–1485.
- Friedman et al. (2010) Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.
- George and McCulloch (1997) George, E. I. and McCulloch, R. E. (1997). Approaches for bayesian variable selection. Statistica sinica, 7(2), 339–373.
- Ghosh (1988) Ghosh, J. (1988). Statistical information and likelihood: A collection of critical essays by dr. d. basu. Lect. Notes in Statist, 45.
- Heagerty et al. (2000) Heagerty, P. J., Lumley, T., and Pepe, M. S. (2000). Time-dependent roc curves for censored survival data and a diagnostic marker. Biometrics, 56(2), 337–344.
- Ibrahim et al. (1999) Ibrahim, J. G., Chen, M.-H., and MacEachern, S. N. (1999). Bayesian variable selection for proportional hazards models. Canadian Journal of Statistics, 27(4), 701–717.
- Johnson and Rossell (2010) Johnson, V. E. and Rossell, D. (2010). On the use of non-local prior densities in bayesian hypothesis tests. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(2), 143–170.
- Kalbfleisch and Prentice (1980) Kalbfleisch, J. and Prentice, R. (1980). The statistical analysis of time failure data. The statistical analysis of time failure data.
- Montagnoli et al. (2010) Montagnoli, A., Moll, J., and Colotta, F. (2010). Targeting cell division cycle 7 kinase: a new approach for cancer therapy. Clinical cancer research, 16(18), 4503–4508.
- Nikooienejad et al. (2016) Nikooienejad, A., Wang, W., and Johnson, V. E. (2016). Bayesian variable selection for binary outcomes in high-dimensional genomic studies using non-local priors. Bioinformatics, 32(9), 1338–1345.
- Papaemmanuil et al. (2016) Papaemmanuil, E., Gerstung, M., Bullinger, L., Gaidzik, V. I., Paschka, P., Roberts, N. D., Potter, N. E., Heuser, M., Thol, F., Bolli, N., et al. (2016). Genomic classification and prognosis in acute myeloid leukemia. New England Journal of Medicine, 374(23), 2209–2221.
- Potapov et al. (2012) Potapov, S., Adler, W., Schmid, M., and Potapov, M. S. (2012). Package âsurvaucâ. Statistics in Medicine, 25, 3474–3486.
- Scott et al. (2010) Scott, J. G., Berger, J. O., et al. (2010). Bayes and empirical-bayes multiplicity adjustment in the variable-selection problem. The Annals of Statistics, 38(5), 2587–2619.
- Sha et al. (2006) Sha, N., Tadesse, M. G., and Vannucci, M. (2006). Bayesian variable selection for the analysis of microarray data with censored outcomes. Bioinformatics, 22(18), 2262–2268.
- Shin et al. (2015) Shin, M., Bhattacharya, A., and Johnson, V. E. (2015). Scalable bayesian variable selection using nonlocal prior densities in ultrahigh-dimensional settings. arXiv preprint arXiv:1507.07106.
- Tibshirani et al. (1997) Tibshirani, R. et al. (1997). The lasso method for variable selection in the cox model. Statistics in medicine, 16(4), 385–395.
- Uno et al. (2007) Uno, H., Cai, T., Tian, L., and Wei, L. (2007). Evaluating prediction rules for t-year survivors with censored regression models. Journal of the American Statistical Association, 102(478), 527–537.
- Zhang and Lu (2007) Zhang, H. H. and Lu, W. (2007). Adaptive lasso for cox’s proportional hazards model. Biometrika, 94(3), 691–703.