Cmix: a high dimensional mixture model for censored durations, with applications to genetic data
Abstract
We introduce a supervised learning mixture model for censored durations (Cmix) to simultaneously detect subgroups of patients with different prognosis and order them based on their risk. Our method is applicable in a highdimensional setting, i.e. with a large number of biomedical covariates.
Indeed, we penalize the negative loglikelihood by the ElasticNet, which leads to a sparse parameterization of the model and automatically pinpoints the relevant covariates for the survival prediction.
Inference is achieved using an efficient QuasiNewton Expectation Maximization (QNEM) algorithm, for which we provide convergence properties.
The statistical performance of the method is examined on an extensive Monte Carlo simulation study, and finally illustrated on three publicly available genetic cancer datasets with highdimensional covariates.
We show that our approach outperforms the stateoftheart survival models in this context, namely both the CURE and Cox proportional hazards models penalized by the ElasticNet, in terms of Cindex, AUC() and survival prediction.
Thus, we propose a powerfull tool for personalized medicine in cancerology.
Keywords. Coxâs proportional hazards model; CURE model; Elasticnet regularization; Highdimensional estimation; Mixture duration model; Survival analysis
1.1
Simon Bussy
Theoretical and Applied Statistics Laboratory
Pierre and Marie Curie University, Paris, France
email: simon.bussy@gmail.com
Agathe Guilloux
LaMME, UEVE and UMR 8071
Paris Saclay University, Evry, France
email: agathe.guilloux@math.cnrs.fr
Stéphane Gaïffas
CMAP Ecole polytechnique, Palaiseau, France
and LPMA, UMR CNRS 7599, Paris Diderot University, Paris, France
email: stephane.gaiffas@polytechnique.edu
AnneSophie Jannot
Biomedical Informatics and Public Health Department
European Georges Pompidou Hospital, Assistance PubliqueHôpitaux de Paris
and INSERM UMRS 1138, Centre de Recherche des Cordeliers, Paris, France
email: annesophie.jannot@aphp.fr
1 Introduction
Predicting subgroups of patients with different prognosis is a key challenge for personalized medicine, see for instance Alizadeh et al. (2000) and Rosenwald et al. (2002) where subgroups of patients with different survival rates are identified based on gene expression data. A substantial number of techniques can be found in the literature to predict the subgroup of a given patient in a classification setting, namely when subgroups are known in advance (Golub et al., 1999; Hastie et al., 2001; Tibshirani et al., 2002). We consider in the present paper the much more difficult case where subgroups are unknown.
In this situation, a first widespread approach consists in first using unsupervised learning techniques applied on the covariates – for instance on the gene expression data (Bhattacharjee et al., 2001; Beer et al., 2002; Sørlie et al., 2001) – to define subsets of patients and then estimating the risks in each of them. The problem of such techniques is that there is no guarantee that the identified subgroups will have different risks. Another approach to subgroups identification is conversely based exclusively on the survival times: patients are then assigned to a âlowriskâ or a âhighriskâ group based on whether they were still alive (Shipp et al., 2002; Van’t Veer et al., 2002). The problem here is that the resulting subgroups may not be biologically meaningful since the method do not use the covariates, and prognosis prediction based on covariates is not possible.
The method we propose uses both the survival information of the patients and its covariates in a supervised learning way. Moreover, it relies on the idea that exploiting the subgroups structure of the data, namely the fact that a portion of the population have a higher risk of early death, could improve the survival prediction of future patients (unseen during the learning phase).
We propose to consider a mixture of event times distributions in which the probabilities of belonging to each subgroups are driven by the covariates (e.g. gene expression data, patients characteristics, therapeutic strategy or omics covariates). Our Cmix model is hence part of the class of modelbased clustering algorithms, as introduced in Banfield and Raftery (1993).
More precisely, to model the heterogeneity within the patient population, we introduce a latent variable and our focus is on the conditional distribution of given the values of the covariates . Now, conditionally on the latent variable , the distribution of duration time is different, leading to a mixture in the event times distribution.
For a patient with covariates , the conditional probabilities of belonging to the th risk group can be seen as scores, that can help decisionmaking for physicians. As a byproduct, it can also shed light on the effect of the covariates (which combination of biomedical markers are relevant to a given event of interest).
Our methodology differs from the standard survival analysis approaches in various ways, that we describe in this paragraph. First, the Cox proportional hazards (PH) model (Cox (1972)) (by far the most widely used in such a setting) is a regression model that describes the relation between intensity of events and covariates via
(1) 
where is a baseline intensity, and is a vector quantifying the multiplicative impact on the hazard ratio of each covariate. As in our model, highdimensional covariates can be handled, via e.g. penalization , see Simon et al. (2011). But it does not permit the stratification of the population in groups of homogeneous risks, hence does no deliver a simple tool for clinical practice. Moreover, we show in the numerical sections that the Cmix model can be trained very efficiently in high dimension, and outperforms the standard Cox PH model by far in the analysed datasets.
Other models condiser mixtures of event times distributions. In the CURE model (see Farewell (1982) and Kuk and Chen (1992)), one fraction of the population is considered as cured (hence not subject to any risk). This can be very limitating, as for a large number of applications (e.g. rehospitalization for patients with chronic diseases or relapse for patients with metastatic cancer), all patients are at risk. We consider, in our model, that there is always an event risk, no matter how small. Other mixture models have been considered in survival analysis: see Kuo and Peng (2000) for a general study about mixture model for survival data or De Angelis et al. (1999) in a cancer survival analysis setting, to name but a few. Unlike our algorithm, none of these algorithms considers the high dimensional setting.
A precise description of the model is given in Section 2. Section 3 focuses on the regularized version of the model with an ElasticNet penalization to exploit dimension reduction and prevent overfitting. Inference is presented under this framework, as well as the convergence properties of the developed algorithm. Section 4 highlights the simulation procedure used to evaluate the performances and compares it with stateoftheart models. In Section 5, we apply our method to genetic datasets. Finally, we discuss the obtained results in Section 6.
2 A censored mixture model
Let us present the survival analysis framework. We assume that, the conditional density of the duration given is a mixture
of densities , for and some parameters to estimate. The weights combining these distributions depend on the patient biomedical covariates and are such that
(2) 
This is equivalent to saying that conditionally on a latent variable , the density of at time is , and we have
where denotes a vector of coefficients that quantifies the impact of each biomedical covariates on the probability that a patient belongs to the th group. Consider a logistic link function for these weights given by
(3) 
The hidden status has therefore a multinomial distribution . The intercept term is here omitted without loss of generality.
In practice, information loss occurs of right censoring type. This is taken into acount in our model by introducing the following: a time when the individual “leaves” the target cohort, a rightcensored duration and a censoring indicator , defined by
where denotes the minimum between two numbers and , and denotes the indicator function.
In order to write a likelihood and draw inference, we make the two following hypothesis.
Hypothesis 1
and are conditionally independent given and .
Hypothesis 2
is independent of .
Hypothesis 1 is classical in survival analysis (Klein and Moeschberger, 2005), while Hypothesis 2 is classical in survival mixture models (Kuo and Peng, 2000; De Angelis et al., 1999). Under this hypothesis, denoting the density of the censoring , the cumulative distribution function corresponding to a given density , and , we have
Then, denoting the parameters to infer and considering an independent and identically distributed (i.i.d.) cohort of patients , the loglikelihood of the Cmix model can be written
where we use the notations and . Note that from now on, all computations are done conditionally on the covariates . An important fact is that we do not need to know or parametrize nor , namely the distribution of the censoring, for inference in this model (since all and terms vanish in Equation (6)).
3 Inference of Cmix
In this section, we describe the procedure for estimating the parameters of the Cmix model. We begin by presenting the QuasiNewton Expectation Maximization (QNEM) algorithm we use for inference. We then focus our study on the convergence properties of the algorithm.
3.1 QNEM algorithm
In order to avoid overfitting and to improve the prediction power of our model, we use ElasticNet regularization (Zou and Hastie, 2005) by minimizing the penalized objective
(4) 
where we add a linear combination of the lasso () and ridge (squared ) penalties for a fixed , tuning parameter , and where we denote the norm of . One advantage of this regularization method is its ability to perform model selection (the lasso part) and pinpoint the most important covariates relatively to the prediction objective. On the other hand, the ridge part allows to handle potential correlation between covariates (Zou and Hastie, 2005). Note that in practice, the intercept is not regularized.
In order to derive an algorithm for this objective, we introduce a socalled QuasiNewton Expectation Maximization (QNEM), being a combination between an EM algorithm (Dempster et al., 1977) and a LBFGSB algorithm (Zhu et al., 1997). For the EM part, we need to compute the negative completed loglikelihood (here scaled by ), namely the negative joint distribution of , and . It can be written
(5) 
Suppose that we are at step of the algorithm, with current iterate denoted . For the Estep, we need to compute the expected loglikelihood given by
We note that
(6) 
with
(7) 
so that is obtained from (5) by replacing the two occurrences with . Depending on the chosen distributions , the Mstep can either be explicit for the updates of (see Section 3.3 below for the geometric distributions case), or obtained using a minimization algorithm otherwise.
Let us focus now on the update of in the Mstep of the algorithm. By denoting
the quantities involved in that depend on , the update for therefore requires to minimize
(8) 
The minimization Problem (8) is a convex problem. It looks like the logistic regression objective, where labels are not fixed but softly encoded by the expectation step (computation of above, see Equation (6)).
We minimize (8) using the wellknown LBFGSB algorithm (Zhu et al., 1997). This algorithm belongs to the class of quasiNewton optimization routines, which solve the given minimization problem by computing approximations of the inverse Hessian matrix of the objective function. It can deal with differentiable convex objectives with box constraints. In order to use it with penalization, which is not differentiable, we use the trick borrowed from Andrew and Gao (2007): for , write , where and are respectively the positive and negative part of , and add the constraints and . Namely, we rewrite the minimization problem (8) as the following differentiable problem with box constraints
(9) 
where . The LBFGSB solver requires the exact value of the gradient, which is easily given by
(10) 
In Algorithm 1, we describe the main steps of the QNEM algorithm to minimize the function given in Equation (4).
The penalization parameters are chosen using crossvalidation, see Section A of Appendices for precise statements about this procedure and about other numerical details.
3.2 Convergence to a stationary point
We are addressing here convergence properties of the QNEM algorithm described in Section 3.1 for the minimization of the objective function defined in Equation (4). Let us denote
Convergence properties of the EM algorithm in a general setting are well known, see Wu (1983). In the QNEM algorithm, since we only improve instead of a minimization of , we are not in the EM algorithm setting but in a so called generalized EM (GEM) algorithm setting (Dempster et al., 1977). For such an algorithm, we do have the descent property, in the sense that the criterion function given in Equation (4) is reduced at each iteration, namely
Let us make two hypothesis.
Hypothesis 3
The duration densities are such that is bounded for all .
Hypothesis 4
is continuous in and , and for any fixed , is a convex function in and is strictly convex in each coordinate of .
Under Hypothesis 3, decreases monotically to some finite limit. By adding Hypothesis 4, convergence of the QNEM algorithm to a stationary point can be shown. In particular, the stationary point is here a local minimum.
Theorem 1
A proof is given in Section B of Appendices.
3.3 Parameterization
Let us discuss here the parametrization choices we made in the experimental part. First, in many applications  including the one addressed in Section 5  we are interested in identifying one subgroup of the population with a high risk of adverse event compared to the others. Then, in the following, we consider where means highrisk of early death and means low risk. Moreover, in such a setting where , one can compare the learned groups by the Cmix and the ones learned by the CURE model in terms of survival curves (see Figure 5).
To simplify notations and given the constraint formulated in Equation 2, we set and we denote and the conditional probability that a patient belongs to the group with high risk of death, given its covariates .
In practice, we deal with discrete times in days. It turns out that the times of the data used for applications in Section 5 is well fitted by Weibull distributions. This choice of distribution is very popular in survival analysis, see for instance Klein and Moeschberger (2005). We then first derive the QNEM algorithm with
with here , being the scale parameter and the shape parameter of the distribution.
As explained in the following Section 4, we select the best model using a crossvalidation procedure based on the Cindex metric, and the performances are evaluated according to both Cindex and AUC() metrics (see Sections 4.3 for details). Those two metrics have the following property: if we apply any mapping on the marker vector (predicted on a test set) such that the order between all vector coefficient values is conserved, then both Cindex and AUC() estimates remain unchanged. In other words, by denoting the vector of markers predicted on a test set of individuals, if is a function such that for all , then both Cindex and AUC() estimates induced by or by are the same.
The order in the marker coefficients is actually paramount when the performances are evaluated according to the mentioned metrics. Furthermore, it turns out that empirically, if we add a constraint on the mixture of Weibull that enforces an order like relation between the two distributions and , the performances are improved. To be more precise, the constraint to impose is that the two density curves do not intersect. We then choose to impose the following: the two scale parameters are equal, . Indeed under this hypothesis, we do have that for all .
With this Weibull parameterization, updates for are not explicit in the QNEM algorithm, and consequently require some iterations of a minimization algorithm. Seeking to have explicit updates for , we then derive the algorithm with geometric distributions instead of Weibull (geometric being a particular case of Weibull with ), namely with .
With this parameterization, we obtain from Equation (7)
which leads to the following explicit Mstep
In this setting, implementation is hence straightforward. Note that Hypothesis 3 and 4 are immediately satisfied with this geometric parameterization.
In Section 5, we note that performances are similar for the Cmix model with Weibull or geometric distributions on all considered biomedical datasets. The geometric parameterization leading to more straightforward computations, it is the one used to parameterize the Cmix model in what follows, if not otherwise stated. Let us focus now on the performance evaluation of the Cmix model and its comparison with the Cox PH and CURE models, both regularized with the ElasticNet.
4 Performance evaluation
In this section, we first briefly introduce the models we consider for performance comparisons. Then, we provide details regarding the simulation study and data generation. The chosen metrics for evaluating performances are then presented, followed by the results.
4.1 Competing models
The first model we consider is the Cox PH model penalized by the ElasticNet, denoted Cox PH in the following. In this model introduced in Cox (1972), the partial loglikelihood is given by
We use respectively the R packages survival and glmnet (Simon et al., 2011) for the partial loglikelihood and the minimization of the following quantity
where is chosen by the same crossvalidation procedure than the Cmix model, for a given (see Section A of Appendices. Ties are handled via the Breslow approximation of the partial likelihood (Breslow, 1972).
We remark that the model introduced in this paper cannot be reduced to a Cox model. Indeed, the Cmix model intensity can be written (in the geometric case)
while it is given by Equation (1) in the Cox model.
Finally, we consider the CURE Farewell (1982) model penalized by the ElasticNet and denoted CURE in the following, with a logistic function for the incidence part and a parametric survival model for , where means that patient is cured, means that patient is not cured, and denotes the survival function. In this model, we then have constant and equal to 1. We add an ElasticNet regularization term, and since we were not able to find any open source package where CURE models were implemented with a regularized objective, we used the QNEM algorithm in the particular case of CURE model. We just add the constraint that the geometric distribution corresponding to the cured group of patients () has a parameter , which does not change over the algorithm iterations. The QNEM algorithm can be used in this particular case, were some terms have disapeared from the completed loglikelihood, since in the CURE model case we have . Note that in the original introduction of the CURE model in Farewell (1982), the density of uncured patients directly depends on individual patient covariates, which is not the case here.
We also give additional simulation settings in Section C of Appendices. First, the case where , including a comparison of the screening strategy we use in Section 5 with the iterative sure independence screening (Fan et al., 2010) (ISIS) method. We also add simulations where data is generated according to the Cmix model with gamma distributions instead of geometric ones, and include the accelerated failure time model (Wei, 1992) (AFT) in the performances comparison study.
4.2 Simulation design
In order to assess the proposed method, we perform an extensive Monte Carlo simulation study. Since we want to compare the performances of the 3 models mentioned above, we consider 3 simulation cases for the time distribution: one for each competing model. We first choose a coefficient vector , with being the value of the active coefficients and a sparsity parameter. For a desired lowrisk patients proportion , the highrisk patients index set is given by
where denotes the largest integer less than or equal to . For the generation of the covariates matrix, we first take , with a Toeplitz covariance matrix (Mukherjee and Maiti, 1988) with correlation . We then add a gap value for patients and subtract it for patients , only on active covariates plus a proportion of the nonactive covariates considered as confusion factors, that is
.
Note that this is equivalent to generate the covariates according to a gaussian mixture.
Then we generate in the Cmix or CURE simulation case, where is computed given Equation (3), with geometric distributions for the durations (see Section 3.3). We obtain in the Cmix case, and in the CURE case. For the Cox PH model, we take , with and where stands for the uniform distribution on a segment .
The distribution of the censoring variable is geometric , with . The parameter is tuned to maintain a desired censoring rate , using a formula given in Section D of Appendices. The values of the chosen hyper parameters are sumarized in Table 6.
gap  

0.1  100, 200, 500  30, 100  10  0.3  1  0.5  0.75  0.1, 0.3, 1  0.2, 0.5  0.01  0.5 
Note that when simulating under the CURE model, the proportion of censored time events is at least equal to : we then choose for the CURE simulations only.
Finally, we want to assess the stability of the Cmix model in terms of variable selection and compare it to the CURE and Cox PH models. To this end, we follow the same simulation procedure explained in the previous lines. For each simulation case, we make vary the two hyperparameters that impact the most the stability of the variable selection, that is the gap varying in and the confusion rate varying in . All other hyperparameters are the same than in Table 6, except and with the choice . For a given hyperparameters configuration (gap, ), we use the following approach to evaluate the variable selection power of the models. Denoting , if we consider that is the predicted probability that the true equals , then we are in a binary prediction setting and we use the resulting AUC of this problem. Explicit examples of such AUC computations are given in Section E of Appendices.
4.3 Metrics
We detail in this section the metrics considered to evaluate risk prediction performances. Let us denote by the marker under study. Note that in the Cmix and the CURE model cases, and in the Cox PH model case. We denote by the probability density function of marker , and assume that the marker is measured once at .
For any threshold , cumulative true positive rates and dynamic false positive rates are two functions of time respectively defined as and . Then, as introduced in Heagerty et al. (2000), the cumulative dynamic timedependent AUC is defined as follows
that we simply denote AUC() in the following. We use the Inverse Probability of Censoring Weighting (IPCW) estimate of this quantity with a KaplanMeier estimator of the conditional survival function , as proposed in Blanche et al. (2013) and already implemented in the R package timeROC.
A common concordance measure that does not depend on time is the Cindex (Harrell et al., 1996) defined by
with two independent patients (which does not depend on under the i.i.d. sample hypothesis). In our case, is subject to right censoring, so one would typically consider the modified defined by
with corresponding to the fixed and prespecified followup period duration (Heagerty and Zheng, 2005). A KaplanMeier estimator for the censoring distribution leads to a nonparametric and consistent estimator of (Uno et al., 2011), already implemented in the R package survival.
Hence in the following, we consider both AUC() and Cindex metrics to assess performances.
4.4 Results of simulation
We present now the simulation results concerning the Cindex metric in the case in Table 2. See Section F of Appendices for results on other configurations for . Each value is obtained by computing the Cindex average and standard deviation (in parenthesis) over 100 simulations. The average (bold line) and standard deviation (bands) over the same 100 simulations are then given in Figure 1, where . Note that the value of the gap can be viewed as a difficulty level of the problem, since the higher the value of the gap, the clearer the separation between the two populations (low risk and high risk patients).
The results measured both by AUC() and Cindex lead to the same conclusion: the Cmix model almost always leads to the best results, even under model misspecification, when data is generated according to the CURE or Cox PH model. Namely, under CURE simulations, Cmix and CURE give very close results, with a strong improvement over Cox PH. Under Cox PH and Cmix simulations, Cmix outperforms both Cox PH and CURE. Surprisingly enough, this exhibits a strong generalization property of the Cmix model, over both Cox PH and CURE. Note that this phenomenon is particularly strong for small gap values, while with an increasing gap (or an increasing sample size ), all procedures barely exhibit the same performance. It can be first explained by the non parametric baseline function in the Cox PH model, and second by the fact that unlike the Cox PH model, the Cmix and CURE models exploit directly the mixture aspect.
Finally, Figure 2 gives the results concerning the stability of the variable selection aspect of the competing models. The Cmix model appears to be the best method as well considering the variable selection aspect, even under model misspecification. We notice a general behaviour of our method that we describe in the following, which is also shared by the CURE model only when the data is simulated according to itself, and which justifies the log scale for the gap to clearly distinguish the three following phases. For very small gap values (less than ), the confusion rate does not impact the variable selection performances, since adding very small gap values to the covariates is almost imperceptible. This means that the resulting AUC is the same when there is no confusion factors and when (that is when there are half active covariates and half confusion ones). For medium gap values (saying between and 1), the confusion factors are more difficult to identify by the model as there number goes up (that is when increases), which is precisely the confusion factors effect we expect to observe. Then, for large gap values (more than 1), the model succeeds in vanishing properly all confusion factors since the two subpopulations are more clearly separated regarding the covariates, and the problem becomes naturally easier as the gap increases.
Estimation  

Simulation  gap  Cmix  CURE  Cox PH  Cmix  CURE  Cox PH  Cmix  CURE  Cox PH 
0.1  0.786 (0.057)  0.745 (0.076)  0.701 (0.075)  0.792 (0.040)  0.770 (0.048)  0.739 (0.055)  0.806 (0.021)  0.798 (0.023)  0.790 (0.024)  
Cmix  0.3  0.796 (0.055)  0.739 (0.094)  0.714 (0.088)  0.794 (0.036)  0.760 (0.058)  0.744 (0.055)  0.801 (0.021)  0.784 (0.027)  0.783 (0.026) 
1  0.768 (0.062)  0.734 (0.084)  0.756 (0.066)  0.766 (0.043)  0.736 (0.054)  0.764 (0.042)  0.772 (0.026)  0.761 (0.027)  0.772 (0.025)  
0.1  0.770 (0.064)  0.772 (0.062)  0.722 (0.073)  0.790 (0.038)  0.790 (0.038)  0.758 (0.049)  0.798 (0.025)  0.799 (0.024)  0.787 (0.025)  
CURE  0.3  0.733 (0.073)  0.732 (0.072)  0.686 (0.072)  0.740 (0.053)  0.741 (0.053)  0.714 (0.060)  0.751 (0.029)  0.751 (0.029)  0.738 (0.030) 
1  0.659 (0.078)  0.658 (0.078)  0.635 (0.070)  0.658 (0.053)  0.658 (0.053)  0.647 (0.047)  0.657 (0.031)  0.657 (0.031)  0.656 (0.032)  
0.1  0.940 (0.041)  0.937 (0.044)  0.850 (0.097)  0.959 (0.021)  0.958 (0.020)  0.915 (0.042)  0.964 (0.012)  0.964 (0.012)  0.950 (0.016)  
Cox PH  0.3  0.956 (0.030)  0.955 (0.029)  0.864 (0.090)  0.966 (0.020)  0.965 (0.020)  0.926 (0.043)  0.968 (0.013)  0.969 (0.012)  0.959 (0.016) 
1  0.983 (0.016)  0.985 (0.015)  0.981 (0.019)  0.984 (0.012)  0.985 (0.011)  0.988 (0.010)  0.984 (0.007)  0.985 (0.006)  0.990 (0.005) 
5 Application to genetic data
In this section, we apply our method on three genetic datasets and compare its performance to the Cox PH and CURE models. We extracted normalized expression data and survival times in days from breast invasive carcinoma (BRCA, ), glioblastoma multiforme (GBM, ) and kidney renal clear cell carcinoma (KIRC, ).
These datasets are available on The Cancer Genome Atlas (TCGA) platform, which aims at accelerating the understanding of the molecular basis of cancer through the application of genomic technologies, including largescale genome sequencing. For each patient, 20531 covariates corresponding to the normalized gene expressions are available. We randomly split all datasets into a training set and a test set (30% for testing, crossvalidation is done on the training).
We compare the three models both in terms of Cindex and AUC() on the test sets. Inference of the Cox PH model fails in very high dimension on the considered data with the glmnet package. We therefore make a first variable selection (screening) among the 20531 covariates. To do so, we compute the Cindex obtained by univariate Cox PH models (not to confer advantage to our method), namely Cox PH models fitted on each covariate separately. We then ordered the obtained 20531 Cindexes by decreasing order and extracted the top , and covariates. We then apply the three methods on the obtained covariates.
The results in terms of AUC() curves are given in Figure 3 for , where we distinguish the Cmix model with geometric or Weibull distributions.
Then it appears that the performances are very close in terms of AUC() between the Cmix model with geometric or Weibull distributions, which is also validated if we compare the corresponding Cindex for these two parameterizations in Table 3.
Parameterization  Geometric  Weibull  

BRCA  0.782  0.780  
Cancer  GBM  0.755  0.754 
KIRC  0.849  0.835 
Similar conclusions in terms of Cindex, AUC() and computing time can be made on all considered datasets and for any choice of . Hence, as already mentionned in Section 3.3, we only concentrate on the geometric parameterization for the Cmix model. The results in terms of Cindex are then given in Table 4.
Cancer  BRCA  GBM  KIRC  

Model  Cmix  CURE  Cox PH  Cmix  CURE  Cox PH  Cmix  CURE  Cox PH  
100  0.792  0.764  0.705  0.826  0.695  0.571  0.768  0.732  0.716  
300  0.782  0.753  0.723  0.849  0.697  0.571  0.755  0.691  0.698  
1000  0.817  0.613  0.577  0.775  0.699  0.592  0.743  0.690  0.685 
A more direct approach to compare performances between models, rather than only focus on the marker order aspect, is to predict the survival of patients in the test set within a specified short time. For the Cox PH model, the survival for patient in the test set is estimated by
where is the estimated survival function of baseline population () obtained using the Breslow estimate of (Breslow, 1972). For the CURE or the Cmix models, it is naturally estimated by
where and are the KaplanMeier estimators (Kaplan and Meier, 1958) of the low and high risk subgroups respectively, learned by the Cmix or CURE models (patients with are clustered in the high risk subgroup, others in the low risk one). The corresponding estimated survival curves are given in Figure 4.
We observe that the subgroups obtained by the Cmix are more clearly separated in terms of survival than those obtained by the CURE model.
For a given time , one can now use for each model to predict whether or not on the test set, resulting on a binary classification problem that we assess using the classical AUC score. By moving within the first years of followup, since it is the more interesting for physicians in practice, one obtains the curves given in Figure 5.
Let us now focus on the runtime comparison between the models in Table 5. We choose the BRCA dataset to illustrate this point, since it is the larger one () and consequently provides more clearer timeconsuming differences.
Model  Cmix  CURE  Cox PH  

100  0.025 (0.792)  1.992 (0.764)  0.446 (0.705)  
300  0.027 (0.782)  2.343 (0.753)  0.810 (0.723)  
1000  0.139 (0.817)  12.067 (0.613)  2.145 (0.577) 
We also notice that despite using the same QNEM algorithm steps, our CURE model implementation is slower since convergence takes more time to be reached, as shows Figure 6.
In Section G of Appendices, the top 20 selected genes for each cancer type and for all models are presented (for ). Literature on those genes is mined to estimate two simple scores that provide information about how related they are to cancer in general first, and second to cancer plus the survival aspect, according to scientific publications. It turns out that some genes have been widely studied in the literature (e.g. FLT3 for the GBM cancer), while for others, very few publications were retrieved (e.g. TRMT2B still for the GBM cancer).
6 Concluding remarks
In this paper, a mixture model for censored durations (Cmix) has been introduced, and a new efficient estimation algorithm (QNEM) has been derived, that considers a penalization of the likelihood in order to perform covariate selection and to prevent overfitting. A strong improvement is provided over the CURE and Cox PH approches (both penalized by the ElasticNet), which are, by far, the most widely used for biomedical data analysis. But more importantly, our method detects relevant subgroups of patients regarding their risk in a supervised learning procedure, and takes advantage of this identification to improve survival prediction over more standard methods. An extensive Monte Carlo simulation study has been carried out to evaluate the performance of the developed estimation procedure. It showed that our approach is robust to model misspecification. The proposed methodology has then been applied on three high dimensional datasets. On these datasets, Cmix outperforms both Cox PH and CURE, in terms of AUC(), Cindex or survival prediction. Moreover, many gene expressions pinpointed by the feature selection aspect of our regularized method are relevant for medical interpretations (e.g. NFKBIA, LEF1, SUSD3 or FAIM3 for the BRCA cancer, see Zhou et al. (2007) or Oskarsson et al. (2011)), whilst others must involve further investigations in the genetic research community. Finally, our analysis provides, as a byproduct, a new robust implementation of CURE models in high dimension.
Software
All the methodology discussed in this paper is implemented in Python. The code is available from https://github.com/SimonBussy/Cmix in the form of annotated programs, together with a notebook tutorial.
Acknowledgements
The results shown in this paper are based upon data generated by the TCGA Research Network and freely available from http://cancergenome.nih.gov/. Conflict of Interest: None declared.
Appendices
Appendix A Numerical details
Let us first give some details about the starting point of Algorithm 1. For all , we simply use as the zero vector, and for we fit a censored parametric mixture model on with an EM algorithm.
Concerning the Vfold cross validation procedure for tuning , we use and the crossvalidation metric is the Cindex. Let us precise that we choose as the largest value such that error is within one standard error of the minimum, and that a gridsearch is made during the crossvalidation on an interval , with the interval upper bound computed in the following.
Let us consider the following convex minimization problem resulting from Equation (8), at a given step :
Regarding the grid of candidate values for , we consider . At , all coefficients for are exactly zero. The KKT conditions (Boyd and Vandenberghe, 2004) claim that
,
where is the active set of the estimator, and for all . Then, using (10), one obtains
Hence, we choose the following upper bound for the grid search interval during the crossvalidation procedure
Appendix B Proof of Theorem 1
Let us denote the number of coordinates of so that one can write
We denote a cluster point of the sequence generated by the QNEM algorithm, , with the epsilonneighbourhood of . We want to prove that is a stationary point of the nondifferentiable function , which means (Tseng, 2001):
(11) 
The proof is inspired by Bertsekas (1995). The conditional density of the complete data given the observed data can be written
Then, one has
(12) 
where we introduced . The key argument relies on the following facts that hold under Hypothesis (3) and (4):

is continuous in and ,

for any fixed (at the th M step of the algorithm), is convex in and strictly convex in each coordinate of .
Let be an arbitrary direction, then Equations (11) and (12) yield
Hence, by Jensen’s inequality we get
(13) 
and so is minimized for , then we have . It remains to prove that . Let us focus on the proof of the following expression
(14) 
Denoting and from the definition of the QNEM algorithm, we first have
(15) 
and second for all . Consequently, if converges to , one obtains (14) by continuity taking the limit