An Efficient Training Algorithm for Kernel Survival Support Vector Machines
Survival analysis is a fundamental tool in medical research to identify predictors of adverse events and develop systems for clinical decision support. In order to leverage large amounts of patient data, efficient optimisation routines are paramount. We propose an efficient training algorithm for the kernel survival support vector machine (SSVM). We directly optimise the primal objective functionand employ truncated Newton optimisation and order statistic trees to significantly lower computational costs compared to previous training algorithms, which require space and time for datasets with samples and features. Our results demonstrate that our proposed optimisation scheme allows analysing data of a much larger scale with no loss in prediction performance. Experiments on synthetic and 5 real-world datasets show that our technique outperforms existing kernel SSVM formulations if the amount of right censoring is high (), and performs comparably otherwise.
Keywords:survival analysis support vector machine optimisation kernel-based learning
In clinical research, the primary interest is often the time until occurrence of an adverse event, such as death or reaching a specific state of disease progression. In survival analysis, the objective is to establish a connection between a set of features and the time until an event of interest. It differs from traditional machine learning, because parts of the training data can only be partially observed. In a clinical study, patients are often monitored for a particular time period, and events occurring in this particular period are recorded. If a patient experiences an event, the exact time of the event can be recorded – the time of the event is uncensored. In contrast, if a patient remained event-free during the study period, it is unknown whether an event has or has not occurred after the study ended – the time of an event is right censored at the end of the study.
Cox’s proportional hazards model (CoxPH) is the standard for analysing time-to-event data. However, its decision function is linear in the covariates, which can lead to poor predictive performance if non-linearities and interactions are not modelled explicitly. Depending on the level of complexity, researchers might be forced to try many different model formulations, which is cumbersome. The success of kernel methods in machine learning has motivated researchers to propose kernel-based survival models, which ease analysis in the presence of non-linearities, and allow analysing complex data in the form of graphs or strings by means of appropriate kernel functions (e.g. [23, 35]). Thus, instead of merely describing patients by feature vectors, structured and more expressive forms of representation can be employed, such as gene co-expression networks .
A kernel-based CoxPH model was proposed in [24, 2]. Authors in [19, 28] cast survival analysis as a regression problem and adapted support vector regression, whereas authors in [11, 31] cast it as a learning-to-rank problem by adapting the rank support vector machine (Rank SVM). Eleuteri et al.  formulated a model based on quantile regression. A transformation model with minimal Lipschitz smoothness for survival analysis was proposed in . Finally, Van Belle et al.  proposed a hybrid ranking-regression model.
In this paper, we focus on improving the optimisation scheme of the non-linear ranking-based survival support vector machine (SSVM). Existing training algorithms [11, 31] perform optimisation in the dual and require time – excluding evaluations of the kernel function – and space, where and are the number of features and samples. Recently, an efficient training algorithm for linear SSVM with much lower time complexity and linear space complexity has been proposed . We extend this optimisation scheme to the non-linear case and demonstrate its superiority on synthetic and real-world datasets. Our implementation of the proposed training algorithm is available online at https://github.com/tum-camp/survival-support-vector-machine.
Given a dataset of samples, let denote a -dimensional feature vector, the time of an event, and the time of censoring of the -th sample. Due to right censoring, it is only possible to observe and for every sample, with being the indicator function and for uncensored records. Hence, training a survival model is based on a set of triplets: . After training, a survival model ought to predict a risk score of experiencing an event based on a set of given features.
2.1 The Survival Support Vector Machine
The SSVM is an extension of the Rank SVM  to right censored survival data [31, 11]. Consequently, survival analysis is cast as a learning-to-rank problem: patients with a lower survival time should be ranked before patients with longer survival time. In the absence of censoring – as it is the case for traditional Rank SVM – all pairwise comparisons of samples are used during training. However, if samples are right censored, some pairwise relationships are invalid. When comparing two censored samples and (), it is unknown whether the -th sample should be ranked before the -th sample or vice versa, because the exact time of an event is unknown for both samples. The same applies if comparing an uncensored sample with a censored sample ( and ) if . Therefore, a pairwise comparison is only valid if the sample with the lower observed time is uncensored. Formally, the set of valid comparable pairs is given by
where are the model’s coefficients and controls the degree of regularisation.
Without censoring, all possible pairs of samples have to be considered during training, hence and the sum in (1) consists of a quadratic number of addends with respect to the number of training samples. If part of the survival times are censored, the size of depends on the amount of uncensored records and the order of observed time points – censored and uncensored. Let denote the percentage of uncensored time points, then is at least . This situation arises if all censored subjects drop out before the first event was observed, hence, all uncensored records are incomparable to all censored records. If the situation is reversed and the first censored time point occurs after the last time point of an observed event, all uncensored records can be compared to all censored records, which means . In both cases, is of the order of and the number of addends in optimisation problem (1) is quadratic in the number of samples. In [31, 11], the objective function (1) is minimised by solving the corresponding Lagrange dual problem:
where , is a -dimensional vector of all ones, are the Lagrangian multipliers, and is a sparse matrix with and if and zero otherwise. It is easy to see that this approach quickly becomes intractable, because constructing the matrix requires space and solving the quadratic problem (2) requires time.
Van Belle et al.  addressed this problem by reducing the size of to elements: they only considered pairs , where is the largest uncensored sample with . However, this approach effectively simplifies the objective function (1) and usually leads to a different solution. In , we proposed an efficient optimisation scheme for solving (1) by substituting the hinge loss for the squared hinge loss, and using truncated Newton optimisation and order statistics trees to avoid explicitly constructing all pairwise comparisons, resulting in a much reduced time and space complexity. However, the optimisation scheme is only applicable to training a linear model. To circumvent this problem, the data can be transformed via Kernel PCA  before training, which effectively results in a non-linear model in the original feature space [26, 4]. The disadvantage of this approach is that it requires operations to construct the kernel matrix – assuming evaluating the kernel function costs – and to perform singular value decomposition. Kuo et al.  proposed an alternative approach for Rank SVM that allows directly optimising the primal objective function in the non-linear case too. It is natural to adapt this approach for training a non-linear SSVM, which we will describe next.
2.2 Efficient Training of a Kernel Survival Support Vector Machine
The main idea to obtain a non-linear decision function is that the objective function (1) is reformulated with respect to finding a function from a reproducing kernel Hilbert space with associated kernel function that maps the input to a real value (usually ):
where we substituted the hinge loss for the squared hinge loss, because the latter is differentiable. Using the representer theorem [20, 3, 21], the function can be expressed as , which results in the objective function
where the norm can be computed by using the reproducing kernel property and . The objective function can be expressed in matrix form through the symmetric positive definite kernel matrix with entries :
where are the coefficients and is a diagonal matrix that has an entry for each that indicates whether this pair is a support pair, i.e., . For the -th item of , representing the pair , the corresponding entry in is defined as
The objective function (3) of the non-linear SSVM is similar to the linear model discussed in ; in fact, is differentiable and convex with respect to , which allows employing truncated Newton optimisation . The first- and second-order partial derivatives have the form
where the generalised Hessian is used in the second derivative, because is not twice differentiable at .
Note that the expression appears in eqs. (3) to (5). Right multiplying by the diagonal matrix has the effect that rows not corresponding to support pairs – pairs for which – are dropped from the matrix . Thus, can be simplified by expressing it in terms of a new matrix as , where denotes the number of support pairs in . Thus, the objective function and its derivatives can be compactly expressed as
The gradient and Hessian of the non-linear SSVM share properties with the corresponding functions of the linear model. Therefore, we can adapt the efficient training algorithm for linear SSVM  with only small modifications, thereby avoiding explicitly constructing the matrix , which would require space. Since the derivation for the non-linear case is very similar to the linear case, we only present the final result here and refer to  for details.
In each iteration of truncated Newton optimisation, a Hessian-vector product needs to be computed. The second term in this product involves and becomes
where, in analogy to the linear SSVM,
, and . The values , , , and can be obtained in logarithmic time by first sorting according to the predicted scores and subsequently incrementally constructing one order statistic tree to hold and , respectively [26, 22, 21]. Finally, the risk score of experiencing an event for a new data point can be estimated by , where .
The overall complexity of the training algorithm for a linear SSVM is
where and are the average number of conjugate gradient iterations and the total number of Newton updates, respectively . For the non-linear model, we first have to construct the kernel matrix . If cannot be stored in memory, computing the product requires evaluations of the kernel function and operations to compute the product. If evaluating the kernel function costs , the overall complexity is . Thus, computing the Hessian-vector product in the non-linear case consists of three steps, which have the following complexities:
to compute for all ,
to sort samples according to values of ,
to calculate the Hessian-vector product via (6).
This clearly shows that, in contrast to training a linear model, computing the sum over all comparable pairs is no longer the most time consuming task in minimising in eq. (3). Instead, computing is the dominating factor.
If the number of samples in the training data is small, the kernel matrix can be computed once and stored in memory thereafter, which results in a one-time cost of . It reduces the costs to compute to and the remaining costs remain the same. Although pre-computing the kernel matrix is an improvement, computing in each conjugate gradient iteration remains the bottleneck. The overall complexity of training a non-linear SSVM with truncated Newton optimisation and order statistics trees is
Note that direct optimisation of the non-linear objective function is preferred over using Kernel PCA to transform the data before training, because it avoids operations corresponding to the singular value decomposition of .
3 Comparison of Survival Support Vector Machines
3.1.1 Synthetic Data
Synthetic survival data of varying size was generated following . Each dataset consisted of one uniformly distributed feature in the interval , denoting age, one binary variable denoting sex, drawn from a binomial distribution with probability 0.5, and a categorical variable with 3 equally distributed levels. In addition, 10 numeric features were sampled from a multivariate normal distribution with mean , , , , , , , , , . Survival times were drawn from a Weibull distribution with (constant hazard rate) and according to the formula presented in : , where is uniformly distributed within , denotes a non-linear model that relates the features to the survival time (see below), and is the feature vector of the -th subject. The censoring time was drawn from a uniform distribution in the interval , where was chosen such that about 20% of survival times were censored in the training data. Survival times in the test data were not subject to censoring to eliminate the effect of censoring on performance estimation. The non-linear model was defined as
where C1, C2, and C3 correspond to dummy codes of a categorical feature with three categories and N1 to N10 to continuous features sampled from a multivariate normal distribution. Note that the 3rd, 9th and 10th numeric feature are associated with a zero coefficient, thus do not affect the survival time. We generated 100 pairs of train and test data of 1,500 samples each by multiplying the coefficients by a random scaling factor uniformly drawn from .
3.1.2 Real Data
|AIDS study ||11||96 (%)||AIDS defining event|
|Coronary artery disease ||56||149 (%)||Myocardial infarction|
|Framingham offspring ||150||1,166 (%)||Coronary vessel disease|
|Veteran’s Lung Cancer ||6||128 (%)||Death|
|Worcester Heart Attack Study ||14||215 (%)||Death|
In the second set of experiments, we focused on 5 real-world datasets of varying size, number of features, and amount of censoring (see table 1). The Framingham offspring and the coronary artery disease data contained missing values, which were imputed using multivariate imputation using chained equations with random forest models . To ease computational resources for validation and since the missing values problem was not the focus, one multiple imputed dataset was randomly picked and analysed. Finally, we normalised continuous variables to have zero mean and unit variance.
3.2 Prediction Performance
Experiments presented in this section focus on the comparison of the predictive performance of the survival SVM on 100 synthetically generated datasets as well as 5 real-world data sets against 3 alternative survival models:
Simple SSVM with hinge loss and restricted to pairs , where is the largest uncensored sample with ,
Minlip survival model by Van Belle et al. ,
linear SSVM based on the efficient training scheme proposed in ,
Cox’s proportional hazards model  with penalty.
The regularisation parameter for SSVM and Minlip controls the weight of the (squared) hinge loss, whereas for Cox’s proportional hazards model, it controls the weight of the penalty. Optimal hyper-parameters were determined via grid search by evaluating each configuration using ten random 80%/20% splits of the training data. The parameters that on average performed best across these ten partitions were ultimately selected and the model was re-trained on the full training data using optimal hyper-parameters. We used Harrell’s concordance index (-index)  – a generalisation of Kendall’s to right censored survival data – to estimate the performance of a particular hyper-parameter configuration. In the grid search, was chosen from the set . The maximum number of iterations of Newton’s method was 200.
In our cross-validation experiments on real-world data, the test data was subject to censoring too, hence performance was measured by Harrell’s and Uno’s -index [12, 29] and the integrated area under the time-dependent, cumulative-dynamic ROC curve (iAUC; [30, 15]). The latter was evaluated at time points corresponding to the percentile of the observed time points in the complete dataset. For Uno’s -index the truncation time was the 80% percentile of the observed time points in the complete dataset. For all three measures, the values 0 and 1 indicate a completely wrong and perfectly correct prediction, respectively. Finally, we used Friedman’s test and the Nemenyi post-hoc test to determine whether the performance difference between any two methods is statistically significant at the 0.05 level .
|Coronary artery disease||Harrel’s||0.739||0.645||0.698||0.706||0.768|
3.2.1 Synthetic Data
The first set of experiments on synthetic data served as a reference on how kernel-based survival models compare to each other in a controlled setup. We performed experiments using an RBF kernel and the clinical kernel . Figure 1 summarises the results on 100 synthetically generated datasets, where all survival times in the test data were uncensored, which leads to unbiased and consistent estimates of the -index. The experiments revealed that using a clinical kernel was advantageous in all experiments (see fig. 1). Using the clinical kernel in combination with any of the SSVM models resulted in a significant improvement over the corresponding model with RBF kernel and linear model, respectively. Regarding the RBF kernel, it improved the performance over a linear model, except for the simple SSVM, which did not perform significantly better than the linear SSVM. The simple SSVM suffers from using a simplified objective function with a restricted set of comparable pairs , despite using an RBF kernel. This clearly indicates that reducing the size of to address the complexity of training a non-linear SSVM, as proposed in , is inadequate. Although, the Minlip model is based on the same set of comparable pairs, the change in loss function is able to compensate for that – to some degree. Our proposed optimisation scheme and the Minlip model performed comparably (both for clinical and RBF kernel).
3.2.2 Real Data
In this section, we will present results on 5 real-world datasets (see table 1) based on 5-fold cross-validation and the clinical kernel , which is preferred if feature vectors are a mix of continuous and categorical features as demonstrated above. Table 2 summarises our results. In general, performance measures correlated well and results support our conclusions from experiments on synthetic data described above. The simplified SSVM performed poorly: it ranked last in all experiments. In particular, it was outperformed by the linear SSVM, which considers all comparable pairs in , which is evidence that restricting is an unlikely approach to train a non-linear SSVM efficiently. The Minlip model was outperformed by the proposed SSVM on two datasets (AIDS study and coronary artery disease). It only performed better on the veteran’s lung cancer data set and was comparable in the remaining experiments. The linear SSVM achieved comparable performance to the SSVM with clinical kernel on all datasets, except the coronary artery disease data. Finally, Cox’s proportional hazard model often performed very well on the real-world datasets, although it does not model non-linearities explicitly. The performance difference between our SSVM and the Minlip model can be explained when considering that they not only differ in the loss function, but also in the definition of the set . While our SSVM is able to consider all (valid) pairwise relationships in the training data, the Minlip model only considers a small subset of pairs. This turned out to be problematic when the amount of censoring is high, as it is the case for the AIDS and coronary artery disease studies, which contained less than 15% uncensored records (see table 1). In this setting, training a Minlip model is based on a much smaller set of comparable pairs than what is available to our SSVM, which ultimately leads to a Minlip model that generalises badly. Therefore, our proposed efficient optimisation scheme is preferred both with respect to runtime and predictive performance. When considering all experiments together, statistical analysis  suggests that the predictive performance of all five survival models is comparably.
We proposed an efficient method for training non-linear ranking-based survival support vector machines. Our algorithm is a straightforward extension of our previously proposed training algorithm for linear survival support vector machines. Our optimisation scheme allows analysing datasets of much larger size than previous training algorithms, mostly by reducing the space complexity from to , and is the preferred choice when learning from survival data with high amounts of right censoring. This opens up the opportunity to build survival models that can utilise large amounts of complex, structured data, such as graphs and strings.
This work has been supported by the CRUK Centre at the Institute of Cancer Research and Royal Marsden (Grant No. C309/A18077), The Heather Beckwith Charitable Settlement, The John L Beckwith Charitable Trust, and the Leibniz Supercomputing Centre (LRZ, www.lrz.de). Data were provided by the Framingham Heart Study of the National Heart Lung and Blood Institute of the National Institutes of Health and Boston University School of Medicine (Contract No. N01-HC-25195).
-  Bender, R., Augustin, T., Blettner, M.: Generating survival times to simulate Cox proportional hazards models. Stat. Med. 24(11), 1713–1723 (2005)
-  Cai, T., Tonini, G., Lin, X.: Kernel machine approach to testing the significance of multiple genetic markers for risk prediction. Biometrics 67(3), 975–986 (2011)
-  Chapelle, O.: Training a support vector machine in the primal. Neural Comput. 19(5), 1155–1178 (2007)
-  Chapelle, O., Keerthi, S.S.: Efficient algorithms for ranking with SVMs. Information Retrieval 13(3), 201–205 (2009)
-  Cox, D.R.: Regression models and life tables. J. R. Stat. Soc. Series B Stat. Methodol. 34, 187–220 (1972)
-  Daemen, A., Timmerman, D., Van den Bosch, T., Bottomley, C., Kirk, E., Van Holsbeke, C., Valentin, L., Bourne, T., De Moor, B.: Improved modeling of clinical data with kernel methods. Artif. Intell. Med. 54, 103–114 (2012)
-  Dembo, R.S., Steihaug, T.: Truncated newton algorithms for large-scale optimization. Math. Program. 26(2), 190–212 (1983)
-  Demšar, J.: Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 7, 1–30 (2006)
-  Doove, L.L., van Buuren, S., Dusseldorp, E.: Recursive partitioning for missing data imputation in the presence of interaction effects. Comput. Stat. Data. Anal. 72(0), 92–104 (2014)
-  Eleuteri, A., Taktak, A.F.: Support vector machines for survival regression. In: Computational Intelligence Methods for Bioinformatics and Biostatistics. LNCS, vol. 7548, pp. 176–189. Springer (2012)
-  Evers, L., Messow, C.M.: Sparse kernel methods for high-dimensional survival data. Bioinformatics 24(14), 1632–1638 (2008)
-  Harrell, F.E., Califf, R.M., Pryor, D.B., Lee, K.L., Rosati, R.A.: Evaluating the yield of medical tests. J. Am. Med. Assoc. 247(18), 2543–2546 (1982)
-  Herbrich, R., Graepel, T., Obermayer, K.: Large Margin Rank Boundaries for Ordinal Regression, chap. 7, pp. 115–132. MIT Press, Cambridge, MA (2000)
-  Hosmer, D., Lemeshow, S., May, S.: Applied Survival Analysis: Regression Modeling of Time to Event Data. John Wiley & Sons, Inc., Hoboken, NJ (2008)
-  Hung, H., Chiang, C.T.: Estimation methods for time-dependent AUC models with survival data. Can. J. Stat. 38(1), 8–26 (2010)
-  Kalbfleisch, J.D., Prentice, R.L.: The Statistical Analysis of Failure Time Data. John Wiley & Sons, Inc. (2002)
-  Kannel, W.B., Feinleib, M., McNamara, P.M., Garrision, R.J., Castelli, W.P.: An investigation of coronary heart disease in families: The Framingham Offspring Study. Am. J. Epidemiol. 110(3), 281–290 (1979)
-  Keerthi, S.S., DeCoste, D.: A modified finite newton method for fast solution of large scale linear SVMs. J. Mach. Learn. Res. 6, 341–361 (2005)
-  Khan, F.M., Zubek, V.B.: Support vector regression for censored data (SVRc): A novel tool for survival analysis. In: 8th IEEE International Conference on Data Mining. pp. 863–868 (2008)
-  Kimeldorf, G.S., Wahba, G.: A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann. Math. Stat. 41(2), 495–502 (1970)
-  Kuo, T.M., Lee, C.P., Lin, C.J.: Large-scale kernel RankSVM. In: SIAM International Conference on Data Mining. pp. 812–820 (2014)
-  Lee, C.P., Lin, C.J.: Large-scale linear RankSVM. Neural Comput. 26(4), 781–817 (2014)
-  Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for SVM protein classification. In: Pac. Symp. Biocomput. pp. 564–575 (2002)
-  Li, H., Luan, Y.: Kernel Cox regression models for linking gene expression profiles to censored survival data. In: Pac. Symp. Biocomput. pp. 65–76 (2003)
-  Ndrepepa, G., Braun, S., Mehilli, J., Birkmeier, K.A., Byrne, R.A., Ott, I., Hösl, K., Schulz, S., Fusaro, M., Pache, J., Hausleiter, J., Laugwitz, K.L., Massberg, S., Seyfarth, M., Schömig, A., Kastrati, A.: Prognostic value of sensitive troponin T in patients with stable and unstable angina and undetectable conventional troponin. Am. Heart J. 161(1), 68–75 (2011)
-  Pölsterl, S., Navab, N., Katouzian, A.: Fast training of support vector machines for survival analysis. In: Machine Learning and Knowledge Discovery in Databases. LNCS, vol. 9285, pp. 243–259. Springer (2015)
-  Schölkopf, B., Smola, A., Müller, K.R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10(5), 1299–1319 (1998)
-  Shivaswamy, P.K., Chu, W., Jansche, M.: A support vector approach to censored targets. In: 7th IEEE International Conference on Data Mining. pp. 655–660 (2007)
-  Uno, H., Cai, T., Pencina, M.J., D’Agostino, R.B., Wei, L.J.: On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Stat. Med. 30(10), 1105–1117 (2011)
-  Uno, H., Cai, T., Tian, L., Wei, L.J.: Evaluating prediction rules for t-year survivors with censored regression models. J. Am. Stat. Assoc. 102, 527–537 (2007)
-  Van Belle, V., Pelckmans, K., Suykens, J.A.K., Van Huffel, S.: Support vector machines for survival analysis. In: 3rd International Conference on Computational Intelligence in Medicine and Healthcare. pp. 1–8 (2007)
-  Van Belle, V., Pelckmans, K., Suykens, J.A.K., Van Huffel, S.: Survival SVM: a practical scalable algorithm. In: 16th European Symposium on Artificial Neural Networks. pp. 89–94 (2008)
-  Van Belle, V., Pelckmans, K., Suykens, J.A.K., Van Huffel, S.: Learning transformation models for ranking and survival analysis. J. Mach. Learn. Res. 12, 819–862 (2011)
-  Van Belle, V., Pelckmans, K., Van Huffel, S., Suykens, J.A.K.: Support vector methods for survival analysis: a comparison between ranking and regression approaches. Artif. Intell. Med. 53(2), 107–118 (2011)
-  Vishwanathan, S., Schraudolph, N.N., Kondor, R., Borgwardt, K.M.: Graph kernels. J. Mach. Learn. Res. 11, 1201–42 (2011)
-  Zhang, W., Ota, T., Shridhar, V., Chien, J., Wu, B., Kuang, R.: Network-based survival analysis reveals subnetwork signatures for predicting outcomes of ovarian cancer treatment. PLoS Comput. Biol. 9(3), e1002975 (2013)