An Efficient Training Algorithm for Kernel Survival Support Vector Machines
Abstract
Survival analysis is a fundamental tool in medical research to identify predictors of adverse events and develop systems for clinical decision support. In order to leverage large amounts of patient data, efficient optimisation routines are paramount. We propose an efficient training algorithm for the kernel survival support vector machine (SSVM). We directly optimise the primal objective functionand employ truncated Newton optimisation and order statistic trees to significantly lower computational costs compared to previous training algorithms, which require space and time for datasets with samples and features. Our results demonstrate that our proposed optimisation scheme allows analysing data of a much larger scale with no loss in prediction performance. Experiments on synthetic and 5 realworld datasets show that our technique outperforms existing kernel SSVM formulations if the amount of right censoring is high (), and performs comparably otherwise.
Keywords:
survival analysis support vector machine optimisation kernelbased learning1 Introduction
In clinical research, the primary interest is often the time until occurrence of an adverse event, such as death or reaching a specific state of disease progression. In survival analysis, the objective is to establish a connection between a set of features and the time until an event of interest. It differs from traditional machine learning, because parts of the training data can only be partially observed. In a clinical study, patients are often monitored for a particular time period, and events occurring in this particular period are recorded. If a patient experiences an event, the exact time of the event can be recorded – the time of the event is uncensored. In contrast, if a patient remained eventfree during the study period, it is unknown whether an event has or has not occurred after the study ended – the time of an event is right censored at the end of the study.
Cox’s proportional hazards model (CoxPH) is the standard for analysing timetoevent data. However, its decision function is linear in the covariates, which can lead to poor predictive performance if nonlinearities and interactions are not modelled explicitly. Depending on the level of complexity, researchers might be forced to try many different model formulations, which is cumbersome. The success of kernel methods in machine learning has motivated researchers to propose kernelbased survival models, which ease analysis in the presence of nonlinearities, and allow analysing complex data in the form of graphs or strings by means of appropriate kernel functions (e.g. [23, 35]). Thus, instead of merely describing patients by feature vectors, structured and more expressive forms of representation can be employed, such as gene coexpression networks [36].
A kernelbased CoxPH model was proposed in [24, 2]. Authors in [19, 28] cast survival analysis as a regression problem and adapted support vector regression, whereas authors in [11, 31] cast it as a learningtorank problem by adapting the rank support vector machine (Rank SVM). Eleuteri et al. [10] formulated a model based on quantile regression. A transformation model with minimal Lipschitz smoothness for survival analysis was proposed in [33]. Finally, Van Belle et al. [34] proposed a hybrid rankingregression model.
In this paper, we focus on improving the optimisation scheme of the nonlinear rankingbased survival support vector machine (SSVM). Existing training algorithms [11, 31] perform optimisation in the dual and require time – excluding evaluations of the kernel function – and space, where and are the number of features and samples. Recently, an efficient training algorithm for linear SSVM with much lower time complexity and linear space complexity has been proposed [26]. We extend this optimisation scheme to the nonlinear case and demonstrate its superiority on synthetic and realworld datasets. Our implementation of the proposed training algorithm is available online at https://github.com/tumcamp/survivalsupportvectormachine.
2 Methods
Given a dataset of samples, let denote a dimensional feature vector, the time of an event, and the time of censoring of the th sample. Due to right censoring, it is only possible to observe and for every sample, with being the indicator function and for uncensored records. Hence, training a survival model is based on a set of triplets: . After training, a survival model ought to predict a risk score of experiencing an event based on a set of given features.
2.1 The Survival Support Vector Machine
The SSVM is an extension of the Rank SVM [13] to right censored survival data [31, 11]. Consequently, survival analysis is cast as a learningtorank problem: patients with a lower survival time should be ranked before patients with longer survival time. In the absence of censoring – as it is the case for traditional Rank SVM – all pairwise comparisons of samples are used during training. However, if samples are right censored, some pairwise relationships are invalid. When comparing two censored samples and (), it is unknown whether the th sample should be ranked before the th sample or vice versa, because the exact time of an event is unknown for both samples. The same applies if comparing an uncensored sample with a censored sample ( and ) if . Therefore, a pairwise comparison is only valid if the sample with the lower observed time is uncensored. Formally, the set of valid comparable pairs is given by
where it is assumed that all observed time points are unique [31, 11]. Training a linear SSVM (based on the hinge loss) requires solving the following optimisation problem:
(1) 
where are the model’s coefficients and controls the degree of regularisation.
Without censoring, all possible pairs of samples have to be considered during training, hence and the sum in (1) consists of a quadratic number of addends with respect to the number of training samples. If part of the survival times are censored, the size of depends on the amount of uncensored records and the order of observed time points – censored and uncensored. Let denote the percentage of uncensored time points, then is at least . This situation arises if all censored subjects drop out before the first event was observed, hence, all uncensored records are incomparable to all censored records. If the situation is reversed and the first censored time point occurs after the last time point of an observed event, all uncensored records can be compared to all censored records, which means . In both cases, is of the order of and the number of addends in optimisation problem (1) is quadratic in the number of samples. In [31, 11], the objective function (1) is minimised by solving the corresponding Lagrange dual problem:
(2)  
subject to 
where , is a dimensional vector of all ones, are the Lagrangian multipliers, and is a sparse matrix with and if and zero otherwise. It is easy to see that this approach quickly becomes intractable, because constructing the matrix requires space and solving the quadratic problem (2) requires time.
Van Belle et al. [32] addressed this problem by reducing the size of to elements: they only considered pairs , where is the largest uncensored sample with . However, this approach effectively simplifies the objective function (1) and usually leads to a different solution. In [26], we proposed an efficient optimisation scheme for solving (1) by substituting the hinge loss for the squared hinge loss, and using truncated Newton optimisation and order statistics trees to avoid explicitly constructing all pairwise comparisons, resulting in a much reduced time and space complexity. However, the optimisation scheme is only applicable to training a linear model. To circumvent this problem, the data can be transformed via Kernel PCA [27] before training, which effectively results in a nonlinear model in the original feature space [26, 4]. The disadvantage of this approach is that it requires operations to construct the kernel matrix – assuming evaluating the kernel function costs – and to perform singular value decomposition. Kuo et al. [21] proposed an alternative approach for Rank SVM that allows directly optimising the primal objective function in the nonlinear case too. It is natural to adapt this approach for training a nonlinear SSVM, which we will describe next.
2.2 Efficient Training of a Kernel Survival Support Vector Machine
The main idea to obtain a nonlinear decision function is that the objective function (1) is reformulated with respect to finding a function from a reproducing kernel Hilbert space with associated kernel function that maps the input to a real value (usually ):
where we substituted the hinge loss for the squared hinge loss, because the latter is differentiable. Using the representer theorem [20, 3, 21], the function can be expressed as , which results in the objective function
where the norm can be computed by using the reproducing kernel property and . The objective function can be expressed in matrix form through the symmetric positive definite kernel matrix with entries :
(3) 
where are the coefficients and is a diagonal matrix that has an entry for each that indicates whether this pair is a support pair, i.e., . For the th item of , representing the pair , the corresponding entry in is defined as
where denotes the th row of kernel matrix . Note that in contrast to the Lagrangian multipliers in (2), in (3) is unconstrained and usually dense.
The objective function (3) of the nonlinear SSVM is similar to the linear model discussed in [26]; in fact, is differentiable and convex with respect to , which allows employing truncated Newton optimisation [7]. The first and secondorder partial derivatives have the form
(4)  
(5) 
where the generalised Hessian is used in the second derivative, because is not twice differentiable at [18].
Note that the expression appears in eqs. (3) to (5). Right multiplying by the diagonal matrix has the effect that rows not corresponding to support pairs – pairs for which – are dropped from the matrix . Thus, can be simplified by expressing it in terms of a new matrix as , where denotes the number of support pairs in . Thus, the objective function and its derivatives can be compactly expressed as
The gradient and Hessian of the nonlinear SSVM share properties with the corresponding functions of the linear model. Therefore, we can adapt the efficient training algorithm for linear SSVM [26] with only small modifications, thereby avoiding explicitly constructing the matrix , which would require space. Since the derivation for the nonlinear case is very similar to the linear case, we only present the final result here and refer to [26] for details.
In each iteration of truncated Newton optimisation, a Hessianvector product needs to be computed. The second term in this product involves and becomes
(6) 
where, in analogy to the linear SSVM,
, and . The values , , , and can be obtained in logarithmic time by first sorting according to the predicted scores and subsequently incrementally constructing one order statistic tree to hold and , respectively [26, 22, 21]. Finally, the risk score of experiencing an event for a new data point can be estimated by , where .
2.3 Complexity
The overall complexity of the training algorithm for a linear SSVM is
where and are the average number of conjugate gradient iterations and the total number of Newton updates, respectively [26]. For the nonlinear model, we first have to construct the kernel matrix . If cannot be stored in memory, computing the product requires evaluations of the kernel function and operations to compute the product. If evaluating the kernel function costs , the overall complexity is . Thus, computing the Hessianvector product in the nonlinear case consists of three steps, which have the following complexities:

to compute for all ,

to sort samples according to values of ,

to calculate the Hessianvector product via (6).
This clearly shows that, in contrast to training a linear model, computing the sum over all comparable pairs is no longer the most time consuming task in minimising in eq. (3). Instead, computing is the dominating factor.
If the number of samples in the training data is small, the kernel matrix can be computed once and stored in memory thereafter, which results in a onetime cost of . It reduces the costs to compute to and the remaining costs remain the same. Although precomputing the kernel matrix is an improvement, computing in each conjugate gradient iteration remains the bottleneck. The overall complexity of training a nonlinear SSVM with truncated Newton optimisation and order statistics trees is
(7) 
Note that direct optimisation of the nonlinear objective function is preferred over using Kernel PCA to transform the data before training, because it avoids operations corresponding to the singular value decomposition of .
3 Comparison of Survival Support Vector Machines
3.1 Datasets
3.1.1 Synthetic Data
Synthetic survival data of varying size was generated following [1]. Each dataset consisted of one uniformly distributed feature in the interval , denoting age, one binary variable denoting sex, drawn from a binomial distribution with probability 0.5, and a categorical variable with 3 equally distributed levels. In addition, 10 numeric features were sampled from a multivariate normal distribution with mean , , , , , , , , , . Survival times were drawn from a Weibull distribution with (constant hazard rate) and according to the formula presented in [1]: , where is uniformly distributed within , denotes a nonlinear model that relates the features to the survival time (see below), and is the feature vector of the th subject. The censoring time was drawn from a uniform distribution in the interval , where was chosen such that about 20% of survival times were censored in the training data. Survival times in the test data were not subject to censoring to eliminate the effect of censoring on performance estimation. The nonlinear model was defined as
where C1, C2, and C3 correspond to dummy codes of a categorical feature with three categories and N1 to N10 to continuous features sampled from a multivariate normal distribution. Note that the 3^{rd}, 9^{th} and 10^{th} numeric feature are associated with a zero coefficient, thus do not affect the survival time. We generated 100 pairs of train and test data of 1,500 samples each by multiplying the coefficients by a random scaling factor uniformly drawn from .
3.1.2 Real Data
Dataset  Events  Outcome  

AIDS study [14]  11  96 (%)  AIDS defining event  
Coronary artery disease [25]  56  149 (%)  Myocardial infarction  
or death  
Framingham offspring [17]  150  1,166 (%)  Coronary vessel disease  
Veteran’s Lung Cancer [16]  6  128 (%)  Death  
Worcester Heart Attack Study [14]  14  215 (%)  Death 
In the second set of experiments, we focused on 5 realworld datasets of varying size, number of features, and amount of censoring (see table 1). The Framingham offspring and the coronary artery disease data contained missing values, which were imputed using multivariate imputation using chained equations with random forest models [9]. To ease computational resources for validation and since the missing values problem was not the focus, one multiple imputed dataset was randomly picked and analysed. Finally, we normalised continuous variables to have zero mean and unit variance.
3.2 Prediction Performance
Experiments presented in this section focus on the comparison of the predictive performance of the survival SVM on 100 synthetically generated datasets as well as 5 realworld data sets against 3 alternative survival models:

Simple SSVM with hinge loss and restricted to pairs , where is the largest uncensored sample with [32],

Minlip survival model by Van Belle et al. [33],

linear SSVM based on the efficient training scheme proposed in [26],

Cox’s proportional hazards model [5] with penalty.
The regularisation parameter for SSVM and Minlip controls the weight of the (squared) hinge loss, whereas for Cox’s proportional hazards model, it controls the weight of the penalty. Optimal hyperparameters were determined via grid search by evaluating each configuration using ten random 80%/20% splits of the training data. The parameters that on average performed best across these ten partitions were ultimately selected and the model was retrained on the full training data using optimal hyperparameters. We used Harrell’s concordance index (index) [12] – a generalisation of Kendall’s to right censored survival data – to estimate the performance of a particular hyperparameter configuration. In the grid search, was chosen from the set . The maximum number of iterations of Newton’s method was 200.
In our crossvalidation experiments on realworld data, the test data was subject to censoring too, hence performance was measured by Harrell’s and Uno’s index [12, 29] and the integrated area under the timedependent, cumulativedynamic ROC curve (iAUC; [30, 15]). The latter was evaluated at time points corresponding to the percentile of the observed time points in the complete dataset. For Uno’s index the truncation time was the 80% percentile of the observed time points in the complete dataset. For all three measures, the values 0 and 1 indicate a completely wrong and perfectly correct prediction, respectively. Finally, we used Friedman’s test and the Nemenyi posthoc test to determine whether the performance difference between any two methods is statistically significant at the 0.05 level [8].
SSVM  SSVM  Minlip  SSVM  Cox  

(ours)  (simple)  (linear)  
AIDS study  Harrel’s  0.759  0.682  0.729  0.767  0.770 
Uno’s  0.711  0.621  0.560  0.659  0.663  
iAUC  0.759  0.685  0.724  0.766  0.771  
Coronary artery disease  Harrel’s  0.739  0.645  0.698  0.706  0.768 
Uno’s  0.780  0.751  0.745  0.730  0.732  
iAUC  0.753  0.641  0.703  0.716  0.777  
Framingham offspring  Harrel’s  0.778  0.707  0.786  0.780  0.785 
Uno’s  0.732  0.674  0.724  0.699  0.742  
iAUC  0.827  0.742  0.837  0.829  0.832  
Lung cancer  Harrel’s  0.676  0.605  0.719  0.716  0.716 
Uno’s  0.664  0.605  0.716  0.709  0.712  
iAUC  0.740  0.630  0.790  0.783  0.780  
WHAS  Harrel’s  0.768  0.724  0.774  0.770  0.770 
Uno’s  0.772  0.730  0.778  0.775  0.773  
iAUC  0.799  0.749  0.801  0.796  0.796 
3.2.1 Synthetic Data
The first set of experiments on synthetic data served as a reference on how kernelbased survival models compare to each other in a controlled setup. We performed experiments using an RBF kernel and the clinical kernel [6]. Figure 1 summarises the results on 100 synthetically generated datasets, where all survival times in the test data were uncensored, which leads to unbiased and consistent estimates of the index. The experiments revealed that using a clinical kernel was advantageous in all experiments (see fig. 1). Using the clinical kernel in combination with any of the SSVM models resulted in a significant improvement over the corresponding model with RBF kernel and linear model, respectively. Regarding the RBF kernel, it improved the performance over a linear model, except for the simple SSVM, which did not perform significantly better than the linear SSVM. The simple SSVM suffers from using a simplified objective function with a restricted set of comparable pairs , despite using an RBF kernel. This clearly indicates that reducing the size of to address the complexity of training a nonlinear SSVM, as proposed in [32], is inadequate. Although, the Minlip model is based on the same set of comparable pairs, the change in loss function is able to compensate for that – to some degree. Our proposed optimisation scheme and the Minlip model performed comparably (both for clinical and RBF kernel).
3.2.2 Real Data
In this section, we will present results on 5 realworld datasets (see table 1) based on 5fold crossvalidation and the clinical kernel [6], which is preferred if feature vectors are a mix of continuous and categorical features as demonstrated above. Table 2 summarises our results. In general, performance measures correlated well and results support our conclusions from experiments on synthetic data described above. The simplified SSVM performed poorly: it ranked last in all experiments. In particular, it was outperformed by the linear SSVM, which considers all comparable pairs in , which is evidence that restricting is an unlikely approach to train a nonlinear SSVM efficiently. The Minlip model was outperformed by the proposed SSVM on two datasets (AIDS study and coronary artery disease). It only performed better on the veteran’s lung cancer data set and was comparable in the remaining experiments. The linear SSVM achieved comparable performance to the SSVM with clinical kernel on all datasets, except the coronary artery disease data. Finally, Cox’s proportional hazard model often performed very well on the realworld datasets, although it does not model nonlinearities explicitly. The performance difference between our SSVM and the Minlip model can be explained when considering that they not only differ in the loss function, but also in the definition of the set . While our SSVM is able to consider all (valid) pairwise relationships in the training data, the Minlip model only considers a small subset of pairs. This turned out to be problematic when the amount of censoring is high, as it is the case for the AIDS and coronary artery disease studies, which contained less than 15% uncensored records (see table 1). In this setting, training a Minlip model is based on a much smaller set of comparable pairs than what is available to our SSVM, which ultimately leads to a Minlip model that generalises badly. Therefore, our proposed efficient optimisation scheme is preferred both with respect to runtime and predictive performance. When considering all experiments together, statistical analysis [8] suggests that the predictive performance of all five survival models is comparably.
3.3 Conclusion
We proposed an efficient method for training nonlinear rankingbased survival support vector machines. Our algorithm is a straightforward extension of our previously proposed training algorithm for linear survival support vector machines. Our optimisation scheme allows analysing datasets of much larger size than previous training algorithms, mostly by reducing the space complexity from to , and is the preferred choice when learning from survival data with high amounts of right censoring. This opens up the opportunity to build survival models that can utilise large amounts of complex, structured data, such as graphs and strings.
Acknowledgments.
This work has been supported by the CRUK Centre at the Institute of Cancer Research and Royal Marsden (Grant No. C309/A18077), The Heather Beckwith Charitable Settlement, The John L Beckwith Charitable Trust, and the Leibniz Supercomputing Centre (LRZ, www.lrz.de). Data were provided by the Framingham Heart Study of the National Heart Lung and Blood Institute of the National Institutes of Health and Boston University School of Medicine (Contract No. N01HC25195).
References
 [1] Bender, R., Augustin, T., Blettner, M.: Generating survival times to simulate Cox proportional hazards models. Stat. Med. 24(11), 1713–1723 (2005)
 [2] Cai, T., Tonini, G., Lin, X.: Kernel machine approach to testing the significance of multiple genetic markers for risk prediction. Biometrics 67(3), 975–986 (2011)
 [3] Chapelle, O.: Training a support vector machine in the primal. Neural Comput. 19(5), 1155–1178 (2007)
 [4] Chapelle, O., Keerthi, S.S.: Efficient algorithms for ranking with SVMs. Information Retrieval 13(3), 201–205 (2009)
 [5] Cox, D.R.: Regression models and life tables. J. R. Stat. Soc. Series B Stat. Methodol. 34, 187–220 (1972)
 [6] Daemen, A., Timmerman, D., Van den Bosch, T., Bottomley, C., Kirk, E., Van Holsbeke, C., Valentin, L., Bourne, T., De Moor, B.: Improved modeling of clinical data with kernel methods. Artif. Intell. Med. 54, 103–114 (2012)
 [7] Dembo, R.S., Steihaug, T.: Truncated newton algorithms for largescale optimization. Math. Program. 26(2), 190–212 (1983)
 [8] Demšar, J.: Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 7, 1–30 (2006)
 [9] Doove, L.L., van Buuren, S., Dusseldorp, E.: Recursive partitioning for missing data imputation in the presence of interaction effects. Comput. Stat. Data. Anal. 72(0), 92–104 (2014)
 [10] Eleuteri, A., Taktak, A.F.: Support vector machines for survival regression. In: Computational Intelligence Methods for Bioinformatics and Biostatistics. LNCS, vol. 7548, pp. 176–189. Springer (2012)
 [11] Evers, L., Messow, C.M.: Sparse kernel methods for highdimensional survival data. Bioinformatics 24(14), 1632–1638 (2008)
 [12] Harrell, F.E., Califf, R.M., Pryor, D.B., Lee, K.L., Rosati, R.A.: Evaluating the yield of medical tests. J. Am. Med. Assoc. 247(18), 2543–2546 (1982)
 [13] Herbrich, R., Graepel, T., Obermayer, K.: Large Margin Rank Boundaries for Ordinal Regression, chap. 7, pp. 115–132. MIT Press, Cambridge, MA (2000)
 [14] Hosmer, D., Lemeshow, S., May, S.: Applied Survival Analysis: Regression Modeling of Time to Event Data. John Wiley & Sons, Inc., Hoboken, NJ (2008)
 [15] Hung, H., Chiang, C.T.: Estimation methods for timedependent AUC models with survival data. Can. J. Stat. 38(1), 8–26 (2010)
 [16] Kalbfleisch, J.D., Prentice, R.L.: The Statistical Analysis of Failure Time Data. John Wiley & Sons, Inc. (2002)
 [17] Kannel, W.B., Feinleib, M., McNamara, P.M., Garrision, R.J., Castelli, W.P.: An investigation of coronary heart disease in families: The Framingham Offspring Study. Am. J. Epidemiol. 110(3), 281–290 (1979)
 [18] Keerthi, S.S., DeCoste, D.: A modified finite newton method for fast solution of large scale linear SVMs. J. Mach. Learn. Res. 6, 341–361 (2005)
 [19] Khan, F.M., Zubek, V.B.: Support vector regression for censored data (SVRc): A novel tool for survival analysis. In: 8^{th} IEEE International Conference on Data Mining. pp. 863–868 (2008)
 [20] Kimeldorf, G.S., Wahba, G.: A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann. Math. Stat. 41(2), 495–502 (1970)
 [21] Kuo, T.M., Lee, C.P., Lin, C.J.: Largescale kernel RankSVM. In: SIAM International Conference on Data Mining. pp. 812–820 (2014)
 [22] Lee, C.P., Lin, C.J.: Largescale linear RankSVM. Neural Comput. 26(4), 781–817 (2014)
 [23] Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for SVM protein classification. In: Pac. Symp. Biocomput. pp. 564–575 (2002)
 [24] Li, H., Luan, Y.: Kernel Cox regression models for linking gene expression profiles to censored survival data. In: Pac. Symp. Biocomput. pp. 65–76 (2003)
 [25] Ndrepepa, G., Braun, S., Mehilli, J., Birkmeier, K.A., Byrne, R.A., Ott, I., Hösl, K., Schulz, S., Fusaro, M., Pache, J., Hausleiter, J., Laugwitz, K.L., Massberg, S., Seyfarth, M., Schömig, A., Kastrati, A.: Prognostic value of sensitive troponin T in patients with stable and unstable angina and undetectable conventional troponin. Am. Heart J. 161(1), 68–75 (2011)
 [26] Pölsterl, S., Navab, N., Katouzian, A.: Fast training of support vector machines for survival analysis. In: Machine Learning and Knowledge Discovery in Databases. LNCS, vol. 9285, pp. 243–259. Springer (2015)
 [27] Schölkopf, B., Smola, A., Müller, K.R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10(5), 1299–1319 (1998)
 [28] Shivaswamy, P.K., Chu, W., Jansche, M.: A support vector approach to censored targets. In: 7^{th} IEEE International Conference on Data Mining. pp. 655–660 (2007)
 [29] Uno, H., Cai, T., Pencina, M.J., D’Agostino, R.B., Wei, L.J.: On the Cstatistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Stat. Med. 30(10), 1105–1117 (2011)
 [30] Uno, H., Cai, T., Tian, L., Wei, L.J.: Evaluating prediction rules for tyear survivors with censored regression models. J. Am. Stat. Assoc. 102, 527–537 (2007)
 [31] Van Belle, V., Pelckmans, K., Suykens, J.A.K., Van Huffel, S.: Support vector machines for survival analysis. In: 3^{rd} International Conference on Computational Intelligence in Medicine and Healthcare. pp. 1–8 (2007)
 [32] Van Belle, V., Pelckmans, K., Suykens, J.A.K., Van Huffel, S.: Survival SVM: a practical scalable algorithm. In: 16^{th} European Symposium on Artificial Neural Networks. pp. 89–94 (2008)
 [33] Van Belle, V., Pelckmans, K., Suykens, J.A.K., Van Huffel, S.: Learning transformation models for ranking and survival analysis. J. Mach. Learn. Res. 12, 819–862 (2011)
 [34] Van Belle, V., Pelckmans, K., Van Huffel, S., Suykens, J.A.K.: Support vector methods for survival analysis: a comparison between ranking and regression approaches. Artif. Intell. Med. 53(2), 107–118 (2011)
 [35] Vishwanathan, S., Schraudolph, N.N., Kondor, R., Borgwardt, K.M.: Graph kernels. J. Mach. Learn. Res. 11, 1201–42 (2011)
 [36] Zhang, W., Ota, T., Shridhar, V., Chien, J., Wu, B., Kuang, R.: Networkbased survival analysis reveals subnetwork signatures for predicting outcomes of ovarian cancer treatment. PLoS Comput. Biol. 9(3), e1002975 (2013)