PIMKL: Pathway Induced Multiple Kernel Learning
Matteo Manica1,2,*, Joris Cadow 1,2,*, Roland Mathis 1,*, María Rodríguez Martínez1,
{tte, dow, lth, mrm}@zurich.ibm.com
IBM Research Zürich
2 ETH - Zürich
* Shared first authorship
Corresponding author



Reliable identification of molecular biomarkers is essential for accurate patient stratification. While state-of-the-art machine learning approaches for sample classification continue to push boundaries in terms of performance, most of these methods are not able to integrate different data types and lack generalization power limiting their application in a clinical setting. Furthermore, many methods behave as black boxes, therefore we have very little understanding about the mechanisms that lead to the prediction provided. While opaqueness concerning machine behaviour might not be a problem in deterministic domains, in health care, providing explanations about the molecular factors and phenotypes that are driving the classification is crucial to build trust in the performance of the predictive system.


We propose Pathway Induced Multiple Kernel Learning (PIMKL), a novel methodology to classify samples reliably that can, at the same time, provide a pathway-based molecular fingerprint of the signature that underlies the classification. PIMKL exploits prior knowledge in the form of molecular interaction networks and annotated gene sets, by optimizing a mixture of pathway-induced kernels using a Multiple Kernel Learning algorithm (MKL), an approach that has demonstrated excellent performance in different machine learning applications. After optimizing the combination of kernels for prediction of a specific phenotype, the model provides a stable molecular signature that can be interpreted in the light of the ingested prior knowledge and that can be used in transfer learning tasks.



Keywords: molecular networks, pathways, molecular signatures, patient stratification, kernel methods, multiple kernel learning.

1 Introduction

Designing reliable and interpretable predictive models for patients stratification and biomarker discovery is a daunting challenge in computational biology. A plethora of methods based on molecular data have been proposed throughout the years, many of them exploiting prior knowledge about the molecular processes involved in the regulation of the phenotype to be predicted. Prior knowledge is frequently encoded as a molecular interaction network, where nodes represent genes or proteins and edges represent relationships between the connected nodes. Supporting the development of such methods, the amount of databases reporting protein-protein interactions has seen an unprecedented growth in recent years, and databases such as STRING [43], OmniPath [46], Reactome [11, 16], IntAct [28], MINT [33], MatrixDB [8], HPRD [29], KEGG [49, 45, 27] or Pathway Commons [7], just to name a few, provide an incredibly useful resource for designing models informed about the underlying molecular processes.

Several studies have focused on comparing prior knowledge-based classification methods. For instance, Cun and Fröhlich [12] evaluated 14 machine learning approaches to predict the survival outcome of breast cancer patients. The methods included, among others: average pathway expression [21], classification by significant hub genes [44], pathway activity classification [31]; and a series of approaches based on Support Vector Machines (SVMs), such as network-based SVMs [50], recursive feature elimination SVMs [22], and graph diffusion kernels for SVMs [40, 19]. The study concluded that, while none of the evaluated approaches significantly improved the classification accuracy, the interpretability of the gene signatures obtained was greatly enhanced by the integration of prior knowledge.

A more recent benchmarking effort was provided by a collaboration between the National Cancer Institute (NCI) and the Dialogue on Reverse Engineering Assessment and Methods (DREAM) project [10]. The NCI-DREAM challenge proposed to identify top-performing methods for predicting therapeutic response in breast cancer cell lines using genomic, proteomic, and epigenomic data profiles. A total of 44 prediction algorithms were scored against an unpublished and hidden gold-standard data set. Two interesting conclusions emerged from the challenge. First, all top-performing methods modeled nonlinear relationships and incorporated biological pathway information, and second, performance was increased by including multiple, independent data sets. Interestingly, the top-performing methodology, Bayesian Multitask Multiple Kernel Learning, exploited a multiple kernel learning (MKL) framework [20].

MKL methods aim to construct a kernel using a weighted combination of base kernels. Finding the combined kernel amounts to learning the weighting coefficients for each base kernel, rather than optimizing the parameters of a single kernel. The advantage of MKL is two-fold. First, different kernels can encode various levels of information (e.g.: different definitions of similarity, different data modalities) endowing the algorithm with a great flexibility to model heterogeneous or multi-modal datasets. Secondly, after optimizing the combination of kernels, the weights associated with each kernel can provide valuable insights about the sets of features that are most informative for the classification task at hand.

We introduce here Pathway Induced Multiple Kernel Learning (PIMKL), a supervised classification algorithm for phenotype prediction from molecular data that jointly exploits the benefits of MKL and prior knowledge ingestion. More specifically, PIMKL uses an interaction network and a set of annotated gene sets to build a mixture of pathway-induced kernels from molecular data, which are then combined with an MKL algorithm. After PIMKL is trained, the weights assigned to each kernel provide information about the importance of the corresponding pathway in the mixture. As a result, a molecular signature is derived and can be used to characterize the phenotype of interest.

This paper is structured as follows. We first describe PIMKL and validate it by predicting disease-free survival for breast cancer samples from multiple cohorts. We benchmark PIMKL by comparing it with the methods analyzed in [12]. To evaluate its generalization power, we use the obtained molecular signature to predict disease-free survival on a different dataset, the METABRIC breast cancer cohort [13]. Finally, we test PIMKL capabilities to integrate diverse types of data and its robustness to noise by simultaneously using METABRIC gene expression (mRNA) and copy number alteration (CNA) data. Our analyses suggest that PIMKL can be successfully applied to a wide range of phenotype prediction problems, since it represents an extremely robust approach for the integration of multiple types of molecular data with prior knowledge.

2 Methods

PIMKL is a methodology for phenotype prediction using multi-modal molecular measurements, based on the optimization of a mixture of pathway-induced kernels. Such kernels are generated by exploiting prior knowledge in a dual fashion. Firstly, prior knowledge is injected in PIMKL in the form of a molecular interaction network, and secondly, as a set of annotated genes sets or pathways.

A key aspect of PIMKL is pathway induction, a method to generate similarity functions using topological properties of an interaction network. In practice, we use pathways gene sets with well-defined biological functions to define sub-networks with which we generate pathway-induced kernels. This mixture of pathway-induced kernels is then optimized to classify a phenotype of interest, and in doing so, each pathway is assigned a weight representing its importance to explain the phenotype. The established link between kernels and pathways enables PIMKL to identify which molecular mechanisms are important for the prediction of the considered phenotype, as shown in Figure 1.

Figure 1: PIMKL concept. Given a network topology describing molecular interactions, relevant sub-networks can be used to generate a mixture of pathway-induced kernels. The convex combination of kernels is then optimized to predict a phenotype of interest. The weights of the mixture assign an importance to each selected pathway, thereby shedding light on the molecular mechanisms that contribute to the specific phenotype.

2.1 Pathway Induction

PIMKL encodes information from a given pathway network topology by integrating it into a kernel. The approach of integrating pathway information into interaction-aware kernel similarity functions is here termed pathway induction. Specifically, we design kernel functions by utilizing a positive semidefinite (PSD) matrix encoding topological properties of a graph. The PSD matrix, representing the pathway, induces a weighted inner product:

where is a valid kernel if matrix is PSD [37]. There are multiple ways to define a PSD matrix encoding graph topological information.

In this work an encoding based on the symmetric normalized Laplacian matrix is adopted. Pathways are defined as weighted undirected graph , with nodes, edges and a diagonal weight matrix representing, respectively: the molecular entities (e.g.: genes, proteins), their interactions and the weights associated to them.

We define as pathway-induced kernel the following similarity function for any pair of samples measurement :

where , and are respectively: the normalized Laplacian, the diagonal degree matrix and an ordered incidence matrix for graph associated to a pathway.

The normalized Laplacian can be interpreted as a discrete Laplace operator representing a diffusion process over graph nodes. The pathway induction based on it introduces a mapping from the original space with measurements for molecular entities into an -dimensional feature space, where each interaction from the pathway is a dimension and the value along the edge is the discrete diffusion potential between respective molecular entities measurements. A schematic illustration of the mapping introduced using pathway induction can be seen in Figure 2 (see Supplementary S.1 for a detailed explanation about the formulation and the design of the kernel function).

Figure 2: Pathway induction. Given a pathway adjacency matrix it is possible to map sample measurements from their original space (the space of the nodes) to the space of the interactions between the molecular entities. The example above shows how the mapping using pathway induction transforms the considered samples.

2.2 Pathway Induced Multiple Kernel Learning

PIMKL makes use of the concept of pathway induction, defined in 2.1 to implement a multiple kernel learning classification framework.

Consider a network recapitulating a comprehensive set of known molecular interactions represented by a graph with nodes and edges and a set of molecular measurements with associated labels for a relevant phenotype .

Given a selection of pathways (e.g.: gene sets from ontologies or inferred via community detection), it is possible to extract for each pathway its corresponding sub-graph with nodes and and a sub-selection of measurements corresponding to the genes contained in the pathway .

The Gram matrix representing the kernel for every pathway can be computed for each pair of samples and using the following:

where and is the normalized Laplacian for . The mixture of kernel used in PIMKL is given by a convex combination of the different pathway kernels considered:

In order to weigh components of the kernel mixture to optimize the prediction of phenotype any supervised multiple kernel learning can be used. In this work a custom version of EasyMKL [1] was implemented, where the weighted combination of kernels is used for phenotype prediction through Kernel method for the Optimization of the Margin Distribution (KOMD) [2].

It is important to note that PIMKL formulation enables a seamless integration of multi-omic data. Kernels from different data types can be generated and added to the mixture in a natural way. The same applies to multi-modal data integration: kernels generated from other data modalities associated with a specific sample (e.g.: histopathology images, clinical records) can be added to the mixture and weighted accordingly to their contribution in the classification problem.

3 Results

In the following sections the application of PIMKL to different breast cancer cohorts is discussed. First, in 3.1, the methodology proposed was validated following a review [12] where different algorithms for phenotype prediction and gene selections using prior knowledge were compared. Later, PIMKL was applied on gene expression and copy number data from METABRIC [13] with two purposes: on one hand to test if transfer learning between different studies is possible and on the other hand to show how PIMKL can be used for multi-omics analysis in presence of noise or uninformative data.

3.1 PIMKL on breast cancer microarray cohorts

PIMKL was validated on gene expression data measured using microarray from six breast cancer cohorts (see Supplementary Table S1 for cohort-specific details).

The classification task consisted in stratifying breast cancer samples using gene expression data for relapse within 5 years. To ensure the fairest possible comparison the same interaction sources from the review were used, namely a merge between KEGG pathways and Pathway Commons. As access to the older release of KEGG is restricted, the most recent versions from both sources were used. A collection of hallmark gene sets from Molecular Signatures Database (MSigDB) [32] was used to define the sub-graphs used for pathway induction, generating 50 kernels.

The classification performance was evaluated by means of the Area Under the receiver operating characterstic Curve (AUC). We ensured a fair comparison by adhering to data processing procedures and the cross-validation scheme proposed in the review (for details see Supplementary Algorithm S1).

Results for PIMKL compared to the ones obtained by other 14 algorithms considered in the study are reported in Figure 3. Overall AUC values for the six cohorts over the cross-validation rounds of the considered methods are shown in Figure 2(a). PIMKL exhibits the highest median value since it consistently outperforms other methods or is in the top performers group on single cohorts (see Supplementary Figure S1).

PIMKL generates a molecular signature that is given by the weighted contributions of the kernels. Each weight corresponds to the hallmark pathway used for pathway induction. To evaluate the stability of the signature the pathway weight distribution over cross-validation rounds was analyzed. The case where each kernel has the same weight (no enrichment) is considered as a baseline: . To find a significant enrichment of a given pathway, the distribution of the kernel weights with median above was tested using Wilcoxon signed-rank test. -values at significance level 0.001 were corrected for multiple testing using Benjamini-Hochberg (for details see Supplementary Figure S2). In Figure 2(b) the most significant pathways for all cohorts are reported.

The agreement between the weights highlights the stability of the molecular signature found with PIMKL, suggesting its adherence towards the specific problem of survival prediction for breast cancer tissues, as shown in Figure 4(a).

Most notably, heme metabolism pathway was significant for all cohorts. This pathway is involved in the metabolism of heme and erythroblast differentiation. A possible explanation is that heme metabolism might reflect an active vascularization of the samples, a phenomenon widely observed in cancer progression [23].

A more intriguing hypothesis is that there might be an association between elevated heme metabolism and cancer progression, as it has been reported in non-small-cell lung cancer cells and xenograft tumors [25].

It is also interesting to look at other pathways that are significant in at least five cohorts: KRAS signaling, myogenesis, allograft rejection, coagulation, P53 pathway and peroxisome.

All of these pathways are very relevant for breast cancer. For instance, activation of KRAS signaling has been reported to promote the mesenchymal features of basal-type breast cancer [30, 38]. Myogenesis, or the process of formation of muscular tissue, is commonly disrupted in cancer [24]. Allograft rejection might reflect an immune-mediated tumour rejection signature following administration of immunotherapeutic agents [5]. Several studies have suggested a role for blood coagulation proteins in tumour progression [34, 6, 17]. P53 is the most commonly mutated protein in cancer [47, 35]. Finally, peroxisomes are small, membrane-enclosed organelles that contain enzymes involved in a variety of metabolic reactions, including several aspects of energy metabolism. Altered peroxisome metabolism has been linked to various diseases, including cancer [14, 18].

Figure 3: PIMKL cross-validation results. (a) Box plots for AUC values over all cohorts for the methods considered. PIMKL results are reported in red, while other methods results are colored in blue. Box plots are obtained from ten (repeats of) mean AUC values over 10-fold cross-validation splits, see algorithm S1. (b) Heat map showing significant pathways selected in PIMKL molecular signature across the different cohorts considered in the study. Significant pathways are highlighted in red, while non-significant are colored in blue.

3.2 PIMKL on METABRIC cohort

To validate findings from the analysis of the cohorts in 3.1 PIMKL was applied to a larger breast cancer cohort from METABRIC where multiple omics levels are available. The proposed approach was used to analyze 1890 samples where Illumina Human v3 microarray (mRNA) and Affymetrix SNP 6.0 copy number (CNA) measurements were considered (Supplementary Table S2). First, in order to validate the generalization power of the molecular signature estimated with PIMKL, the analysis was focused on microarray data. The main assumption is that while predicting a related phenotype, disease free survival versus relapse within five years, the molecular mechanisms underlying the two processes are similar, since in both cases the proliferation of the tumor plays a major role.

After computing the pathway-induced kernels with the same procedure adopted in 3.1, a set of weights for each pathway has been defined by considering the median of the weights obtained for the six cohorts previously analyzed. In Figure 4 are shown the results obtained by training a KOMD classifier using the weights transferred from the other cohorts and by learning METABRIC-specific pathway weights as well (for details see Supplementary Algorithm S2). It is evident how the molecular signature learned on Cun and Fröhlich datasets is performing in a comparable way to the re-trained one. Indeed, the two signatures are highly correlated (Pearson correlation , -value , Figure 4(b)). It is important to notice that also the variance of the prediction results is consistently reduced, probably due to the different quality of the microarray data produced.

To further test its potential, PIMKL was then applied on both multi-omic data types available on the same predictive task. A set of additional kernels were generated using the copy number data and then used in two ways: at first considered as a mixture on the CNA level and then added as part of a mixture together with the mRNA ones. By looking at Figure 6, it is evident how the signal from CNA data results less relevant compared to the one from mRNA towards the prediction of disease free survival. Nevertheless it is interesting to notice how PIMKL is able to discard noisy kernels and achieve the same performance even when considering pathway-induced kernels from both levels at once. This suggests that the application of the proposed algorithm is feasible even in a context when no prior knowledge about the information content of the single omic levels is available.

Figure 4: PIMKL performance on METABRIC. Box plots of the performance of PIMKL over the six cohorts used to benchmark the method (left to the dashed vertical line) and its application on METABRIC for disease free survival prediction (right to the dashed vertical line). Optimized weights at training by EasyMKL (blue); provided weights from taking the pathway-wise median weights of the six signatures obtained during benchmarking (red).


Figure 5: Transfer of molecular signature. (a) Heat map reporting the correlation of the molecular signature estimated across multiple cohorts. Studies exhibit a positive correlation, significant in most cases, testifying the stability of the molecular signature obtained with PIMKL. Correlation values are reported in the lower triangular part of the heat map (since it is symmetric) on blue to red scale, white squares indicate non significant correlations. (b) Regression of pathway weights in the signature from training on METABRIC (median over 100 cross-validation folds) against the transferred signature obtained from the median over 6 benchmarking signatures (each median over 100 cross-validation folds) indicating high correlation of the two signatures.

Figure 6: PIMKL performance on METABRIC multi-omics. Box plots for AUC values obtained applying PIMKL on different data types and their integration. CNA only results are reported in blue, mRNA ones in green and their integration in orange.

4 Discussion

We have presented here PIMKL (Pathway Induced Multiple Kernel Learning), a novel effective and interpretable machine learning methodology for patient stratification. PIMKL exploits prior knowledge in the form of molecular interaction networks and sets of annotated pathways with known biological functions to build a mixture of pathway-induced kernels. The kernel weights are later optimized to classify a phenotype or a clinical variable of interest.

In this work PIMKL was extensively applied to the problem of predicting disease-free survival for breast cancer samples. We have demonstrated that the resulting weighted combination of kernels represents a phenotypic molecular signature and provide direct insights into which kernel, and thereby which types of data and molecular interactions, resulted most relevant in breast cancer patient stratification context.

As a benchmark, a well-studied set of cohorts previously analyzed using a range of stratification methods has been adopted. The quality and the stability of the obtained signatures were confirmed by the results obtained, where we showed that PIMKL was able to outperform other methods and find consistent molecular signatures across different breast cancer cohorts. Given the agreement of the estimated signatures we combined them in a unique one to analyze its generalization power. We tested PIMKL signature on unseen mRNA data from METABRIC for prediction of disease-free survival. The results obtained confirmed that the algorithm can be used to effectively gain insights into disease progression and that this knowledge can be transferred to other cohorts without loss of performance.

Furthermore, PIMKL can be seamlessly applied to integrate data from different omic layers. Its intrinsic capability to discard noisy molecular features has been demonstrated by applying it on METABRIC, where it was possible to integrate multiple types of data with varying predictive power, and PIMKL was able to discard uninformative kernels by guaranteeing stable results.

Clearly PIMKL is not restricted to breast cancer, the specific omic data types or the sources of prior information used in this work. Its application is open to other disease types using any combination of omic data together with any suitable prior network and sets of genes.

Besides using different types of prior knowledge, the proposed approach is also highly flexible in regard to the number and the nature of the selected kernels. Indeed, PIMKL was developed by making use of an efficient implementation of EasyMKL, an extremely scalable MKL algorithm with constant memory complexity in the number of kernels. Therefore it allows to define smaller pathways leading to a more fine-grained characterization and understanding of molecular mechanisms involved in disease progression with limited performance drawbacks.

Finally, possible extensions of PIMKL such as optimizing the kernel mixture using semi-supervised or unsupervised multiple kernel learning methodologies [36] may help the discovery of phenotype-independent pathway signatures and will be explored in the future. To summarize, PIMKL provides a flexible and scalable method to translate prior knowledge and molecular data into actionable insights in a clinical setting.



The project leading to this application has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 668858. We thank Yupeng Cun for providing results [12] for recreation of Figures S1 and 2(a)

Authors contributions

M.M., J.C., R.M. and M.R.M. conceived the study and analyses. M.M., J.C. and R.M. implemented PIMKL and performed data analysis. M.R.M. provided biological analysis and interpretation. M.M., J.C., R.M. and M.R.M. wrote the manuscript with input from all authors.

Competing financial interests

The authors declare no competing financial interest.

Availability of data and materials

Data and materials used to produce the results presented in this work can be downloaded from the following link https://ibm.box.com/s/ac2ilhyn7xjj27r0xiwtom4crccuobst. PIMKL as a service will be soon available on IBM Cloud.


  • Aiolli and Donini [2015] Fabio Aiolli and Michele Donini. EasyMKL: A scalable multiple kernel learning algorithm. Neurocomputing, 169:215–224, 2015. ISSN 18728286. doi: 10.1016/j.neucom.2014.11.078. URL http://dx.doi.org/10.1016/j.neucom.2014.11.078.
  • Aiolli et al. [2008] Fabio Aiolli, Giovanni Da San Martino, and Alessandro Sperduti. A kernel method for the optimization of the margin distribution. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 5163 LNCS(PART 1):305–314, 2008. ISSN 03029743. doi: 10.1007/978-3-540-87536-9˙32.
  • Anderson and Morley [1985] William N. Anderson and Thomas D. Morley. Eigenvalues of the Laplacian of a Graph. Linear and Multilinear Algebra, 18(2):141–145, 1985. ISSN 15635139. doi: 10.1080/03081088508817681. URL http://www.math.ucsd.edu/{~}fan/research/cb/ch1.pdf.
  • Barrett et al. [2013] Tanya Barrett, Stephen E. Wilhite, Pierre Ledoux, Carlos Evangelista, Irene F. Kim, Maxim Tomashevsky, Kimberly A. Marshall, Katherine H. Phillippy, Patti M. Sherman, Michelle Holko, Andrey Yefanov, Hyeseung Lee, Naigong Zhang, Cynthia L. Robertson, Nadezhda Serova, Sean Davis, and Alexandra Soboleva. NCBI GEO: Archive for functional genomics data sets - Update. Nucleic Acids Research, 41(D1), 2013. ISSN 03051048. doi: 10.1093/nar/gks1193.
  • Bedognetti et al. [2015] Davide Bedognetti, Wouter Hendrickx, Francesco M. Marincola, and Lance D. Miller. Prognostic and predictive immune gene signatures in breast cancer. Current Opinion in Oncology, 27(6):433–444, nov 2015. doi: 10.1097/cco.0000000000000234. URL https://doi.org/10.1097/cco.0000000000000234.
  • Belting et al. [2005] M. Belting, J. Ahamed, and W. Ruf. Signaling of the tissue factor coagulation pathway in angiogenesis and cancer. Arterioscler. Thromb. Vasc. Biol., 25(8):1545–1550, Aug 2005.
  • Cerami et al. [2011] Ethan G. Cerami, Benjamin E. Gross, Emek Demir, Igor Rodchenkov, Özgün Babur, Nadia Anwar, Nikolaus Schultz, Gary D. Bader, and Chris Sander. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Research, 39(SUPPL. 1), 2011. ISSN 03051048. doi: 10.1093/nar/gkq1039.
  • Chautard et al. [2009] Emilie Chautard, Lionel Ballut, Nicolas Thierry-Mieg, and Sylvie Ricard-Blum. Matrixdb, a database focused on extracellular protein–protein and protein–carbohydrate interactions. Bioinformatics, 25(5):690–691, 2009.
  • Chen et al. [2011] Li Chen, Jianhua Xuan, Rebecca B. Riggins, Robert Clarke, and Yue Wang. Identifying cancer biomarkers by network-constrained support vector machines. BMC Systems Biology, 5, 2011. ISSN 17520509. doi: 10.1186/1752-0509-5-161.
  • Costello et al. [2014] James C Costello, Laura M Heiser, Elisabeth Georgii, Mehmet Gönen, Michael P Menden, Nicholas J Wang, Mukesh Bansal, Petteri Hintsanen, Suleiman A Khan, John-Patrick Mpindi, et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nature biotechnology, 32(12):1202, 2014.
  • Croft et al. [2014] David Croft, Antonio Fabregat Mundo, Robin Haw, Marija Milacic, Joel Weiser, Guanming Wu, Michael Caudy, Phani Garapati, Marc Gillespie, Maulik R. Kamdar, Bijay Jassal, Steven Jupe, Lisa Matthews, Bruce May, Stanislav Palatnik, Karen Rothfels, Veronica Shamovsky, Heeyeon Song, Mark Williams, Ewan Birney, Henning Hermjakob, Lincoln Stein, and Peter D’Eustachio. The reactome pathway knowledgebase. Nucleic Acids Research, 42(D1):D472–D477, 2014. doi: 10.1093/nar/gkt1102. URL +http://dx.doi.org/10.1093/nar/gkt1102.
  • Cun and Fröhlich [2012] Yupeng Cun and H Fröhlich. Prognostic gene signatures for patient stratification in breast cancer-accuracy, stability and interpretability of gene selection approaches using prior knowledge. BMC bioinformatics, 2012. URL http://www.biomedcentral.com/content/pdf/1471-2105-13-69.pdf.
  • Curtis et al. [2012] Christina Curtis, Sohrab P. Shah, Suet-Feung Chin, Gulisa Turashvili, Oscar M. Rueda, Mark J. Dunning, Doug Speed, Andy G. Lynch, Shamith Samarajiwa, Yinyin Yuan, Stefan Gräf, Gavin Ha, Gholamreza Haffari, Ali Bashashati, Roslin Russell, Steven McKinney, METABRIC Group, Anita Langerød, Andrew Green, Elena Provenzano, Gordon Wishart, Sarah Pinder, Peter Watson, Florian Markowetz, Leigh Murphy, Ian Ellis, Arnie Purushotham, Anne-Lise Børresen-Dale, James D. Brenton, Simon Tavaré, Carlos Caldas, and Samuel Aparicio. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature, 486:346 EP –, 04 2012. URL http://dx.doi.org/10.1038/nature10983.
  • Delille et al. [2006] H. K. Delille, N. A. Bonekamp, and M. Schrader. Peroxisomes and disease - an overview. Int J Biomed Sci, 2(4):308–314, Dec 2006.
  • Desmedt et al. [2007] Christine Desmedt, Fanny Piette, Sherene Loi, Yixin Wang, Françoise Lallemand, Benjamin Haibe-Kains, Giuseppe Viale, Mauro Delorenzi, Yi Zhang, Mahasti Saghatchian D’Assignies, Jonas Bergh, Rosette Lidereau, Paul Ellis, Adrian L Harris, Jan G M Klijn, John A Foekens, Fatima Cardoso, Martine J Piccart, Marc Buyse, Christos Sotiriou, and TRANSBIG Consortium. Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clinical cancer research : an official journal of the American Association for Cancer Research, 13(11):3207–14, jun 2007. ISSN 1078-0432. doi: 10.1158/1078-0432.CCR-06-2765. URL http://www.ncbi.nlm.nih.gov/pubmed/17545524.
  • Fabregat et al. [2018] Antonio Fabregat, Steven Jupe, Lisa Matthews, Konstantinos Sidiropoulos, Marc Gillespie, Phani Garapati, Robin Haw, Bijay Jassal, Florian Korninger, Bruce May, Marija Milacic, Corina Duenas Roca, Karen Rothfels, Cristoffer Sevilla, Veronica Shamovsky, Solomon Shorser, Thawfeek Varusai, Guilherme Viteri, Joel Weiser, Guanming Wu, Lincoln Stein, Henning Hermjakob, and Peter D’Eustachio. The reactome pathway knowledgebase. Nucleic Acids Research, 46(D1):D649–D655, 2018. doi: 10.1093/nar/gkx1132. URL +http://dx.doi.org/10.1093/nar/gkx1132.
  • FALANGA et al. [2013] A. FALANGA, M. MARCHETTI, and A. VIGNOLI. Coagulation and cancer: biological and clinical aspects. Journal of Thrombosis and Haemostasis, 11(2):223–233, feb 2013. doi: 10.1111/jth.12075. URL https://doi.org/10.1111/jth.12075.
  • Fransen et al. [2012] Marc Fransen, Marcus Nordgren, Bo Wang, and Oksana Apanasets. Role of peroxisomes in ROS/RNS-metabolism: Implications for human disease. Biochimica et Biophysica Acta (BBA) - Molecular Basis of Disease, 1822(9):1363–1373, sep 2012. doi: 10.1016/j.bbadis.2011.12.001. URL https://doi.org/10.1016/j.bbadis.2011.12.001.
  • Gao et al. [2009] Cuilan Gao, Xin Dang, Yixin Chen, and Dawn Wilkins. Graph ranking for exploratory gene data analysis. In BMC bioinformatics, volume 10, page S19. BioMed Central, 2009.
  • Gönen and Alpaydın [2011] Mehmet Gönen and Ethem Alpaydın. Multiple kernel learning algorithms. Journal of machine learning research, 12(Jul):2211–2268, 2011.
  • Guo et al. [2005] Zheng Guo, Tianwen Zhang, Xia Li, Qi Wang, Jianzhen Xu, Hui Yu, Jing Zhu, Haiyun Wang, Chenguang Wang, Eric J. Topol, Qing Wang, and Shaoqi Rao. Towards precise classification of cancers based on robust gene functional expression profiles. BMC Bioinformatics, 6, 2005. ISSN 14712105. doi: 10.1186/1471-2105-6-58.
  • Guyon et al. [2002] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classification using support vector machines. Machine learning, 46(1-3):389–422, 2002.
  • Hillen and Griffioen [2007] F. Hillen and A. W. Griffioen. Tumour vascularization: sprouting angiogenesis and beyond. Cancer Metastasis Rev., 26(3-4):489–502, Dec 2007.
  • Hogan et al. [2017] K. A. Hogan, D. S. Cho, P. C. Arneson, A. Samani, P. Palines, Y. Yang, and J. D. Doles. Tumor-derived cytokines impair myogenesis and alter the skeletal muscle immune microenvironment. Cytokine, Nov 2017.
  • Hooda et al. [2015] J Hooda, MM Alam, and L Zhang. Evaluating the association of heme and heme metabolites with lung cancer bioenergetics and progression. Metabolomics, 5(3):1000150, 2015.
  • Ivshina et al. [2006] A. V. Ivshina, J. George, O. Senko, B. Mow, T. C. Putti, J. Smeds, T. Lindahl, Y. Pawitan, P. Hall, H. Nordgren, J. E.L. Wong, E. T. Liu, J. Bergh, V. A. Kuznetsov, L. D. Miller, M Buyse, MJ Van de Vijver, J Bergh, M Piccart, M Delorenzi, J Younger, U Balis, J Michaelson, A Bhan, K Habin, TM Baer, J Brugge, DA Haber, MG Erlander, and DC Sgroi. Genetic Reclassification of Histologic Grade Delineates New Clinical Subtypes of Breast Cancer. Cancer Research, 66(21):10292–10301, 2006. ISSN 0008-5472. doi: 10.1158/0008-5472.CAN-05-4414. URL http://cancerres.aacrjournals.org/cgi/doi/10.1158/0008-5472.CAN-05-4414.
  • Kanehisa and Goto [2000] M Kanehisa and S Goto. KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research, 28(1):27–30, jan 2000. ISSN 0305-1048. URL http://www.ncbi.nlm.nih.gov/pubmed/10592173http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC102409.
  • Kerrien et al. [2011] Samuel Kerrien, Bruno Aranda, Lionel Breuza, Alan Bridge, Fiona Broackes-Carter, Carol Chen, Margaret Duesbury, Marine Dumousseau, Marc Feuermann, Ursula Hinz, et al. The intact molecular interaction database in 2012. Nucleic acids research, 40(D1):D841–D846, 2011.
  • Keshava Prasad et al. [2008] TS Keshava Prasad, Renu Goel, Kumaran Kandasamy, Shivakumar Keerthikumar, Sameer Kumar, Suresh Mathivanan, Deepthi Telikicherla, Rajesh Raju, Beema Shafreen, Abhilash Venugopal, et al. Human protein reference database—2009 update. Nucleic acids research, 37(suppl_1):D767–D772, 2008.
  • Kim et al. [2015] R. K. Kim, Y. Suh, K. C. Yoo, Y. H. Cui, H. Kim, M. J. Kim, I. Gyu Kim, and S. J. Lee. Activation of KRAS promotes the mesenchymal features of basal-type breast cancer. Exp. Mol. Med., 47:e137, Jan 2015.
  • Lee et al. [2008] Eunjung Lee, Han-Yu Chuang, Jong-Won Kim, Trey Ideker, and Doheon Lee. Inferring pathway activity toward precise disease classification. PLoS computational biology, 4(11):e1000217, 2008.
  • Liberzon et al. [2015] Arthur Liberzon, Chet Birger, Helga Thorvaldsdóttir, Mahmoud Ghandi, Jill P. Mesirov, and Pablo Tamayo. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Systems, 1(6):417–425, dec 2015. ISSN 24054712. doi: 10.1016/j.cels.2015.12.004. URL https://www.sciencedirect.com/science/article/pii/S2405471215002185.
  • Licata et al. [2011] Luana Licata, Leonardo Briganti, Daniele Peluso, Livia Perfetto, Marta Iannuccelli, Eugenia Galeota, Francesca Sacco, Anita Palma, Aurelio Pio Nardozza, Elena Santonico, et al. Mint, the molecular interaction database: 2012 update. Nucleic acids research, 40(D1):D857–D861, 2011.
  • Lima and Monteiro [2013] Luize G. Lima and Robson Q. Monteiro. Activation of blood coagulation in cancer: implications for tumour progression. Bioscience Reports, 33(5):701–710, sep 2013. doi: 10.1042/bsr20130057. URL https://doi.org/10.1042/bsr20130057.
  • Mandinova and Lee [2011] A. Mandinova and S. W. Lee. The p53 pathway as a target in cancer therapeutics: Obstacles and promise. Science Translational Medicine, 3(64):64rv1–64rv1, jan 2011. doi: 10.1126/scitranslmed.3001366. URL https://doi.org/10.1126/scitranslmed.3001366.
  • Mariette and Villa-Vialaneix [2017] Jérôme Mariette and Nathalie Villa-Vialaneix. Unsupervised multiple kernel learning for heterogeneous data integration. Bioinformatics, 34(2009), 2017. ISSN 1367-4803. doi: 10.1093/bioinformatics/btx682. URL http://academic.oup.com/bioinformatics/article/doi/10.1093/bioinformatics/btx682/4565592/Unsupervised-multiple-kernel-learning-for.
  • M.Bishop [2006] Christopher M.Bishop. Pattern Recognition and Machine Learning. Springer, page 738, 2006.
  • Najumudeen et al. [2016] A K Najumudeen, A Jaiswal, B Lectez, C Oetken-Lindholm, C Guzmán, E Siljamäki, I M D Posada, E Lacey, T Aittokallio, and D Abankwa. Cancer stem cell drugs target k-ras signaling in a stemness context. Oncogene, 35(40):5248–5262, mar 2016. doi: 10.1038/onc.2016.59. URL https://doi.org/10.1038/onc.2016.59.
  • Pawitan et al. [2005] Yudi Pawitan, Judith Bjöhle, Lukas Amler, Anna-Lena Borg, Suzanne Egyhazi, Per Hall, Xia Han, Lars Holmberg, Fei Huang, Sigrid Klaar, Edison T Liu, Lance Miller, Hans Nordgren, Alexander Ploner, Kerstin Sandelin, Peter M Shaw, Johanna Smeds, Lambert Skoog, Sara Wedrén, and Jonas Bergh. Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Research, 7(6):R953, 2005. ISSN 1465-542X. doi: 10.1186/bcr1325. URL http://breast-cancer-research.biomedcentral.com/articles/10.1186/bcr1325.
  • Rapaport et al. [2007] Franck Rapaport, Andrei Zinovyev, Marie Dutreix, Emmanuel Barillot, and Jean Philippe Vert. Classification of microarray data using gene networks. BMC Bioinformatics, 8, 2007. ISSN 14712105. doi: 10.1186/1471-2105-8-35.
  • Schmidt et al. [2008] Marcus Schmidt, Daniel Böhm, Christian Von Törne, Eric Steiner, Alexander Puhl, Henryk Pilch, Hans Anton Lehr, Jan G. Hengstler, Heinz Kölbl, and Mathias Gehrmann. The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer Research, 68(13):5405–5413, 2008. ISSN 00085472. doi: 10.1158/0008-5472.CAN-07-5206.
  • Sotiriou et al. [2006] Christos Sotiriou, Pratyaksha Wirapati, Sherene Loi, Adrian Harris, Steve Fox, Johanna Smeds, Hans Nordgren, Pierre Farmer, Viviane Praz, Benjamin Haibe-Kains, Christine Desmedt, Denis Larsimont, Fatima Cardoso, Hans Peterse, Dimitry Nuyten, Marc Buyse, Marc J. Van de Vijver, Jonas Bergh, Martine Piccart, and Mauro Delorenzi. Gene expression profiling in breast cancer: Understanding the molecular basis of histologic grade to improve prognosis. Journal of the National Cancer Institute, 98(4):262–272, 2006. ISSN 00278874. doi: 10.1093/jnci/djj052.
  • Szklarczyk et al. [2017] Damian Szklarczyk, John H Morris, Helen Cook, Michael Kuhn, Stefan Wyder, Milan Simonovic, Alberto Santos, Nadezhda T Doncheva, Alexander Roth, Peer Bork, Lars J Jensen, and Christian von Mering. The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic acids research, 45(D1):D362–D368, jan 2017. ISSN 1362-4962. doi: 10.1093/nar/gkw937. URL http://www.ncbi.nlm.nih.gov/pubmed/27924014http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC5210637.
  • Taylor et al. [2009] Ian W Taylor, Rune Linding, David Warde-Farley, Yongmei Liu, Catia Pesquita, Daniel Faria, Shelley Bull, Tony Pawson, Quaid Morris, and Jeffrey L Wrana. Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nature biotechnology, 27(2):199, 2009.
  • Tenenbaum D [2016] Tenenbaum D. KEGGREST:Client-side REST access to KEGG, 2016.
  • Türei et al. [2016] Dénes Türei, Tamás Korcsmáros, and Julio Saez-Rodriguez. Omnipath: guidelines and gateway for literature-curated signaling pathway resources. Nature methods, 13(12):966, 2016.
  • Vazquez et al. [2008] Alexei Vazquez, Elisabeth E. Bond, Arnold J. Levine, and Gareth L. Bond. The genetics of the p53 pathway, apoptosis and cancer therapy. Nature Reviews Drug Discovery, 7(12):979–987, dec 2008. doi: 10.1038/nrd2656. URL https://doi.org/10.1038/nrd2656.
  • Wang et al. [2005] Yixin Wang, Jan G.M. Klijn, Yi Zhang, Anieta M. Sieuwerts, Maxime P. Look, Fei Yang, Dmitri Talantov, Mieke Timmermans, Marion E. Meijer-Van Gelder, Jack Yu, Tim Jatkoe, Els M.J.J. Berns, David Atkins, and John A. Foekens. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet, 365(9460):671–679, 2005. ISSN 01406736. doi: 10.1016/S0140-6736(05)17947-1.
  • Zhang and Wiemann [2009] Jitao David Zhang and Stefan Wiemann. KEGGgraph: A graph approach to KEGG PATHWAY in R and bioconductor. Bioinformatics, 25(11):1470–1471, 2009. ISSN 13674803. doi: 10.1093/bioinformatics/btp167.
  • Zhu et al. [2009] Yanni Zhu, Xiaotong Shen, and Wei Pan. Network-based support vector machine for classification of microarray samples. BMC Bioinformatics 2009 10:1, 10(1):S21, jan 2009. ISSN 1471-2105. doi: 10.1186/1471-2105-10-s1-s21. URL https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-S1-S21.

PIMKL: Pathway Induced Multiple Kernel Learning
Matteo Manica1,2,*, , Roland Mathis 1,*, Joris Cadow 1,*, María Rodríguez Martínez1,
{tte, lth, dow, mrm}@zurich.ibm.com
IBM Research Zürich
2 ETH - Zürich
* Shared first authorship
Corresponding author

Supplementary information

s.1 Pathway induction

Similarity functions can be designed by making use of a PSD matrix to induce a weighted inner product:

represents a valid kernel if matrix is PSD [37], indeed this ensures the existence of a matrix :

where is a mapping describing a transformation in the feature space.

By making use of a PSD matrix encoding topological properties of a graph representing a pathway it’s possible to design interaction-aware kernels.

Let’s consider an undirected graph representing a pathway:

with nodes and edges representing the genes/proteins and their interactions respectively. Such a graph is defined by a symmetric adjacency matrix :

and its diagonal degree matrix :

For such a graph, the Laplacian matrix is computed using the following:

The Laplacian is a PSD matrix and therefore represents a suitable candidate for induction of a weighted inner product based on a pathway topology. This can be shown by defining an ordered incidence matrix for that by construction satisfies the relation . By introducing an index set for the edges , can be defined as:

This guarantees that edges are considered in an arbitrary order consistently applied, as in [3].

Moreover the Laplacian can be interpreted as a discrete Laplace operator. Indicating with a set of samples, a discrete diffusion process over graph nodes is described as:


where the term computes the discrete diffusion potential (a difference) along the edges and Equation 1 describes how the flow of this potential is affected, aggregating incoming and outgoing flow on the nodes.

Decomposing the Laplacian using an ordered incidence matrix shows how samples are mapped from the original space with measurements for molecular entities into an -dimensional feature space, where each interaction from the pathway is a dimension and the value along the edge is the discrete diffusion potential between respective nodes measurements.

The inner product in this space is the resulting similarity function defined as:

Similar considerations can be applied in case of weighted graphs with non-negative weights. Given a weighted undirected graph , indicating a by its diagonal weights matrix, the Laplacian is defined as:

To ensure an equal contribution from all the nodes in the considered pathway, the degree-normalized version of the Laplacian can be adopted:

This pathway encoding directly leads to the definition of pathway induction used in this work. Given any two samples measurement :


A similar concept was proposed [9] at complete network level. The normalized Laplacian was used as a regularizer to constrain the optimization problem when training an SVM. In PIMKL, we arrive to a similar formulation of the problem from the intuition of introducing a feature mapping and not from a regularization perspective. We define a kernel function which allows easy application to any kernelized method and any kernel transformation. The decomposition of can be derived from the graph but is implicit, and can be easily extended to the multiple kernel learning case, allowing to work at pathway/sub-network level.

s.2 Breast cancer microarray cohorts

GEOid [4] Patients dmfs/rfs 5 years dmfs/rfs 5 years
GSE2034 [48] 286 93 183
GSE1456 [39] 159 34 119
GSE2990 [42] 187 42 116
GSE4922 [26] 249 69 159
GSE7390 [15] 198 56 135
GSE11121 [41] 200 28 154
Table S1: Breast cancer benchmark cohorts. Brief description of samples counts in the different classes for the cohorts considered in [12] (all Affymetrix Human Genome U133A Array). In GSE4922 and GSE11121 metastasis free survival (dmfs) is considered, in other cohorts relapse free survival (rfs).
Data types Patients Recurred/Progressed DiseaseFree
Illumina Human v3 microarray (mRNA) 1890 647 1333
Affymetrix SNP 6.0 copy number (CNA)
Table S2: Breast cancer METABRIC cohort. Brief description of samples counts in the different classes for the considered data types in the METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) cohort [13]
1:for  do
2:     for  stratified 10-fold split  do
3:         learn feature-wise normalization on and apply to
4:         for ( stratified 3-fold split of  do parameter grid search with 3-fold cross-validation
5:              for  do
6:                  train PIMKL() on and
7:                  record prediction accuracy on                        
8:          argmax(mean prediction accuracy over cross-validation)
9:         PIMKL() on and
10:         report kernel weights
11:         report area under the curve for prediction on      
12:     report mean area under the curve over 10-fold splits for figure S1 and 2(a)
Algorithm S1 PIMKL Cross-validation on [12]. Cross-validation analysis of PIMKL for each of the breast cancer cohorts as suggested in [12] (with internal optimization of parameters). Given as input: gene expression measurements with related clinical labels , a set of pathways and a set of hyper-parameters = {0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0} for EasyMKL.
1:for 100 folds with 20 samples per class in  do
2:     for  in  do
3:         learn feature-wise normalization on and apply to      
4:     train PIMKL() on and
5:     report kernel weights
6:     report area under the curve for prediction on
Algorithm S2 PIMKL Cross-validation on METABRIC. Cross-Validation on METABRIC single omics or multi-omics. Given as input: molecular measurements comprised of a choice of data types (CNA, mRNA or both) with related clinical labels , a set of pathways with a respective pathway for each data type in and = 0.2 for EasyMKL.

Figure S1: PIMKL cross-validation AUC. Box plots of the AUC values for the methods analyzed in [12] (blue) and PIMKL (red). PIMKL clearly outperforms other methods in four out of six data sets. For GSE1456 is performing close to other methods average while for GSE11121 is in the top group. Results are presented as in [12], where each box is drawn from ten (repeats of) mean AUC values over 10-fold cross-validation splits, see algorithm S1.

Figure S2: PIMKL cross-validation weights. Significance of weights over 100 cross-validation folds for the 50 hallmark pathways are reported. Significant pathways are colored in red, while non-significant in blue.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description