Drug Similarity Integration Through Attentive Multiview Graph AutoEncoders
Abstract
Drug similarity has been studied to support downstream clinical tasks such as inferring novel properties of drugs (e.g. side effects, indications, interactions) from known properties. The growing availability of new types of drug features brings the opportunity of learning a more comprehensive and accurate drug similarity that represents the full spectrum of underlying drug relations. However, it is challenging to integrate these heterogeneous, noisy, nonlinearrelated information to learn accurate similarity measures especially when labels are scarce. Moreover, there is a tradeoff between accuracy and interpretability. In this paper, we propose to learn accurate and interpretable similarity measures from multiple types of drug features. In particular, we model the integration using multiview graph autoencoders, and add attentive mechanism to determine the weights for each view with respect to corresponding tasks and features for better interpretability. Our model has flexible design for both semisupervised and unsupervised settings. Experimental results demonstrated significant predictive accuracy improvement. Case studies also showed better model capacity (e.g. embed node features) and interpretability.
Drug Similarity Integration Through Attentive Multiview Graph AutoEncoders
Tengfei Ma, Cao Xiao, Jiayu Zhou, Fei Wang IBM Research, MITIBM Watson AI Lab Computer Science and Engineering, Michigan State University Weill Cornell Medical School, Cornell University tengfei.ma1@ibm.com, cxiao@us.ibm.com, jiayuz@msu.edu, few2001@med.cornell.edu
1 Introduction
The rapidly evolving technologies have made it easier to collect multiple types of drug data and thus opened new opportunities for computational drug discovery research and drug safety studies. The study of drug similarity paves the foundation for these research since similar structural, molecular and biological properties often relate to similar drug indications or adverse effects [?]. In literature, drug similarity has been computed using molecular structure data [?], interaction profile data [?], as well as sideeffect information [?; ?].
Recently, there has been a growing interest in learning improved drug similarity from multiple types of drug features. For example, [?] proposed an inductive matrix completion method to combine multiple data sources and help predict the unknown side effects. [?] proposed an integrative label propagation algorithm to infer clinical side effects from multiple sources with considering highorder similarity. Results from these pilot studies show that combined similarity measures are usually more informative and robust to noise. These methods could be summarized into four major categories: the nearest neighbor method, the random walk based approaches, the unsupervised, and the multiple kernel learning methods. Section 2 provides more details of the related literature.
Despite potential benefits, when learning from multiple biomedical data sources, significant challenges arise from the simultaneous handling of the following issues: 1) different types of features have different levels of associations with targeting outcomes. For example, drugs’ structural similarity could have more influence on their interaction profiles than drugs’ indication similarity do; 2) the underlying relations of biomedical events (e.g., two drugs interact to cause a side effect) are often nonlinear and complex over all types of features [?]; 3) data quality (e.g. lack of label, noise in the data) also creates challenges for similarity learning, and 4) a model that captures complex drug relations is often be very complex and lacking interpretability.
To address the aforementioned challenges, we consider each type of drug feature as a view and learn integrated drug similarity using multiview graph autoencoders (GAE). In particular, we model each drug as a node in the drug association network and extend the graph convolutional networks (GraphCNN) [?] to embed multiview node features and edges. Across views, we use attentive view selection scheme to enable nonlinear multiview fusion and make the learning more interpretable and adaptive to data. By such embedding, we learn drug similarity and use them to predict outcomes (e.g., drugdrug interactions). In addition, for the setting where we would like to integrate multiple drug similarity graph without knowing any features, we propose an alternative transductive learning method based on treating labels as latent variables. The proposed models not only improve prediction performance, but also have the following benefits.

Intepretable and adaptive multiview fusion: To model the heterogeneous relevance among different views with targeting tasks, in our similarity integration, we use attentive model to fuse multiple views. The attentive view selection scheme generates taskwise feature relevance, by which we could learn interpretable similarity measures. Also the learned similarity would be more adaptive to the underlying data, thus is more accurate.

Transductive prediction using unlabeled data: Labels are expensive to acquire, and often very scarce for new drugs. By developing an autoencoder structure, whose reconstruction loss could be seen as a regularization term that explicitly models the information of graph structure, we efficiently leverage the unlabeled data for accurate predictions.

Robusttonoise: The proposed methods inherit the advantage of autoencoders and can extract representations that are relatively stable and robust to the noise in the data, e.g. in the drugdrug interaction prediction case, sometimes unseen interactions might not indicate no interaction. The proposed methods effectively reduce the negative impacts caused by these “positive unlabeled” samples.
2 Related Work
Our work addresses the problem of multiview similarity integration. To our best knowledge, current approaches mainly could be summarized as below.
The nearest neighbor methods that make predictions based on majority cases among neighbors. To name a few, [?], [?], and [?]. However, as pointed out by [?], most of these existing methods only utilize firstorder similarity to construct neighborhood and do not consider transitivity of similarities.
The random walk methods (e.g., label propagation in [?] and [?]) that leverage the assumption that data points occupying the same manifold are very likely to share the same semantic label, and then aim to propagate labeling information from labeled data points to unlabeled ones according to the intrinsic data manifold structures collectively revealed by a large number of data points. These methods can handle nonlinear relations and perform transductive learning with scarce labeled data. However, these models have fixed loss functions, hence lack of flexibility in modeling various problem settings.
The unsupervised methods For example, in [?] and [?], the authors construct an integrative network to fuse multiple similarity networks via an iterative scaling approach. In [?],the authors integrated feature ranking and feature variation as feature weights for weighted similarity fusion. These unsupervised methods have good flexibility, however without any supervision, unreliable results could be generated.
The multiple kernel learning (MKL) methods such as [?]. MKL were further extended to integrate heterogeneous data in [?], however, most existing methods are often limited to convex integration.
3 Background
Over the past few years, several graphbased convolutional network models emerged for inducing informative latent feature representations of nodes and links. For example, [?] proposed a new graph convolutional network (GraphCNN) that learns node embeddings based on node features and their connections, which could be used in node classification. Specifically, given an undirected graph with nodes and adjacency matrix , a multilayer neural network is constructed on the graph with the following layerwise propagation rule:
where is the adjacency matrix with added selfconnections, is a diagonal matrix such that , is a layerspecific parameter matrix, is the node representation in the layer, and is an activation function (e.g. ReLU or sigmoid). Later, [?] and [?] extended GraphCNN and proposed a graph autoencoder (GAE) using GraphCNN for both node classification and link prediction tasks. However, their model only reconstructs the edges, and cannot work on unseen data. In the following, we will make further extension based on [?] and [?] in terms of reconstructing both links and node embeddings and allowing for inductive prediction.
4 Method
In this paper, we consider each type of drug feature as a view. For view , we construct a graph by modeling each drug as a node and the similarity between two nodes as an edge. We denote node feature embeddings as and use similarity matrix to represent the pairwise similarity between drugs on that view. Given different views, the task of multiview similarity integration is to derive an integrated node embedding and similarity matrix across all views.
4.1 Similarity Integration with Attentive Multiview Graph Autoencoders
Basic GraphCNN Structure with Multiple Views For each view , we set and diagonal matrix where , then we use a twolayer GraphCNN to get the node embeddings using Eq. (4.1).
where , and are weight matrices.
Given the node embeddings on each view , we concatenate the embedding from each view to get a new representation of the node . The prediction between two nodes and could be done by a sigmoid function with a matrix parameter . The structure of this method is shown in Figure 1(a).
(a)  (b)  (c) 
Similarity Matrix Fusion Instead of concatenating node embeddings in different views, we can also first get an integrated similarity matrix and construct only one graph for all views. In this single graph, the nodes features are fixed for all views. And the similarity fusion could be simply done as follows: considering the complexity of normalization, to fuse similarity we first normalize all similarity matrices to get , and then aggregate all similarity matrices to get a comprehensive one as the adjacency matrix of the graph: , where are mixing weights for different similarity matrices. Following the structure in [?], we use a onelayer GraphCNN to encode the nodes in our graph:
(2) 
After that, we decode the embedding back to the original feature space
(3) 
If we do not have any labels for the nodes, the objective function is the loss of the autoencoder in Eq. 4.
(4) 
In this case, our framework could be regarded as an unsupervised multigraph fusion and embedding method.
The derived similarity matrix can be used for other tasks as well, such as node clustering.
Attentive View Selection In practice, the fusion of each view could be nonlinear, while the weights of features in each view need to be decided by both the data and the targeting tasks. To allow for such a flexibility, in this section we extend the mixing scheme by fusing features from different views with attention mechanism, where weights of features are determined by corresponding inputs. The attentive view selection scheme is illustrated in Fig. 2.
Assume we have adjacency matrix for view , we assign attention weights to the graph edges, such that the integrated adjacency matrix becomes , where is the element wise multiplication. For each view, we first project the original adjacency matrix to an unnormalized matrix , and then normalize them over different views to get the attention weights . In practice, the graph is often large, thus there will be too many parameters for the attention calculation if we use a fully connected attention matrix. To reduce the complexity, we alternatively employ a diagonal attention matrix. To be specific, we limit to be a vector, and form the weighted similarity matrix by . In this way, the size of parameters (i.e. and ) is reduced from to . And the attentive similarity matrix is generated as follows: , where .
Then we normalize them to get the attention weights for each position : , and is then used to induce the final similarity matrix , where is the matrix multiplication, is a diagonal matrix of as its diagonal value. After we get the new attention based similarity matrix , we can use the same framework as 4.2 and 4.3.
4.2 A Semisupervised Extension Given Partial Labeled Data
The graph autoencoder (GAE) structure could be further extended to a semisupervised setting when we have labels for some of the nodes in the graph (in our case drugs). We could keep the autoencoder framework unchanged, and predict the labels on training data using a network : . The prediction loss of is formulated by Eq. (5).
(5) 
This new model then integrates the two loss functions as its objective function
(6) 
Compared to a generic neural network, which generally contains only the , Eq. (6) can be seen as adding an autoencoder loss as the regularization term . In a graph based semisupervised learning framework, the graph Laplacian regularization is often used as the regularization to capture the graph structure. , where . Our objective function replaces the second term with the reconstruction loss of the GAE, which also explicitly models the graph structure information.
4.3 Transductive Learning using Test Labels as Variables
Sometimes when we only have the graph structure of the similarity matrix but no node features, although we could model them using onehot representation as in [?] (for details, see Appendix A.1 in [?]), such embedding is typically not efficient. More importantly decoding the embedding vectors to the onehot vectors cannot gain much information. This motivates us to develop another scheme to extend the previous introduced graph autoencoder to improve learning given no node feature.
Instead of using the original node features or onehot node vectors in GAE, we consider an alternative way: we use the training labels (i.e. DDI links of each node) as inputs and reconstruct them using the same GAE structure as in Figure 1(c). So the graph autoencoder would output the predicted links .
Moreover, if we consider similarities as graph edges, the labels of the test nodes would also impact the decoding of the training nodes. So we employ a transductive method to use the test labels as additional latent variables. The predicted labels are formulated as follows:
(7) 
i.e. is a function of when , and are known. The objective function of this model is then given by:
where is a regularization term which enforces stability of the solutions. Thus after inference from the training data, we can get the optimal neural network parameters as well as the latent variables .
5 Experiment
Detecting adverse drugdrug interaction (DDI), a modification of the effect of a drug when administered with another drug, is one of the clinically important applications as DDIs result in large amounts of fatalities per year and incur huge morbidity and mortality related cost of billion annually [?]. Making use of multiple drug characterizations in similarity computation is critical since drugs could have heterogeneous similarity in different feature dimensions, e.g. drugs that have similar chemical structures could have very different therapeutic target and thus result in different DDI mechanism.
5.1 Data Sources
Binary Prediction of the Occurrence of DDIs: For the first data set, we will integrate multiple similarity graph (without node feature) to predict whether there will be interaction between a new pair of drugs.
In the data, we have the following views: 1) DDI: The known labels of DDIs are extracted from the Twosides database [?], including 645 drugs and 1318 DDI events, in total distinct pairs of drugs associated with DDI reports. 2) Label Side Effect: Drugs’ side effects extracted from SIDER database [?] are considered one type of features, including drugs and side effects. We call this view as “Label Side Effect” by the convention in [?]. 3) OffLabel Side Effect: Drugs confoundercontrolled side effects from OFFSIDES dataset are considered another type of features, including drugs and side effects. 4) Chemical Structure: Drug structure features (i.e. chemical fingerprints) are structural descriptors of drugs. In our study, we generate drug structure features with the extendedconnectivity fingerprints with diameter 6 (ECFP6) using the R package “rcdk [?]”. The features are hashed binary vectors of 1,024bit length, of which each bit encodes the presence or absence of a substructure in a drug molecule. We used Jaccard index to compute similarities between all the fingerprints.
Multilabel Prediction of Specific DDI Types: For the second data, we integrate multiple type of drug views to predict specific interaction types among candidate types for new drug pairs. In the data, we have drugs and the following views: 1) Drug Indication: The drug indication data of dimension is downloaded from SIDER [?]. It is originally generated from MedDRA database, which is a widely used clinicallyvalidated international medical terminology. 2) Drug chemical protein interactome (CPI): The CPI data from [?] provides an important measure about how much power a drug needs to bind with its protein target. Its dimension is . The similarity of CPI is calculated using the RBF kernel. 3) Protein and nucleic acid targets (TTD): For each drug, we associate its multiple protein and nucleic acid targets information and generate features of dimension . These entries are extracted from the Therapeutic Target Database (TTD) [?]. 4) Chemical Structure: The chemical structure features are extracted in the same way as in dataset 1, except that we chose “pubchem” fingerprint instead, whose feature dimension is .
5.2 Implementation and Evaluation Strategy
Proposed Model: We implement the proposed model with Tensorflow 1.0 [?] and trained using Adam with learning rate 0.01 and early stopping with window size 30. We optimized the hyperparameter for SemiGAE on validation data and then fixed for all GAE models: 0.5 (dropout rate), 5e4 (L2 regularization) and 64 (# of hidden units). For GCN models, we have the second layer and the number of the hidden units in the second layer is set as .
Baseline: In addition, we implemented the following four baselines for comparison:

Nearest Neighbor (NN): We implemented the NN method in [?]. It identifies novel DDIs by using the nearest neighbor similarity to drugs involved in established DDIs.

Label Propagation (LP): We considered the LP model in [?] as a baseline. The LP method propagates the existing DDI information in the network to predict new DDIs, and could also integrate multiple similarity matrices in the network.

GraphCNN: For single view, we use the same structure as the nonprobabilistic GAE model in [?]. We consider the DDI links as edges and form the adjacency matrix. For multiple views, we linearly integrate all similarity matrices as well as the training DDI links.

Multiple Kernel Learning (MKL): For MKL, we used the python “Mklaren” library [?]. We set for RBF and degree for polynomial kernel. We only applied MKL on Data 2 since Data 1 does not have features for all views.
For all models, we use Tanimoto coefficient (TC) to calculate similarity except for CPI. For CPI, we measure drug similarity using RBF kernel. For all methods (except NN which already has the similar procedure in its method), following the procedures in [?], after getting the predicted labels using our model, we calculate the probability of that drug interacts with drug by .
Evaluation: In evaluation, we adopted strategies in [?] and randomly selected a fixed percentage (i.e., and ) of drugs, and moved all DDIs associated with these drugs for testing. For the data not in testing, we train on and perform validation and model selection on of the drugs. For testing data, we repeated the holdout validation experiment times with different random divisions of the data, and reported the mean and the standard deviation of the area under the receiver operating characteristic curve (ROCAUC) as well as the area under the precisionrecall curve (PRAUC) over the 50 repetitions. In the ROC and PR analytics, we utilized DDI interactions from TWOSIDES as reference positives, and the complement set as reference negatives.
5.3 Results
Table 1 and 2 compare the performance of the proposed models against baselines on both datasets. From the tables we can see for both single view and multiview, the proposed models significantly outperform baselines. Also, the multiview models generally outperform corresponding single view models since our integrations provide more comprehensive measures of drug similarity. With adding attention mechanism, the relevant types of features receive more weights in similarity integration.
In addition, we observed that the attentive semisupervised GAE (AttSemiGAE, the model of Section 4.2) often achieves the best ROCAUC, which is due to the embedding of node features. This advantage is more obvious on Dataset 2 than Dataset 1, since for Dataset 2, we have node features on all views, while most views in Dataset 1 have no node feature. For Dataset 1, due to the lack of node feature in most views, the attentive transductive GAE (attTransGAE, the model of Section 4.3) gains better PRAUC thanks to transductive learning from tests labels and adaptive weight learning.
Using Single View  

Methods  Test Split ()  Test Split ()  
ROCAUC  PRAUC  ROCAUC  PRAUC  
Baselines  NN  
LP  
GraphCNN  
Proposed  SemiGAE  
TransGAE  
Using Multiple Views  
Baselines  LP  
GraphCNN  
Proposed  AttSemiGAE  
AttTransGAE 
Using Single View  

Methods  Test Split ()  Test Split ()  
ROCAUC  PRAUC  ROCAUC  PRAUC  
Baselines  NN  
LP  
GraphCNN  
Proposed  SemiGAE  
TransGAE  
Using Multiple Views  
Baselines  LP  
GraphCNN  
MKL  
Proposed  AttSemiGAE  
AttTransGAE 
5.4 Case Studies
Understanding the Major Source of Similarity When two drugs cause similar DDIs, such a similarity could be induced by various mechanisms. For example, drugs that prolong the QT interval, drugs that are CYP3A4 inhibitors, or drugs that alter another drug’s metabolism via cytochrome P450 interactions or changes in protein binding, etc [?]. Better understanding the major DDI mechanism would benefit us from developing actionable insights to identify proper ways to prevent DDIs. In this paper, adding attention mechanism enhances the interpretability of the models and could potentially provide understanding of the underlying DDI mechanism.
DDI Type  AUC  chem.  indi.  TTDS  CPI 

Chest Pain  
Insomnia  
Aching Muscles 
Table. 3 reports several selected DDIs and the weights of each views as predicted using AttSemiGAE. For example, the DDI “chest pain” has good prediction AUC, and the views “CPI” and “indication” both have more impact on the predictions than other views. We consult domain expert, and find it in line with domain knowledge. Many DDI cases of chest pain are due to particular drug overdose, such as Venlafaxine and Mirtazapine [?], which could be prescribed together to treat depression. However, the couse of them could cause overdose thus prolong the QT interval via chemical protein interactome (CPI), and eventually cause chest pain. For another DDI “insomnia”, one major mechanism is the interaction between cytochrome P450 (CYP) inducers (e.g. Rifampicin) and Hypnosedatives. Insomnia happens when the cytochrome P450 (CYP) inducers significantly induce the metabolism of the newer hypnosedatives and decreased their sedative effects [?]. Such a process was caused by the bindings of chemical structures with proteins. In the results, the weights for “pubchem” and “CPI (compoundprotein binding)” are much higher than the rest, in line with knowledge.
Importance of Multiview Feature Integration: We also examined how feature integrations across multiples views of features could help provide more accurate measures of drug similarity.
For example, Acyclovir (Pubchem ID 2022) and Ganciclovir (Pubchem ID 3454), having medium level similarity in indication and TTDS, since Acyclovir is used for treating herpes simplex virus infections and shingles but Ganciclovir is mainly used in more severe Cytomegalovirus diseases and AIDS. However, they both are analogues of 2’deoxyguanosine and have very high structural similarity (0.961 measured using “Pubchem” fingerprint). The high structural similarity lead to many common DDIs shared by the two drugs according to the groundtruth. Our proposed model adaptively gives more weight to the structural similarity and computes an integrated similarity score at , but label propagation (LP) fails to capture such heterogeneous influences and yields an integrated score at only , which is an underestimate comparing with the groundtruth.
Similar examples include the similarity between Alprazolam (Pubchem ID 2118) and Estazolam (Pubchem ID 3261) as well as the similarity between Alprazolam (Pubchem ID 2118) and Triazolam (Pubchem ID 5556). The two pairs of drugs have quite low indication similarity, however, they all interact when in combined use with CYP3A4 inhibitors such as Cimetidine, Erythromycin, Norfluoxetine, Fluvoxamine, Itraconazole, Ketoconazole, Nefazodone, Propoxyphene, and Ritonavir. The combined uses will delay the hepatic clearance of Alprazolam, Estazolam or Triazolam, which then cause accumulation and increased severity of side effects from these drugs. The proposed model account for the feature heterogeneity and weight more on the chemical structural feature and CPI feature, leading to integrated similarity at , while other methods considered each views homogeneously and the resulting similarity is often low at .
6 Conclusion
In this paper, we proposed a set of Graph AutoEncoder based models that perform multiview drug similarity integration with attention model to perform view selection. The nonlinear and adaptive integration not only offers superior predictive performance but also interpretable results. We extended these GAE models to semisupervised/transductive settings and predict the unknown DDIs. Experimental results on two realworld drug datasets demonstrated the performance and efficacy of our methods. Future works include expansion along the line of data or model. Datawise, we could try on a larger drug database to fully exploit of the power of deep learning without overfitting. Modelwise, we will pursuit directly computing integrated similarity across multiple views without the need for calculation similarity for each view first.
Acknowledgments
The work of Fei Wang is supported by NSF IIS1750326 and IIS1716432. The work of Jiayu Zhou is funded by NSF IIS1749940, IIS1615597 and by ONR under N000141712265.
References
 [Abadi et al., 2015] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, et al. TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
 [Angione et al., 2016] Claudio Angione, Max Conway, and Pietro Lió. Multiplex methods provide effective integration of multiomic data in genomescale models. BMC Bioinformatics, 17(4):83, Mar 2016.
 [Ansari, 2010] J. Ansari. Drug interaction and pharmacist. Journal of Young Pharmacists., 2(3), 2010.
 [Chen et al., 2002] X. Chen, ZL. Ji, and YZ. Chen. Ttd: Therapeutic target database. Nucleic Acids Res., 30(1), 2002.
 [Cheng and Zhao, 2014] F. Cheng and Z. Zhao. Machine learningbased prediction of drugdrug interactions by integrating drug phenotypic, therapeutic, chemical, and genomic properties. JAMIA, 21, 2014.
 [Fokoue et al., 2016] A. Fokoue, O. Hassanzadeh, M. Sadoghi, and P. Zhang. Predicting drugdrug interactions through similaritybased link prediction over web data. WWW ’16 Companion, pages 175–178, 2016.
 [Giacomini et al., 2007] K. Giacomini, R. Krauss, D. Roden, M. Eichelbaum, and M. Hayden. When good drugs go bad. Nature, 446:975–977, 2007.
 [Guha, 2007] Rajarshi Guha. Chemical informatics functionality in r. Journal of Statistical Software, 18(6), 2007.
 [Hesse et al., 2003] LM. Hesse, LL. von Moltke, and DJ. Greenblatt. Clinically important drug interactions with zopiclone, zolpidem and zaleplon. CNS Drugs., 17(7), 2003.
 [Jin et al., 2017] B. Jin, H. Yang, C. Xiao, P. Zhang, X. Wei, and F. Wang. Multitask dyadic prediction and its application in prediction of adverse drugdrug interaction. In AAAI, 2017.
 [Kipf and Welling, 2016a] TN. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. 2016.
 [Kipf and Welling, 2016b] TN. Kipf and M. Welling. Variational graph autoencoders. 2016.
 [Kuhn et al., 2015] M. Kuhn, I. Letunic, LJ. Jensen, and P. Bork. The sider database of drugs and side effects. Nucleic Acids Res., 2015.
 [Li et al., 2015] R. Li, Y. Dong, Q. Kuang, Y. Wu, Y. Li, M. Zhu, and M. Li. Inductive matrix completion for predicting adverse drug reactions (adrs) integrating drug–target interactions. Chemometrics and Intelligent Laboratory Systems, 144:71–79, 2015.
 [Liu et al., 2012] M. Liu, Y. Wu, Y. Chen, J. Sun, Z. Zhao, X. Chen, M. Matheny, and H. Xu. Largescale prediction of adverse drug reactions using chemical, biological, and phenotypic properties of drugs. JAMIA, 19(e1):e28–e35, 2012.
 [McFee and Lanckriet, 2011] B. McFee and G. Lanckriet. Learning multimodal similarity. J. Mach. Learn. Res., 12, 2011.
 [Nachimuthu et al., 2012] S. Nachimuthu, MD. Assar, and JM. Schussler. Druginduced qt interval prolongation: mechanisms and clinical management. Therapeutic Advances in Drug Safety., 3(5), 2012.
 [Peter et al., 2006] I. Peter, S. Christian, and M. Achim. Drugs, their targets and the nature and number of drug targets. Nature Reviews Drug Discovery, 5:821– 834, 2006.
 [rep, 2016] Drug repositioning. http://astro.temple.edu/~tua87106/drugreposition.html, 2016.
 [Schlichtkrull et al., 2017] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling. Modeling Relational Data with Graph Convolutional Networks. ArXiv eprints, 2017.
 [Strazar and Curk, 2016] M. Strazar and T. Curk. Learning the kernel matrix via predictive lowrank approximations. CoRR, abs/1601.04366, 2016.
 [Tatonetti et al., 2012] NP. Tatonetti, P. Patrick, R. Daneshjou, and RB. Altman. Datadriven prediction of drug effects and interactions. Science translational medicine, 4(125):125ra31–125ra31, 2012.
 [Vilar et al., ] S. Vilar, R. Harpaz, E. Uriarte, L. Santana, R. Rabadan, and C. Friedman. Drug drug interaction through molecular structure similarity analysis. JAMIA, (6):1066–1074.
 [Vilar et al., 2012] S. Vilar, R. Harpaz, E. Uriarte, L. Santana, R. Rabadan, and C. Friedman. Drug—drug interaction through molecular structure similarity analysis. JAMIA, 19(6):1066–1074, 2012.
 [Wang et al., 2010] F. Wang, P. Li, and AC. Konig. Learning a bistochastic data similarity matrix. In ICDM, pages 551–560, 2010.
 [Wang et al., 2014] B. Wang, A. Mezlini, F. Demir, M. Fiume, Z. Tu, M. Brudno, and A. Goldenberg. Similarity network fusion for aggregating data types on a genomic scale. Nature Methods, 11:333–337, 2014.
 [Xu et al., 2016] Taosheng Xu, Thuc Duy Le, Lin Liu, Rujing Wang, Bingyu Sun, and Jiuyong Li. Identifying cancer subtypes from mirnatfmrna regulatory networks and expression data. PLOS ONE, 11(4):1–20, 04 2016.
 [Zhang et al., 2015] P. Zhang, F. Wang, J. Hu, and R. Sorrentino. Label propagation prediction of drugdrug interactions based on clinical side effects. Scientific reports, 5, 2015.
 [Zhang et al., 2016] W. Zhang, H. Zou, L. Luo, Q. Liu, W. Wu, and W. Xiao. Predicting potential side effects of drugs by recommender methods and ensemble learning. Neurocomputing., 173, 2016.
 [Zhang et al., 2017] W. Zhang, Y. Chen, F. Liu, F. Luo, G. Tian, and X. Li. Predicting potential drugdrug interactions by integrating chemical, biological, phenotypic and network data. BMC Bioinformatics., 2017.
 [Zhuang et al., 2011] J. Zhuang, IW. Tsang, and S. Hoi. Twolayer multiple kernel learning. In AISTATS, volume 15, pages 909–917, 2011.