Genn: Predicting Correlated Drugdrug Interactions with Graph Energy Neural Networks
Abstract
Gaining more comprehensive knowledge about drugdrug interactions (DDIs) is one of the most important tasks in drug development and medical practice. Recently graph neural networks have achieved great success in this task by modeling drugs as nodes and drugdrug interactions as links and casting DDI predictions as link prediction problems. However, correlations between link labels (e.g., DDI types) were rarely considered in existing works. We propose the graph energy neural network (GENN) to explicitly model link type correlations. We formulate the DDI prediction task as a structure prediction problem, and introduce a new energybased model where the energy function is defined by graph neural networks. Experiments on two real world DDI datasets demonstrated that GENN is superior to many baselines without consideration of link type correlations and achieved and PRAUC improvement on the two datasets, respectively. We also present a case study in which GENN can better capture meaningful DDI correlations compared with baseline models.
1 Introduction
The use of drug combinations is common and often necessary for treating patients with complex diseases. However, it also increases the risk of drugdrug interactions (DDI). DDIs are pharmacological interactions between drugs that can alter the action of either or both drugs and cause adverse effects. Overall DDIs result in a large number of fatalities per year and incur billion DDI associated cost annually (Giacomini et al., 2007). To mitigate these risks and costs, accurate prediction of DDI becomes a clinically important task.
While DDI knowledge is expensive to gather, several deep learning approaches have been proposed to search the large biomedical data for predicting potential DDIs (Zitnik et al., 2018; Ma et al., 2018; Ryu et al., 2018). Among them, graph neural networks (GNN) that consider DDI prediction in the graph setting have obtained great performance. In DDI graphs, drugs are represented as nodes and embedded into a lowdimensional space. Further prediction can be made based on casting DDI prediction as a link prediction problem where DDI types are represented as link types (Kipf and Welling, 2016; Zitnik et al., 2018).
However, existing works rarely exploit the correlations between these link types (e.g., DDI types) despite of their importance. For example, in Fig. 1 Warfarin is a drug for treating blood clots. Its combined use with antibiotic or nonsteroidal antiinflammatory drugs can cause multiple DDI types including nhibition of clotting, gastrointestinal bleeding, and hemorrhage. These DDI types are correlated since they are all bleeding related DDIs caused by the increase of Warfarin’s effect. Explicit modeling such correlations can help infer unseen DDI types.
To fill the gap, we propose GENN, a new deep architecture that predicts correlated DDIs based on graph neural networks and energy based models. Specifically, we leverage the dependency structures among DDI types and formulate multitype DDI detection as a structure prediction problem (Lafferty et al., 2001b; LeCun et al., 2006; Belanger and McCallum, 2016b). Here we use an energy based approach to incorporate such dependency structures.
To summarize, GENN is enabled by the following technical contributions.

[leftmargin=*,noitemsep]

Modeling link type correlations. GENN bridges graph neural networks and structure prediction to directly capture link type correlations for more accurate link prediction in graphs through minimizing an energy function.

A new graph based energy function. Inspired by a family of structure prediction model called structured prediction energy networks (SPENs) (Belanger and McCallum, 2016a), we design a new graph energy function based on graph neural networks to capture the dependencies among link types in DDI graphs.

Efficient semisupervised training. We designed one costaugmented inference network to approximate the output in training and one test inference network to approximate the output in testing. We also propose a semisupervised joint optimization procedure to optimize the structured hinge loss with respect to the parameters for both inference networks and the energy function.
We evaluated GENN on two real world DDI datasets with both quantitative and qualitative study. Results demonstrated GENN outperformed basic graph neural networks with and PRAUC improvement on the two datasets, respectively.
2 Related Work
2.1 DDI Prediction
To predict unseen DDIs based on known ones, drug similarity has been learned via nearest neighbor approaches (Zhang et al., 2017), and using random walk methods including label propagation (Zhang et al., 2015; Wang et al., 2010), and unsupervised methods (Wang et al., 2014; Angione et al., 2016). Recently, deep graph neural networks have been demonstrated to provide much improved performance in DDI prediction. Among them, (Ma et al., 2018) integrates different sources of drugrelated information with heterogeneous formats into an coherent and informationpreserving representation using attentive multiview graph autoencoders (Kipf and Welling, 2016). Decagon (Zitnik et al., 2018) develops a new graph autoencoder approach, which allows us to develop an endtoend trainable model for link prediction on a multimodal graph. As the experimental results in Zitnik et al. (2018) show graph neural networks achieved significantly improved performance than shallow models such as tensor decomposition (Nickel et al., 2011; Papalexakis et al., 2017), random walk based methods (Perozzi et al., 2014; Zong et al., 2017), and nongraph neural fingerprinting (Duvenaud et al., 2015; Jaeger et al., 2018), DeepDDI (Ryu et al., 2018). None of the existing DDI prediction works consider the correlation among multiple DDI types.
2.2 Structure Prediction and Energy Based Models
Structure prediction is an important problem in various application domains where we want to predict structured outputs instead of independent labels, e.g., structured labels prediction for object detection (Zheng et al., 2015b), semantic structure prediction (Belanger et al., 2017) or partofspeeches tagging (Ma and Hovy, 2016). For structure prediction problems, feedforward networks is insufficient since they cannot directly model the interaction or constraints of the outputs. Instead, energy based models (LeCun et al., 2006) provide one way to address the challenge by defining a much more flexible and expressive energy score, and the prediction is conducted by minimizing the energy function.
One of the wellknown structure prediction approaches is the conditional random fields (CRF) (Lafferty et al., 2001a) which show success in application areas such as named entity recognition (Sato et al., 2017) and image segmentation (Zheng et al., 2015a). But as a structured linear model, CRF has limited representation ability. To enhance the flexibility of structure prediction models, Belanger and McCallum (2016b) proposed Structured Prediction Energy Networks (SPENs) which uses arbitrary neural networks to define the energy function and optimizes the energy over a continuous relaxation of labels. Thus it can capture higharity interaction among output labels and approximate the optimization of energy function via gradient descent and repeated “lossaugmented inference”. Belanger et al. (2017) further developed an “endtoend” method that unroll the approximate energy optimization to a different computation graph. However, after learning the energy function, they still have to use gradient descent for testtime inference. Later, Tu and Gimpel (2018) replaced the gradient descent with a neural network trained to do inference directly. However, it separates the training of costaugmented inference network and the finetuning of another inference network for testing, which makes it inefficient. In addition, how to optimize the SPENs on graphs for semisupervised learning remains challenging.
Very recently, GMNN (Qu et al., 2019) combines the statistical relational learning (SRL) and graph neural networks. It includes a CRF to model the joint distribution of object labels conditioned on object attributes, and then it is trained with a variational EM algorithm. The model has been used for node classification and link classification. However, it cannot be easily extended for prediction of missing links, since they have to use links as nodes in a “dual graph” for link classification, but the missing links generally cannot be modeled as nodes.
3 Preliminary
In this section, we first summarize our task in Section 3.1, then a basic message passing graph neural network (MPNN) is described for embedding the DDI graph in Section 3.2.
3.1 Task Formulation
Definition 1 (DDI Graph)
Given a DDI graph , is the node set which contains drug nodes with node features , and is the edge set which contains all DDIs between drug pairs. For a specific drug pair , the DDIs of this pair can have multiple types, i.e. , where is the feature dimension and is the total number of DDI types. And these DDI vectors form the edge features (or labels) .
Problem 1 (DDI Prediction)
We cast the DDI prediction into a multitype link prediction problem. We assume there are some missing edges in the graph, i.e. . Given the node features and some known DDI links , the goal is to predict the unknown DDIs .
3.2 Message Passing Neural Networks for DDI prediction
Recently graph neural networks (GNNs) have been successfully applied to DDI prediction (Zitnik et al., 2018; Ma et al., 2018). To predict the unknown DDIs in a graph, a GNN first learns the node embeddings of all the drug nodes and then make prediction based on these embedding. In this section, we describe a basic message passing based graph neural network (Gilmer et al., 2017) for drug node embedding. Considering that our graph contains different types of edges, we select a schema including edge information in the message passing. Given the graph with node features , for each node with hidden state and its neighborhood , we define the message passing layer to update the node hidden state as follows:
(1) 
where is the edge (DDI type vector) between nodes , are both learnable weight, and is a neural network, i.e. MLP.
The message passing layer can then be stacked to get the final state for each node, i.e. the node embedding. Then we can use these final states to predict the probability of DDI types for each edge:
(2) 
where indicates concatenation, and are parameters and Sigmoid is the activation function commonly used for multilabel prediction.
The model is trained on the partially known DDI edges by optimizing cross entropy loss between the predictions and the true edges. After training, it can be directly used for prediction of unknown DDIs .
4 The Genn Method
We present our GENN framework in this section (Fig. 2). We first discuss the design of energy based framework for DDI prediction and the design of our graph energy functions based on MPNN. Then we propose a joint learning strategy with a semisupervised learning setting in Section 4.3.
Obviously the aforementioned message passing neural network does not explicitly consider the DDI correlation in the prediction. However, in practice, the labels are often correlated as we stated in the introduction section. So we reformulate our problem as a structure prediction problem and infer the labels by optimizing an energy function. In order to get extremely powerful representation for the energy model, we follow the ideas of SPENs to formulate the energy function as a neural network.
4.1 Energy Function Design
The SPENs define the energy as a neural network that takes both the input features and labels as inputs and returns the energy. Although the SPENs can almost obtain arbitrarily high expressiveness, in practice it requires a tradeoff between using increasingly expressive energy networks and being more vulnerable to overfitting (Belanger and McCallum, 2016b), so the energy function need to be properly designed.
Energy function via GNN
Recall the message passing neural network we used before, the edge information is also included in the neural networks. If we aggregate all the nodes’ and edges’ information to get the whole graph’s representation, it is natural to derive an “energy” over the graph. Motivated by this intuition, we formulate a new and simple energy function from a graph neural network:
(3) 
Where is an arbitrary graph neural network (GNN) which accepts node features and edge features which can be initialized by multihot encoding storing labels, is the node embedding of final output of GNN which contains information of neighbor nodes and edges, MLP is a multilayer perceptron network. Obviously, Eq. (3) contains both the interaction between each and and the interaction among all labels . When we have an energy function, the inference of the predicted label then simply becomes to minimize the energy function:
(4) 
4.2 Training CostAugmented Inference Network
For training, we use the structured hinge loss (Tsochantaridis et al., 2004; Belanger and McCallum, 2016b) and relax the edge labels to be continuous for easier optimization. Assume the parameters for the energy function Eq. (3) is , our problem becomes:
(5) 
where are the groundtruth DDI edges for training, and are the predictions on these edges. is the structured error function which returns a nonnegative value to indicate the difference between the ground truth and prediction; and means . In this paper, we use L1 loss as .
This above training loss is expensive to directly optimize due to the costaugmented inference step. Following the idea of Tu and Gimpel (2018), we use a costaugmented inference network (i.e. a graph neural network) to approximate the output in the training phase.
(6) 
The problem can be seen as minimax game and optimized by alternatively optimizing and .
(1) Fix , we can optimize the parameters for the costaugmented inference networks
(7) 
(2) Fix , optimize :
(8) 
4.3 Semisupervised Joint Training and Inference
The trained costaugmented inference network cannot be directly used for test inference because of the cost augmentation. For inference, we need to first fine tune the trained inference network on training data again with respect to the original inference objective without cost augmentation . And the real effect of this finetuning is actually doubted (Tu and Gimpel (2018)).
In addition, in the DDI prediction problem, the training set and test set share the same set of nodes . In this semisupervised setting, the energy optimization in the test phase should not be regarding only but all graphs including both and the predicted , as we showed in Eq. 4. Since this energy function is a graph neural network, that is to say, we use both and the prediction as edge attributes for the computation of Eq. (4).
Considering these two reasons, we propose to jointly train another test inference network to directly approximate the test output. Since the test edge features are not given(instead, we only know the indices of edges to test), so this test inference network also use and as its inputs. In order to make the training and test inference networks not deviate too much, we share the base layers for these two networks and only make the last layers different. When is given, the objective for this test inference network is as below:
(9) 
Combing Eq. 9 with Eq. 7, when is fixed, we can jointly optimize the costaugmented training network and the test inference network , thus deriving a joint training schema. We still use the minmax procedure to optimize all the parameters.
(1) Fix , we optimize the parameters for both the costaugmented inference networks and the test inference network.
(10) 
To better discriminate the two inference networks, we use to indicates the predictions on training edges, and are the predictions on and missing edges. is a hyperparameter which is often set as 1 in practice. Notice that and are not independent but share some parameters.
5 Experiment
We evaluate GENN ^{1}^{1}1a placeholder for github link model to answer the following questions:

[leftmargin=20pt,noitemsep]

Q1: Does GENN provide more accurate DDI prediction than feedforward GNNs?

Q2: Does GENN improves the supervised inference method (Section 4.2)?

Q3: How does GENN respond to various fraction of edge missingness?

Q4: Does the model really capture meaningful label correlation?
5.1 Experimental Setup
Data We implemented our experiments on two public datasets (with some modification of the setting): DeepDDI and Decagon. For both datasets, we randomly select 80% of the drugdrug pairs as training data, 10% as validation data and the remaining 10% as test data. For a given drugdrug pair, we predict its DDI types. Note that one drugdrug pair can have zero, one or multiple DDI types, so it is a multilabel prediction problem.
DeepDDI Dataset (Ryu et al., 2018) consists of 1861 drugs (nodes) and 222,127 drugdrug pairs (edges) from DrugBank which results in 113 different DDI types as labels. 99.87% drugdrug pairs only have one type of DDI. The edge exists when two drugs has at least one DDI type. The input feature for each drug is generated based on structural similarity profile (SSP), then its dimension is reduced to 50 using principal components analysis (PCA) as suggested by DeepDDI. Moreover, Chemicalbased similarity measure is used to reduce the the effect of redundant drugs on prediction accuracy where redundant similar drugs in training and test dataset will overestimate the predictive performanc as noted in Gottlieb et al. (2012). In particular, we use Tanimoto score (Tanimoto, 1957) and find 90% drug pairs has less than 0.4721 similarity score which show that there is no need to tackle the effect of redundant drugs.
BIOSNAPsub Dataset (Leskovec, 2018) consists of 645 drugs (nodes) and 46,221 drugdrug pairs (edges) from TWOSIDES dataset which results in 200 different DDI types as labels. We extract data from Zitnik et al. (2018) and only keep 200 medium commonlyoccurring DDI types ranging from Top600 to Top800 that every ddi type has at least 90 drug combinations which contains proper number of edges for fast evaluation. 73.27% drugdrug pairs have more than one type of DDI. The input feature is 32dimension vector by transforming onehot encoding by Gaussian random projection using Pedregosa et al. (2011).
Baselines We compared GENN with the following baselines.

[leftmargin=*,noitemsep]

Label Propragation (LP) (Zhu et al., 2003) is a similaritybased semisupervised method which makes use of unlabeled data to better generalize to new samples.

MLP is the model used in original DeepDDI dataset which accepts pairwise features (e.g. structural similarity profile (SSP)) between two drugs to predict DDI.

DeepWalk learns ddimensional neural features for nodes based on a biased random walk procedure exploring network neighborhoods of nodes. For each drug pair, we concatenate the learned DeepWalk feature vectors and its original drug feature representation and use the MLP same as the above method to do classification.

GNN is a basic graph neural network based on MPNN framework as introduced in 3.2.

GLENN is designed to be an energy model with a CRF style local energy function and the same inference algorithm as GENN. Notice that each node and all its connected edges form a clique (because these edges are mutual neighbors and they are also connected to each other due to the correlation), so we use . The difference between Eq. 3 and Eq. 1 is that it does not include the neighbor node features in the equation. It is linear and has only one layer. We call it a local energy model.

is an ablation model without semisupervised joint training and inference.

GENN is our final model that incorporates the power of GNNs and energy models.
The Implementation Details can be found in Appendix B.
Metrics. To measure the prediction accuracy, we used the following metrics

[leftmargin=*,noitemsep]

ROCAUC: Area under the receiver operating characteristic curve: the area under the plot of the true positive rate against the false positive rate at various thresholds.

PRAUC: Area under the precisionrecall curve: the area under the plot of precision versus recall curve at various thresholds.

P@K: short for Precision at K which is the mean percentage of correct predicted labels among TOPK over all samples. Here, we specially choose P@1 and P@5 to validate performence as usually done in multilabel learning setting.
Following Decagon (Zitnik et al., 2018), we calculated the ROCAUC and PRAUC for each single label, and then use the average scores as the final values.
5.2 Results
Performance Comparison
For each dataset, we use 3 different random splits and run all the models on that split. The results are then averaged over 3 runs and we report the mean and std values in test dataset see Table 1. Note that the results on DeepDDI dataset in Table 1 is based on the 60% randomly sampled data, because it will take more than 2 days for models to converge with limited performance gain on total dataset.
Method  P@1  P@5  PRAUC  ROCAUC  

DeepDDI  LP         
MLP  0.7311 (.0026)  0.1926 (.0005)  0.5888 (.0362)  0.9736 (.0032)  
DeepWalk  0.7773 (.0029)  0.1962 (.0019)  0.6276 (.0139)  0.9786 (.0046)  
GNN  0.9002 (.0119)  0.1986 (.0002)  0.7606 (.0187)  0.9861 (.0066)  
GLENN  0.8928 (.0067)  0.1986 (.0002)  0.7590 (.0095)  0.9891 (.0058)  
0.9020 (.0220)  0.1987 (.0004)  0.8389 (.0512)  0.9871 (.0071)  
GENN  0.9077 (.0293)  0.1990 (.0003)  0.8635 (.0286)  0.9928 (.0052)  
BIOSNAPsub  LP  0.1089 (.0049)  0.0850 (.0040)  0.0607 (.0013)  0.6414 (.0010) 
MLP  0.2120 (.0019)  0.1508 (.0009 )  0.1675 (.0022)  0.8041 (.0009)  
DeepWalk  0.2463 (.0012)  0.1719 (.0011)  0.1908 (.0029)  0.8311 (.0019)  
GNN  0.3275 (.0231)  0.2215 (.0184 )  0.2494 (.0208)  0.8757 (.0172)  
GLENN  0.3255 (.0192)  0.2213 (.0181)  0.2476 (.0209)  0.8756 (.0172)  
0.3290 (.0102)  0.2216 (.0137)  0.2503 (.0175)  0.8788 (.0159)  
GENN  0.3396 (.0072)  0.2326 (.0016)  0.2602 (.0034)  0.8855 (.0026) 
From Table. 1, we can see DeepWalk and the methods based on MPNN framework outperform MLP and LP in both datasets with a large margin. It indicates the neighbor edges and nodes information actually help the representation learning which improve the performance. Compared with GNN, GLENN achieves nearly the same performance and gets little better score with respect to ROCAUC. As for GENN, it largely outperforms others with respect to all metrics in both datasets which give the positive answer to Q1 that GENN does provide more accurate DDI prediction. Especially, compared to GNN and GLENN, it demonstrates the power of our new designed graph neural network based global energy function. In addition, for variants of GENN, the supervised learning model setting performs worse compared with GENN which give the answer of Q2.
Analysis of Robustness
We design experiments to show that compared with baseline models, GENN achieved empirically more robust performance in the setting of high edge missingness. We let each method be trained on 5%, 10%, and 30% of edges on DeepDDI dataset, and predict on the rest of them. From Figure. 4, we can see that GENN is better than all the baselines in every percentage point (Q3).
Correlation Analysis
In correlation analysis experiment, we chose the models trained on 60% DeepDDI dataset same as the models reported in Table. 1 and made DDI type predictions on the rest of the edges (40% left test dataset). To illustrate that GENN is able to capture correlation between labels, i.e., different DDI types by incorperating energy function, we further calculate the pearson correlation coefficent between distribution of DDI types. For each of the 113 DDI types^{2}^{2}2http://pnas.org/content/suppl/2018/04/14/1803294115.DCSupplemental, we define its distribution by counting the number of occurrences in the prediction on each drug node over all edges in test dataset.
The PairGrid figure 4 shows the pairwise relation between these randomly chosed four DDI types, numbered as 17, 36, 75, 102. For instance, DDI type 17 means “Drug a may increase the cardiotoxic activities of Drug b” and DDI type 36 means “Drug a may decrease the bronchodilatory activities of Drug b”. For the ground truth, DDI type 17 has medium correlation with DDI type 36 but MLP and DeepWalk overestimate the coefficient. Especially for DDI type 102, only GLENN and GENN capture the true distribution same with the ground truth. On the one hand, it is consistent with the high performance achieved by GENN. On the other hand, the consistency between Truth and GENN demonstrates that GENN really capture some label correlation which answers Q4.
6 Conclusion
In this paper, we proposed GENN to cast DDI detection as a structure prediction problem with a new GNN based energy function. Experiments on two real DDI datasets demonstrated that GENN is superior to the models without consideration of link type correlations and achieved up to ( relative improvement) with respect to PRAUC. Future works includes extending the correlated GNNs to heterogeneous networks and incorporating medical domain knowledge with structure information such as drug classification ontology in the learning.
References
 Multiplex methods provide effective integration of multiomic data in genomescale models. BMC bioinformatics 17 (4), pp. 83. Cited by: §2.1.
 Structured prediction energy networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pp. 983–992. External Links: Link Cited by: item 2.
 Structured prediction energy networks. In International Conference on Machine Learning, pp. 983–992. Cited by: §1, §2.2, §4.1, §4.2.
 Endtoend learning for structured prediction energy networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 429–439. Cited by: §2.2, §2.2.
 Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2224–2232. Cited by: §2.1.
 Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: Appendix B.
 When good drugs go bad.. Nature 446, pp. 975–977. Cited by: §1.
 Neural message passing for quantum chemistry. In ICML, Cited by: §3.2.
 INDI: a computational framework for inferring drug interactions and their associated recommendations. Molecular systems biology 8 (1). Cited by: §5.1.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: Appendix B.
 Mol2vec: unsupervised machine learning approach with chemical intuition. Journal of chemical information and modeling 58 (1), pp. 27–35. Cited by: §2.1.
 Variational graph autoencoders. arXiv preprint arXiv:1611.07308. Cited by: §1, §2.1.
 Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, San Francisco, CA, USA, pp. 282–289. External Links: ISBN 1558607781, Link Cited by: §2.2.
 Conditional random fields: probabilistic models for segmenting and labeling sequence data. Cited by: §1.
 A tutorial on energybased learning. Predicting Structured Data 1, pp. 0. Cited by: §1, §2.2.
 BioSNAP Datasets: Stanford biomedical network dataset collection. Note: http://snap.stanford.edu/biodata Cited by: §5.1.
 Drug similarity integration through attentive multiview graph autoencoders. arXiv preprint arXiv:1804.10850. Cited by: §1, §2.1, §3.2.
 Endtoend sequence labeling via bidirectional lstmcnnscrf. arXiv preprint arXiv:1603.01354. Cited by: §2.2.
 A threeway model for collective learning on multirelational data.. In ICML, Vol. 11, pp. 809–816. Cited by: §2.1.
 Tensors for data mining and data fusion: models, applications, and scalable algorithms. ACM Transactions on Intelligent Systems and Technology (TIST) 8 (2), pp. 16. Cited by: §2.1.
 Automatic differentiation in pytorch. Cited by: Appendix B.
 Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: Appendix B, §5.1.
 DeepWalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA, pp. 701–710. External Links: ISBN 9781450329569, Link, Document Cited by: §2.1.
 GMNN: graph Markov neural networks. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 5241–5250. Cited by: §2.2.
 Deep learning improves prediction of drug–drug and drug–food interactions. Proceedings of the National Academy of Sciences 115 (18), pp. E4304–E4311. Cited by: §1, §2.1, §5.1.
 Segmentlevel neural conditional random fields for named entity recognition. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Taipei, Taiwan. Cited by: §2.2.
 IBM internal report 17th. November. Cited by: §5.1.
 Support vector machine learning for interdependent and structured output spaces. In Proceedings of the twentyfirst international conference on Machine learning, pp. 104. Cited by: §4.2.
 Learning approximate inference networks for structured prediction. ICLR. Cited by: Appendix C, §2.2, §4.2, §4.3.
 Similarity network fusion for aggregating data types on a genomic scale. Nature Methods 11, pp. 333–337. Cited by: §2.1.
 Learning a bistochastic data similarity matrix. In ICDM, pp. 551–560. Cited by: §2.1.
 Label propagation prediction of drugdrug interactions based on clinical side effects. Scientific reports 5. Cited by: §2.1.
 Predicting potential drugdrug interactions by integrating chemical, biological, phenotypic and network data. BMC Bioinformatics.. Cited by: §2.1.
 Conditional random fields as recurrent neural networks. CoRR abs/1502.03240. External Links: Link, 1502.03240 Cited by: §2.2.
 Conditional random fields as recurrent neural networks. In Proceedings of the IEEE international conference on computer vision, pp. 1529–1537. Cited by: §2.2.
 Semisupervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML03), pp. 912–919. Cited by: item 1.
 Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34 (13), pp. i457–i466. Cited by: §1, §2.1, §3.2, §5.1, §5.1.
 Deep mining heterogeneous networks of biomedical linked data to predict novel drug–target associations. Bioinformatics 33 (15), pp. 2337–2344. Cited by: §2.1.
Appendix A Notations
Notation  Definition 

the whole graph data with nodes, edges set  
node feature matrix with nodes and features dimension  
DDI type vector between nodes with different types  
ground truth labels and prediction labels of all graph  
th node’s feature at th update and final out of MPNN  
graph neural networks (GNN)  
learnable weight matrices and bias  
energy function parameterized by  
training and testing inference network 
Appendix B Implementation Details
Most methods are implemented in PyTorchv1.1.0 (Paszke et al., 2017) based on Fey and Lenssen (2019) and trained on an Ubuntu 18.04 with 56 CPUs and 64GB memory.
We use the implementation from Pedregosa et al. (2011) for Label Propagation method with the setting of 0.25 gamma value for rbf kernel and 200 maximum number of iteration. As for DeepWalk method, the official implementation^{3}^{3}3https://github.com/phanein/deepwalk is used with the setting of embedding dimension as 50 and the length of random walk as 20.
We implement training inference network (), test inference network () and energy network with the same 2layer message passing neural network (MPNN) encoder and 1hiddenlayer MLP decoder structure. Hidden dimension are all set as 100, and to reduce the number of parameters and ensure the two inference networks ( and ) do not deviate too much, we share the message passing layer for them, while make the MLP layer different. For energy network , we also use message passing for node encoding, but the layer has a different set of parameters from the inference networks in order to make the minimax training schema more effective. For MLP decoder, we use batchnormalization technology (Ioffe and Szegedy, 2015) with Relu nonlinear activation function.
For training efficiency, we first train the basic MPNN (i.e. GNN baseline) on the training data, and then initialize all MPNN layers used in inference networks and energy networks with the these pretrained parameters. We use the learning rate 0.01 for methods to train and 0.001 to fine tuning on two datasets with earlystop mechanism (i.e., training is stopped when there is no performance improvement on validation dataset with 35 consecutive epochs). Threshold for prediction is simply set as 0.4. For all energy based models, we add cross entropy regularization as described in the following section.
Appendix C Cross Entropy Regularization
From the experience of Tu and Gimpel (2018), adding an local cross entropy loss to the energy function could improve the performance a lot. It can be seen as a multitask training with two objectives: energy optimization and reconstruction error minimization. We add the reconstruction crossentropy loss of both the training and test inference networks in Eq. (10) to train better inference network approximations.
(11)  
where , and are hyperparameters which can all be set as 1 the same as in practice. Note that these additional regularization terms are independent of , so we do not need to add it when minimizing over (Eq. (8) is still unchanged).
Appendix D Correlation Analysis
Pair of DDI Types  Truth  GENN  GNN  

Metabolism  Bradycardic Activities  0.6423  0.5375  0.3176 
Risk of Hypotension  Neuromuscular Blocking Activities  0.3347  0.2954  0.0027 
Risk of Hyperkalemia  Neuromuscular Blocking Activities  0.2854  0.2822  0.3913 
As shown in table 3, there are three pairs of DDI type which shows large (0.6423), medium (0.3347) and small postive correlation (0.2854) indicated by Truth column. For the first pair of DDI type, decreasing metabolism has large correlation with increasing the bradycardic activities in ground truth but GNN only got medium correlation coefficient much lower than GENN’. As for the last two pairs of DDI type, GNN performed bad and gave a nearly zero coefficient for the second and overestimated the coefficient for the last. In the meantime, GENN corresponds with the ground truth which in some extent gives the evidence that GENN has the power to capture correlation between labels.