BayesGrad: Explaining Predictions of
Graph Convolutional Networks
Abstract
Recent advances in graph convolutional networks have significantly improved the performance of chemical predictions, raising a new research question: “how do we explain the predictions of graph convolutional networks?” A possible approach to answer this question is to visualize evidence substructures responsible for the predictions. For chemical property prediction tasks, the sample size of the training data is often small and/or a label imbalance problem occurs, where a few samples belong to a single class and the majority of samples belong to the other classes. This can lead to uncertainty related to the learned parameters of the machine learning model. To address this uncertainty, we propose BayesGrad, utilizing the Bayesian predictive distribution, to define the importance of each node in an input graph, which is computed efficiently using the dropout technique. We demonstrate that BayesGrad successfully visualizes the substructures responsible for the label prediction in the artificial experiment, even when the sample size is small. Furthermore, we use a real dataset to evaluate the effectiveness of the visualization. The basic idea of BayesGrad is not limited to graphstructured data and can be applied to other data types.
Keywords:
Machine learning Deep learning Interpretablity Cheminformatics Graph convolution1 Introduction
The applications of deep neural networks are expanding rapidly in various fields, including chemistry and biology. Graph convolutional neural networks, which can handle graphstructured data (e.g., chemical compounds) as inputs, have opened the door to endtoend learning for chemical prediction. Many variants of graph convolutional neural networks have been proposed, which are now improving the performance of various chemical prediction tasks, including physical property prediction [1], toxicity prediction [2], solubility and drug efficiency prediction [3], and total energy prediction [4].
Deep neural networks automatically learn useful features for prediction, which sometimes outperform handengineered features carefully designed by domain experts, enabling these neural networks to find new knowledge about molecular properties. However, the complex nonlinear operations in deep neural networks make it prohibitively difficult to understand their behaviors.
Sensitivity map, also known as saliency map or pixel attribution map, is a common approach used to explain the reasons for the predictions of neural networks. The map assigns an importance score to each substructure of an instance, which reflects the influence of the substructure on the final prediction, and visualizes highscored substructures. The gradients are commonly used to measure the importance. A naive way entails using the size of the norm of the gradient [5] (we call this approach VanillaGrad). Sensitivity maps generated by this approach are typically noisy. As a result, SmoothGrad has been proposed to address this issue by adding noise to input samples and taking the mean values of the gradients [6].
The existing approaches do not take into account the uncertainty in the prediction of the model. The uncertainty becomes particularly apparent in the chemical domain, because the sample size of the chemical dataset is often small and/or a imbalance problem occurs, where only a few samples belong to a single class and the majority of the samples belong to the other classes. In such cases, it is difficult to estimate which substructures are responsible for a prediction.
In this paper, we propose BayesGrad, a novel sensitivity map algorithm that can deal with model uncertainty. Our key idea is to quantify the uncertainty of a prediction utilizing its Bayesian predictive distribution. We implement the idea using the dropout, a common regularization technique for deep neural networks, because the outputs obtained using this technique approximate the expected value with respect to the Bayesian predictive distribution [7, 8].
We conducted experiments using a synthetic compound dataset labeled with a particular substructure, and quantitatively evaluated the validity of our importance score. BayesGrad achieved superior performance, especially when the number of training data is small. We also use real datasets to visualize the bases of the predictions, and found that the visualized substructure is consistent with the known results. Although we present the formalization of BayesGrad in the context of graphstructured data and demonstrate its efficiency in the chemical domain, BayesGrad is a general framework and, thus, can be applied to other data types such as images.
Our contributions to the literature are summarized as follows:

Bayesian approximation for sensitivity map visualization: We propose a novel method that uses the dropout technique to quantify model uncertainty.

Application of gradientbased sensitivity map visualization for graphs: Most of the existing gradientbased sensitivity map algorithms are evaluated on image classification tasks.

Quantitative evaluation in the chemical domain: We quantitatively evaluated the performance of the gradientbased method to visualize the basis of a prediction.
2 Preliminaries
We begin with the problem setting for sensitivity map visualization in graph prediction, followed by a brief review of several existing sensitivity map generation methods.
2.1 Problem definition
We assume that we have an (already trained) regression or classification model , where is a graph consisting of a set of nodes and a set of edges , and the output of the model indicates the regression result or the classification score. Note that in the case of the binary classification model, the output of is the raw score in , not a value transformed by the sigmoid function. In the graph neural network , a node is associated with a feature vector .
Given the model and a target input graph , our goal is to assign an importance score to each node .
2.2 VanillaGrad
There have been several recent attempts to interpret the predictions made by complex neural network models. Although these methods focus on images, they are easily applied to graphs. The gradient of with respect to feature is often used as the importance score of an input (i.e., a node) [5]:
(1) 
To simplify the calculation, we often use the 2norm, but we can also use another norm such as the 1norm. We call the importance score defined in Eq. (1) VanillaGrad.
2.3 SmoothGrad
It is known that sensitivity maps generated by VanillaGrad are likely to be noisy. To address the problem, SmoothGrad [6] calculates the expected value of the gradient (1) over the Gaussian noise added to the input:
(2) 
We approximate the value of Eq. (2) using sampling. SmoothGrad first generates noisy inputs by adding noise to the original input ; that is, is given as:
(3) 
where is a sample from a Gaussian distribution with a mean of zero a variance of . The importance score of a noisy input is then calculated as
(4) 
Finally, the importance score of the original input is estimated as the average of :
(5) 
which is called SmoothGrad. Note that both of the variance of the Gaussian noise and the sample size are hyperparameters to be tuned. In the original paper, is tuned as a relative scale from the range of the input value . However we used a fixed value of for each input, because this was more stable in our experiment.
2.4 Importance score calculation using signed values
In the previous discussion, the sensitivity map only gives how much each atom impacts on the prediction, but does not give whether the atoms have positive or negative effects on the prediction. To address this, Shrikumar et al. [9] used the product of the input and the gradient instead of the norm to evaluate how the atoms affect the output:
(6) 
where denotes the baseline vector. The above formula represents the effect on function when we change the th input from to . It can be understood as (the negative of) the firstorder term of the Taylor expansion of at , which is evaluated at . Note that there is a freedom of choice of the baseline ; it is often set to , which corresponds to a black image in the image domain. As we discuss in the experimental section, this technique gives us richer information in certain cases.
3 Proposed Method
Existing approaches do not consider the uncertainty of the prediction by the model. To address this issue, we propose BayesGrad, which quantifies the uncertainty of the sensitivity map using Bayesian inference. We first describe the formulation of BayesGrad, and then explain the practical implementation using the dropout technique.
3.1 BayesGrad
The existing methods are formulated in the framework of the maximum likelihood estimation of the neural network parameter . However, the learned is not necessarily stable and can vary with addition or deletion of a small portion of the training data sample.
In our formulation, we consider the uncertainty of the parameter by using the posterior of the neural network parameter given the training data . We consider the expected value of the importance score with respect to :
(7) 
We approximate this using sampling as
(8) 
where is the th importance score computed from the th sample () as
(9) 
We call the importance score computed by Eq.(8) BayesGrad. BayesGrad has a sample size as a hyperparameter.
3.2 Dropout as a Bayesian Approximation
In order to implement BayesGrad, we need to take samples from the posterior distribution . In general, the exact computation of the posterior is intractable, in which case we resort to approximation methods such as Markov chain Monte Carlo methods or variational Bayesian approximations. In particular, we utilize the dropout technique which can be interpreted as a variational Bayesian method because of its relatively small computational cost. Dropout is originally introduced as a regularization technique to prevent overfitting [10, 11], but recent studies show that dropout can be viewed as a kind of variational Bayesian inference [7, 8]. We use the “Dropout as a Bayesian Approximation (DBA)” technique to calculate the uncertainty of the model using dropout. It approximates the posterior distribution using variational distribution , where is a parameter to best approximate . In DBA, has a special form. is given as an Hadamard product of an adjustable constant matrix and the random mask matrix, and the stochasticity lies in each element of the random mask matrix that take the values either zero or one, typically with equal probability.
3.3 Comparison between SmoothGrad and BayesGrad
In contrast with SmoothGrad that takes the expectation of the gradient over possible fluctuations in the input variable, BayesGrad smooths gradients over fluctuations in the model parameter that follows the (approximate) posterior distribution . Validity of adding the Gaussian noise to the input depends on the task. In the image domain, even if some noise is added to the original image, the noisy image still looks similar to the original one, and is still considered natural. However, in the chemical domain, the input is the feature vector of each atom which is originally a discrete object; hence, the noisy input does not correspond to a real atom anymore. The similar discussion also applies in other domain as well, e.g., word embedding in natural language processing.
Another benefit of BayesGrad is emphasized when the training data is few. Model training tends to be unstable in such cases, and the model predictions tend to be stochastic. BayesGrad can treat this type of uncertainty by exploiting the Bayesian inference.
Note that the idea behind SmoothGrad taking the expectation in the input space and that of BayesGrad taking the expectation in the model space are not mutually exclusive, and we can combine both techniques to calculate a sensitivity map (as BayesSmoothGrad).
4 Experiments
We demonstrate the effectiveness of our approach in the chemical domain, where the sample size could be small and there is high demand for substructure visualization. We first validate the methods using a synthetic dataset where the groundtruth substructures correlated with the target label are known. In addition, we demonstrate the method using the real datasets and discuss its effectiveness.
We used Chainer Chemistry which is an opensource deep learning framework providing major graph convolutional network algorithms [12]. We slightly modified the neural fingerprint method [3] and the gatedgraph neural network (GGNN) [13] in the library by including the dropout function to perform BayesGrad. Our code used in this experiment is available at https://github.com/pfnetresearch/bayesgrad. Please refer the code for how to reproduce our result, including the hyperparameter configuration.
4.1 Quantitative Evaluation on Tox21 Synthetic Data
Tox21 [14] is a collection of chemical compounds including training, validation, and test data samples. Each compound is associated with some of toxicity type labels; we used only the training and validation data in our experiment since the test dataset has no label information. Since the original tox21 dataset does not have the information of what substructure actually contribute to their toxicity labels, We first used synthetic labels to quantitatively evaluate the different evidence visualization methods. We assigned the label to compounds that contain pyridine () and to the remainder, which resulted in label compounds and label compounds in the training dataset.
We trained the GGNN with the dropout function to predict whether the input compound contains pyridine. The GGNN has a gating architecture that enables the model to set the weights for important information. After training the model, the ROCAUC scores for both the training and the validation data were as high as 0.99, which suggests that the model was successfully trained.
The validation dataset was used for testing. There are molecules that contain a pyridine substructure in the validation dataset. We expect atoms that belong to pyridine rings to have a higher importance score than the others; hence, we selected the atoms in descending order of the importance scores after calculating the importance scores by each method. We calculated the gradient of the output of the prefinal layer just before applying the sigmoid function that gives probability values, because the sigmoid function squashes the gradient and therefore the performance became worse if we took the gradient after the sigmoid function.
Figure 1 shows the precisionrecall curve, where the precision indicates the proportion of the atoms consisting of pyridine rings in the extracted substructure, and the recall indicates the proportion of the extracted atoms in all the atoms in the pyridine rings. We used for the SmoothGrad and BayesGrad calculations.
Algorithm  PRCAUC score 

VanillaGrad  0.506 0.044 
SmoothGrad  0.514 0.042 
BayesGrad (Ours)  0.544 0.019 
BayesSmoothGrad (Ours)  0.536 0.028 
Figure 2 shows the sensitivity map visualization for each method. All of the methods successfully extracted the substructure containing the pyridine ring. This result implies that the gradientbased sensitivity map calculation is effective in extracting the substructure responsible for the target label in chemical prediction tasks.
Note that even though BayesGrad seems to outperform SmoothGrad or VanillaGrad in Fig. 1, this result is not deterministic owing to the stochastic behavior of SmoothGrad and BayesGrad. To compare the performance of the methods, we consider a slightly difficult case with a small dataset. This reflects a practical situation where limited data are available and the model’s prediction tends to be uncertain.
To test that BayesGrad can deal with the uncertainty of the prediction, we randomly select 30 different subset consisting of 1000 compounds from the original training dataset and obtained 30 different models. We calculated the mean and standard deviation of their PRCAUC scores. The results are summarized in Table 1. BayesGrad records statistically higher PRCAUC scores than both VanillaGrad and SmoothGrad. We also tested BayesSmoothGrad method, which uses both dropout and noise; however, its performance did not improve in this experiment.
4.2 Visualization on Tox21 Actual Data
We also performed a toxicity prediction task experiment using the Tox21 dataset where each compound has some of 12 toxicity labels We trained the prediction model for each of the labels, and visualized the grounds for prediction of the label SRMMP with the highest prediction accuracy ( ROCAUC in test data).
Figure 3 shows some interesting results; Tyrphostin 9 (Fig. 3 (a)) is a tyrosine kinase inhibitor and is known to be a potent uncoupler of oxidative phosphorylation, which has a strong influence on the mitochondrial membrane potential (SRMMP). Terada et al. [15] examined the effect of the mitochondrial function of the aciddissociable group using Tyrphostin 9 and a derivative, modified by methylation of its phenolic OH group. They confirmed that the aciddissociable group is essential for uncoupling. We computed sensitivity maps for these compounds, as shown in Fig. 3 (a) and (b). Our visualization results are consistent with their experimental results. We also found similar compounds in the Tox21 dataset, with an aciddissociable group, as shown in Fig. 3 (c) and (d). We confirmed that our visualization method has the potential to detect these essential substructures accurately.
4.3 Evaluation on Solubility Dataset
Solubility is an important property in drug design because sufficient watersolubility is necessary for drug absorption. It is well known that some functional groups such as the hydroxyl group and primary amine group contribute to the hydrophilic nature, whereas other groups such as the phenyl group and ethyl group contribute to the hydrophobic nature. In addition, molecular weight has a strong correlation with solubility. Medicinal chemists need to modify the chemical structure by adding charged substituents, reducing the hydrophobic groups and the molecular weight, to improve solubility. However, if the molecule has a complicated structure, it becomes difficult to identify which part of structure is significant for the chemical property. Thus, our motivation is to provide a way to visualize which parts of a chemical structure are significant for the solubility. We demonstrated the effectiveness of our approach using a publicly available dataset.
We used the ESOL dataset [16] to evaluate our approach. This dataset contains 1,127 compounds with measured log solubility. We used the signed importance score explained in Section 2.4 to discriminate the positive/negative contributions to solubility. The choice of the baseline is not trivial in chemical prediction tasks, where we consider the embedded feature vector space of atom representation as input. In our experiment, we used the baseline , which corresponds to the mean of the prior distribution of the embedded feature vector. We used the neural fingerprint model [3] to evaluate this task.
Figure 4 shows the prediction result, where the model achieved good performance for the solubility prediction. Figure 5 shows examples of the visualization for solubility prediction. Our approach accurately assigns positive importance scores to the hydrophilic atoms, and negative scores to the hydrophobic atoms, even for such compounds with complicated structures.
5 Conclusion
We proposed a method to visualize a sensitivity map of chemical prediction tasks. While existing methods focus on the visualization on image domain, our quantitative evaluation with the tox21 dataset showed that BayesGrad outperforms the existing methods. BayesGrad exploits the Bayesian inference technique to handle the uncertainty in predictions, which contributes to a robust sensitivity map, especially for small datasets. Furthermore, we obtained the promising experimental results on the real datasets, which accord with the wellknown chemical properties.
Elucidating the chemical mechanism is challenging research. We believe the proposed algorithm will lead to a better understanding of the chemical mechanism. Our idea is easily applicable to other deep neural networks in other domains, which we leave to future research.
References
 [1] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1263–1272, 2017.
 [2] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: Moving beyond fingerprints. Journal of ComputerAided Molecular Design, 30(8):595–608, 2016.
 [3] David K. Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alan AspuruGuzik, and Ryan P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems 28, pages 2224–2232. 2015.
 [4] Kristof Schütt, PieterJan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela, Alexandre Tkatchenko, and KlausRobert Müller. SchNet: A continuousfilter convolutional neural network for modeling quantum interactions. In Advances in Neural Information Processing Systems 30, pages 992–1002. Curran Associates, Inc., 2017.
 [5] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
 [6] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. SmoothGrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
 [7] Shinichi Maeda. A bayesian encourages dropout. arXiv preprint arXiv:1412.7003, 2014.
 [8] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML), pages 1050–1059, 2016.
 [9] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685, abs/1704.02685, 2017.
 [10] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
 [11] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
 [12] pfnet research. chainerchemistry. https://github.com/pfnetresearch/chainerchemistry.
 [13] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. Proceedings of the International Conference on Learning Representations (ICLR), 2016.
 [14] Ruili Huang, Menghang Xia, DacTrung Nguyen, Tongan Zhao, Srilatha Sakamuru, Jinghua Zhao, Sampada A Shahane, Anna Rossoshek, and Anton Simeonov. Tox21challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs. Frontiers in Environmental Science, 3:85, 2016.
 [15] Hiroshi Terada, Yosuke Fukui, Yasuo Shinohara, and Motoharu Juichi. Unique action of a modified weakly acidic uncoupler without an acidic group, methylated sf 6847, as an inhibitor of oxidative phosphorylation with no uncoupling activity: possible identity of uncoupler binding protein. Biochimica et Biophysica Acta, 933:193–199, 1988.
 [16] John S. Delaney. Esol: Estimating aqueous solubility directly from molecular structure. Journal of Chemical Information and Computer Sciences, 44(3):1000–1005, 2004. PMID: 15154768.