On Filter Size in Graph Convolutional Networks
Abstract
Recently, many researchers have been focusing on the definition of neural networks for graphs. The basic component for many of these approaches remains the graph convolution idea proposed almost a decade ago. In this paper, we extend this basic component, following an intuition derived from the wellknown convolutional filters over multidimensional tensors. In particular, we derive a simple, efficient and effective way to introduce a hyperparameter on graph convolutions that influences the filter size, i.e. its receptive field over the considered graph. We show with experimental results on realworld graph datasets that the proposed graph convolutional filter improves the predictive performance of Deep Graph Convolutional Networks.
I Introduction
Graphs are a common and natural way to represent many real world data, e.g. in Chemistry a compound can be represented by its molecular graph, in social networks the relationships between users are represented as edges in a graph where users are nodes. Many computational tasks involving such graphical representations require machine learning, such as classification of active/nonactive drugs or prediction of the creation of a future link between two users in a social network. Stateoftheart machine learning techniques for classification and regression on graphs are at the moment kernel machines equipped with specifically designed kernels for graphs (e.g, [1, 2, 3]). Although there are examples of kernels for structures that can be designed on the basis of a training set [4, 5, 6], most of the more efficient and effective graph kernels are based on predefined structural features, i.e, features definition is not part of the learning process.
There is a recent shift of trend from kernels to neural networks for graphs. Unlike kernels, the definition of features in neural networks are defined based on a learning process which is supervised by the graph’s labels (targets). Many approaches have addressed the problem of defining neural networks for graphs [7]. However, one of the core components, the graph convolution, has not changed much with respect to the earlier works [8, 9].
In this paper, we work on the redesign of this basic component. We propose a new formulation for the graph convolution operator that is strictly more general than the existing one. Our proposal can be virtually applied to all the techniques based on graph convolutions.
The paper is organized as follows. We start in Section II with some basic definitions and notation. In Section III, we provide an overview over the various proposals of graph convolution available in literature. In Section IV we detail our proposed parametric graph convolutional filter. In Section V we discuss other related works that are not based on graph convolution, including some alternative graph neural network architectures and graph kernels. In Section VI we report our experimental results. Finally, Section VII concludes the paper.
Ii Notation and Definitions
We denote matrices with bold uppercase letters, vectors with uppercase letters, and variables with lowercase letters. Given a matrix , denotes the th row of the matrix, and is the element in th row and th column. Given the vector , refers its th element.
Let’s consider as a graph, where is the set of vertices (or nodes), is the set of edges, and is a node label matrix, where each row is the label (a vector of size ) associated to each vertex , i.e. . Note that, in this paper, we will not consider edge labels. When the reference to the graph is clear from the context, for the sake of notation we discard the superscript referring to the specific graph. We define the adjacency matrix as , 0 otherwise. We also define the neighborhood of a vertex as the set of vertices connected to by an edge, i.e. . Note that is also the set of nodes at shortest path distance exactly one from , i.e. , where is a function computing the shortestpath distance between two nodes in a graph.
In this paper, we consider the problem of graph classification. Given a dataset composed of pairs , the task is then, given an unseen graph G, to predict its correct target .
Iii Graph convolutions
The first definition of neural network for graphs has been proposed in [10]. More recent models have been proposed in [8, 9]. Both works are based on an idea that has been rebranded later as graph convolution.
The idea is to define the neural architecture following the topology of the graph. Then a transformation is performed from the neurons corresponding to a vertex and its neighborhood to a hidden representation, that is associated to the same vertex (possibly in another layer of the network). This transformation depends on some parameters, that are shared among all the nodes. In the following, for the sake of simplicity we ignore the bias terms.
In [9], when considering nonpositional graphs, i.e. the most common definition, and the one we are considering in this paper, a transition function on a graph node at time is defined as:
(1) 
where is a parametric function whose parameters have to be learned (e.g. a neural network) and are shared among all the vertices. Note that, if edge labels are available, they can be included in eq. (1). In fact, in the original formulation, depends also on the label of the edge between and . This transition function is part of a recurrent system. It is defined as a contraction mapping, thus the system is guaranteed to converge to a fixed point, i.e. a representation, that does not depend on the particular initialization of the weight matrix . The output is computed from the last representation and the original node labels as follows:
(2) 
where is another neural network. [11] extends the work in [9] by removing the constraint for the recurrent system to be a contraction mapping, and replacing the recurrent units with GRUs. However, recently it has been shown in [12] that stacked graph convolutions are superior to graph recurrent architectures in terms of both accuracy and computational cost.
In [8], a model referred to as Neural Network for Graphs (NN4G) is proposed. In the first layer, a transformation over node labels is computed:
(3) 
where are the weights connecting the original labels to the current neuron, and is the vertex index. The graph convolution is then defined for the th layer (for ) as:
(4) 
where are weights connecting the previous hidden layers to the current neuron (shared). Note that in this formulation, skip connections are present, to the th layer, from layer to layer . There is an interesting recent work about the parallel between skipconnections (residual networks in that case) and recurrent networks [13]. However, since in the formulation in eq. (4), every layer is connected to all the subsequent layers, it is not possible to reconduct it to a (vanilla) recurrent model. Let us consider the th graph convolutional layer, that comprehends graph convolutional filters. We can rewrite eq. (4) for the whole layer as:
(5) 
where (and is the number of layers), , , , is the size of the hidden representation at the th layer, and is applied elementwise.
An abstract representation of eq. (4) is depicted in Figure 1. The convolution in eq. (4) is part of a multilayer architecture, where each layer’s connectivity resembles the topology of the graph, and the training is layerwise. Finally, for each graph, NN4G computes the average graph node representation for each hidden layer, and concatenates them. This is the graph representation computed by NN4G, and it can be used for the final prediction of graph properties with a standard output layer.
In [14], a hierarchical approach has been proposed. This method is similar to NN4G and is inspired by circular fingerprints in chemical structures. While [8] adopts CascadeCorrelation for training, [14] uses an endtoend backpropagation. ECC [15] proposes an improvement of [14], weighting the sum over the neighbors of a node by weights conditioned by the edge labels. We consider this last version as a baseline in our experiments.
Recently, [16] derives a graph convolution that closely resembles (4). Let us, from now on, consider . Motivated by a firstorder approximation of localized spectral filters on graphs, the proposed graph convolutional filter looks like:
(6) 
where , , and is any activation function applied elementwise.
If we ignore the terms (that in practice act as normalization), it is easy to see that eq. (6) is very similar to eq. (5), the difference being that there are no skip connections in this case, i.e. the th layer is connected just to the th layer. Consequently, we just have to learn one weight matrix per layer.
In [17], a slightly more complex model compared to [16] is proposed. This model shows the highest predictive performance with respect to the other methods presented in this section. The first layers of the network are again stacked graph convolutional layers, defined as follows:
(7) 
where and . Note that in the previous equation, we compute the representation of all the nodes in the graph at once. The difference between eq. (7) and eq. (6) is the use of different propagation scheme for nodes’ representations: eq. (6) is based on the normalized graph Laplacian, while eq. (7) is based on the randomwalk graph Laplacian. In [17], authors state that the choice of normalization does not significantly affect the results. In fact, both equations can be seen as firstorder approximations of the polynomially parameterized spectral graph convolution. In [17], three graph convolutional layers are stacked. The graph convolutions are followed by a concatenation layer that merges the representations computed by each graph convolutional layer. Then, differently from previous approaches, the paper introduces a sortpooling layer, that selects a fixed number of node representations, and computes the output from them stacking 1D convolutional layers and dense layers. This is the same network architecture that we considered in this paper.
Iiia SortPooling layer
After stacking some graph convolution layer, we need a mechanism to predict the target for the graph, starting from its node encoding. Ideally, this mechanism should be applicable to graphs with variable number of vertices. Instead of averaging the node representations, [17] proposes to solve this issue with the SortPooling layer.
Let us assume that the encoding, for each node, of the th graph convolution layer is . Let us consider the output of the last graph convolution (or concatenation) layer to be , where each row is a vertex’s feature descriptor and each column is a feature channel. The output of the SortPooling layer is a tensor, where is a userdefined integer.
In the SortPooling layer, the rows of are sorted lexicographically (possibly starting from the last column). We can see the output of the graph convolutional layer as continuous WL colors, and thus we are sorting all the vertices according to these colors. This way, a consistent ordering is imposed for graph vertices, making it possible to train traditional neural networks on the sorted graph representations.
In addition to sorting vertex features in a consistent order, the other function of SortPooling is to unify the sizes of the output tensors. After sorting, we truncate/extend the output tensor in the first dimension from n to k. The intention is to unify graph sizes, making graphs with different numbers of vertices unify their sizes to k. The unifying is done by deleting the last rows if , or adding zero rows if .
Note that if two vertices have the same hidden representation, it doesn’t matter which node we pick because the output of the SortPooling layer would be exactly the same.
Iv Parametric Graph Convolutions
A straightforward generalization of eq. (7) would be defined on the powers of the adjacency matrix, i.e. on random walks [18]. This would introduce tottering in the learned representation, that is not considered to be beneficial in general. We decided to follow another approach, based on shortestpaths. As mentioned before, the adjacency matrix of a graph can be seen as the matrix of the shortestpaths of length , i.e.
(8) 
Moreover, the identity matrix I is the matrix of the shortestpaths of length (assuming that each node is at distance zero from itself), i.e. . Moreover, note that .
By means of this new notation, we can rewrite eq. (7) as:
(9) 
Let us now define . We can now extend our reasoning and define our parameterized (by ) graph convolution layer. In our contribution, we decided to process information in a slightly different way with respect to (9). Instead of summing the contributions of the matrices, we decided to keep the contributions of the nodes at different shortestpath distance separated. This is equivalent to the definition of multiple graph convolutional filters, one for each shortestpath distance. We define the Parametric Graph Convolution as:
(10) 
where is the vertical concatenation of vectors. Note that with our formulation, we have a different matrix for each layer and for each shortestpath distance . Moreover, as mentioned before, we are concatenating the information and not summing it, explicitly keeping the contributions of the different distances separated. This approach follows the networkinnetwork idea [19]. In our case, at each layer, we are effectively applying, at the same time, convolutions (one for each shortestpath distance) and concatenating their output. Let us fix a parameter controlling the number of filters for the layer, say , and a value for the hyperparameter , then we have .
Iva Receptive field
It has been shown in [16, 17] that with the standard definition of graph convolution, e.g. the ones in eq. (6) and eq. (7), the receptive field of a graph convolutional filter at layer corresponding to the vertex is . This draws an interesting parallel with the WeisfeilerLehman graph kernel (see Section VA), where intuitively the number of WL iterations is equivalent to the number of stacked graph convolution layers in the architecture.
In our proposed parametric graph convolution in eq. (10), the parameter directly influences the considered neighborhood in the graph convolutional filter (and the number of output channels, since we concatenate the output of the convolutions for all ). It is easy to see that, by definition, the receptive field of a graph convolutional filter parameterized by and applied to the vertex includes all the nodes at shortestpath distance at most from . When we stack multiple layers of our parametric graph convolution, the receptive field grows in the same way. The receptive field of a parametric graph convolutional filter of size at layer applied to the vertex includes then all the vertices at shortestpath distance at most from .
IvB Computational complexity
Equation (10) requires to compute the allpairs shortest paths, up to a fixed length . While computing the unbounded shortest paths for a graph with nodes requires time, if the maximum length is small enough, it is possible to implement it with one depthlimited breadthfirst visit starting from each node, with an overall complexity of , where is the number of edges in a graph.
V Related works
Besides the approaches based on graph convolutions presented in Section III, there are some other methods in literature to process graphs with neural networks.
For instance, [20] defined an attention mechanism to propagate information between the nodes in a graph. The basic idea is the definition of an external network that, given two neighboring nodes, outputs an attention weight for that specific edge. A shared attentive mechanism computes the attention coefficients
(11) 
that indicate the importance of node ’s features to node . Here, is a parametric function, that in the original paper is a singlelayer feedforward network parameterized by the vector . The information about the graph structure is injected into the mechanism by performing masked attention, i.e. is only computed for nodes . To make coefficients easily comparable across different nodes, a softmax function is used:
(12) 
Once obtained, the normalized attention coefficients are used to compute a linear combination of the features corresponding to them, to serve as the final output features for every node (after potentially applying a pointwise nonlinearity, ):
(13) 
To stabilize the learning process of selfattention, authors propose to extending the mechanism to employ multihead attention( different attention weights per edge). For the last layer, authors employ averaging, and delay applying the final nonlinearity (usually a softmax or logistic sigmoid for classification problems) until then.
This technique has been applied to node classification only, and its complexity (due to implementation issues) is high. In principle, the same approach of NN4G can be adopted to generate graphlevel representations and predictions for this model.
[21] (PSCN) proposes another interpretation of graph convolution. Given a graph, it first selects the nodes where the convolutional filter have to be centered. Then, it selects a fixed number of vertices from its neighborhood, and infers an order on them. This ordering constraint limits the flexibility of the approach because learning a consistent order is difficult, and the number of nodes in the convolutional filter has to be fixed apriori.
Diffusion CNN (DCNN) [22] is based on the principle of heat diffusion (on graphs). The idea is to map from nodes and their labels to the result of a diffusion process that begins at that node.
Dataset  MUTAG  PTC  NCI1  PROTEINS  DD  COLLAB  IMDBB  IMDBM 

Nodes (Max)  28  109  111  620  5748  492  136  89 
Nodes (Avg)  17.93  25.56  29.87  39.06  284.32  74.49  19.77  13.00 
Graphs  188  344  4110  1113  1178  5000  1000  1500 

Va Graph Kernels
Kernel methods defines the model as a linear classifier in a Reproducing Kernel Hilbert Space, that is the space implicitly defined by a kernel function . SVM is the most popular kernelized learning algorithm, that defines the solution as the maximummargin hyperplane.
Kernel functions can be defined for many objects, and in particular for graphs. Many graph kernels have been defined in literature. For instance, Random Walk kernels are based on the number of common random walks in two graphs [23, 2] and can be computed efficiently in closed form. More recent proposals focus on more complex structures, and allow to represent the function explicitly, with computational benefits. Among others, kernels have been defined considering graphlets [24], shortestpaths [25], subtrees [26, 27] and subtreewalks [28, 29]. For instance, the WeisfeilerLehman subtree kernel (WL) defines its features as rooted subtreewalks, i.e, subtrees whose nodes can appear multiple times, up to a userdefined maximum height (maximum number of iterations).
Propagation kernels (PK) [30] follow a different idea, inspired by the diffusion process in graph node kernels (i.e. kernels between nodes in a single graph), of propagating the node label information through the edges in a graph. Then, for each node, a distribution over the propagated labels is computed. Finally, the kernel between two graphs compares such distributions over all the nodes in the two graphs.
While exhibiting stateoftheart performance on many graph datasets, the main problem of graph kernels is that they define a fixed representation, that is not taskdependent and can in principle limit the predictive performance of the method. Deep graph kernels (DGK) [31] propose an approach to alleviate this problem. Let us fix a base kernel and its explicit representation . Then a deep graph kernel can be defined as:
where is a matrix of parameters that has to be learned, possibly including target information.
Vi Experiments
In this section, we aim at evaluating the performance of the proposed method and comparing it with many existing graph kernels and deep learning approaches for graphs. We pay a special attention to the performances of our method and DGCNN, to see whether the proposed generalization helps to improve the predictive performance. As a means to achieve this purpose, various experiments are conducted in two settings, following the experimental procedure used in [17] on eight graph datasets (see Table I for a summary). The code for our experiments is available online at https://github.com/dinhinfotech/PGCDGCNN.
In the first setting, we compare the performance of our method with DGCNN and stateoftheart graph kernels: the graphlet kernel (GK) [1], the random walk kernel (RW) [2], the propagation kernel (PK) [30], and the WeisfeilerLehman subtree kernel (WL) [32]. We do not include other stateoftheart graph kernels such as NSPDK [33] and ODD [26] because their performance is not much different from the considered ones, and it is above the scope of this paper to extensively compare the graph kernels in literature. In this setting, five datasets containing biological nodelabeled graphs are employed, namely MUTAG [34], PTC [35], NCI1 [36], PROTEINS, and DD [37]. In the first three datasets, each graph represents a chemical compound, where nodes are labeled with the atom type, and edges represent bonds between them. MUTAG is a dataset of aromatic and heteroaromatic nitro compounds, where the task is to predict their mutagenic effect on a bacterium. In PTC, the task is to predict chemical compounds carcinogenicity for male and female rats. NCI1 contains anticancer screens for cell lung cancer. In PROTEINS and DD, each graph represents a protein. The nodes are labeled according to the aminoacid type. The proteins are classified into two classes: enzymes and nonenzymes.
In the second setting, we desire to evaluate the performance of the proposed method and DGCNN along with other deep learning approaches for graphs: PATCHYSAN (PSCN) [21], Diffusion CNN (DCNN) [22], ECC [15] and Deep Graphlet Kernel (DGK) [31]. In this setting, three biological datasets (NCI1, PROTEINS and DD) and three social network datasets from [31] (COLLAB, IMDBB and IMDBM) are used. COLLAB is a dataset of scientific collaborations, where egonetworks are generated for researchers and are classified in three research fields. IMDBB (binary) is a movie collaboration dataset where egonetworks for actors/actresses are classified in action or romance genres. IMDBM is a multiclass version of IMDBB, containing genres comedy, romance, and scifi.
In this setting, we eliminate MUTAG and PTC since they have a small number of examples which easily causes overfitting problems for deep learning approaches.
Dataset  MUTAG  PTC  NCI1  PROTEINS  DD 

GK  81.391.74  55.650.46  62.490.27  71.390.31  74.380.69 
RW  79.172.07  55.910.32  3 days  59.570.09  3 days 
PK  76.002.69  59.502.44  82.540.47  73.680.68  78.250.51 
WL  84.111.91  57.972.49  84.460.45  74.680.49  78.340.62 
DGCNN  85.831.66  58.592.47  74.440.47  75.540.94  79.370.94 
PGCDGCNN ()  87.221.43  61.061.83  76.130.73  76.451.02  78.930.91 

Dataset  NCI1  PROTEINS  DD  COLLAB  IMDBB  IMDBM 

PSCN  76.341.68  75.002.51  76.272.64  72.602.15  71.002.29  45.232.84 
DCNN  56.611.04  61.291.60  58.090.53  52.110.71  49.061.37  33.491.42 
ECC  76.82  –  72.54  –  –  – 
DGK  62.480.25  71.680.50  –  73.090.25  66.960.56  44.550.52 
DGCNN  74.440.47  75.540.94  79.370.94  73.760.49  70.030.86  47.830.85 
PGCDGCNN  76.130.73  76.451.02  78.930.91  75.000.58  71.621.22  47.251.44 
Evaluation method and model selection
to evaluate the different methods, a nested 10fold crossvalidation is employed, i.e, one fold for testing, 9 folds for training of which one is used as validation set for model selection. For each dataset, we repeated each experiment 10 times and report the average accuracy over the 100 resulting folds. To select the best model, the hyperparameters’ values of different kernels are set as follows: the height of WL and PK in , the bin width of PK to 0.001, the size of the graphlets in GK to 3 and the decay of RW to the largest power of 10 that is smaller than the reciprocal of the squared maximum node degree. Note that some of our results are reported from [17].
Network architecture
we employ the network architecture used in [17] to have a fair comparison with DGCNN. The network consists of three graph convolution layers, a concatenation layer, a SortPooling layer, followed by two 1D convolutional layers and one dense layer. The activation function for the graph convolutions is the hyperbolic tangent, while the 1D convolutions and the dense layer use rectified linear units. Note that our proposal, as presented in Section IV, is a generalization of DGCNN. In other words, DGCNN is very similar to a special case of our method where just the neighbors at shortestpath distance are considered. On the contrary, our proposal considers the distance as a hyperparameter, , allowing to flexibly capture local structures associated to graph nodes. In this section, we set equal to 2 as the first attempt and plan to explore neighborhoods with nodes at a higher distance as a future work.
Via Experimental Results
Table II and III show the performance of various methods in the first and second settings, respectively. Overall, DGCNN and our proposed method outperform the compared kernels and deep learning methods in most datasets.
As can be seen from Table II, DGCNN and the proposed method (PGCDGCNN) present higher performances in four out of five datasets with an improvement ranging from 1.03 to 3.11 with respect to the best performing kernel. Compared to the RW kernel, our proposed method impressively achieves the highest improvement in MUTAG and PROTEINS with about 8 and 17, respectively. Concerning PK and WL, that are similar in spirit to DGCNN and our method as shown in [17], DGCNN and PGCDGCNN illustrate higher performances in most cases with a bigger difference comparing with PK. It is worth noticing, when comparing with PK and WL, that their optimal models in each experiment are selected by tuning the height parameter, , from a range of predefined values. Instead, DGCNN and our method are evaluated with a fixed number of layers only. This indicates that the performance of DGCNN and the proposed method can be higher if we validate the number of stacked graph convolutional layers.
Related to the performances of various deep learning methods described in Table III, our method and DGCNN obtain the highest results in five out of six cases, except in NCI1 where they show marginally lower results. Considering the performance of DCNN, DGCNN and our method gain dramatically higher accuracies with the improvement ranges from around 14 and up to 21.
We now move our consideration to the difference between the performances of DGCNN and our proposal. It can be seen from the Table II and III that our method performs better than DGCNN in the majority of the datasets. In particular, PGCDGCNN outperforms DGCNN in six out of eight cases with a consistent improvement from about 1 to 2. In DD and IMDBM, the accuracy of our method is slightly lower than DGCNN. However, these declines are only marginal. The general improved performance of our method comparing to DGCNN can be explained by the fact that our method parameterizes the graph convolutions, making it a generalization of DGCNN. (we recall that we fix the neighborhood distance ). In this case, our method captures more information about the local graph structure associated to each node comparing to DGCNN which considers just the direct neighbors, i.e. . It is worth to notice that (1) we use a single value for to build our model. However, in general, we can choose an optimal model by tuning values from a range of values for ; (2) we utilize the architecture as proposed in [17], meaning that we have not tried to optimize the network architecture. Therefore, the performance of our method can improve if we optimize the distance parameter and the number of graph convolutional layers, together with the rest of the architecture.
Vii Conclusions and Future Works
In this paper, we presented a new definition of graph convolutional filter. It generalizes the most commonly adopted filter, adding an hyperparameter controlling the distance of the considered neighborhood. Experimental results show that our proposed filter improves the predictive performance of Deep graph Convolutional Neural Networks on many realworld datasets.
In future, we plan to analyze more in depth the impact of filter size in graph convolutional networks. We will define the 1D convolutions as special cases of Graph Convolutions, and we will explore Fully GraphConvolutional neural architectures, that will avoid fullyconnected layers, and possibly stack more graph convolution layers. Moreover, we will explore the impact of different activation functions for the graph convolutions in such a setting. Finally, we plan to enhance the input graph representation associating to each node the explicit features extracted by graph kernels.
Acknowledgment
This project was funded, in part, by the Department of Mathematics, University of Padova, under the DEEP project and DFG project, BA 2168/33.
References
 [1] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt, “Efficient graphlet kernels for large graph comparison,” in Artificial Intelligence and Statistics, 2009, pp. 488–495.
 [2] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt, “Graph kernels,” Journal of Machine Learning Research, vol. 11, no. Apr, pp. 1201–1242, 2010.
 [3] G. Da San Martino, N. Navarin, and A. Sperduti, “A treebased kernel for graphs,” in Proceedings of the Twelfth SIAM International Conference on Data Mining, Anaheim, California, USA, April 2628, 2012., 2012, pp. 975–986. [Online]. Available: https://doi.org/10.1137/1.9781611972825.84
 [4] L. van der Maaten, “Learning discriminative fisher kernels,” in Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28  July 2, 2011, 2011, pp. 217–224.
 [5] F. Aiolli, G. Da San Martino, M. Hagenbuchner, and A. Sperduti, “Learning nonsparse kernels by selforganizing maps for structured data,” IEEE Trans. Neural Networks, vol. 20, no. 12, pp. 1938–1949, 2009. [Online]. Available: https://doi.org/10.1109/TNN.2009.2033473
 [6] D. Bacciu, A. Micheli, and A. Sperduti, “Generative kernels for treestructured data,” IEEE Trans. Neural Netw. Learning Syst., vol. early access, 2018. [Online]. Available: https://ieeexplore.ieee.org/document/8259316/
 [7] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation learning on graphs: Methods and applications,” CoRR, vol. abs/1709.05584, 2017. [Online]. Available: http://arxiv.org/abs/1709.05584
 [8] A. Micheli, “Neural network for graphs: A contextual constructive approach,” IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 498–511, 2009.
 [9] F. Scarselli, M. Gori, A. C. Ah Chung Tsoi, M. Hagenbuchner, and G. Monfardini, “The Graph Neural Network Model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2009. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4700287
 [10] A. Sperduti and A. Starita, “Supervised neural networks for the classification of structures,” IEEE Trans. Neural Networks, vol. 8, no. 3, pp. 714–735, 1997. [Online]. Available: https://doi.org/10.1109/72.572108
 [11] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated Graph Sequence Neural Networks,” in ICLR, 2016. [Online]. Available: http://arxiv.org/abs/1511.05493
 [12] X. Bresson and T. Laurent, “An Experimental Study of Neural Networks for Variable Graphs,” in ICLR 2018 Workshop, 2018.
 [13] Q. Liao and T. Poggio, “Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex,” arXiv preprint, 2016. [Online]. Available: http://arxiv.org/abs/1604.03640
 [14] D. Duvenaud, D. Maclaurin, J. AguileraIparraguirre, R. GómezBombarelli, T. Hirzel, A. AspuruGuzik, and R. P. Adams, “Convolutional networks on graphs for learning molecular fingerprints,” in Advances in Neural Information Processing Systems, Montreal, Canada, 2015, pp. 2215–2223.
 [15] M. Simonovsky and N. Komodakis, “Dynamic edgeconditioned filters in convolutional neural networks on graphs,” in CVPR, 2017.
 [16] T. N. Kipf and M. Welling, “SemiSupervised Classification with Graph Convolutional Networks,” in ICLR, 2017, pp. 1–14. [Online]. Available: http://arxiv.org/abs/1609.02907
 [17] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An EndtoEnd Deep Learning Architecture for Graph Classification,” in AAAI Conference on Artificial Intelligence, 2018.
 [18] S. AbuElHaija, A. Kapoor, B. Perozzi, and J. Lee, “NGCN: Multiscale Graph Convolution for Semisupervised Node Classification,” in Proceedings of the 14th International Workshop on Mining and Learning with Graphs (MLG), 2018. [Online]. Available: http://arxiv.org/abs/1802.08888
 [19] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, jun 2015, pp. 1–9. [Online]. Available: http://ieeexplore.ieee.org/document/7298594/
 [20] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph Attention Networks,” in ICLR, 2018. [Online]. Available: http://arxiv.org/abs/1710.10903
 [21] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural networks for graphs,” in International conference on machine learning, 2016, pp. 2014–2023.
 [22] J. Atwood and D. Towsley, “Diffusionconvolutional neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 1993–2001.
 [23] T. Gartner, P. Flach, S. Wrobel, and T. Gärtner, “On Graph Kernels: Hardness Results and Efficient Alternatives,” in Proceedings of the 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop, ser. Lecture Notes in Computer Science, B. Schölkopf and M. K. Warmuth, Eds., vol. 2777. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 129–143. [Online]. Available: http://link.springer.com/10.1007/b12006
 [24] N. Shervashidze, K. Mehlhorn, T. H. Petri, S. V. N. Vishwanathan, K. M. Borgwardt, T. H. Petri, K. Mehlhorn, and K. M. Borgwardt, “Efficient graphlet kernels for large graph comparison,” in AISTATS, vol. 5. Clearwater Beach, Florida, USA: CSAIL, 2009, pp. 488–495.
 [25] K. Borgwardt and H.P. Kriegel, “ShortestPath Kernels on Graphs,” in ICDM. Los Alamitos, CA, USA: IEEE, 2005, pp. 74–81. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1565664
 [26] G. Da San Martino, N. Navarin, and A. Sperduti, “Ordered Decompositional DAG Kernels Enhancements,” Neurocomputing, vol. 192, pp. 92–103, 2016.
 [27] ——, “A TreeBased Kernel for Graphs,” in Proceedings of the Twelfth SIAM International Conference on Data Mining, 2012, pp. 975–986.
 [28] ——, “Graph Kernels Exploiting WeisfeilerLehman Graph Isomorphism Test Extensions,” in Neural Information Processing, vol. 8835, 2014, pp. 93–100. [Online]. Available: http://link.springer.com/10.1007/9783319126401{_}12
 [29] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and K. M. Borgwardt, “WeisfeilerLehman Graph Kernels,” JMLR, vol. 12, pp. 2539–2561, 2011.
 [30] M. Neumann, N. Patricia, R. Garnett, and K. Kersting, “Efficient graph kernels by randomization,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2012, pp. 378–393.
 [31] P. Yanardag and S. Vishwanathan, “Deep graph kernels,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015, pp. 1365–1374.
 [32] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt, “Weisfeilerlehman graph kernels,” Journal of Machine Learning Research, vol. 12, no. Sep, pp. 2539–2561, 2011.
 [33] F. Costa and K. De Grave, “Fast neighborhood subgraph pairwise distance kernel,” in ICML. Omnipress, 2010, pp. 255–262.
 [34] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch, “Structureactivity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity,” Journal of Medicinal Chemistry, vol. 34, no. 2, pp. 786–797, feb 1991. [Online]. Available: http://pubs.acs.org/doi/abs/10.1021/jm00106a046
 [35] H. Toivonen, A. Srinivasan, R. D. King, S. Kramer, and C. Helma, “Statistical evaluation of the predictive toxicology challenge 20002001,” Bioinformatics, 2003.
 [36] N. Wale, I. Watson, and G. Karypis, “Comparison of descriptor spaces for chemical compound retrieval and classification,” Knowledge and Information Systems, vol. 14, no. 3, pp. 347–375, 2008.
 [37] P. D. Dobson and A. J. Doig, “Distinguishing Enzyme Structures from Nonenzymes Without Alignments,” Journal of Molecular Biology, vol. 330, no. 4, pp. 771–783, 2003.