On Filter Size in Graph Convolutional Networks

On Filter Size in Graph Convolutional Networks

Dinh V. Tran12, Nicolò Navarin13, Alessandro Sperduti1 1Department of Mathematics, University of Padova, Italy
{dinh, nnavarin, sperduti}@math.unipd.it
2 Bioinformatics Group, Department of Computer Science, University of Freiburg, Germany
dinh@informatik.uni-freiburg.de
3School of Computer Science, University of Nottingham, United Kingdom
nicolo.navarin@nottingham.ac.uk
Abstract

Recently, many researchers have been focusing on the definition of neural networks for graphs. The basic component for many of these approaches remains the graph convolution idea proposed almost a decade ago. In this paper, we extend this basic component, following an intuition derived from the well-known convolutional filters over multi-dimensional tensors. In particular, we derive a simple, efficient and effective way to introduce a hyper-parameter on graph convolutions that influences the filter size, i.e. its receptive field over the considered graph. We show with experimental results on real-world graph datasets that the proposed graph convolutional filter improves the predictive performance of Deep Graph Convolutional Networks.

graphs, deep learning for graphs, graph convolution, convolutional neural networks for graphs.

I Introduction

Graphs are a common and natural way to represent many real world data, e.g. in Chemistry a compound can be represented by its molecular graph, in social networks the relationships between users are represented as edges in a graph where users are nodes. Many computational tasks involving such graphical representations require machine learning, such as classification of active/non-active drugs or prediction of the creation of a future link between two users in a social network. State-of-the-art machine learning techniques for classification and regression on graphs are at the moment kernel machines equipped with specifically designed kernels for graphs (e.g,  [1, 2, 3]). Although there are examples of kernels for structures that can be designed on the basis of a training set [4, 5, 6], most of the more efficient and effective graph kernels are based on predefined structural features, i.e, features definition is not part of the learning process.

There is a recent shift of trend from kernels to neural networks for graphs. Unlike kernels, the definition of features in neural networks are defined based on a learning process which is supervised by the graph’s labels (targets). Many approaches have addressed the problem of defining neural networks for graphs [7]. However, one of the core components, the graph convolution, has not changed much with respect to the earlier works [8, 9].

In this paper, we work on the re-design of this basic component. We propose a new formulation for the graph convolution operator that is strictly more general than the existing one. Our proposal can be virtually applied to all the techniques based on graph convolutions.

The paper is organized as follows. We start in Section II with some basic definitions and notation. In Section III, we provide an overview over the various proposals of graph convolution available in literature. In Section IV we detail our proposed parametric graph convolutional filter. In Section V we discuss other related works that are not based on graph convolution, including some alternative graph neural network architectures and graph kernels. In Section VI we report our experimental results. Finally, Section VII concludes the paper.

Ii Notation and Definitions

We denote matrices with bold uppercase letters, vectors with uppercase letters, and variables with lowercase letters. Given a matrix , denotes the -th row of the matrix, and is the element in -th row and -th column. Given the vector , refers its -th element.

Let’s consider as a graph, where is the set of vertices (or nodes), is the set of edges, and is a node label matrix, where each row is the label (a vector of size ) associated to each vertex , i.e. . Note that, in this paper, we will not consider edge labels. When the reference to the graph is clear from the context, for the sake of notation we discard the superscript referring to the specific graph. We define the adjacency matrix as , 0 otherwise. We also define the neighborhood of a vertex as the set of vertices connected to by an edge, i.e. . Note that is also the set of nodes at shortest path distance exactly one from , i.e. , where is a function computing the shortest-path distance between two nodes in a graph.

In this paper, we consider the problem of graph classification. Given a dataset composed of pairs , the task is then, given an unseen graph G, to predict its correct target .

Iii Graph convolutions

The first definition of neural network for graphs has been proposed in [10]. More recent models have been proposed in [8, 9]. Both works are based on an idea that has been re-branded later as graph convolution.

The idea is to define the neural architecture following the topology of the graph. Then a transformation is performed from the neurons corresponding to a vertex and its neighborhood to a hidden representation, that is associated to the same vertex (possibly in another layer of the network). This transformation depends on some parameters, that are shared among all the nodes. In the following, for the sake of simplicity we ignore the bias terms.

In [9], when considering non-positional graphs, i.e. the most common definition, and the one we are considering in this paper, a transition function on a graph node at time is defined as:

(1)

where is a parametric function whose parameters have to be learned (e.g. a neural network) and are shared among all the vertices. Note that, if edge labels are available, they can be included in eq. (1). In fact, in the original formulation, depends also on the label of the edge between and . This transition function is part of a recurrent system. It is defined as a contraction mapping, thus the system is guaranteed to converge to a fixed point, i.e. a representation, that does not depend on the particular initialization of the weight matrix . The output is computed from the last representation and the original node labels as follows:

(2)

where is another neural network. [11] extends the work in [9] by removing the constraint for the recurrent system to be a contraction mapping, and replacing the recurrent units with GRUs. However, recently it has been shown in [12] that stacked graph convolutions are superior to graph recurrent architectures in terms of both accuracy and computational cost.

In [8], a model referred to as Neural Network for Graphs (NN4G) is proposed. In the first layer, a transformation over node labels is computed:

(3)

where are the weights connecting the original labels to the current neuron, and is the vertex index. The graph convolution is then defined for the -th layer (for ) as:

(4)

where are weights connecting the previous hidden layers to the current neuron (shared). Note that in this formulation, skip connections are present, to the -th layer, from layer to layer . There is an interesting recent work about the parallel between skip-connections (residual networks in that case) and recurrent networks [13]. However, since in the formulation in eq. (4), every layer is connected to all the subsequent layers, it is not possible to reconduct it to a (vanilla) recurrent model. Let us consider the -th graph convolutional layer, that comprehends graph convolutional filters. We can rewrite eq. (4) for the whole layer as:

(5)

where (and is the number of layers), , , , is the size of the hidden representation at the -th layer, and is applied element-wise.

Fig. 1: Graph convolution as described in [8], and adopted with some variations by many state-of-the-art Graph Convolutional neural networks.

An abstract representation of eq. (4) is depicted in Figure 1. The convolution in eq. (4) is part of a multi-layer architecture, where each layer’s connectivity resembles the topology of the graph, and the training is layer-wise. Finally, for each graph, NN4G computes the average graph node representation for each hidden layer, and concatenates them. This is the graph representation computed by NN4G, and it can be used for the final prediction of graph properties with a standard output layer.

In [14], a hierarchical approach has been proposed. This method is similar to NN4G and is inspired by circular fingerprints in chemical structures. While [8] adopts Cascade-Correlation for training, [14] uses an end-to-end back-propagation. ECC [15] proposes an improvement of [14], weighting the sum over the neighbors of a node by weights conditioned by the edge labels. We consider this last version as a baseline in our experiments.

Recently, [16] derives a graph convolution that closely resembles (4). Let us, from now on, consider . Motivated by a first-order approximation of localized spectral filters on graphs, the proposed graph convolutional filter looks like:

(6)

where , , and is any activation function applied element-wise.

If we ignore the terms (that in practice act as normalization), it is easy to see that eq. (6) is very similar to eq. (5), the difference being that there are no skip connections in this case, i.e. the -th layer is connected just to the -th layer. Consequently, we just have to learn one weight matrix per layer.

In [17], a slightly more complex model compared to [16] is proposed. This model shows the highest predictive performance with respect to the other methods presented in this section. The first layers of the network are again stacked graph convolutional layers, defined as follows:

(7)

where and . Note that in the previous equation, we compute the representation of all the nodes in the graph at once. The difference between eq. (7) and eq. (6) is the use of different propagation scheme for nodes’ representations: eq. (6) is based on the normalized graph Laplacian, while eq. (7) is based on the random-walk graph Laplacian. In [17], authors state that the choice of normalization does not significantly affect the results. In fact, both equations can be seen as first-order approximations of the polynomially parameterized spectral graph convolution. In [17], three graph convolutional layers are stacked. The graph convolutions are followed by a concatenation layer that merges the representations computed by each graph convolutional layer. Then, differently from previous approaches, the paper introduces a sortpooling layer, that selects a fixed number of node representations, and computes the output from them stacking 1D convolutional layers and dense layers. This is the same network architecture that we considered in this paper.

Iii-a SortPooling layer

After stacking some graph convolution layer, we need a mechanism to predict the target for the graph, starting from its node encoding. Ideally, this mechanism should be applicable to graphs with variable number of vertices. Instead of averaging the node representations, [17] proposes to solve this issue with the SortPooling layer.

Let us assume that the encoding, for each node, of the -th graph convolution layer is . Let us consider the output of the last graph convolution (or concatenation) layer to be , where each row is a vertex’s feature descriptor and each column is a feature channel. The output of the SortPooling layer is a tensor, where is a user-defined integer.

In the SortPooling layer, the rows of are sorted lexicographically (possibly starting from the last column). We can see the output of the graph convolutional layer as continuous WL colors, and thus we are sorting all the vertices according to these colors. This way, a consistent ordering is imposed for graph vertices, making it possible to train traditional neural networks on the sorted graph representations.

In addition to sorting vertex features in a consistent order, the other function of SortPooling is to unify the sizes of the output tensors. After sorting, we truncate/extend the output tensor in the first dimension from n to k. The intention is to unify graph sizes, making graphs with different numbers of vertices unify their sizes to k. The unifying is done by deleting the last rows if , or adding zero rows if .

Note that if two vertices have the same hidden representation, it doesn’t matter which node we pick because the output of the SortPooling layer would be exactly the same.

Iv Parametric Graph Convolutions

A straightforward generalization of eq. (7) would be defined on the powers of the adjacency matrix, i.e. on random walks [18]. This would introduce tottering in the learned representation, that is not considered to be beneficial in general. We decided to follow another approach, based on shortest-paths. As mentioned before, the adjacency matrix of a graph can be seen as the matrix of the shortest-paths of length , i.e.

(8)

Moreover, the identity matrix I is the matrix of the shortest-paths of length (assuming that each node is at distance zero from itself), i.e. . Moreover, note that .

By means of this new notation, we can rewrite eq. (7) as:

(9)

Let us now define . We can now extend our reasoning and define our parameterized (by ) graph convolution layer. In our contribution, we decided to process information in a slightly different way with respect to (9). Instead of summing the contributions of the matrices, we decided to keep the contributions of the nodes at different shortest-path distance separated. This is equivalent to the definition of multiple graph convolutional filters, one for each shortest-path distance. We define the Parametric Graph Convolution as:

(10)

where is the vertical concatenation of vectors. Note that with our formulation, we have a different matrix for each layer and for each shortest-path distance . Moreover, as mentioned before, we are concatenating the information and not summing it, explicitly keeping the contributions of the different distances separated. This approach follows the network-in-network idea [19]. In our case, at each layer, we are effectively applying, at the same time, convolutions (one for each shortest-path distance) and concatenating their output. Let us fix a parameter controlling the number of filters for the layer, say , and a value for the hyper-parameter , then we have .

Fig. 2: The proposed Parametric Graph Convolution. The parameter controls the maximum distance of the considered neighborhood, and the dimensionality of the output.

Iv-a Receptive field

It has been shown in [16, 17] that with the standard definition of graph convolution, e.g. the ones in eq. (6) and eq. (7), the receptive field of a graph convolutional filter at layer corresponding to the vertex is . This draws an interesting parallel with the Weisfeiler-Lehman graph kernel (see Section V-A), where intuitively the number of WL iterations is equivalent to the number of stacked graph convolution layers in the architecture.

In our proposed parametric graph convolution in eq. (10), the parameter directly influences the considered neighborhood in the graph convolutional filter (and the number of output channels, since we concatenate the output of the convolutions for all ). It is easy to see that, by definition, the receptive field of a graph convolutional filter parameterized by and applied to the vertex includes all the nodes at shortest-path distance at most from . When we stack multiple layers of our parametric graph convolution, the receptive field grows in the same way. The receptive field of a parametric graph convolutional filter of size at layer applied to the vertex includes then all the vertices at shortest-path distance at most from .

Iv-B Computational complexity

Equation (10) requires to compute the all-pairs shortest paths, up to a fixed length . While computing the unbounded shortest paths for a graph with nodes requires time, if the maximum length is small enough, it is possible to implement it with one depth-limited breadth-first visit starting from each node, with an overall complexity of , where is the number of edges in a graph.

V Related works

Besides the approaches based on graph convolutions presented in Section III, there are some other methods in literature to process graphs with neural networks.

For instance, [20] defined an attention mechanism to propagate information between the nodes in a graph. The basic idea is the definition of an external network that, given two neighboring nodes, outputs an attention weight for that specific edge. A shared attentive mechanism computes the attention coefficients

(11)

that indicate the importance of node ’s features to node . Here, is a parametric function, that in the original paper is a single-layer feed-forward network parameterized by the vector . The information about the graph structure is injected into the mechanism by performing masked attention, i.e. is only computed for nodes . To make coefficients easily comparable across different nodes, a softmax function is used:

(12)

Once obtained, the normalized attention coefficients are used to compute a linear combination of the features corresponding to them, to serve as the final output features for every node (after potentially applying a point-wise nonlinearity, ):

(13)

To stabilize the learning process of self-attention, authors propose to extending the mechanism to employ multi-head attention( different attention weights per edge). For the last layer, authors employ averaging, and delay applying the final nonlinearity (usually a softmax or logistic sigmoid for classification problems) until then.

This technique has been applied to node classification only, and its complexity (due to implementation issues) is high. In principle, the same approach of NN4G can be adopted to generate graph-level representations and predictions for this model.

[21] (PSCN) proposes another interpretation of graph convolution. Given a graph, it first selects the nodes where the convolutional filter have to be centered. Then, it selects a fixed number of vertices from its neighborhood, and infers an order on them. This ordering constraint limits the flexibility of the approach because learning a consistent order is difficult, and the number of nodes in the convolutional filter has to be fixed a-priori.

Diffusion CNN (DCNN) [22] is based on the principle of heat diffusion (on graphs). The idea is to map from nodes and their labels to the result of a diffusion process that begins at that node.

Dataset MUTAG PTC NCI1 PROTEINS DD COLLAB IMDB-B IMDB-M
Nodes (Max) 28 109 111 620 5748 492 136 89
Nodes (Avg) 17.93 25.56 29.87 39.06 284.32 74.49 19.77 13.00
Graphs 188 344 4110 1113 1178 5000 1000 1500

TABLE I: Summary of employed graph datasets

V-a Graph Kernels

Kernel methods defines the model as a linear classifier in a Reproducing Kernel Hilbert Space, that is the space implicitly defined by a kernel function . SVM is the most popular kernelized learning algorithm, that defines the solution as the maximum-margin hyper-plane.

Kernel functions can be defined for many objects, and in particular for graphs. Many graph kernels have been defined in literature. For instance, Random Walk kernels are based on the number of common random walks in two graphs [23, 2] and can be computed efficiently in closed form. More recent proposals focus on more complex structures, and allow to represent the function explicitly, with computational benefits. Among others, kernels have been defined considering graphlets [24], shortest-paths [25], subtrees [26, 27] and subtree-walks [28, 29]. For instance, the Weisfeiler-Lehman subtree kernel (WL) defines its features as rooted subtree-walks, i.e, subtrees whose nodes can appear multiple times, up to a user-defined maximum height (maximum number of iterations).

Propagation kernels (PK) [30] follow a different idea, inspired by the diffusion process in graph node kernels (i.e. kernels between nodes in a single graph), of propagating the node label information through the edges in a graph. Then, for each node, a distribution over the propagated labels is computed. Finally, the kernel between two graphs compares such distributions over all the nodes in the two graphs.

While exhibiting state-of-the-art performance on many graph datasets, the main problem of graph kernels is that they define a fixed representation, that is not task-dependent and can in principle limit the predictive performance of the method. Deep graph kernels (DGK) [31] propose an approach to alleviate this problem. Let us fix a base kernel and its explicit representation . Then a deep graph kernel can be defined as:

where is a matrix of parameters that has to be learned, possibly including target information.

Vi Experiments

In this section, we aim at evaluating the performance of the proposed method and comparing it with many existing graph kernels and deep learning approaches for graphs. We pay a special attention to the performances of our method and DGCNN, to see whether the proposed generalization helps to improve the predictive performance. As a means to achieve this purpose, various experiments are conducted in two settings, following the experimental procedure used in [17] on eight graph datasets (see Table I for a summary). The code for our experiments is available online at https://github.com/dinhinfotech/PGC-DGCNN.

In the first setting, we compare the performance of our method with DGCNN and state-of-the-art graph kernels: the graphlet kernel (GK) [1], the random walk kernel (RW) [2], the propagation kernel (PK) [30], and the Weisfeiler-Lehman subtree kernel (WL) [32]. We do not include other state-of-the-art graph kernels such as NSPDK [33] and ODD [26] because their performance is not much different from the considered ones, and it is above the scope of this paper to extensively compare the graph kernels in literature. In this setting, five datasets containing biological node-labeled graphs are employed, namely MUTAG [34], PTC [35], NCI1 [36], PROTEINS, and D[37]. In the first three datasets, each graph represents a chemical compound, where nodes are labeled with the atom type, and edges represent bonds between them. MUTAG is a dataset of aromatic and hetero-aromatic nitro compounds, where the task is to predict their mutagenic effect on a bacterium. In PTC, the task is to predict chemical compounds carcinogenicity for male and female rats. NCI1 contains anti-cancer screens for cell lung cancer. In PROTEINS and DD, each graph represents a protein. The nodes are labeled according to the amino-acid type. The proteins are classified into two classes: enzymes and non-enzymes.

In the second setting, we desire to evaluate the performance of the proposed method and DGCNN along with other deep learning approaches for graphs: PATCHY-SAN (PSCN) [21], Diffusion CNN (DCNN) [22], ECC [15] and Deep Graphlet Kernel (DGK) [31]. In this setting, three biological datasets (NCI1, PROTEINS and DD) and three social network datasets from [31] (COLLAB, IMDB-B and IMDB-M) are used. COLLAB is a dataset of scientific collaborations, where ego-networks are generated for researchers and are classified in three research fields. IMDB-B (binary) is a movie collaboration dataset where ego-networks for actors/actresses are classified in action or romance genres. IMDB-M is a multi-class version of IMDB-B, containing genres comedy, romance, and sci-fi.

In this setting, we eliminate MUTAG and PTC since they have a small number of examples which easily causes over-fitting problems for deep learning approaches.

Dataset MUTAG PTC NCI1 PROTEINS DD
GK 81.391.74 55.650.46 62.490.27 71.390.31 74.380.69
RW 79.172.07 55.910.32 3 days 59.570.09 3 days
PK 76.002.69 59.502.44 82.540.47 73.680.68 78.250.51
WL 84.111.91 57.972.49 84.460.45 74.680.49 78.340.62
DGCNN 85.831.66 58.592.47 74.440.47 75.540.94 79.370.94
PGC-DGCNN () 87.221.43 61.061.83 76.130.73 76.451.02 78.930.91

TABLE II: Comparison with graph kernels. : our proposed approach. DGCNN is similar to our approach with .
Dataset NCI1 PROTEINS DD COLLAB IMDB-B IMDB-M
PSCN 76.341.68 75.002.51 76.272.64 72.602.15 71.002.29 45.232.84
DCNN 56.611.04 61.291.60 58.090.53 52.110.71 49.061.37 33.491.42
ECC 76.82 72.54
DGK 62.480.25 71.680.50 73.090.25 66.960.56 44.550.52
DGCNN 74.440.47 75.540.94 79.370.94 73.760.49 70.030.86 47.830.85
PGC-DGCNN 76.130.73 76.451.02 78.930.91 75.000.58 71.621.22 47.251.44
TABLE III: Comparison with other deep learning approaches. : our proposed approach. DGCNN is similar to our approach with .

Evaluation method and model selection

to evaluate the different methods, a nested 10-fold cross-validation is employed, i.e, one fold for testing, 9 folds for training of which one is used as validation set for model selection. For each dataset, we repeated each experiment 10 times and report the average accuracy over the 100 resulting folds. To select the best model, the hyper-parameters’ values of different kernels are set as follows: the height of WL and PK in , the bin width of PK to 0.001, the size of the graphlets in GK to 3 and the decay of RW to the largest power of 10 that is smaller than the reciprocal of the squared maximum node degree. Note that some of our results are reported from [17].

Network architecture

we employ the network architecture used in [17] to have a fair comparison with DGCNN. The network consists of three graph convolution layers, a concatenation layer, a SortPooling layer, followed by two 1-D convolutional layers and one dense layer. The activation function for the graph convolutions is the hyperbolic tangent, while the 1D convolutions and the dense layer use rectified linear units. Note that our proposal, as presented in Section IV, is a generalization of DGCNN. In other words, DGCNN is very similar to a special case of our method where just the neighbors at shortest-path distance are considered. On the contrary, our proposal considers the distance as a hyper-parameter, , allowing to flexibly capture local structures associated to graph nodes. In this section, we set equal to 2 as the first attempt and plan to explore neighborhoods with nodes at a higher distance as a future work.

Vi-a Experimental Results

Table II and III show the performance of various methods in the first and second settings, respectively. Overall, DGCNN and our proposed method outperform the compared kernels and deep learning methods in most datasets.

As can be seen from Table II, DGCNN and the proposed method (PGC-DGCNN) present higher performances in four out of five datasets with an improvement ranging from 1.03 to 3.11 with respect to the best performing kernel. Compared to the RW kernel, our proposed method impressively achieves the highest improvement in MUTAG and PROTEINS with about 8 and 17, respectively. Concerning PK and WL, that are similar in spirit to DGCNN and our method as shown in [17], DGCNN and PGC-DGCNN illustrate higher performances in most cases with a bigger difference comparing with PK. It is worth noticing, when comparing with PK and WL, that their optimal models in each experiment are selected by tuning the height parameter, , from a range of pre-defined values. Instead, DGCNN and our method are evaluated with a fixed number of layers only. This indicates that the performance of DGCNN and the proposed method can be higher if we validate the number of stacked graph convolutional layers.

Related to the performances of various deep learning methods described in Table III, our method and DGCNN obtain the highest results in five out of six cases, except in NCI1 where they show marginally lower results. Considering the performance of DCNN, DGCNN and our method gain dramatically higher accuracies with the improvement ranges from around 14 and up to 21.

We now move our consideration to the difference between the performances of DGCNN and our proposal. It can be seen from the Table II and III that our method performs better than DGCNN in the majority of the datasets. In particular, PGC-DGCNN outperforms DGCNN in six out of eight cases with a consistent improvement from about 1 to 2. In DD and IMDB-M, the accuracy of our method is slightly lower than DGCNN. However, these declines are only marginal. The general improved performance of our method comparing to DGCNN can be explained by the fact that our method parameterizes the graph convolutions, making it a generalization of DGCNN. (we recall that we fix the neighborhood distance ). In this case, our method captures more information about the local graph structure associated to each node comparing to DGCNN which considers just the direct neighbors, i.e. . It is worth to notice that (1) we use a single value for to build our model. However, in general, we can choose an optimal model by tuning values from a range of values for ; (2) we utilize the architecture as proposed in [17], meaning that we have not tried to optimize the network architecture. Therefore, the performance of our method can improve if we optimize the distance parameter and the number of graph convolutional layers, together with the rest of the architecture.

Vii Conclusions and Future Works

In this paper, we presented a new definition of graph convolutional filter. It generalizes the most commonly adopted filter, adding an hyper-parameter controlling the distance of the considered neighborhood. Experimental results show that our proposed filter improves the predictive performance of Deep graph Convolutional Neural Networks on many real-world datasets.

In future, we plan to analyze more in depth the impact of filter size in graph convolutional networks. We will define the 1D convolutions as special cases of Graph Convolutions, and we will explore Fully Graph-Convolutional neural architectures, that will avoid fully-connected layers, and possibly stack more graph convolution layers. Moreover, we will explore the impact of different activation functions for the graph convolutions in such a setting. Finally, we plan to enhance the input graph representation associating to each node the explicit features extracted by graph kernels.

Acknowledgment

This project was funded, in part, by the Department of Mathematics, University of Padova, under the DEEP project and DFG project, BA 2168/3-3.

References

  • [1] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt, “Efficient graphlet kernels for large graph comparison,” in Artificial Intelligence and Statistics, 2009, pp. 488–495.
  • [2] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt, “Graph kernels,” Journal of Machine Learning Research, vol. 11, no. Apr, pp. 1201–1242, 2010.
  • [3] G. Da San Martino, N. Navarin, and A. Sperduti, “A tree-based kernel for graphs,” in Proceedings of the Twelfth SIAM International Conference on Data Mining, Anaheim, California, USA, April 26-28, 2012., 2012, pp. 975–986. [Online]. Available: https://doi.org/10.1137/1.9781611972825.84
  • [4] L. van der Maaten, “Learning discriminative fisher kernels,” in Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, 2011, pp. 217–224.
  • [5] F. Aiolli, G. Da San Martino, M. Hagenbuchner, and A. Sperduti, “Learning nonsparse kernels by self-organizing maps for structured data,” IEEE Trans. Neural Networks, vol. 20, no. 12, pp. 1938–1949, 2009. [Online]. Available: https://doi.org/10.1109/TNN.2009.2033473
  • [6] D. Bacciu, A. Micheli, and A. Sperduti, “Generative kernels for tree-structured data,” IEEE Trans. Neural Netw. Learning Syst., vol. early access, 2018. [Online]. Available: https://ieeexplore.ieee.org/document/8259316/
  • [7] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation learning on graphs: Methods and applications,” CoRR, vol. abs/1709.05584, 2017. [Online]. Available: http://arxiv.org/abs/1709.05584
  • [8] A. Micheli, “Neural network for graphs: A contextual constructive approach,” IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 498–511, 2009.
  • [9] F. Scarselli, M. Gori, A. C. Ah Chung Tsoi, M. Hagenbuchner, and G. Monfardini, “The Graph Neural Network Model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2009. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4700287
  • [10] A. Sperduti and A. Starita, “Supervised neural networks for the classification of structures,” IEEE Trans. Neural Networks, vol. 8, no. 3, pp. 714–735, 1997. [Online]. Available: https://doi.org/10.1109/72.572108
  • [11] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated Graph Sequence Neural Networks,” in ICLR, 2016. [Online]. Available: http://arxiv.org/abs/1511.05493
  • [12] X. Bresson and T. Laurent, “An Experimental Study of Neural Networks for Variable Graphs,” in ICLR 2018 Workshop, 2018.
  • [13] Q. Liao and T. Poggio, “Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex,” arXiv preprint, 2016. [Online]. Available: http://arxiv.org/abs/1604.03640
  • [14] D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Convolutional networks on graphs for learning molecular fingerprints,” in Advances in Neural Information Processing Systems, Montreal, Canada, 2015, pp. 2215–2223.
  • [15] M. Simonovsky and N. Komodakis, “Dynamic edge-conditioned filters in convolutional neural networks on graphs,” in CVPR, 2017.
  • [16] T. N. Kipf and M. Welling, “Semi-Supervised Classification with Graph Convolutional Networks,” in ICLR, 2017, pp. 1–14. [Online]. Available: http://arxiv.org/abs/1609.02907
  • [17] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An End-to-End Deep Learning Architecture for Graph Classification,” in AAAI Conference on Artificial Intelligence, 2018.
  • [18] S. Abu-El-Haija, A. Kapoor, B. Perozzi, and J. Lee, “N-GCN: Multi-scale Graph Convolution for Semi-supervised Node Classification,” in Proceedings of the 14th International Workshop on Mining and Learning with Graphs (MLG), 2018. [Online]. Available: http://arxiv.org/abs/1802.08888
  • [19] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, jun 2015, pp. 1–9. [Online]. Available: http://ieeexplore.ieee.org/document/7298594/
  • [20] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph Attention Networks,” in ICLR, 2018. [Online]. Available: http://arxiv.org/abs/1710.10903
  • [21] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural networks for graphs,” in International conference on machine learning, 2016, pp. 2014–2023.
  • [22] J. Atwood and D. Towsley, “Diffusion-convolutional neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 1993–2001.
  • [23] T. Gartner, P. Flach, S. Wrobel, and T. Gärtner, “On Graph Kernels: Hardness Results and Efficient Alternatives,” in Proceedings of the 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop, ser. Lecture Notes in Computer Science, B. Schölkopf and M. K. Warmuth, Eds., vol. 2777.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 129–143. [Online]. Available: http://link.springer.com/10.1007/b12006
  • [24] N. Shervashidze, K. Mehlhorn, T. H. Petri, S. V. N. Vishwanathan, K. M. Borgwardt, T. H. Petri, K. Mehlhorn, and K. M. Borgwardt, “Efficient graphlet kernels for large graph comparison,” in AISTATS, vol. 5.   Clearwater Beach, Florida, USA: CSAIL, 2009, pp. 488–495.
  • [25] K. Borgwardt and H.-P. Kriegel, “Shortest-Path Kernels on Graphs,” in ICDM.   Los Alamitos, CA, USA: IEEE, 2005, pp. 74–81. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1565664
  • [26] G. Da San Martino, N. Navarin, and A. Sperduti, “Ordered Decompositional DAG Kernels Enhancements,” Neurocomputing, vol. 192, pp. 92–103, 2016.
  • [27] ——, “A Tree-Based Kernel for Graphs,” in Proceedings of the Twelfth SIAM International Conference on Data Mining, 2012, pp. 975–986.
  • [28] ——, “Graph Kernels Exploiting Weisfeiler-Lehman Graph Isomorphism Test Extensions,” in Neural Information Processing, vol. 8835, 2014, pp. 93–100. [Online]. Available: http://link.springer.com/10.1007/978-3-319-12640-1{_}12
  • [29] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and K. M. Borgwardt, “Weisfeiler-Lehman Graph Kernels,” JMLR, vol. 12, pp. 2539–2561, 2011.
  • [30] M. Neumann, N. Patricia, R. Garnett, and K. Kersting, “Efficient graph kernels by randomization,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases.   Springer, 2012, pp. 378–393.
  • [31] P. Yanardag and S. Vishwanathan, “Deep graph kernels,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.   ACM, 2015, pp. 1365–1374.
  • [32] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt, “Weisfeiler-lehman graph kernels,” Journal of Machine Learning Research, vol. 12, no. Sep, pp. 2539–2561, 2011.
  • [33] F. Costa and K. De Grave, “Fast neighborhood subgraph pairwise distance kernel,” in ICML.   Omnipress, 2010, pp. 255–262.
  • [34] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch, “Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity,” Journal of Medicinal Chemistry, vol. 34, no. 2, pp. 786–797, feb 1991. [Online]. Available: http://pubs.acs.org/doi/abs/10.1021/jm00106a046
  • [35] H. Toivonen, A. Srinivasan, R. D. King, S. Kramer, and C. Helma, “Statistical evaluation of the predictive toxicology challenge 2000-2001,” Bioinformatics, 2003.
  • [36] N. Wale, I. Watson, and G. Karypis, “Comparison of descriptor spaces for chemical compound retrieval and classification,” Knowledge and Information Systems, vol. 14, no. 3, pp. 347–375, 2008.
  • [37] P. D. Dobson and A. J. Doig, “Distinguishing Enzyme Structures from Non-enzymes Without Alignments,” Journal of Molecular Biology, vol. 330, no. 4, pp. 771–783, 2003.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
320342
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description