edGNN: A Simple and Powerful GNN for Directed Labeled Graphs

edGNN: A Simple and Powerful GNN for Directed Labeled Graphs

Guillaume Jaume
IBM Research, Zurich
EPFL, Lausanne
gja@zurich.ibm.com
&An-phi Nguyen11footnotemark: 1
IBM Research, Zurich
ETH, Zurich
uye@zurich.ibm.com
&María Rodríguez Martínez
IBM Research, Zurich
mrm@zurich.ibm.com &Jean-Philippe Thiran
EPFL, Lausanne
jean-philippe.thiran@epfl.ch
&Maria Gabrani
IBM Research, Zurich
mga@zurich.ibm.com
equal contribution
Abstract

The ability of a graph neural network (GNN) to leverage both the graph topology and graph labels is fundamental to building discriminative node and graph embeddings. Building on previous work, we theoretically show that edGNN, our model for directed labeled graphs, is as powerful as the Weisfeiler–Lehman algorithm for graph isomorphism. Our experiments support our theoretical findings, confirming that graph neural networks can be used effectively for inference problems on directed graphs with both node and edge labels. Code available at https://github.com/guillaumejaume/edGNN.

edGNN: A Simple and Powerful GNN for Directed Labeled Graphs

Guillaume Jaumethanks: equal contribution
IBM Research, Zurich
EPFL, Lausanne
gja@zurich.ibm.com
An-phi Nguyen11footnotemark: 1
IBM Research, Zurich
ETH, Zurich
uye@zurich.ibm.com
María Rodríguez Martínez
IBM Research, Zurich
mrm@zurich.ibm.com
Jean-Philippe Thiran
EPFL, Lausanne
jean-philippe.thiran@epfl.ch
Maria Gabrani
IBM Research, Zurich
mga@zurich.ibm.com

1 Introduction

In recent years, much work has been devoted to extending deep-learning models to graphs, e.g.,  Scarselli et al. (2009); Bruna et al. (2014); Li et al. (2016); Defferrard et al. (2016); Kipf & Welling (2017); Hamilton et al. (2017a; b); Veličković et al. (2017); Ying et al. (2018). Gilmer et al. (2017) formulated numerous such models within their proposed Message Passing Neural Network (MPNN) framework. In this model, the state of a vertex is represented by a feature vector. The state is then updated iteratively via a two-step strategy: For each vertex (i) in the aggregation step, the feature vectors of the neighboring vertices are combined into a single vector via a differentiable operator, e.g., a sum. Then, (ii) in the update step, a new state is computed by applying another differentiable operator, e.g., a one-layer perceptron, to the current state of the vertex and to the aggregate vector from the previous step.

In two recent works, Xu et al. (2019) and Morris et al. (2018), independently proved that certain formulations of MPNNs are as powerful as the Weisfeiler–Lehman (WL) algorithm for graph isomorphism (Weisfeiler & Leman (1968)). In practice, this means that there exist MPNNs able to learn unique representations for (almost) all undirected node-labeled graphs, which is desirable for such tasks as node or graph classification. Note that previous work has drawn a parallel between the WL test and the MPNN, e.g.,  Hamilton et al. (2017a); Jin et al. (2017); Lei et al. (2017).

In this paper, we extend the above-mentioned results to directed graphs with labels for both nodes and edges. In particular, by extending the theoretical framework provided by Morris et al. (2018), we show that there exist MPNNs as powerful as the (one-dimensional) WL algorithm for directed labeled graphs. Although this problem has already been addressed, e.g.,  Li et al. (2016); Niepert et al. (2016); Simonovsky & Komodakis (2017); Beck et al. (2018); Schlichtkrull et al. (2018), we present a theoretically-grounded GNN formulation for directed labeled graphs. We experimentally corroborate our theoretical results by comparing our model, edGNN, against state-of-the-art models for node and graph classification.

2 Theoretical framework

2.1 Notation and setup

As our work is an extension of Morris et al. (2018), we maintain a notation consistent to theirs in order to more easily reference their results.

A graph is a pair , where is the set of vertices, is the set of edges and the directed edge for is an edge starting in and ending in . We will denote the vertex and the edge sets of as and , respectively.

We are interested in graphs with both node and edge labels. We therefore assume that, given a graph , there exists a vertex-labeling function and an edge-labeling function that assign to each vertex and edge of a label from countable sets and . For the rest of this paper, we will refer to graphs with node and edge labels simply as labeled graphs.

For each vertex , we can define the neighborhood . As we are dealing with directed graphs, we distinguish between incoming neighbors and outgoing neighbors . Naturally, . The cardinalities of the neighborhoods are referred to as the degree, the in-degree and the out-degree of vertex , respectively.

Definition 2.1.

Two labeled directed graphs and are isomorphic if there exists a bijection such that if and only if with and .

2.1.1 The Weisfeiler–Lehman algorithm

The Weisfeiler–Lehman (WL) test (Weisfeiler & Leman (1968)) is an algorithm to distinguish whether two graphs are non-isomorphic. We present the test in its one-dimensional variant, also known as the naive vertex refinement. We will start by presenting the WL test on node-labeled graphs, and later discuss its extension to directed labeled graphs.

At initialization, the vertices are labeled consistently with the vertex-labeling function . We call this the initial coloring of the graph and we denote it as . The algorithm then proceeds in a recursive fashion. At iteration , new labels are computed for each vertex from the current labels of the vertex itself and its neighbors, i.e.,

(1)

where is an injective hashing function, and denotes a multiseti.e., a generalization of a set that allows elements to be repeated. Each iteration is performed in parallel for the two graphs to be tested, and . If at some iteration , the number of vertices assigned to a label differs for the two graphs, then the algorithm stops, concluding that the two graphs are not isomorphic. Otherwise, the algorithm will stop whenever a stable coloring is achieved, i.e., whenever for all and for any pair with , , and . This is guaranteed to happen at most after iterations. In this case, and are considered isomorphic.

Despite a few corner cases, the WL test is able to distinguish a wide class of graphs (Cai et al. (1992)). Moreover, the most efficient implementation of the algorithm has a runtime complexity which is quasi-linear in the number of vertices (Grohe et al. (2017)).

The extension of the WL test to a directed graph with edge labels is straightforward (Grohe et al. (2017); Orsini et al. (2016)). During the recursive step, for each vertex , we need to include the in-degrees and out-degrees of separately in the hashing function with respect to each edge label.

Let us denote an edge label as . For each vertex , we then define as the number of edges incoming to with label . Similarly, is defined for outgoing edges. Then, Eq. (1) can be adapted for labeled directed graphs in the following way:

(2)

2.2 Graph neural networks

Graph neural networks architectures implement a neighborhood aggregation strategy. Similar to Morris et al. (2018), our work focuses on the GNN model presented by Hamilton et al. (2017a), which implements this strategy with the node update function

(3)

where is the -dimensional node representation, or node embedding, at time step of the vertex , and are weight matrices. The initial representation is consistent with the vertex-labeling function , i.e.,   if and only if for all . Morris et al. (2018) showed that there exists a sequence such that the GNN model in Eq. (3) is as powerful as the WL test.

2.2.1 Extension to directed labeled graphs

The extension of Eq. (3) to directed labeled graphs follows the WL test extension. We simply need to augment the equation with embeddings for the labeled edges with incoming and outgoing edges considered separately, i.e.,

(4)

where is the -dimensional embedding of the edge with label . The embeddings should be defined such that if and only if . The same should hold for outgoing edges. For practical applications, this can be achieved by using one-hot encodings of the edge labels.
We can now directly extend the theorems presented in Morris et al. (2018).

Theorem 2.1 (Theorem 1 in Morris et al. (2018)).

Let G be a directed labeled graph. Then for all and for all choices of initial colorings consistent with and of edge embeddings consistent with , and weights

(5)

with and defined in Eqs. (2) and (4), respectively.

Morris et al. (2018) prove this theorem by induction. The proof is essentially the same for our extended case. In fact, as neither the labels nor the embeddings change over the iterations, there is no need to include them in the induction step.

Theorem 2.2 (Theorem 2 in Morris et al. (2018)).

Let G be a directed labeled graph with finite vertex degree. Then there exists a sequence with such that

(6)

The proof is provided in the supplemental material. Note that we specifically require the graph to have finite vertex degree. However, this is not a strong assumption in real-world applications.

2.2.2 Graph classification

For graph classification tasks, we need a representation of the graph. We build it from the node representations following the formulation of Xu et al. (2019):

(7)

Note that, although should theoretically be at least (Section 2.1.1), only a few layers (i.e., iterations) are used in practice to update the node representations. Finally, a linear classifier is applied to the graph representation to perform the classification.

3 Experiments

3.1 Datasets and baselines

We benchmark our algorithm edGNN on graph and node classification tasks. For graph classification tasks, we benchmark our model against the Subgraph Matching Kernel (CSM) (Kriege & Mutzel (2012)), Weisfeiler–Lehman Shortest Path Kernel (Shervashidze et al. (2011)) and R-GCN (Schlichtkrull et al. (2018)) . As R-GCN only defines how to build node embeddings, we reuse the formulation of Xu et al. (2019) to build a graph-level representation. For node classification tasks, we compare our model against R-GCN (Schlichtkrull et al. (2018)), RDF2Vec (Ristoski & Paulheim (2016)) and WL (Shervashidze et al. (2011); De Vries & De Rooij (2015)) on the AIFB and MUTAG dataset (Ristoski et al. (2016)). Dataset statistics as well as training details are reported in the Appendix.

3.2 Results and discussion

The left-hand table in Table 1 reports results for the graph classification tasks. Our provably powerful model, edGNN, reaches state-of-the-art performance on each dataset showing its discriminative power over other approaches. In the case of graph-kernel-based methods, we conjecture that the lower performances are due to the use of less powerful models, which could lead to their underfitting. Both CSM and WLSP focus on specific subgraph features (e.g., shortest paths), which may not be sufficient to fully characterize a graph in the datasets we analyzed. However, as we have no information about the training accuracies for these models, we cannot confirm this hypothesis. In the case of R-GCNs, as there is no theoretical result proving that they can or cannot be equivalent to the WL test, the consistently low performance might merely be because the R-GCN formulation is not appropriate for the particular learning setting of our experiments (e.g., relatively small datasets). For node classification (right-hand table in Table 1), edGNN achieves comparable performance with the state-of-the-art without outperforming it. This does not contradict our theoretical findings. In fact, the power of a learnable model does not guarantee its generalization nor that the best model can be learned. However, it is true that, in the best-case scenario, a more powerful model should perform better than a less powerful one, as shown by the results regarding the best-learned edGNN model. An interesting future research direction would be the study of all the proposed models fitting the MPNN framework in order to understand whether they can be as powerful as the WL test or whether, instead, they introduce a particular bias.

Model MUTAG PTC FM PTC FR PTC MM PTC MR
CSM
WLSP
R-GCN
edGNN (avg)
edGNN (max)
Model AIFB MUTAG
WL
RDF2Vec
R-GCN
edGNN (avg)
edGNN (max)
Table 1: Graph (left) and node (right) classification results in accuracy averaged over ten runs. Results are expressed as percentages. For graph classification, following prior art, we performed 10-fold cross validation. For node classification, we used the split provided by Ristoski & Paulheim (2016).

References

  • Beck et al. (2018) Daniel Beck, Gholamreza Haffari, and Trevor Cohn. Graph-to-Sequence Learning using Gated Graph Neural Networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 273—-283, 2018.
  • Bruna et al. (2014) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral Networks and Deep Locally Connected Networks on Graphs. In International Conference on Learning Representations (ICLR), 2014.
  • Cai et al. (1992) Jin Yi Cai, Martin Fürer, and Neil Immerman. An optimal lower bound on the number of variables for graph identification. Combinatorica, 1992.
  • De Vries & De Rooij (2015) Gerben Klaas Dirk De Vries and Steven De Rooij. Substructure counting graph kernels for machine learning from RDF data. Journal of Web Semantics, 2015.
  • Debnath et al. (1991) Asim Kumar Debnath, Rosa L. Lopez de Compadre, Gargi Debnath, Alan J. Shusterman, and Corwin Hansch. Structure-Activity Relationship of Mutagenic Aromatic and Heteroaromatic Nitro Componds. Correlation with Modelcular Orbital Energies and Hydrophobicity. J. Med. Chem, 34:786–797, 1991.
  • Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Advances in neural information processing systems (NIPS), 2016.
  • Gilmer et al. (2017) Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural Message Passing for Quantum Chemistry. In International Conference on Machine Learning (ICML), pp. 1273–1272, 2017.
  • Grohe et al. (2017) Martin Grohe, Kristian Kersting, Martin Mladenov, and Pascal Schweitzer. Color Refinement and its Applications. 2017.
  • Hamilton et al. (2017a) William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive Representation Learning on Large Graphs. In Advances in neural information processing systems (NIPS), 2017a.
  • Hamilton et al. (2017b) William L. Hamilton, Rex Ying, and Jure Leskovec. Representation Learning on Graphs: Methods and Applications. In IEEE Data Engineering Bulletin, 2017b. ISBN 1476-4687.
  • Helma et al. (2003) C Helma, R D King, S Kramer, and A Srinivasan. The Predictive Toxicology Challenge 2000-2001. Bioinformatics (Oxford, England), 19(1):1179–82, 2003.
  • Jin et al. (2017) Wengong Jin, Connor W. Coley, Regina Barzilay, and Tommi Jaakkola. Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network. In Advances in neural information processing systems (NIPS), 2017.
  • Kipf & Welling (2017) Thomas N Kipf and Max Welling. Semi supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.
  • Kriege & Mutzel (2012) Nils Kriege and Petra Mutzel. Subgraph Matching Kernels for Attributed Graphs. In International Conference on Machine Learning (ICML), 2012.
  • Lei et al. (2017) Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Deriving Neural Architectures from Sequence and Graph Kernels. International Conference on Machine Learning (ICML), 2017.
  • Li et al. (2016) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated Graph Sequence Neural Networks. In International Conference on Learning Representations (ICLR), 2016.
  • Morris et al. (2018) Christopher Morris, Martin Ritzert, Matthias Fey, William L. Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks. In 33rd AAAI Conference on Artificial Intelligence (AAAI), 2018.
  • Niepert et al. (2016) Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning Convolutional Neural Networks for Graphs. In International Conference on Machine Learning (ICML), pp. 2014–2023, 2016.
  • Orsini et al. (2016) Francesco Orsini, Paolo Frasconi, and Luc De Raedt. Graph invariant kernels. In International Joint Conference on Artificial Intelligence (IJCAI), 2016.
  • Patterson & Warner (1967) E. M. Patterson and Seth Warner. Modern Algebra (Prentice-Hall, Inc., 1965), two volumes, 806 pp., volume 15. Cambridge University Press, dec 1967. URL http://www.journals.cambridge.org/abstract{_}S0013091500012098.
  • Ristoski & Paulheim (2016) Petar Ristoski and Heiko Paulheim. RDF2Vec: RDF graph embeddings for data mining. In Lecture Notes in Computer Science, 2016.
  • Ristoski et al. (2016) Petar Ristoski, Gerben Klaas Dirk De Vries, and Heiko Paulheim. A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web. In Lecture Notes in Computer Science, 2016.
  • Scarselli et al. (2009) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 2009.
  • Schlichtkrull et al. (2018) Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling Relational Data with Graph Convolutional Networks. In Extended Semantic Web Conference, 2018.
  • Shervashidze et al. (2011) Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research, 12:2539–2561, 2011.
  • Simonovsky & Komodakis (2017) Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph Attention Networks. International Conference on Learning Representations (ICLR), 2017.
  • Weisfeiler & Leman (1968) B Yu Weisfeiler and A A Leman. A reduction of a graph to a canonical form and an algebra arising during this reduction. In Nauchno-Technicheskaya Informatsia , 2(9), pp. 2–16, 1968.
  • Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How Powerful are Graph Neural Networks? In International Conference on Learning Representations (ICLR), 2019.
  • Ying et al. (2018) Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L. Hamilton, and Jure Leskovec. Hierarchical Graph Representation Learning with Differentiable Pooling. In Neural Information Processing Systems (NIPS), 2018.

Appendix A Appendix

a.1 Proof of Theorem 2.2

For a given vertex , we define as the set of labels of edges incoming into the vertex . We then define . That is, for each vertex , we create a label by concatenating all the labels of the incoming edges together with their multiplicities . Similarly, we define and for the outgoing edges. Note that the pairs and take values in . Therefore, (or , respectively) can take values in (or , respectively), i.e.,  the Cartesian product of (or , respectively) copies of the set .

For all vertices , we can construct a function that bijectively maps a tuple to a label .

Note that, as we are considering graphs with finite vertex degree, and (and, consequently, and ) are finite. Therefore, is a countable set because the finite Cartesian product of countable sets is itself countable. Thus, as we built the function to be bijective, the sets , and their countable union , are also countable (for results on countable sets, refer for example to Patterson & Warner (1967)).

We can then construct an injective hash function such that

(8)

where the right-hand side is the relabeling function defined in Eq. (2).

These constructions highlight the fact that an iteration of the WL algorithm on a directed labeled graph is the same as performing an iteration of the WL algorithm on a undirected node-only-labeled graph, where node labels take values in an appropriately augmented label set .

The same equivalence can be highlighted between the GNN update functions in Eqs. (3) and (4). In fact, Eq. (4) can be rewritten as

(9)

where is the embedding resulting from the (horizontal) concatenation of , , and , whereas is the (vertical) concatenation of , , and .

The reformulations presented in Eqs. (8) and (9) allow us to treat our problem as one of undirected graphs with labels only for nodes. We can therefore prove this theorem by directly using the proof of Theorem 2 in Morris et al. (2018).

a.2 Details of training and more results

a.3 Initialization

We initialize the node and edge features with a one-hot encoding of their input label. For node classification, we use the in-degree as input label of the nodes. We also report results with learned embeddings (emb) instead of a one-hot encoding. To model the outgoing edges of the node classification graphs, we create new artificial relations by reversing each directed edge. We also report results without reversing the edges (reg).

a.3.1 Graph classification

The tasks predict the mutagenicity (Debnath et al. (1991); Kriege & Mutzel (2012)) (MUTAG) and the toxicity (Helma et al. (2003)) of chemical compounds (PTC).

We ran our graph classification experiments with a batch size of and a learning rate of with a weight decay of . We then performed a parameter search over the number of layers and node embedding size. The best performance was reached by using layers and hidden units. When the system was trained with learned embeddings, the initial embedding size was set to the number of hidden units (i.e.,  ). We used a ReLu activation at each layer without dropout. The system was trained for at most epochs with early stopping with respect to the validation set cross-entropy loss.

R-GCN for graph classification was also trained with a batch size of , a learning rate of with a weight decay of . We used layers and hidden units with learned initial embeddings for the nodes (instead of a one-hot encoding for edGNN). We used a basis decomposition with the number of basis set to the number of relations. Results for CSM (Kriege & Mutzel (2012)) and WLSP (Shervashidze et al. (2011)) are based on the re-implementation of Kriege & Mutzel (2012). All our experiments were performed with 10-fold cross validation as in Kriege & Mutzel (2012).

Dataset statistics for the graph classification tasks are shown in Table 2, whereas Table 3 shows the accuracy results together with the standard deviation.

Dataset Graphs Classes Avg nodes Avg edges Node labels Edge labels
MUTAG 188 2 17.9 19.8 6 3
PTC FM 349 2 14.1 14.5 18 4
PTC FR 351 2 14.6 15.0 19 4
PTC MM 336 2 14.0 14.3 20 4
PTC MR 344 2 14.3 14.7 18 4
Table 2: Graph classification dataset statistics with the number of graphs (Graphs), the number of classes (Classes), the average number of nodes per graph (Avg nodes), the average number of edges per graph (Avg edges), the number of node labels (Node labels) and the number of edge labels (Edge labels).
Model MUTAG PTC FM PTC FR PTC MM PTC MR
CSM
WLSP
R-GCN
edGNN (avg)
edGNN (max)
edGNN (emb)
Table 3: Graph classification results in accuracy obtained with 10-fold cross validation. Results are expressed as percentages. edGNN is compared with the Subgraph Matching Kernel (CSM) (Kriege & Mutzel (2012)), Weisfeiler–Lehman Shortest Path Kernel (Shervashidze et al. (2011)) and R-GCN (Schlichtkrull et al. (2018)).

a.3.2 Node classification

The node classification experiments were run with a learning rate of without weight decay. We used dropout on each layer with a ReLu activation. The best performance was achieved by using layers and hidden units. The maximum number of epochs was set to with early stopping with respect to the validation set cross-entropy loss. Results with R-GCN (Schlichtkrull et al. (2018)), RDF2Vec (Ristoski & Paulheim (2016)) and WL (Shervashidze et al. (2011)) are based on the re-implementation of Schlichtkrull et al. (2018).

Dataset Classes Nodes Edges Edge labels
AIFB 4 8,285 29,043 45
MUTAG 2 23,644 74,227 23
Table 4: Node classification dataset statistics with the number of classes (Classes), the total number of nodes (Nodes), the total number of edges (Edges) and the number of edge labels (Edge labels).
Model AIFB MUTAG
WL
RDF2Vec
R-GCN
edGNN (avg)
edGNN (max)
edGNN (emb)
edGNN (reg)
Table 5: Node classification results in accuracy averaged over ten runs. Results are expressed as percentages. edGNN is compared with WL (De Vries & De Rooij (2015)), RDF2Vec (Ristoski & Paulheim (2016)) and R-GCN (Schlichtkrull et al. (2018)).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
354701
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description