A Hierarchy of Graph Neural Networks Based on Learnable Local Features

A Hierarchy of Graph Neural Networks
Based on Learnable Local Features

Michael Lingzhi Li
Operational Research Center
Massachusetts Institute of Technology
Cambridge, MA 02139
mlli@mit.edu
&Meng Dong
Institute for Applied Computational Science
Harvard University
Cambridge, MA 02138
mengdong@g.harvard.edu
&Jiawei Zhou
Harvard School of Engineering and Applied Sciences
Harvard University
Cambridge, MA 02138
jzhou02@g.harvard.edu
&Alexander M. Rush
Harvard School of Engineering and Applied Sciences
Harvard University
Cambridge, MA 02138
srush@seas.harvard.edu
Website: https://mlli.mit.edu
Abstract

Graph neural networks (GNNs) are a powerful tool to learn representations on graphs by iteratively aggregating features from node neighbourhoods. Many variant models have been proposed, but there is limited understanding on both how to compare different architectures and how to construct GNNs systematically. Here, we propose a hierarchy of GNNs based on their aggregation regions. We derive theoretical results about the discriminative power and feature representation capabilities of each class. Then, we show how this framework can be utilized to systematically construct arbitrarily powerful GNNs. As an example, we construct a simple architecture that exceeds the expressiveness of the Weisfeiler-Lehman graph isomorphism test. We empirically validate our theory on both synthetic and real-world benchmarks, and demonstrate our example’s theoretical power translates to strong results on node classification, graph classification, and graph regression tasks.

1 Introduction

Graphs arise naturally in the world and are key to applications in chemistry, social media, finance, and many other areas. Understanding graphs is important and learning graph representations is a key step. Recently, there has been an explosion of interest in utilizing graph neural networks (GNNs), which have shown outstanding performance across tasks (e.g. kipf2016semi, velivckovic2017graph). Generally, we consider node-feature GNNs which operate recursively to aggregate representations from a neighbouring region (gilmer2017neural).

In this work, we propose a representational hierarchy of GNNs, and derive the discriminative power and feature representation capabilities in each class. Importantly, while most previous work has focused on GNNs aggregating over vertices in the immediate neighbourhood, we consider GNNs aggregating over arbitrary subgraphs containing the node. We show that, under mild conditions, there is only in fact a small class of subgraphs that are valid aggregation regions. These subgraphs provide a systematic way of defining a hierarchy for GNNs.

Using this hierarchy, we can derive theoretical results which provide insight into GNNs. For example, we show that no matter how many layers are added, networks which only aggregate over immediate neighbors cannot learn the number of triangles in a node’s neighbourhood. We demonstrate that many popular frameworks, including GCN111Throughout the paper we use GCN to specifically refer to the model proposed in kipf2016semi. (kipf2016semi), GAT (velivckovic2017graph), and N-GCN (abu2018n) are unified under our framework. We also compare each class using the Weisfeiler-Lehman (WL) isomorphism test (weisfeiler1968reduction), and conclude our hierarchy is able to generate arbitrarily powerful GNNs. Then we utilize it to systematically generate GNNs exceeding the discriminating power of the 1-WL test.

Experiments utilize both synthetic datasets and standard GNN benchmarks. We show that the method is able to learn difficult graph properties where standard GCNs fail, even with multiple layers. On benchmark datasets, our proposed GNNs are able to achieve strong results on multiple datasets covering node classification, graph classification, and graph regression.

2 Related Work

Numerous works (see li2015gated, atwood2016diffusion, defferrard2016convolutional, kipf2016semi, niepert2016learning, santoro2017simple, velivckovic2017graph, verma2018graph, zhang2018end, ivanov2018anonymous, wu2019simplifying for examples) have constructed different architectures to learn graph representations. Collectively, GNNs have pushed the state-of-the-art on many different tasks on graphs, including node classification, and graph classification/regression. However, there are relatively few works that attempt to understand or categorize GNNs theoretically.

scarselli2009computational presented one of the first works that investigated the capabilities of GNNs. They showed that the GNNs are able to approximate a large class of functions (those satisfying preservation of the unfolding equivalence) on graphs arbitrarily well. A recent work by xu2018powerful also explored the theoretical properties of GNNs. Its definition of GNNs is limited to those that aggregate features in the immediate neighbourhood, and thus is a special case of our general framework. We also show that the paper’s conclusion that GNNs are at most as powerful as the Weisfeiler-Lehman test fails to hold in a simple extension.

Survey works including zhou2018graph and wu2019comprehensive give an overview of the current field of research in GNNs, and provide structural classifications of GNNs. We differ in our motivation to categorize GNNs from a computational perspective. We also note that our classification only covers static node feature graphs, though extensions to more general settings are possible.

The disadvantages of GNNs using localized filter to propagate information are analyzed in li2018insights. One major problem is their incapability of exploring global graph structures. To alleviate this, there has been two important works in expanding neighbourhoods: N-GCN (abu2018n) feeds higher-degree polynomials of adjacency matrix to multiple instantiations of GCNs, and morris2018weisfeiler generalizes GNNs to -GNNs by constructing a set-based -WL to consider higher-order neighbourhoods and capture information beyond node-level. We compare architectures constructed using our hierarchy to these previous works in the experiments, and show that systematic construction of higher-order neighbourhoods brings an advantage across different tasks.

3 Background

Let denote an undirected and unweighted graph, where , and . Unless otherwise specified, we include self-loops for every node . Let be the graph’s adjacency matrix. Denote as the distance between two nodes and on a graph, defined as the minimum length of walk between and . We further write as the degree of node , and as the set of nodes in the direct neighborhood of (including itself).

Graph Neural Networks (GNNs) utilize the structure of a graph and node features to learn a refined representation of each node, where is input feature size, i.e. for each node , we have features .

A GNN is a function that for every layer at every node aggregates features over a connected subgraph containing the node , and updates a hidden representation . Formally, we can define the th layer of a GNN (with ):

where is the restriction symbol over the domain , the aggregation subgraph. The aggregation function is invariant with respect to the labeling of the nodes. The aggregation function, , summarizes information from a neighbouring region , while the combination function joins such information with the previous hidden features to produce a new representation.

For different tasks, these GNNs are combined with an output layer to coerce the final output into an appropriate shape. Examples include fully-connected layers (xu2018powerful), convolutional layers (zhang2018end), and simple summation (verma2018graph). These output layers are task-dependent and not graph-dependent, so we would omit these in our framework, and consider the node level output of the final th layer as the output of the GNN.

We consider three representative GNN variants in terms of this notation, where is a learnable weight matrix at layer :222For simplicity we present the version without feature normalization using node degrees.

  • Graph Convolutional Networks (GCNs) (kipf2016semi):

  • Graph Attention Networks (GAT) (velivckovic2017graph):

  • N-GCN (abu2018n) (2-layer case):

4 Hierarchical Framework for Constructing GNNs

Our proposed framework uses walks to specify a hierarchy of aggregation ranges. The aggregation function over a node is a permutation-invariant function over a connected subgraph . Consider the simplest case, using the neighbouring vertices , utilized by many popular architectures (e.g. GCN, GAT). Then in this case is a star-shaped subgraph, as illustrated below in Figure 1. We refer to that as , which in terms of walks, is the union of all edges and nodes in length-2 walks that start and end at .

To build a hierarchy, we consider benefits of longer walks. The next simplest graph feature is the triangles in the neighbourhood of . Knowledge on connections between the neighbouring nodes of are necessary for considering triangles. A natural formulation using walks would be length-3 walks that start and end at . A length-3 returning walk outlines a triangle, and the union of all length-3 returning walks induces a subgraph, formed by all nodes and edges included in those walks. This is illustrated in Figure 1 as .

Definition 1.

Define the set of all walks of length returning to as . For , we define as the subgraph formed by all the edges and nodes in , while is defined as the subgraph formed by all the nodes and edges in .

Intuitively, is a subgraph of consisting of all nodes and edges in the -hop neighbourhood of node , and only differs from by excluding the edges between the distance- neighbors of . We explore this further in Section 5. An example illustration of the neighbourhoods defined above is shown in Figure 1.

This set of subgraphs naturally induces a hierarchy with increasing aggregation region:

Definition 2.

The D-L hierarchy of aggregation regions for a node , in a graph is, in increasing order:

(1)

Where .

Next, we consider the properties of this hierarchy. One important property is completeness - that the hierarchy can classify every possible GNN. Note that there is no meaningful complete hierarchy if is arbitrary. Therefore, we propose to limit our focus to those that can be defined as a function of the distance from . Absent specific graph structures, distance is a canonical metric between vertices and this definition includes all examples listed in Section 3. With such assumption, we can show that the D-L hierarchy is complete:

Theorem 1.

Consider a GNN defined by its action at each layer:

(2)

Assume can be defined as a univariate function of the distance from . Then both of the following statements are true for all :

  • If , then .

  • If , then .

This theorem shows that one cannot create an aggregation region based on node distance that is "in between" the hierarchy defined. With Theorem 1, we can use the D-L aggregation hierarchy to create a hierarchy of GNNs based on their aggregation regions.

Figure 1: Illustration of D-L aggregation regions. Dashed circles represent neighborhoods of different hops. From left to right: , , , and . Both and include nodes within the k-hop neighborhood, but does not include edges between nodes on the outmost ring whereas does.
Definition 3.

For , is the set of all graph neural networks with aggregation region that is not a member of . is the set of all graph neural networks with aggregation region that is not a member of .

We explicitly exclude those belonging to a lower aggregation region in order to make the hierarchy well-defined (otherwise a GNN of order is trivially one of order ). We also implicitly define .

4.1 Constructing D-L GNNs

The D-L Hierarchy can be used both to classify existing GNNs and also to construct new models. We first note that all GNNs which aggregate over immediate neighbouring nodes fall in the class of . For example, Graph Convolutional Networks (GCNs) defined in Section 3 is in since its aggregation is , and similarly the N-GCN example is in . Note that these classes are defined by the subgraph used by , but does not imply that these networks reach the maximum discriminatory power of their class (defined in the next section).

We can use basic building blocks to implement different levels of GNNs. These examples are not meant to be exhaustive and only serve as a glimpse of what could be achieved with this framework.

Examples.

For every :

  • Any GNN with is a GNN of class .

  • Any GNN with is a GNN of class .

  • Any GNN with is a GNN of class .

Intuitively, counts all -length walks from to , which includes all nodes in the -hop neighbourhood. The difference between the first and the second example above is that in the second one, we allow -length walks from the nodes in the -hop neighbourhood, which promotes it to be class of . Note the simplicity of the first and the last examples: in matrix form the first is while the last form is .

The building blocks can be gradually added to the original aggregation function. This is particularly useful if an experimenter knows there are higher-level properties that are necessary to compute, for instance to incorporate knowledge of triangles, one can design the following network (see Section 6 for more details):

(3)

where are learnable weights.

5 Theoretical Properties

We can prove interesting theoretical properties for each class of graph neural networks on this hierarchy. To do this, we utilize the Weisfeiler-Lehman test, a powerful classical algorithm used to discriminate between potentially isomorphic graphs. In interest of brevity, its introduction is included in the Appendix in Section 8.1.

We define the terminology of "discriminating graphs" formally below:

Definition 4.

The discriminative power of a function over graphs is the set of graphs such that for every pair of graphs , the function has iff and iff . We say decides as isomorphic if and vice versa.

Essentially, is the set of graphs that can decide correctly whether any two of them are isomorphic or not. We say has a greater discriminative power than if . Now we first introduce a theorem proven by xu2018powerful:

Theorem 2.

The maximum discriminative power of the set of GNNs in is strictly less than or equal to the 1-dimensional WL test.

Their framework only included GNNs, and they upper bounded the discriminative power of such GNNs. With our generalized framework, we are able to prove a slightly surprising result:

Theorem 3.

The maximum discriminative power of the set of GNNs in is strictly greater than the 1-dimensional WL test.

This result is central to understanding GNNs. Even though the discriminative power of is strictly less than or equal to the 1-WL test, Theorem 3 shows that just by adding the connections between the immediate neighbors of each node (), we can achieve theoretically greater discriminative power.

One particular implication is that GNNs with maximal discriminative power in can count the number of triangles in a graph, while those in cannot, no matter how many layers are added. This goes against the intuition that more layers allow GNNs to aggregate information from further nodes, as is unable to aggregate the information of triangles from the region, which is important in many applications (see frank1986markov, tsourakakis2011spectral, becchetti2008efficient, eckmann2002curvature).

Unfortunately, this is the only positive result we are able to establish regarding the WL test as the -dim WL-test is not a local method for . Nevertheless, we are able to prove that our hierarchy admits arbitrarily powerful GNNs through the following theorem:

Theorem 4.

For all , there exists a GNN within the class of that is able to discriminate all graphs with nodes.

This shows that as , we are able to discriminate all graphs. We record the full set of results proven in Table 1.

GNN Class Computational Complexity Maximum Discriminatory Power Possible Learned Features
1-WL Node Degree
1-WL
All graphs of nodes
All Cliques
Length 3 cycles (Triangles)
1-WL
All graphs of nodes
Length 2 walks
Length 4 cycles
1-WL
All graphs of nodes
Length walks
Length cycles
1-WL
All graphs of nodes
Length cycles
Table 1: Properties of different GNN classes. Shows the upper bound computational complexity when the maximum discriminatory power is obtained. Here we assume hidden size is the same as feature input size. Final column contains some examples of features that can be learned by each class.

The key ingredients for proving these results are contained in Appendix 8.3 and 8.4. Here we see that at the class, theoretically we are able to learn all cliques (as cliques by definition are fully connected). As we gradually move upward in the hierarchy, we are able to learn more far-reaching features such as higher length walks and cycles, while the discriminatory power improves. We also note that the theoretical complexity increases as increases.

6 Experiments

We consider the capability of two specific GNNs instantiations that are motivated by this framework: (GCN-L1) and (GCN-D2). These can be seen as extensions of the GCN introduced in kipf2016semi. The first, GCN-L1, equips the GNN with the ability to count triangles. The second, GCN-D2, can further count the number of 4-cycles. We note their theoretical power below (proof follows from Theorem 3):

Corollary 1.

The maximum discriminative power of GCN-L1 and GCN-D2 is strictly greater than the 1-dimensional WL test.

We compare the performance of GCN-L1, GCN-D2 and other state-of-art GNN variants on both synthetic and real-world tasks.333The code is available at a public github repository. Reviewers have anonymized access through supplementary materials. The synthetic datasets are included in the codebase as well. For the combine function of GCN, GCN-L1, and GCN-D2, we use , where is a multi-layer perceptron (MLP) with LeakyReLu activation similar to xu2018powerful.

All of our experiments are run with PyTorch 1.2.0, PyTorch-Geometric 1.2.1, and we use NVIDIA Tesla P100 GPUs with 16GB memory.

6.1 Synthetic Experiments

To verify our previous claim that in our proposed hierarchy, GNNs from certain classes are able to learn specific features more effectively, we created two tasks: predict the number of triangles and number of 4-cycles in the graphs. For each task, the dataset contains 1000 graphs and is generated in a procedure as follows: We fix the number of nodes in each graph to be 100 and use the Erdős–Rényi random graph model to generate random graphs with edge probability 0.07. Then we count the number of patterns of interest. In the 4-cycle dataset, the average number of 4-cycles in each graph is 1350 and in the triangle dataset, there are 54 triangles on average in each graph.

We perform 10-fold cross-validation and record the average and standard deviation of evaluation metrics across the 10 folds within the cross-validation. We used 16 hidden features, and trained the networks using Adam optimizer with 0.001 initial learning rate, regularization . We further apply early stopping on validation loss with a delay window size of 10. The dropout rate is 0.1. The learning rate is scheduled to reduce if the validation accuracy stops increasing for 10 epochs. We utilized a two-layer MLP in our combine function for GCN, GCN-L1 and GCN-L2, similar to the implementation in xu2018powerful. For training stability, we limited in our models using the sigmoid function.

Results

In our testing, we limited GCN-L1 and GCN-D2 to a 1-layer network. This prevents GCN-L1 and GCN-D2 from utilizing higher order features to reverse predict the triangles and 4-cycles. Simultaneously, we ensured GCN had the same receptive field as such networks by using 2-layer and 3-layer GCNs, which provided GCN with additional feature representational capability. The baseline is a model that predicts the training mean on the testing set. The results are in Table 2. GCN completely fails to learn the features (worse than a simple mean prediction). However, we see that GCN-L1 effectively learns the triangle counts and greatly outperforms the mean baseline, while GCN-D2 is similarly able in providing a good approximation on the count of 4-cycles, without losing the ability to count triangles. This validates the "possible features learned" in Table 1.

MSE # Triangles MSE # 4 Cycles ()
Predict Mean (Baseline)
GCN (2-layer)
GCN (3-layer)
GCN-L1 (1-layer)
GCN-D2 (1-layer)
Table 2: Results of experiments on synthetic datasets (i) Count the number of triangles in the graph (ii) Count the number of 4 cycles in the graph. The reported metric is MSE over the testing set.

6.2 Real-World Benchmarks

We next consider standard benchmark datasets for (i) node classification, (ii) graph classification, (iii) graph regression tasks. The details of these datasets are presented in Table 3.

Dataset Category # Graphs # Classes # Nodes Avg. # Edges Avg Task
Cora (yang2016revisiting) Citation 1 7 2,708 5,429 NC
Citeseer (yang2016revisiting) Citation 1 6 3,327 4,732 NC
PubMed (yang2016revisiting) Citation 1 3 19,717 44,338 NC
NCI1 (shervashidze2011weisfeiler) Bio 4,110 2 29.87 32.30 GC
Proteins (KKMMN2016) Bio 1,113 2 39.06 72.82 GC
PTC-MR (KKMMN2016) Bio 344 2 14.29 14.69 GC
MUTAG (borgwardt2005protein) Bio 188 2 17.93 19.79 GC
QM7b (wu2018moleculenet) Bio 7,210 14 16.42 244.95 GR
QM9 (wu2018moleculenet) Bio 133,246 12 18.26 37.73 GR
Table 3: Details of benchmark datasets used. Types of tasks are: NC for node classification, GC for graph classification, GR for graph regression.

The setup of the learning rate scheduling and regularization rate are the same as in synthetic tasks. For the citation tasks, we used 16 hidden features, while we used 64 for the biological datasets. Since our main modification is the expansion of aggregation region, our main comparison benchmarks are -GNN (morris2018weisfeiler) and N-GCN (abu2018n), two previous best attempts in incorporating aggregation regions beyond the immediate nodes. Note that we can view a th order N-GCN as aggregating over .

We further include results on GAT (verma2018graph), GIN (xu2018powerful), RetGK (zhang2018retgk), GNTK (du2019graph), WL-subtree (shervashidze2011weisfeiler), SHT-PATH, (borgwardt2005shortest) and PATCHYSAN (niepert2016learning) to show some best performing architectures.

Baseline neural network models use a 1-layer perceptron combine function, with the exception of -GNN, which uses a 2-layer perceptron combine function. Thus, to illustrate the effectiveness of the framework, we only utilize a 1-layer perceptron combine function for all tasks for our GCN models, with the exception of NCI1. 2-layer perceptrons seemed necessary for good performance in NCI1, and thus we implemented all neural networks with 2-layer perceptrons for this task to ensure a fair comparison. We tuned the learning rates and dropout rates . For numerical stability, we normalize the aggregation function using the degree of only. For the node classification tasks, we directly utilized the final layer output, while we summed over the node representations for the graph-level tasks.

Results

Experimental results on real-world data are summarized in Table 4. According to our experiments, GCN-L1 and GCN-D2 noticeably improve upon GCN across all datasets, due to its ability to combine node features in more complex and nonlinear ways. The improvement is statistically significant on the level for all datasets except Proteins. The results of GCN-L1 and GCN-D2 match the best performing architectures in most datasets, and lead in numerical averages for Cora, Proteins, QM7b, and QM9 (though not statistically significant for all).

Importantly, the results also show a significant improvement from the two main comparison architectures, -GNN and N-GCN. We see that further expanding aggregation regions generates diminishing returns on these datasets, and the majority of the benefit is gained in the first-order extension . This is in contrast to N-GCN which skipped to only used -type aggregation regions (), which is an incomplete hierarchy of aggregation regions. The differential in results illustrates the power of the complete hierarchy as proven in Theorem 1.

We especially would like to stress the outsized improvement of GCN-L1 on the biological datasets. As described in 1, GCN-L1 is able to capture information about triangles, which are highly relevant for the properties of biological molecules. The experimental results verify such intuition, and show how knowledge about the task can lead to targeted GNN design using our framework.

Dataset Cora Citeseer PubMed NCI1 Proteins PTC-MR MUTAG QM7b QM9
GAT
GIN
WL-OA
P.SAN
RetGK
GNTK
S-PATH
N-GCN
k-GNN
GCN
GCN-L1
GCN-D2
Table 4: Results of experiments on real-world datasets. The reported metrics are accuracy on classification tasks and MSE on regression tasks. Figures for comparative methods are from literature except for those with *, which come from our own implementation. The best-performing architectures are highlighted in bold.

7 Conclusion

We propose a theoretical framework to classify GNNs by their aggregation region and discriminative power, proving that the presented framework defines a complete hierarchy for GNNs. We also provide methods to construct powerful GNN models of any class with various building blocks. Our experimental results show that example models constructed in the proposed way can effectively learn the corresponding features exceeding the capability of 1-WL algorithm in graphs. Aligning with our theoretical analysis, experimental results show that these stronger GNNs can better represent the complex properties of a number of real-world graphs.

References

8 Appendixes

8.1 Introduction to Weisfeiler-Lehman Test

The 1-dimensional Weisfeiler-Lehman (WL) test is an iterative vertex classification method widely used in checking graph isomorphism. In the first iteration, the vertices are labeled by their valences. Then at each following step, the labels of vertices are updated by the multiset of the labels of themselves and their neighbors. The algorithm terminates when a stable set of labels is reached. The details of 1-dimensional Weisfeiler-Lehman (WL) is shown in Algorithm 1. Regarding the limitation of 1-dimensional Weisfeiler-Lehman (WL), cai1992optimal described families of non-isomorphic graphs which 1-dimensional WL test cannot distinguish.

1:Input: , , initial labels for all
2:Output: stablized labels for all
3:while  has not converged do Until the labels reach stabalization
4:     for  do
5:          multi-set
6:     end for
7:     Sort and concatenate them into string
8:     , where is any function s.t.
9:end while
Algorithm 1 1-dimensional Weisfeiler-Lehman (WL)

The k-dimensional Weisfeiler-Lehman (WL) algorithm extends the above procedure from operations on nodes to operations on tuples .

8.2 Proof of Theorem 1

Theorem.

Consider a GNN defined by its action at each layer:

(4)

Assume can be defined as a univariate function of the distance from . Then both of the following statements are true for all :

  • If , then .

  • If , then .

Proof.

We would prove by contradiction. Assume, in contrary, that one of the statements in Theorem 1 is false. Let . Then we would separate these two cases as below:

Assume that satisfies such relationship. Since and only differ by the set , can only contain this set partially. Let be the non-empty maximal subset of that is contained in . Since , there exists with such that but . Consider a non-identity permutation of vertices fixing . Then since is defined only using the distance function, and it needs to be permutation invariant, all with must be in and not in . But then is empty, a contradiction.

Assume that satisfies such relationship. Consider the set difference between and , denoted as a subgraph :

(5)

Then can only contain this set partially. Let be the maximal subset of that is contained in . Since , at least one of the followings must be true:

  • There exists with such that but .

  • There exists with , such that but

For the first case, consider a non-identity permutation of the vertices fixing . Then since is permutation invariant and defined only using the distance, then thus all vertices with are in but not in . This implies is empty.

Using the same logic for the second case, one can conclude that all with and must be in but not in . That means is empty.

Therefore, we can conclude that at least one of and is empty. Since both cannot be empty (as that means ), we must have either or . With the former case is not a valid subgraph (as some edges to nodes with distance from are in the set, but the nodes are not), and with the latter case it is not connected (as the nodes with distance from are in the set but none of the edges are), so neither of them are valid aggregation regions in our framework (our definition of GNN requires the region to be a connected graph). Thus, we reach a contradiction. ∎

8.3 Proof of Theorem 3

Theorem.

The maximum discriminative power of the set of GNNs in is strictly greater than the 1-dimensional WL test.

Proof.

We first note that a GNN with the aggregation function (in matrix notation):

(6)

is a -class GNN. Then note that is the number of length 3 walks that start and end at . These walks must be simple 3-cycles, and thus is twice the number of triangles that contains (since for a triangle , there would be two walks and ). furer2017combinatorial showed that 1-WL test cannot measure the number of triangles in a graph (while 2-WL test can), so there exist graphs and such that 1-WL test cannot differentiate but the GNN above can due to different number of triangles in these two graphs (an example are two regular graphs with the same degree and same number of nodes).

Now xu2018powerful proved that has a maximum discriminatory power of 1-WL, and since , the maximum discriminatory power of is at least as great than that of , which is 1-WL.

Thus combining these two results give the required theorem. ∎

We here note the computational complexity of using naive matrix multiplication requires multiplications. However, by exploiting the sparsity and binary nature of , there exist algorithms that can calculate with additions (razzaque2008fast), and we thus derive a more favorable bound.

8.4 Proof of Theorem 4

Theorem.

For all , there exists a GNN within the class of that is able to discriminate all graphs with nodes.

Proof.

We would prove by induction on . The base case for is simple.

Assume the statement is true for all . Let us prove the case for .

We would separate the proof into three cases:

  • and are both disconnected.

  • One of , is disconnected.

  • and are both connected.

If we can create appropriate GNNs to discriminate graphs in each of the three cases (say , , ), then the concatenated function can discriminate all graphs. Therefore, we would prove the induction step separately for the three cases below.

8.4.1 and disconnected

We would use the Graph Reconstruction Conjecture as proved in the disconnected case (harary1974survey):

Lemma 1.

For all , two disconnected graphs with are isomorphic if and only if the set of sized subgraphs for these two graphs are isomorphic.

Let and be any two disconnected graphs with nodes. By the induction assumption, there exist a GNN in such that it discriminates all graphs with nodes. Denote that .

Then by Lemma 1 above, we know that and are isomorphic iff:

(7)

Where are the subgraphs of with size , and similarly for . Then we define:

Then, by Theorem 4.3 in wagstaff2019limitations, is an injective, permutation-invariant function on . Therefore, is a GNN with an aggregation region of that can discriminate and . Thus, the induction step is proven.

8.4.2 and connected

In the case where both and are connected, let be two vertices from each of the two graphs. Note that since and have nodes, by definition of , we have and . Since are arbitrary, every node in the two graphs has an aggregation region of its entire graph. Then we define the Aggregation function as followed:

Where is the adjacency matrix restricted to the subgraph, and returns the lexicographical smallest ordering of the adjacency matrix as a row-major vector among all isomorphic permutations (where nodes are relabeled).444Strictly speaking, for consistency of the output length, we would require padding to a length of if the adjacency matrix restricted to the subgraph is not the full adjacency matrix. However, since we only care about the behavior of the function when and are both connected, this scenario never happens, so it is not material for any part of the proof below. For example, take:

There are two isomorphic permutations of this adjacency matrix, which are:

The row-major vectors of these two adjacency matrices are and , in which is lexicographical smaller, so . Note that this function is permutation invariant.

Then for and connected, is always the adjacency matrix of the full graph. Therefore if and are connected and isomorphic, then their adjacency matrices are permutations of each other, and thus their lexicographical smallest ordering of the row-major vector form of the adjacency matrix are identical. The converse is also clearly true as the adjacency matrix determines the graph.

Therefore, this function discriminates all connected graphs of nodes, and the induction step is proven.

8.4.3 disconnected and connected

In the case where is disconnected and is connected, we define the aggregation function as the number of vertices in , denoted :

This is a permutation-invariant function. Note that for the connected graph and any vertex in , this function returns as . Therefore, every node has the same embedding .

On the other hand, for the disconnected graph , let be the connected components of . Then for a vertex , it is clear that , and thus for all . And since by construction, for all , so the embedding of and are never equal when is connected and is disconnected.

Therefore, this function discriminate all graphs of nodes in which one is connected and one is disconnected, so the induction step is proven.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398263
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description