A Hierarchy of Graph Neural Networks
Based on Learnable Local Features
Abstract
Graph neural networks (GNNs) are a powerful tool to learn representations on graphs by iteratively aggregating features from node neighbourhoods. Many variant models have been proposed, but there is limited understanding on both how to compare different architectures and how to construct GNNs systematically. Here, we propose a hierarchy of GNNs based on their aggregation regions. We derive theoretical results about the discriminative power and feature representation capabilities of each class. Then, we show how this framework can be utilized to systematically construct arbitrarily powerful GNNs. As an example, we construct a simple architecture that exceeds the expressiveness of the WeisfeilerLehman graph isomorphism test. We empirically validate our theory on both synthetic and realworld benchmarks, and demonstrate our example’s theoretical power translates to strong results on node classification, graph classification, and graph regression tasks.
1 Introduction
Graphs arise naturally in the world and are key to applications in chemistry, social media, finance, and many other areas. Understanding graphs is important and learning graph representations is a key step. Recently, there has been an explosion of interest in utilizing graph neural networks (GNNs), which have shown outstanding performance across tasks (e.g. kipf2016semi, velivckovic2017graph). Generally, we consider nodefeature GNNs which operate recursively to aggregate representations from a neighbouring region (gilmer2017neural).
In this work, we propose a representational hierarchy of GNNs, and derive the discriminative power and feature representation capabilities in each class. Importantly, while most previous work has focused on GNNs aggregating over vertices in the immediate neighbourhood, we consider GNNs aggregating over arbitrary subgraphs containing the node. We show that, under mild conditions, there is only in fact a small class of subgraphs that are valid aggregation regions. These subgraphs provide a systematic way of defining a hierarchy for GNNs.
Using this hierarchy, we can derive theoretical results which provide insight into GNNs. For example, we show that no matter how many layers are added, networks which only aggregate over immediate neighbors cannot learn the number of triangles in a node’s neighbourhood. We demonstrate that many popular frameworks, including GCN^{1}^{1}1Throughout the paper we use GCN to specifically refer to the model proposed in kipf2016semi. (kipf2016semi), GAT (velivckovic2017graph), and NGCN (abu2018n) are unified under our framework. We also compare each class using the WeisfeilerLehman (WL) isomorphism test (weisfeiler1968reduction), and conclude our hierarchy is able to generate arbitrarily powerful GNNs. Then we utilize it to systematically generate GNNs exceeding the discriminating power of the 1WL test.
Experiments utilize both synthetic datasets and standard GNN benchmarks. We show that the method is able to learn difficult graph properties where standard GCNs fail, even with multiple layers. On benchmark datasets, our proposed GNNs are able to achieve strong results on multiple datasets covering node classification, graph classification, and graph regression.
2 Related Work
Numerous works (see li2015gated, atwood2016diffusion, defferrard2016convolutional, kipf2016semi, niepert2016learning, santoro2017simple, velivckovic2017graph, verma2018graph, zhang2018end, ivanov2018anonymous, wu2019simplifying for examples) have constructed different architectures to learn graph representations. Collectively, GNNs have pushed the stateoftheart on many different tasks on graphs, including node classification, and graph classification/regression. However, there are relatively few works that attempt to understand or categorize GNNs theoretically.
scarselli2009computational presented one of the first works that investigated the capabilities of GNNs. They showed that the GNNs are able to approximate a large class of functions (those satisfying preservation of the unfolding equivalence) on graphs arbitrarily well. A recent work by xu2018powerful also explored the theoretical properties of GNNs. Its definition of GNNs is limited to those that aggregate features in the immediate neighbourhood, and thus is a special case of our general framework. We also show that the paper’s conclusion that GNNs are at most as powerful as the WeisfeilerLehman test fails to hold in a simple extension.
Survey works including zhou2018graph and wu2019comprehensive give an overview of the current field of research in GNNs, and provide structural classifications of GNNs. We differ in our motivation to categorize GNNs from a computational perspective. We also note that our classification only covers static node feature graphs, though extensions to more general settings are possible.
The disadvantages of GNNs using localized filter to propagate information are analyzed in li2018insights. One major problem is their incapability of exploring global graph structures. To alleviate this, there has been two important works in expanding neighbourhoods: NGCN (abu2018n) feeds higherdegree polynomials of adjacency matrix to multiple instantiations of GCNs, and morris2018weisfeiler generalizes GNNs to GNNs by constructing a setbased WL to consider higherorder neighbourhoods and capture information beyond nodelevel. We compare architectures constructed using our hierarchy to these previous works in the experiments, and show that systematic construction of higherorder neighbourhoods brings an advantage across different tasks.
3 Background
Let denote an undirected and unweighted graph, where , and . Unless otherwise specified, we include selfloops for every node . Let be the graph’s adjacency matrix. Denote as the distance between two nodes and on a graph, defined as the minimum length of walk between and . We further write as the degree of node , and as the set of nodes in the direct neighborhood of (including itself).
Graph Neural Networks (GNNs) utilize the structure of a graph and node features to learn a refined representation of each node, where is input feature size, i.e. for each node , we have features .
A GNN is a function that for every layer at every node aggregates features over a connected subgraph containing the node , and updates a hidden representation . Formally, we can define the th layer of a GNN (with ):
where is the restriction symbol over the domain , the aggregation subgraph. The aggregation function is invariant with respect to the labeling of the nodes. The aggregation function, , summarizes information from a neighbouring region , while the combination function joins such information with the previous hidden features to produce a new representation.
For different tasks, these GNNs are combined with an output layer to coerce the final output into an appropriate shape. Examples include fullyconnected layers (xu2018powerful), convolutional layers (zhang2018end), and simple summation (verma2018graph). These output layers are taskdependent and not graphdependent, so we would omit these in our framework, and consider the node level output of the final th layer as the output of the GNN.
We consider three representative GNN variants in terms of this notation, where is a learnable weight matrix at layer :^{2}^{2}2For simplicity we present the version without feature normalization using node degrees.

Graph Convolutional Networks (GCNs) (kipf2016semi):

Graph Attention Networks (GAT) (velivckovic2017graph):

NGCN (abu2018n) (2layer case):
4 Hierarchical Framework for Constructing GNNs
Our proposed framework uses walks to specify a hierarchy of aggregation ranges. The aggregation function over a node is a permutationinvariant function over a connected subgraph . Consider the simplest case, using the neighbouring vertices , utilized by many popular architectures (e.g. GCN, GAT). Then in this case is a starshaped subgraph, as illustrated below in Figure 1. We refer to that as , which in terms of walks, is the union of all edges and nodes in length2 walks that start and end at .
To build a hierarchy, we consider benefits of longer walks. The next simplest graph feature is the triangles in the neighbourhood of . Knowledge on connections between the neighbouring nodes of are necessary for considering triangles. A natural formulation using walks would be length3 walks that start and end at . A length3 returning walk outlines a triangle, and the union of all length3 returning walks induces a subgraph, formed by all nodes and edges included in those walks. This is illustrated in Figure 1 as .
Definition 1.
Define the set of all walks of length returning to as . For , we define as the subgraph formed by all the edges and nodes in , while is defined as the subgraph formed by all the nodes and edges in .
Intuitively, is a subgraph of consisting of all nodes and edges in the hop neighbourhood of node , and only differs from by excluding the edges between the distance neighbors of . We explore this further in Section 5. An example illustration of the neighbourhoods defined above is shown in Figure 1.
This set of subgraphs naturally induces a hierarchy with increasing aggregation region:
Definition 2.
The DL hierarchy of aggregation regions for a node , in a graph is, in increasing order:
(1) 
Where .
Next, we consider the properties of this hierarchy. One important property is completeness  that the hierarchy can classify every possible GNN. Note that there is no meaningful complete hierarchy if is arbitrary. Therefore, we propose to limit our focus to those that can be defined as a function of the distance from . Absent specific graph structures, distance is a canonical metric between vertices and this definition includes all examples listed in Section 3. With such assumption, we can show that the DL hierarchy is complete:
Theorem 1.
Consider a GNN defined by its action at each layer:
(2) 
Assume can be defined as a univariate function of the distance from . Then both of the following statements are true for all :

If , then .

If , then .
This theorem shows that one cannot create an aggregation region based on node distance that is "in between" the hierarchy defined. With Theorem 1, we can use the DL aggregation hierarchy to create a hierarchy of GNNs based on their aggregation regions.
Definition 3.
For , is the set of all graph neural networks with aggregation region that is not a member of . is the set of all graph neural networks with aggregation region that is not a member of .
We explicitly exclude those belonging to a lower aggregation region in order to make the hierarchy welldefined (otherwise a GNN of order is trivially one of order ). We also implicitly define .
4.1 Constructing DL GNNs
The DL Hierarchy can be used both to classify existing GNNs and also to construct new models. We first note that all GNNs which aggregate over immediate neighbouring nodes fall in the class of . For example, Graph Convolutional Networks (GCNs) defined in Section 3 is in since its aggregation is , and similarly the NGCN example is in . Note that these classes are defined by the subgraph used by , but does not imply that these networks reach the maximum discriminatory power of their class (defined in the next section).
We can use basic building blocks to implement different levels of GNNs. These examples are not meant to be exhaustive and only serve as a glimpse of what could be achieved with this framework.
Examples.
For every :

Any GNN with is a GNN of class .

Any GNN with is a GNN of class .

Any GNN with is a GNN of class .
Intuitively, counts all length walks from to , which includes all nodes in the hop neighbourhood. The difference between the first and the second example above is that in the second one, we allow length walks from the nodes in the hop neighbourhood, which promotes it to be class of . Note the simplicity of the first and the last examples: in matrix form the first is while the last form is .
The building blocks can be gradually added to the original aggregation function. This is particularly useful if an experimenter knows there are higherlevel properties that are necessary to compute, for instance to incorporate knowledge of triangles, one can design the following network (see Section 6 for more details):
(3) 
where are learnable weights.
5 Theoretical Properties
We can prove interesting theoretical properties for each class of graph neural networks on this hierarchy. To do this, we utilize the WeisfeilerLehman test, a powerful classical algorithm used to discriminate between potentially isomorphic graphs. In interest of brevity, its introduction is included in the Appendix in Section 8.1.
We define the terminology of "discriminating graphs" formally below:
Definition 4.
The discriminative power of a function over graphs is the set of graphs such that for every pair of graphs , the function has iff and iff . We say decides as isomorphic if and vice versa.
Essentially, is the set of graphs that can decide correctly whether any two of them are isomorphic or not. We say has a greater discriminative power than if . Now we first introduce a theorem proven by xu2018powerful:
Theorem 2.
The maximum discriminative power of the set of GNNs in is strictly less than or equal to the 1dimensional WL test.
Their framework only included GNNs, and they upper bounded the discriminative power of such GNNs. With our generalized framework, we are able to prove a slightly surprising result:
Theorem 3.
The maximum discriminative power of the set of GNNs in is strictly greater than the 1dimensional WL test.
This result is central to understanding GNNs. Even though the discriminative power of is strictly less than or equal to the 1WL test, Theorem 3 shows that just by adding the connections between the immediate neighbors of each node (), we can achieve theoretically greater discriminative power.
One particular implication is that GNNs with maximal discriminative power in can count the number of triangles in a graph, while those in cannot, no matter how many layers are added. This goes against the intuition that more layers allow GNNs to aggregate information from further nodes, as is unable to aggregate the information of triangles from the region, which is important in many applications (see frank1986markov, tsourakakis2011spectral, becchetti2008efficient, eckmann2002curvature).
Unfortunately, this is the only positive result we are able to establish regarding the WL test as the dim WLtest is not a local method for . Nevertheless, we are able to prove that our hierarchy admits arbitrarily powerful GNNs through the following theorem:
Theorem 4.
For all , there exists a GNN within the class of that is able to discriminate all graphs with nodes.
This shows that as , we are able to discriminate all graphs. We record the full set of results proven in Table 1.
GNN Class  Computational Complexity  Maximum Discriminatory Power  Possible Learned Features  
1WL  Node Degree  










Length cycles 
The key ingredients for proving these results are contained in Appendix 8.3 and 8.4. Here we see that at the class, theoretically we are able to learn all cliques (as cliques by definition are fully connected). As we gradually move upward in the hierarchy, we are able to learn more farreaching features such as higher length walks and cycles, while the discriminatory power improves. We also note that the theoretical complexity increases as increases.
6 Experiments
We consider the capability of two specific GNNs instantiations that are motivated by this framework: (GCNL1) and (GCND2). These can be seen as extensions of the GCN introduced in kipf2016semi. The first, GCNL1, equips the GNN with the ability to count triangles. The second, GCND2, can further count the number of 4cycles. We note their theoretical power below (proof follows from Theorem 3):
Corollary 1.
The maximum discriminative power of GCNL1 and GCND2 is strictly greater than the 1dimensional WL test.
We compare the performance of GCNL1, GCND2 and other stateofart GNN variants on both synthetic and realworld tasks.^{3}^{3}3The code is available at a public github repository. Reviewers have anonymized access through supplementary materials. The synthetic datasets are included in the codebase as well. For the combine function of GCN, GCNL1, and GCND2, we use , where is a multilayer perceptron (MLP) with LeakyReLu activation similar to xu2018powerful.
All of our experiments are run with PyTorch 1.2.0, PyTorchGeometric 1.2.1, and we use NVIDIA Tesla P100 GPUs with 16GB memory.
6.1 Synthetic Experiments
To verify our previous claim that in our proposed hierarchy, GNNs from certain classes are able to learn specific features more effectively, we created two tasks: predict the number of triangles and number of 4cycles in the graphs. For each task, the dataset contains 1000 graphs and is generated in a procedure as follows: We fix the number of nodes in each graph to be 100 and use the Erdős–Rényi random graph model to generate random graphs with edge probability 0.07. Then we count the number of patterns of interest. In the 4cycle dataset, the average number of 4cycles in each graph is 1350 and in the triangle dataset, there are 54 triangles on average in each graph.
We perform 10fold crossvalidation and record the average and standard deviation of evaluation metrics across the 10 folds within the crossvalidation. We used 16 hidden features, and trained the networks using Adam optimizer with 0.001 initial learning rate, regularization . We further apply early stopping on validation loss with a delay window size of 10. The dropout rate is 0.1. The learning rate is scheduled to reduce if the validation accuracy stops increasing for 10 epochs. We utilized a twolayer MLP in our combine function for GCN, GCNL1 and GCNL2, similar to the implementation in xu2018powerful. For training stability, we limited in our models using the sigmoid function.
Results
In our testing, we limited GCNL1 and GCND2 to a 1layer network. This prevents GCNL1 and GCND2 from utilizing higher order features to reverse predict the triangles and 4cycles. Simultaneously, we ensured GCN had the same receptive field as such networks by using 2layer and 3layer GCNs, which provided GCN with additional feature representational capability. The baseline is a model that predicts the training mean on the testing set. The results are in Table 2. GCN completely fails to learn the features (worse than a simple mean prediction). However, we see that GCNL1 effectively learns the triangle counts and greatly outperforms the mean baseline, while GCND2 is similarly able in providing a good approximation on the count of 4cycles, without losing the ability to count triangles. This validates the "possible features learned" in Table 1.
MSE # Triangles  MSE # 4 Cycles ()  
Predict Mean (Baseline)  
GCN (2layer)  
GCN (3layer)  
GCNL1 (1layer)  
GCND2 (1layer) 
6.2 RealWorld Benchmarks
We next consider standard benchmark datasets for (i) node classification, (ii) graph classification, (iii) graph regression tasks. The details of these datasets are presented in Table 3.
Dataset  Category  # Graphs  # Classes  # Nodes Avg.  # Edges Avg  Task 
Cora (yang2016revisiting)  Citation  1  7  2,708  5,429  NC 
Citeseer (yang2016revisiting)  Citation  1  6  3,327  4,732  NC 
PubMed (yang2016revisiting)  Citation  1  3  19,717  44,338  NC 
NCI1 (shervashidze2011weisfeiler)  Bio  4,110  2  29.87  32.30  GC 
Proteins (KKMMN2016)  Bio  1,113  2  39.06  72.82  GC 
PTCMR (KKMMN2016)  Bio  344  2  14.29  14.69  GC 
MUTAG (borgwardt2005protein)  Bio  188  2  17.93  19.79  GC 
QM7b (wu2018moleculenet)  Bio  7,210  14  16.42  244.95  GR 
QM9 (wu2018moleculenet)  Bio  133,246  12  18.26  37.73  GR 
The setup of the learning rate scheduling and regularization rate are the same as in synthetic tasks. For the citation tasks, we used 16 hidden features, while we used 64 for the biological datasets. Since our main modification is the expansion of aggregation region, our main comparison benchmarks are GNN (morris2018weisfeiler) and NGCN (abu2018n), two previous best attempts in incorporating aggregation regions beyond the immediate nodes. Note that we can view a th order NGCN as aggregating over .
We further include results on GAT (verma2018graph), GIN (xu2018powerful), RetGK (zhang2018retgk), GNTK (du2019graph), WLsubtree (shervashidze2011weisfeiler), SHTPATH, (borgwardt2005shortest) and PATCHYSAN (niepert2016learning) to show some best performing architectures.
Baseline neural network models use a 1layer perceptron combine function, with the exception of GNN, which uses a 2layer perceptron combine function. Thus, to illustrate the effectiveness of the framework, we only utilize a 1layer perceptron combine function for all tasks for our GCN models, with the exception of NCI1. 2layer perceptrons seemed necessary for good performance in NCI1, and thus we implemented all neural networks with 2layer perceptrons for this task to ensure a fair comparison. We tuned the learning rates and dropout rates . For numerical stability, we normalize the aggregation function using the degree of only. For the node classification tasks, we directly utilized the final layer output, while we summed over the node representations for the graphlevel tasks.
Results
Experimental results on realworld data are summarized in Table 4. According to our experiments, GCNL1 and GCND2 noticeably improve upon GCN across all datasets, due to its ability to combine node features in more complex and nonlinear ways. The improvement is statistically significant on the level for all datasets except Proteins. The results of GCNL1 and GCND2 match the best performing architectures in most datasets, and lead in numerical averages for Cora, Proteins, QM7b, and QM9 (though not statistically significant for all).
Importantly, the results also show a significant improvement from the two main comparison architectures, GNN and NGCN. We see that further expanding aggregation regions generates diminishing returns on these datasets, and the majority of the benefit is gained in the firstorder extension . This is in contrast to NGCN which skipped to only used type aggregation regions (), which is an incomplete hierarchy of aggregation regions. The differential in results illustrates the power of the complete hierarchy as proven in Theorem 1.
We especially would like to stress the outsized improvement of GCNL1 on the biological datasets. As described in 1, GCNL1 is able to capture information about triangles, which are highly relevant for the properties of biological molecules. The experimental results verify such intuition, and show how knowledge about the task can lead to targeted GNN design using our framework.
Dataset  Cora  Citeseer  PubMed  NCI1  Proteins  PTCMR  MUTAG  QM7b  QM9 
GAT  
GIN  
WLOA  
P.SAN  
RetGK  
GNTK  
SPATH  
NGCN  
kGNN  
GCN  
GCNL1  
GCND2 
7 Conclusion
We propose a theoretical framework to classify GNNs by their aggregation region and discriminative power, proving that the presented framework defines a complete hierarchy for GNNs. We also provide methods to construct powerful GNN models of any class with various building blocks. Our experimental results show that example models constructed in the proposed way can effectively learn the corresponding features exceeding the capability of 1WL algorithm in graphs. Aligning with our theoretical analysis, experimental results show that these stronger GNNs can better represent the complex properties of a number of realworld graphs.
References
8 Appendixes
8.1 Introduction to WeisfeilerLehman Test
The 1dimensional WeisfeilerLehman (WL) test is an iterative vertex classification method widely used in checking graph isomorphism. In the first iteration, the vertices are labeled by their valences. Then at each following step, the labels of vertices are updated by the multiset of the labels of themselves and their neighbors. The algorithm terminates when a stable set of labels is reached. The details of 1dimensional WeisfeilerLehman (WL) is shown in Algorithm 1. Regarding the limitation of 1dimensional WeisfeilerLehman (WL), cai1992optimal described families of nonisomorphic graphs which 1dimensional WL test cannot distinguish.
The kdimensional WeisfeilerLehman (WL) algorithm extends the above procedure from operations on nodes to operations on tuples .
8.2 Proof of Theorem 1
Theorem.
Consider a GNN defined by its action at each layer:
(4) 
Assume can be defined as a univariate function of the distance from . Then both of the following statements are true for all :

If , then .

If , then .
Proof.
We would prove by contradiction. Assume, in contrary, that one of the statements in Theorem 1 is false. Let . Then we would separate these two cases as below:
Assume that satisfies such relationship. Since and only differ by the set , can only contain this set partially. Let be the nonempty maximal subset of that is contained in . Since , there exists with such that but . Consider a nonidentity permutation of vertices fixing . Then since is defined only using the distance function, and it needs to be permutation invariant, all with must be in and not in . But then is empty, a contradiction.
Assume that satisfies such relationship. Consider the set difference between and , denoted as a subgraph :
(5) 
Then can only contain this set partially. Let be the maximal subset of that is contained in . Since , at least one of the followings must be true:

There exists with such that but .

There exists with , such that but
For the first case, consider a nonidentity permutation of the vertices fixing . Then since is permutation invariant and defined only using the distance, then thus all vertices with are in but not in . This implies is empty.
Using the same logic for the second case, one can conclude that all with and must be in but not in . That means is empty.
Therefore, we can conclude that at least one of and is empty. Since both cannot be empty (as that means ), we must have either or . With the former case is not a valid subgraph (as some edges to nodes with distance from are in the set, but the nodes are not), and with the latter case it is not connected (as the nodes with distance from are in the set but none of the edges are), so neither of them are valid aggregation regions in our framework (our definition of GNN requires the region to be a connected graph). Thus, we reach a contradiction. ∎
8.3 Proof of Theorem 3
Theorem.
The maximum discriminative power of the set of GNNs in is strictly greater than the 1dimensional WL test.
Proof.
We first note that a GNN with the aggregation function (in matrix notation):
(6) 
is a class GNN. Then note that is the number of length 3 walks that start and end at . These walks must be simple 3cycles, and thus is twice the number of triangles that contains (since for a triangle , there would be two walks and ). furer2017combinatorial showed that 1WL test cannot measure the number of triangles in a graph (while 2WL test can), so there exist graphs and such that 1WL test cannot differentiate but the GNN above can due to different number of triangles in these two graphs (an example are two regular graphs with the same degree and same number of nodes).
Now xu2018powerful proved that has a maximum discriminatory power of 1WL, and since , the maximum discriminatory power of is at least as great than that of , which is 1WL.
Thus combining these two results give the required theorem. ∎
We here note the computational complexity of using naive matrix multiplication requires multiplications. However, by exploiting the sparsity and binary nature of , there exist algorithms that can calculate with additions (razzaque2008fast), and we thus derive a more favorable bound.
8.4 Proof of Theorem 4
Theorem.
For all , there exists a GNN within the class of that is able to discriminate all graphs with nodes.
Proof.
We would prove by induction on . The base case for is simple.
Assume the statement is true for all . Let us prove the case for .
We would separate the proof into three cases:

and are both disconnected.

One of , is disconnected.

and are both connected.
If we can create appropriate GNNs to discriminate graphs in each of the three cases (say , , ), then the concatenated function can discriminate all graphs. Therefore, we would prove the induction step separately for the three cases below.
8.4.1 and disconnected
We would use the Graph Reconstruction Conjecture as proved in the disconnected case (harary1974survey):
Lemma 1.
For all , two disconnected graphs with are isomorphic if and only if the set of sized subgraphs for these two graphs are isomorphic.
Let and be any two disconnected graphs with nodes. By the induction assumption, there exist a GNN in such that it discriminates all graphs with nodes. Denote that .
Then by Lemma 1 above, we know that and are isomorphic iff:
(7) 
Where are the subgraphs of with size , and similarly for . Then we define:
Then, by Theorem 4.3 in wagstaff2019limitations, is an injective, permutationinvariant function on . Therefore, is a GNN with an aggregation region of that can discriminate and . Thus, the induction step is proven.
8.4.2 and connected
In the case where both and are connected, let be two vertices from each of the two graphs. Note that since and have nodes, by definition of , we have and . Since are arbitrary, every node in the two graphs has an aggregation region of its entire graph. Then we define the Aggregation function as followed:
Where is the adjacency matrix restricted to the subgraph, and returns the lexicographical smallest ordering of the adjacency matrix as a rowmajor vector among all isomorphic permutations (where nodes are relabeled).^{4}^{4}4Strictly speaking, for consistency of the output length, we would require padding to a length of if the adjacency matrix restricted to the subgraph is not the full adjacency matrix. However, since we only care about the behavior of the function when and are both connected, this scenario never happens, so it is not material for any part of the proof below. For example, take:
There are two isomorphic permutations of this adjacency matrix, which are:
The rowmajor vectors of these two adjacency matrices are and , in which is lexicographical smaller, so . Note that this function is permutation invariant.
Then for and connected, is always the adjacency matrix of the full graph. Therefore if and are connected and isomorphic, then their adjacency matrices are permutations of each other, and thus their lexicographical smallest ordering of the rowmajor vector form of the adjacency matrix are identical. The converse is also clearly true as the adjacency matrix determines the graph.
Therefore, this function discriminates all connected graphs of nodes, and the induction step is proven.
8.4.3 disconnected and connected
In the case where is disconnected and is connected, we define the aggregation function as the number of vertices in , denoted :
This is a permutationinvariant function. Note that for the connected graph and any vertex in , this function returns as . Therefore, every node has the same embedding .
On the other hand, for the disconnected graph , let be the connected components of . Then for a vertex , it is clear that , and thus for all . And since by construction, for all , so the embedding of and are never equal when is connected and is disconnected.
Therefore, this function discriminate all graphs of nodes in which one is connected and one is disconnected, so the induction step is proven.
∎