On Graph Neural Networks versus GraphAugmented MLPs
Abstract
From the perspective of expressive power, this work compares multilayer Graph Neural Networks (GNNs) with a simplified alternative that we call GraphAugmented MultiLayer Perceptrons (GAMLPs), which first augments node features with certain multihop operators on the graph and then applies an MLP in a nodewise fashion. From the perspective of graph isomorphism testing, we show both theoretically and numerically that GAMLPs with suitable operators can distinguish almost all nonisomorphic graphs, just like the WeifeilerLehman (WL) test. However, by viewing them as nodelevel functions and examining the equivalence classes they induce on rooted graphs, we prove a separation in expressive power between GAMLPs and GNNs that grows exponentially in depth. In particular, unlike GNNs, GAMLPs are unable to count the number of attributed walks. We also demonstrate via community detection experiments that GAMLPs can be limited by their choice of operator family, as compared to GNNs with higher flexibility in learning.
: Equal contributions.
1 Introduction
While multilayer Graph Neural Networks (GNNs) have gained popularity for their applications in various fields, recently authors have started to investigate what their true advantages over baselines are, and whether they can be simplified. On one hand, GNNs based on neighborhoodaggregation allows the combination of information present at different nodes, and by increasing the depth of such GNNs, we increase the size of the receptive field. On the other hand, it has been pointed out that deep GNNs can suffer from issues including oversmoothing, exploding or vanishing gradients in training as well as bottleneck effects (Kipf and Welling, 2016; Li et al., 2018; Luan et al., 2019; Oono and Suzuki, 2019; Rossi et al., 2020; Alon and Yahav, 2020).
Recently, a series of models have attempted at relieving these issues of deep GNNs while retaining their benefit of combining information across nodes, using the approach of firstly augmenting the node features by propagating the original node features through powers of graph operators such as the (normalized) adjacency matrix, and secondly applying a nodewise function to the augmented node features, usually realized by a MultiLayer Perceptron (MLP) (Wu et al., 2019; NT and Maehara, 2019; Chen et al., 2019a; Rossi et al., 2020). Because of the usage of graph operators for augmenting the node features, we will refer to such models as GraphAugmented MLPs (GAMLPs). These models have achieved competitive performances on various tasks, and moreover enjoy better scalability since the augmented node features can be computed during preprocessing (Rossi et al., 2020). Thus, it becomes natural to ask what advantages GNNs have over GAMLPs.
In this work, we ask whether GAMLPs sacrifice expressive power compared to GNNs while gaining these advantages. A popular measure of the expressive power of GNNs is their ability to distinguish nonisomorphic graphs (Hamilton et al., 2017; Xu et al., 2018a; Morris et al., 2019). In our work, besides studying the expressive power of GAMLPs from the viewpoint of graph isomorphism tests, we propose a new perspective that better suits the setting of nodeprediction tasks: we analyze the expressive power of models including GNNs and GAMLPs as nodelevel functions, or equivalently, as functions on rooted graphs. Under this perspective, we prove an exponentialindepth gap between the expressive powers of GNNs and GAMLPs. We illustrate this gap by finding a broad family of userfriendly functions that can be provably approximated by GNNs but not GAMLPs, based on counting attributed walks on the graph. Moreover, via the task of community detection, we show a lack of flexibility of GAMLPs compared to GNNs learn the best operators to use.
In summary, our main contributions are:

Finding graph pairs that several GAMLPs cannot distinguish while GNNs can, but also proving there exist simple GAMLPs that distinguish almost all nonisomorphic graphs.

From the perspective of approximating nodelevel functions, proving an exponential gap between the expressive power of GNNs and GAMLPs in terms of the equivalence classes on rooted graphs that they induce.

Showing that the functions that count a particular type of attributed walk among nodes can be approximated by GNNs but not GAMLPs both in theory and numerically.

Through community detection tasks, demonstrate the limitations of GAMLPs due to the choice of the operator family.
2 Related Works
Depth in GNNs
Kipf and Welling (2016) observe that the performance of Graph Convolutional Networks (GCNs) degrade as the depth grows too large, and the best performance is achieved with or layers. Along the spectral perspective on GNNs (Bruna et al., 2013; Defferrard et al., 2016; Bronstein et al., 2017; NT and Maehara, 2019), Li et al. (2018) and Wu et al. (2019) explain the failure of deep GCNs by the oversmoothing of the node features. Oono and Suzuki (2019) show an exponential loss of expressive power as the depth in GCNs increases in the sense that the hidden node states tend to converge to Laplacian subeigenspaces as the depth increases to infinity. Alon and Yahav (2020) show an oversquashing effect of deep GNNs, in the sense that the width of the hidden states needs to grow exponentially in the depth in order to retain all information about longrange interactions. In comparison, our work focuses on more general GNNs based on neighborhoodaggregation that are not limited in the hidden state widths, and demonstrates the their advantage in expressive power compared to GAMLP models at finite depth, in terms of distinguishing rooted graphs for nodeprediction tasks. On the other hand, there exist examples of useful deep GNNs. Chen et al. (2019b) apply layer GNNs for community detection problems, which uses a family of multiscale operators as well as normalization steps (Ioffe and Szegedy, 2015; Ulyanov et al., 2016). Recently, Li et al. (2019, 2020a) and Chen et al. (2020a) build deeper GCN architectures with the help of various residual connections (He et al., 2016) and normalization techniques to achieve impressive results in standard datasets, which further highlights the need to study the role of depth in GNNs.
Existing GAMLPtype models
Motivated by better understanding GNNs as well as enhancing computational efficiency, several models of the GAMLP type have been proposed and they achieve competitive performances on various datasets. Wu et al. (2019) propose the Simple Graph Convolution (SGC), which removes the intermediary weights and nonlinearities in GCNs. Chen et al. (2019a) propose the Graph Feature Network (GFN), which further adds intermediary powers of the normalized adjacency matrix to the operator family and is applied to graphprediction tasks. NT and Maehara (2019) propose the Graph Filter Neural Networks (gfNN), which enhances the SGC in the final MLP step. Rossi et al. (2020) propose Scalable Inception Graph Neural Networks (SIGNs), which augments the operator family with PersonalizedPageRankbased (Klicpera et al., 2018, 2019) and trianglebased (Monti et al., 2018; Chen et al., 2019b) adjacency matrices.
Expressive Power of GNNs
Xu et al. (2018a) and Morris et al. (2019) show that GNNs based on neighborhoodaggregation are no more powerful than the WeisfeilerLehman (WL) test for graph isomorphism (Weisfeiler and Leman, 1968), in the sense that these GNNs cannot distinguish between any pair of nonisomorphic graphs that the WL test cannot distinguish. They also propose models that match the expressive power of the WL test. Since then, many attempts have been made to build GNN models whose expressive power are not limited by WL (Morris et al., 2019; Maron et al., 2019; Chen et al., 2019c; Morris and Mutzel, 2019; You et al., 2019; Bouritsas et al., 2020; Li et al., 2020b; FlamShepherd et al., 2020; Sato et al., 2019, 2020). Other perspectives for understanding the expressive power of GNNs include function approximation (Maron et al., 2019; Chen et al., 2019c; Keriven and Peyré, 2019), substructure counting (Chen et al., 2020b), Turing universality (Loukas, 2019) and the determination of graph properties (Garg et al., 2020). Sato (2020) provides a survey on these topics. In this paper, besides studying the expressive power of GAMLPs along the line of graph isomorphism tests, we propose a new perspective of approximating functions on rooted graphs, which is motivated by nodeprediction tasks, and show a gap between GAMLPs and GNNs that grows exponentially in the size of the receptive field in terms of the equivalence classes that they induce on rooted graphs.
3 Background
3.1 Notations
Let denote a graph, with being the vertex set and being the edge set. Let denote the number of nodes in , denote the adjacency matrix, denote the diagonal degree matrix with being the degree of node , and denote the symmetrically normalized adjacency matrix. Let denote the matrix of node features, where denotes the dimensional feature that node possesses. For a node , let denote the set of neighbors of . We assume that the edges do not possess features. In a node prediction task, the labels are given by .
For a positive integer , we let . We use to denote a multiset, which allows repeated elements. We say a function is doublyexponential in if grows linearly in , and polyexponential in if grows polynomially in , as tends to infinity.
3.2 Graph Neural Networks (GNNs)
Following the notations in Xu et al. (2018a), we consider layer GNNs defined generically as follows. For , we compute the hidden node states iteratively as
(1) 
where we set to be the node features. If a graphlevel output is desired, we finally let
(2) 
Different choices of the trainable COMBINE, AGGREGATE and READOUT functions result in different GNN models, though usually AGGREGATE and READOUT are chosen to be permutationinvariant. As graphlevel functions, it is shown in Xu et al. (2018a) and Morris et al. (2019) that the maximal expressive power of models of this type coincides with running iterations of the WL test for graph isomorphism, in the sense that any two nonisomorphic graphs that cannot be distinguished by the latter cannot be distinguished by the layer GNNs, either. For this reason, we will not distinguish between GNN and WL in discussions on expressive powers.
3.3 GraphAugmented MultiLayer Peceptrons (GAMLPs)
Let , …, be a set of (usually multihop) linear operators that are functions of the adjacency matrix, . Common choices of the operators are powers of the (normalized) adjacency matrix, and several particular choices of that give rise to existing GAMLP models are listed in Appendix A. In its most general form, a GAMLP first computes a series of augmented features via
(3) 
with being a learnable function acting as a feature transformation applied to each node separately. It can be realized by an MLP, e.g. , where is a nonlinear activation function and are trainable weight matrices of suitable dimensions. Next, the model concatenates into , and computes
(4) 
where is also a learnable nodewise function, again usually realized by an MLP. If a graphlevel output is desired, we can also add a READOUT function as in (2).
A simplified version of the model sets to be the identity function, in which case (3) and (4) can be written together as
(5) 
Such a simplification improves computational efficiency since the matrix products can be precomputed before training (Rossi et al., 2020). Since we are mostly interested in an upper bounds on the expressive power of GAMLPs, we will work with the more general update rule (3) in this paper, but the lowerbound result in Proposition 2 remains valid even when we restrict to the subset of models where is taken to be the identity function.
4 Expressive Power as Graph Isomorphism Tests
We first study the expressive power of GAMLPs via their ability to distinguish nonisomorphic graphs. It is not hard to see that when , where for any generalizes the normalized adjacency matrix, this is upperbounded by the power of iterations of WL. We next ask whether it can fall strictly below. Indeed, for two common choices of , we can find concrete examples: 1) If consists of integer powers of any normalized adjacency matrix of the form for some , then it is apparent that the GAMLP cannot distinguish any pair of regular graphs with the same size but different node degrees; 2) If consists of integer powers of the adjacency matrix, , then the model cannot distinguish between the pair of graphs shown in Figure 1, which can be distinguished by iterations of the WL test. The proof of the latter result is given in Appendix J. Together, we summarize the results as:
Proposition 1.
If , with either or for some , there exists a pair of graphs which can be distinguished by GNNs but not this GAMLP.
Nonetheless, if we focus on not particular counterexamples but rather the average performance in distinguishing random graphs, it is not hard for GAMLPs to reach the same level as WL, which is known to distinguish almost all pairs of random graphs under a uniform distribution (Babai et al., 1980). Specifically, building on the results in Babai et al. (1980), we have
Proposition 2.
For all , such that any GAMLP that has can distinguish almost all pairs of nonisomorphic graphs of at most nodes, in the sense that the fraction of graphs on which such a GAMLP fails to test isomorphism is as .
The hypothesis that distinguishing nonisomorphic graphs is not difficult on average for either GNNs or GAMLPs is further supported by the numerical results provided in Appendix B, in which we count the number of equivalence classes that either of them induce on graphs that occur in realworld datasets. This further raises the question of whether graph isomorphism tests along suffice as a criterion for comparing the expressive power of models on graphs, which leads us to the explorations in the next section.
Lastly, we remark that with suitable choices of operators in , it is possible for GAMLPs to go beyond the power of WL. For example, if contains the power graph adjacency matrix introduced in Chen et al. (2019b), , then the GAMLP can distinguish between a hexagon and a pair of triangles, which WL cannot distinguish.
5 Expressive Power as Functions on Rooted Graphs
To study the expressive power beyond graph isomorphism tests, we consider the setting of nodewise prediction tasks, for which the final readout step (2) is dropped in both GNNs and GAMLPs.
Whether the learning setup is transductive or inductive, we can consider the models as functions on rooted graphs, or egonets (Preciado and Jadbabaie, 2010), which are graphs with one node designated as the root
To do so, we introduce a notion of induced equivalence relations on , analogous to the equivalence relations on introduced in Section 4. Given a family of functions on , we can define an equivalence relation among all rooted graphs such that , if and only if , . By examining the number and sizes of the induced equivalence classes of rooted graphs, we can evaluate the relative expressive power of different families of functions on in a quantitative way.
In the rest of this section, we assume that the node features belong to a finite alphabet and all nodes have degree at most . Firstly, GNNs are known to distinguish neighborhoods up to the rooted aggregation tree, which can be obtained by unrolling the neighborhood aggregation steps in the GNNs as well as the WL test (Xu et al., 2018a; Morris et al., 2019; Garg et al., 2020). The depth rooted aggregation tree of a rooted graph is a depth rooted tree with a (possibly manytoone) mapping from every node in the tree to some node in , where (i) the root of the tree is mapped to node , and (ii) the children of every node in the tree are mapped to the neighbors of the node in to which is mapped. An illustration of rooted graphs and rooted aggregation trees is given in Figure 4. Hence, each equivalence class in induced by the family of all depth GNNs consists of all rooted graphs that share the same rooted aggregation tree of depth. Thus, to estimate the number of equivalence classes on induced by GNNs, we need to estimate the number of possible rooted aggregation trees, which is given by Lemma 3 in Appendix F. Thus, we derive the following lower bound on the number of equivalence classes in that depth GNNs induce:
Proposition 3.
Assume that and . The total number of equivalence classes of rooted graphs induced by GNNs of depth grows at least doublyexponentially in .
In comparison, we next demonstrate that the equivalence classes induced by GAMLPs are more coarsened. To see this, let’s first consider the example where we take , in which with any is a generalization of the normalized adjacency matrix. From formula (3), by expanding the matrix product, we have
(6) 
where we define to be set of all walks of length in the rooted graph starting from node (an illustration is given in Figure 2). Thus, the th augmented feature of node , , is completely determined by the number of each “type” of walks in of length , where the type of a walk, , is determined jointly by the degree multiset, as well as the degree and the node feature of the end node, and . Hence, to prove an upper bound on the total number of equivalence classes on induced by such a GAMLP, it is sufficient to upperbound the total number of possibilities of assigning the counts of all types of walks in a rooted graph. This allows us to derive the following result, which we prove in Appendix G.
Proposition 4.
Fix , where for some . Then the total number of equivalence classes in induced by such GAMLPs is polyexponential in .
Compared with Proposition 3, this shows that the number of equivalence classes on induced by such GAMLPs is exponentially smaller than that by GNNs. In addition, as the other side of the same coin, these results also indicate the complexity of these hypothesis classes. Building on the results in Chen et al. (2019c, 2020b) on the equivalence between distinguishing nonisomorphic graphs and approximating arbitrary permutationinvariant functions on graphs by neural networks, and by the definition of VC dimension (Vapnik and Chervonenkis, 1971; Mohri et al., 2018), we conclude that
Corollary 1.
The VC dimension of all GNNs of layers as functions on rooted graphs grows at least doublyexponentially in ; Fixing , the VC dimension of all GAMLPs with as functions on rooted graphs is at most polyexponential in .
Meanwhile, for more general operators, we can show that the equivalence classes induced by GAMLPs are coarser than those induced by GNNs at least under some measurements. For instance, the pair of rooted graphs in Figure 2 belong to the same equivalence class induced by any GAMLP (as we prove in Appendix H) but different equivalence classes induced by GNNs. Rigorously, we characterize such a gap in expressive power by finding certain equivalence classes in induced by GAMLPs that intersect with many equivalence classes induced by GNNs. In particular, we have the following general result, which we prove in Appendix H:
Proposition 5.
If is any family of equivariant linear operators on the graph that only depend on the graph topology of at most hops, then there exist exponentiallyin many equivalence classes in induced by the GAMLPs with , each of which intersects with doublyexponentiallyin many equivalence classes in induced by depth GNNs, assuming that and . Conversely, in constrast, if , in which with any , then each equivalence class in induced by depth GNNs is contained in one equivalence class induced by the GAMLPs with .
In essence, this result establishes that GAMLP circuits can express fewer (exponentially fewer) functions than GNNs with equivalent receptive field. Taking a step further, we can find explicit functions on rooted graphs that can be approximated by GNNs but not GAMLPs. In the framework that we have developed so far, this occurs when the image of each equivalence class in induced by GNNs under this function contains a single value, whereas the image of some equivalence class in induced by GAMLPs contains multiple values. Inspired by the proofs of the results above, a natural candidate is the family of functions that count the number of walks of a particular type in the rooted graph. We can establish the following result, which we prove in Appendix I:
Proposition 6.
For any sequence of node features , consider the sequence of functions on . For all , the image under of every equivalence class in induced by depth GNNs contains a single value, while for any GAMLP using equivariant linear operators that only depend on the graph topology, there exist exponentiallyin many equivalence classes in induced by this GAMLP whose image under contains exponentiallyin many values.
In words, there exist graph instances where the attributed walk counting function takes different values, yet no GAMLP model can predict them apart – and there are exponentially many of these instances as the number of hops increases. This suggests the possibility of lowerbounding the average approximation error for certain functions by GAMLPs under various random graph families, which we leave for future work.
6 Experiments
The baseline GAMLP models we consider has operator family for a certain , and we call it GAMLP. For the experiments in Section 6.3, due to the large as well as the analogy with spectral methods (Chen et al., 2019b), we use instance normalization (Ulyanov et al., 2016). Further details are described in Appendix K.
6.1 Number of equivalence classes of rooted graphs
Motivated by Propositions 3 and 4, we numerically count the number of equivalence classes induced by GNNs and GAMLPs among the rooted graphs found in actual graphs with node features removed. For depth GNNs, we implement a WLlike process with hash functions to map the depth egonet associated with each node to a string before comparing across nodes. For GAMLP, we compare the augmented features of each egonet computed via (3). From the results in Table 2, we see that indeed, the number of equivalence classes induced by GAMLP is smaller than that by GNNs, with the highest relative difference occurring at . The contrast is much more visible than their difference in the number of graph equivalence classes given in Appendix B.
6.2 Counting attributed walks
Motivated by Proposition 6, we test the ability of GNNs and GAMLPs to count the number of walks of a particular type in synthetic data. We take graphs from the Cora dataset (with node features removed) as well as generate a random regular graph (RRG) with nodes and the node degree being . We assign node feature blue to all nodes with even index and node feature red to all nodes with odd index. The node feature is given by dimensional onehot encoding. On the Cora graph, a node ’s label is given by the number of walks of the type blueblueblue starting from . On the RRG, the label is given by the number of walks of the type blueblueblueblue starting from . The number of nodes for training and testing is split as for the Cora graph and for the random regular graph. We test two GAMLP models, one with as many powers of the operator as the walk length and the other with twice as many operators, and compare their performances against that of the Graph Isomorphism Network (GIN), a GNN model shown to achieve the expressive power of the WL test (Xu et al., 2018a). From the results in Table 2, we see that GIN significantly outperforms GAMLP in both training and testing on both graphs, which is consistent with the theoretical result in Proposition 6 that GNNs can count attributed walks while GAMLPs cannot. Thus, this points out an intuitive task that lies in the gap of expressive power between GNNs and GAMLPs.
6.3 Community detection on Stochastic Block Models (SBM)
We use the task of community detection to illustrate another limitation of GAMLP models: a lack of flexibility to learn the family of operators. SBM is a random graph model in which nodes are partitioned into underlying communities and each edge is drawn independently with a probability that only depends on whether the pair of nodes belong to the same community or not. The task of community detection is then to recover the community assignments from the connectivity pattern. We focus on binary (that is, having two underlying communities) SBM in the sparse regime, where it is known that the hardness of detecting communities is characterized by a signaltonoise ratio (SNR) that is a function of the ingroup and outgroup connectivity (Abbe, 2017). We select pairs of ingroup and outgroup connectivity, resulting in different hardness levels of the task.
Among all different approaches to community detection, spectral methods are particularly worth mentioning here, which usually aim at finding a certain eigenvector of a certain operator that is correlated with the community assignment, such as the second largest eigenvector of the adjacency matrix or the second smallest eigenvector of the Laplacian matrix or the Bethe Hessian matrix (Krzakala et al., 2013). In particular, the Bethe Hessian matrix is known to be asymptotically optimal in the hard regime, provided that a datadependent parameter is known. Note that spectral methods bear close resemblance to GAMLPs and GNNs. In particular, Chen et al. (2019b) propose a spectral GNN (sGNN) model for community detection that can be viewed as a learnable generalization of power iterations on a collection of operators. Further details on Bethe Hessian and sGNN are provided in Appendix K.
We first compare two variants of GAMLP models: GAMLP with , and GAMLP with generated from the Bethe Hessian matrix  with the datadependent parameter given by oracle  up to the same . From Figure 3, we see that the latter consistently outperforms the former, indicating the importance of the choice of the operators for GAMLPs. Meanwhile, we also test a variant of sGNN that is only based on powers of the and has the same receptive field as GAMLP (further details given in Appendix K). We see that its performance is comparable to that of GAMLP. Thus, this demonstrates a scenario in which GAMLP with common choices of do not work well, but there exists some choice of that is a priori unknown, with which GAMLP can achieve good performance. In contrast, a GNN model does not need to rely on the knowledge of such an oracle set of operators, demonstrating its flexibility of learning.
7 Conclusions
We have studied the separation in terms of representation power between GNNs and a popular alternative that we coined GAMLPs. This latter family is appealing due to its computational scalability and its conceptual simplicity, whereby the role of topology is reduced to creating ‘augmented’ node features then fed into a generic MLP. Our results show that while GAMLPs can distinguish almost all nonisomorphic graphs, in terms of approximating nodelevel functions, there exists a gap growing exponentiallyindepth between GAMLPs and GNNs in terms of the number of equivalence classes of nodes (or rooted graphs) they induce. Furthermore, we find a concrete class of functions that lie in this gap given by the counting of attributed walks. Moreover, through community detection, we demonstrate the lack of GAMLP’s ability to go beyond the fixed family of operators as compared to GNNs. In other words, GNNs possess an inherent ability to discover topological features through learnt diffusion operators, while GAMLPs are limited to a fixed, hardwired family of diffusions.
While we do not attempt to provide a decisive answer of whether GNNs or GAMLPs should be preferred in practice, our theoretical framework and concrete examples help to understand their differences in expressive power and indicate the types of tasks in which a gap is more likely to be seen – those exploiting stronger structures among nodes like counting attributed walks, or those involving the learning of graph operators. That said, our results are purely on the representation side, and disregard optimization considerations; integrating the possible optimization counterparts is an important direction of future improvement. Finally, another open question is to better understand the links between GAMLPs and spectral methods, and how this can help learning diffusion operators.
Acknowledgements
We are grateful to Jiaxuan You for initiating the discussion on GAMLPtype models, as well as Mufei Li, Minjie Wang, Xiang Song and Lingfan Yu for helpful conversations. This work is partially supported by the Alfred P. Sloan Foundation, NSF RI1816753, NSF CAREER CIF 1845360, NSF CHS1901091, Samsung Electronics, and the Institute for Advanced Study.
Appendix A Examples of existing GAMLP models
For , let , be the diagonal matrix with , and .

Simple Graph Convolution (Wu et al., 2019):
for some . In addition, is the identity function and for some trainable weight matrix . 
Graph Feature Network (Chen et al., 2019a):
for some and . In addition, is the identity function and is an MLP. 
Scalable Inception Graph Networks (Rossi et al., 2020):
, where is a family of simple / normalized adjacency matrices, is a family of PersonalizedPageRankbased adjacency matrices, and is a family of trianglebased adjacency matrices. In addition, writing , there is , with and being nonlinear activation functions and and being trainable weight matrices of suitable dimensions.
Appendix B Equivalence classes induced by GNNs and GAMLPs among real graphs
IMDBBINARY  IMDBMULTI  REDDITBINARY  REDDITMULTI5K  COLLAB  

# Graphs  1000  1500  2000  5000  5000  
GNN  GAMLP  GNN  GAMLP  GNN  GAMLP  GNN  GAMLP  GNN  GAMLP  
1  51  51  49  49  781  781  1365  1365  294  294 
2  537  537  387  387  1998  1998  4999  4999  4080  4080 
3  537  537  387  387  1998  1998  4999  4999  4080  4080 
ground truth  537  387  1998  4999  4080 
Given a space of graphs, , and a family of functions mapping to , induces an equivalence relation that we denote by among graphs in such that for , if and only if , . For example, if is powerful enough to distinguish all pairs of nonisomorphic graphs, then each equivalence class under contains exactly one graph. Thus, by examining the number or sizes of the equivalence classes induced by different families of functions on , we can evaluate their relative expressive power in a quantitative way.
Hence, we supplement the theoretical result of Proposition 2 with the following numerical results on five realworld datasets for graphpredictions. For graphs in each of the two real datasets, we remove their node features and count the total number of equivalence classes among them induced by depth GNNs (equivalent to iterations of the WL test, as discussed in Section 3.2) as well as GAMLPs with for different ’s. We see from the results in Table 3 that as soon as , the number of equivalence classes induced by GNNs and the GAMLPs are both close to the total number of graphs up to isomorphism, implying that they are indeed both able to distinguish almost all pairs of nonisomorphic graphs among the ones occurring in these datasets.
Appendix C Additional notations
For any and any rooted graph , define
(7) 
(8) 
as the sets of walks and nonbacktracking walks of length in starting from the root node, respectively. Note that when is a rooted tree, a nonbacktracking walk of length is a path from the root node to a node at depth . In addition, for and , define the following subsets of :
(9) 
(10) 
(11) 
We also define , and similarly.
Appendix D Illustration of rooted graphs and rooted aggregation trees
Appendix E Proof of Proposition 2
With node features being identical in the random graphs, we take to be the all vector. Thus,
(12) 
and
(13) 
Since (4) and (2) together can approximate arbitrary permutationinvariant functions on multisets (Zaheer et al., 2017), if two graphs and cannot be distinguished by the GAMLP with an operator family that includes under any choice of its parameters, it means that the two multisets , and therefore both of the following hold:
(14)  
(15) 
To see what this means, we need the two following lemmas.
Lemma 1.
Let be the set of all multisets consisting of at most elements, all of which are integers between and . Consider the function defined for multisets . If , is an injective function on .
Proof of Lemma 1:
For to be injective on , it suffices to require that , there is , for which it is sufficient to require that , or .
Lemma 2 (Babai et al. (1980), Theorem 1).
Consider the space of graphs with vertices, . There is a subset that contains almost all such graphs (i.e. the fraction converges to as ) such that the following algorithm yields a unique identifier for every graph :
Algorithm 1:
Set , and let be the degree of the node in with the th largest degree; For each node in , define the multiset ; Finally define a multiset associated with , , which is the output of the algorithm. In other words, , and are isomorphic if and only if as multisets. In particular, we can choose such that the top node degrees of every graph in are distinct.Based on these lemmas, we will show that when and for , (14) and (15) together imply that is isomorphic to . To see this, suppose that (14) and (15) hold. Because of (14), we know that and share the same degree sequence, and hence . Because of (15), we know that there is a bijective map from to such that ,
(16) 
which, by Lemma 1, implies that . We then have , and therefore , which implies that and are isomorphic by Lemma 2. This shows a contradiction. Therefore, if are not isomorphic, then it cannot be the case that both (14) and (15) hold, and hence there exists a choice of parameters for the GAMLP with that makes it return different outputs when applied to and . This proves Proposition 2.
Appendix F Proof of Proposition 3
As argued in the main text, to estimate the number of equivalence classes on induced by GNNs, we need to estimate the number of possible rooted aggregation trees. In particular, to lowerbound the number of equivalence classes on induced by GNNs, we only need to focus on a subset of all possible rooted aggregation trees, namely those in which every node has exactly children. Letting denote the set of all rooted aggregation trees of depth in which each nonleaf node has degree exactly and the node features belong to , we will first prove the following lemma:
Lemma 3.
If , then .
Note that a rooted aggregation tree needs to satisfy the constraint that each of its node must have its parent’s feature equal to one of its children’s feature, and so this lower bound is not as straightforward to prove as lowerbounding the total number of rooted subtrees. As argued above, this will allow us to derive Proposition 3.
Proof of Lemma 3: Define . Since , we assume without loss of generality that . To prove a lowerbound on the cardinality of , it suffices to restrict our attention to its subset, , where all nodes have feature either or . Furthermore, it is sufficient to restrict our attention to the subset of which contain all possible types of paths of length from the root to the leaves. Formally, with defined as in Appendix C, we let
(17) 
and it is sufficient to prove a lower bound on the cardinality of . Define to be the set of all binary tuples. By the definition of (17), we know that . This means that , there exists at least one leaf node in such that the path from the root node to this node consists of a sequence of nodes with features exactly as given by . We call any such node a node under .
We show such a lower bound on the cardinality of inductively. For the base case, we know that consists of all binaryfeatured depth rooted trees with at least leaf node of feature and leaf node of feature , and hence . Next, we consider the inductive step. For every and every , we can generate rooted aggregation trees belonging to by assigning children of feature or to the leaf nodes of . First note that, from two nonisomorphic rooted aggregation trees and , we obtain nonisomorphic rooted aggregation trees in in this way. Moreover, as we will show next, for every , we can lowerbound the number of distinct rooted aggregation trees belonging to obtained from in this way.
There are many choices to assign the children. To get a lowerbound on the cardinality of , we only need to consider a subset of these choices of assignments, namely, those that assign the same number of children with feature to every node under the same . Thus, we let denote the number of children of feature assigned to every node in . Due to the constraint that each node in the rooted aggregation tree must have its parent’s feature equal to one of its children’s feature, not all choices of lead to legitimate rooted aggregation trees. Nonetheless, when restricting to the choices where , we see that every leaf node of gets assigned at least one child of feature and another child of feature , thereby satisfying the constraint above whether its parent has feature or . Moreover, for such choices, the rooted aggregation tree of depth obtained in this way contains all possible paths of length , and therefore belongs to . Hence, it remains to show a lower bound on how many distinct trees in can be obtained in this way from each . Since for such that , a node under is distinguishable from a node under , we see that every legitimate choice of the tuple of integers, , leads to a distinct rooted aggregation tree of depth , and there are of these choices. Hence, we have derived that , and therefore .
Appendix G Proof of Proposition 4
According to the formula (3), by expanding the matrix product, we have
(18) 
with defined in Appendix C. Hence, for two different nodes in and in ( and can be the same graph), the nodewise outputs of the GAMLP at and will be identical if the rooted graphs and satisfy for every combination of choices on the multiset , the integer and the node feature , under the constraints of and . Note that there are at most possible choices of the multiset , choices of and choices of , thereby allowing at most possible choices. Because of the constraint
(19) 
We see that the total number of equivalence classes on induced by such a GAMLP is upperbounded by , which is on the order of with growing and bounded. Finally, since the total number of equivalence classes induced by multiple operators can be upperbounded by the product of the number of equivalence classes induced by each operator separately, we derive the proposition as desired.
Appendix H Proof of Proposition 5
Consider the set of full ary rooted trees of depth , , that is all rooted trees of depth in which the nodes have features belonging to the discrete set and all nonleaf nodes have children. is a subset of , the space of all rooted graphs. If is a function represented by a GAMLP using operators of at most hop, then for , we can write
(20) 
where we denote the node set of by and the vectors ’s depend only on the topological relationship between and the root node. Let denote the set of nodes at depth of . By the assumption that the operators depend only on the graph topology, and thanks to the topological symmetry of such full ary trees among all nodes on the same depth, we have that and , there is . Thus, we can write
(21) 
for some other set of coefficients ’s, and where is defined in Appendix C. In other words, for two trees and , if , they satisfy