On Graph Neural Networks versus Graph-Augmented MLPs

# On Graph Neural Networks versus Graph-Augmented MLPs

## Abstract

From the perspective of expressive power, this work compares multi-layer Graph Neural Networks (GNNs) with a simplified alternative that we call Graph-Augmented Multi-Layer Perceptrons (GA-MLPs), which first augments node features with certain multi-hop operators on the graph and then applies an MLP in a node-wise fashion. From the perspective of graph isomorphism testing, we show both theoretically and numerically that GA-MLPs with suitable operators can distinguish almost all non-isomorphic graphs, just like the Weifeiler-Lehman (WL) test. However, by viewing them as node-level functions and examining the equivalence classes they induce on rooted graphs, we prove a separation in expressive power between GA-MLPs and GNNs that grows exponentially in depth. In particular, unlike GNNs, GA-MLPs are unable to count the number of attributed walks. We also demonstrate via community detection experiments that GA-MLPs can be limited by their choice of operator family, as compared to GNNs with higher flexibility in learning.

\@footnotetext

: Equal contributions.

## 1 Introduction

While multi-layer Graph Neural Networks (GNNs) have gained popularity for their applications in various fields, recently authors have started to investigate what their true advantages over baselines are, and whether they can be simplified. On one hand, GNNs based on neighborhood-aggregation allows the combination of information present at different nodes, and by increasing the depth of such GNNs, we increase the size of the receptive field. On the other hand, it has been pointed out that deep GNNs can suffer from issues including over-smoothing, exploding or vanishing gradients in training as well as bottleneck effects (Kipf and Welling, 2016; Li et al., 2018; Luan et al., 2019; Oono and Suzuki, 2019; Rossi et al., 2020; Alon and Yahav, 2020).

Recently, a series of models have attempted at relieving these issues of deep GNNs while retaining their benefit of combining information across nodes, using the approach of firstly augmenting the node features by propagating the original node features through powers of graph operators such as the (normalized) adjacency matrix, and secondly applying a node-wise function to the augmented node features, usually realized by a Multi-Layer Perceptron (MLP) (Wu et al., 2019; NT and Maehara, 2019; Chen et al., 2019a; Rossi et al., 2020). Because of the usage of graph operators for augmenting the node features, we will refer to such models as Graph-Augmented MLPs (GA-MLPs). These models have achieved competitive performances on various tasks, and moreover enjoy better scalability since the augmented node features can be computed during preprocessing (Rossi et al., 2020). Thus, it becomes natural to ask what advantages GNNs have over GA-MLPs.

In this work, we ask whether GA-MLPs sacrifice expressive power compared to GNNs while gaining these advantages. A popular measure of the expressive power of GNNs is their ability to distinguish non-isomorphic graphs (Hamilton et al., 2017; Xu et al., 2018a; Morris et al., 2019). In our work, besides studying the expressive power of GA-MLPs from the viewpoint of graph isomorphism tests, we propose a new perspective that better suits the setting of node-prediction tasks: we analyze the expressive power of models including GNNs and GA-MLPs as node-level functions, or equivalently, as functions on rooted graphs. Under this perspective, we prove an exponential-in-depth gap between the expressive powers of GNNs and GA-MLPs. We illustrate this gap by finding a broad family of user-friendly functions that can be provably approximated by GNNs but not GA-MLPs, based on counting attributed walks on the graph. Moreover, via the task of community detection, we show a lack of flexibility of GA-MLPs compared to GNNs learn the best operators to use.

In summary, our main contributions are:

• Finding graph pairs that several GA-MLPs cannot distinguish while GNNs can, but also proving there exist simple GA-MLPs that distinguish almost all non-isomorphic graphs.

• From the perspective of approximating node-level functions, proving an exponential gap between the expressive power of GNNs and GA-MLPs in terms of the equivalence classes on rooted graphs that they induce.

• Showing that the functions that count a particular type of attributed walk among nodes can be approximated by GNNs but not GA-MLPs both in theory and numerically.

• Through community detection tasks, demonstrate the limitations of GA-MLPs due to the choice of the operator family.

## 2 Related Works

#### Depth in GNNs

Kipf and Welling (2016) observe that the performance of Graph Convolutional Networks (GCNs) degrade as the depth grows too large, and the best performance is achieved with or layers. Along the spectral perspective on GNNs (Bruna et al., 2013; Defferrard et al., 2016; Bronstein et al., 2017; NT and Maehara, 2019), Li et al. (2018) and Wu et al. (2019) explain the failure of deep GCNs by the over-smoothing of the node features. Oono and Suzuki (2019) show an exponential loss of expressive power as the depth in GCNs increases in the sense that the hidden node states tend to converge to Laplacian sub-eigenspaces as the depth increases to infinity. Alon and Yahav (2020) show an over-squashing effect of deep GNNs, in the sense that the width of the hidden states needs to grow exponentially in the depth in order to retain all information about long-range interactions. In comparison, our work focuses on more general GNNs based on neighborhood-aggregation that are not limited in the hidden state widths, and demonstrates the their advantage in expressive power compared to GA-MLP models at finite depth, in terms of distinguishing rooted graphs for node-prediction tasks. On the other hand, there exist examples of useful deep GNNs. Chen et al. (2019b) apply -layer GNNs for community detection problems, which uses a family of multi-scale operators as well as normalization steps (Ioffe and Szegedy, 2015; Ulyanov et al., 2016). Recently, Li et al. (2019, 2020a) and Chen et al. (2020a) build deeper GCN architectures with the help of various residual connections (He et al., 2016) and normalization techniques to achieve impressive results in standard datasets, which further highlights the need to study the role of depth in GNNs.

#### Existing GA-MLP-type models

Motivated by better understanding GNNs as well as enhancing computational efficiency, several models of the GA-MLP type have been proposed and they achieve competitive performances on various datasets. Wu et al. (2019) propose the Simple Graph Convolution (SGC), which removes the intermediary weights and nonlinearities in GCNs. Chen et al. (2019a) propose the Graph Feature Network (GFN), which further adds intermediary powers of the normalized adjacency matrix to the operator family and is applied to graph-prediction tasks. NT and Maehara (2019) propose the Graph Filter Neural Networks (gfNN), which enhances the SGC in the final MLP step. Rossi et al. (2020) propose Scalable Inception Graph Neural Networks (SIGNs), which augments the operator family with Personalized-PageRank-based (Klicpera et al., 2018, 2019) and triangle-based (Monti et al., 2018; Chen et al., 2019b) adjacency matrices.

#### Expressive Power of GNNs

Xu et al. (2018a) and Morris et al. (2019) show that GNNs based on neighborhood-aggregation are no more powerful than the Weisfeiler-Lehman (WL) test for graph isomorphism (Weisfeiler and Leman, 1968), in the sense that these GNNs cannot distinguish between any pair of non-isomorphic graphs that the WL test cannot distinguish. They also propose models that match the expressive power of the WL test. Since then, many attempts have been made to build GNN models whose expressive power are not limited by WL (Morris et al., 2019; Maron et al., 2019; Chen et al., 2019c; Morris and Mutzel, 2019; You et al., 2019; Bouritsas et al., 2020; Li et al., 2020b; Flam-Shepherd et al., 2020; Sato et al., 2019, 2020). Other perspectives for understanding the expressive power of GNNs include function approximation (Maron et al., 2019; Chen et al., 2019c; Keriven and Peyré, 2019), substructure counting (Chen et al., 2020b), Turing universality (Loukas, 2019) and the determination of graph properties (Garg et al., 2020). Sato (2020) provides a survey on these topics. In this paper, besides studying the expressive power of GA-MLPs along the line of graph isomorphism tests, we propose a new perspective of approximating functions on rooted graphs, which is motivated by node-prediction tasks, and show a gap between GA-MLPs and GNNs that grows exponentially in the size of the receptive field in terms of the equivalence classes that they induce on rooted graphs.

## 3 Background

### 3.1 Notations

Let denote a graph, with being the vertex set and being the edge set. Let denote the number of nodes in , denote the adjacency matrix, denote the diagonal degree matrix with being the degree of node , and denote the symmetrically normalized adjacency matrix. Let denote the matrix of node features, where denotes the -dimensional feature that node possesses. For a node , let denote the set of neighbors of . We assume that the edges do not possess features. In a node prediction task, the labels are given by .

For a positive integer , we let . We use to denote a multiset, which allows repeated elements. We say a function is doubly-exponential in if grows linearly in , and poly-exponential in if grows polynomially in , as tends to infinity.

### 3.2 Graph Neural Networks (GNNs)

Following the notations in Xu et al. (2018a), we consider -layer GNNs defined generically as follows. For , we compute the hidden node states iteratively as

 M(k)i=AGGREGATE(k)({H(k−1)j:j∈N(i)}) ,H(k)i=COMBINE(k)(H(k−1)i,M(k)i) , (1)

where we set to be the node features. If a graph-level output is desired, we finally let

Different choices of the trainable COMBINE, AGGREGATE and READOUT functions result in different GNN models, though usually AGGREGATE and READOUT are chosen to be permutation-invariant. As graph-level functions, it is shown in Xu et al. (2018a) and Morris et al. (2019) that the maximal expressive power of models of this type coincides with running iterations of the WL test for graph isomorphism, in the sense that any two non-isomorphic graphs that cannot be distinguished by the latter cannot be distinguished by the -layer GNNs, either. For this reason, we will not distinguish between GNN and WL in discussions on expressive powers.

### 3.3 Graph-Augmented Multi-Layer Peceptrons (GA-MLPs)

Let , …, be a set of (usually multi-hop) linear operators that are functions of the adjacency matrix, . Common choices of the operators are powers of the (normalized) adjacency matrix, and several particular choices of that give rise to existing GA-MLP models are listed in Appendix A. In its most general form, a GA-MLP first computes a series of augmented features via

 ~Xk=ωk(A)⋅φ(X) , (3)

with being a learnable function acting as a feature transformation applied to each node separately. It can be realized by an MLP, e.g. , where is a nonlinear activation function and are trainable weight matrices of suitable dimensions. Next, the model concatenates into , and computes

 Z=ρ(~X) , (4)

where is also a learnable node-wise function, again usually realized by an MLP. If a graph-level output is desired, we can also add a READOUT function as in (2).

A simplified version of the model sets to be the identity function, in which case (3) and (4) can be written together as

 X=ρ([ω1(A)⋅X,...,ωK(A)⋅X]) (5)

Such a simplification improves computational efficiency since the matrix products can be pre-computed before training (Rossi et al., 2020). Since we are mostly interested in an upper bounds on the expressive power of GA-MLPs, we will work with the more general update rule (3) in this paper, but the lower-bound result in Proposition 2 remains valid even when we restrict to the subset of models where is taken to be the identity function.

## 4 Expressive Power as Graph Isomorphism Tests

We first study the expressive power of GA-MLPs via their ability to distinguish non-isomorphic graphs. It is not hard to see that when , where for any generalizes the normalized adjacency matrix, this is upper-bounded by the power of iterations of WL. We next ask whether it can fall strictly below. Indeed, for two common choices of , we can find concrete examples: 1) If consists of integer powers of any normalized adjacency matrix of the form for some , then it is apparent that the GA-MLP cannot distinguish any pair of regular graphs with the same size but different node degrees; 2) If consists of integer powers of the adjacency matrix, , then the model cannot distinguish between the pair of graphs shown in Figure 1, which can be distinguished by iterations of the WL test. The proof of the latter result is given in Appendix J. Together, we summarize the results as:

###### Proposition 1.

If , with either or for some , there exists a pair of graphs which can be distinguished by GNNs but not this GA-MLP.

Nonetheless, if we focus on not particular counterexamples but rather the average performance in distinguishing random graphs, it is not hard for GA-MLPs to reach the same level as WL, which is known to distinguish almost all pairs of random graphs under a uniform distribution (Babai et al., 1980). Specifically, building on the results in Babai et al. (1980), we have

###### Proposition 2.

For all , such that any GA-MLP that has can distinguish almost all pairs of non-isomorphic graphs of at most nodes, in the sense that the fraction of graphs on which such a GA-MLP fails to test isomorphism is as .

The hypothesis that distinguishing non-isomorphic graphs is not difficult on average for either GNNs or GA-MLPs is further supported by the numerical results provided in Appendix B, in which we count the number of equivalence classes that either of them induce on graphs that occur in real-world datasets. This further raises the question of whether graph isomorphism tests along suffice as a criterion for comparing the expressive power of models on graphs, which leads us to the explorations in the next section.

Lastly, we remark that with suitable choices of operators in , it is possible for GA-MLPs to go beyond the power of WL. For example, if contains the power graph adjacency matrix introduced in Chen et al. (2019b), , then the GA-MLP can distinguish between a hexagon and a pair of triangles, which WL cannot distinguish.

## 5 Expressive Power as Functions on Rooted Graphs

To study the expressive power beyond graph isomorphism tests, we consider the setting of node-wise prediction tasks, for which the final readout step (2) is dropped in both GNNs and GA-MLPs. Whether the learning setup is transductive or inductive, we can consider the models as functions on rooted graphs, or egonets (Preciado and Jadbabaie, 2010), which are graphs with one node designated as the root 1. For example, if is a set of nodes in the graphs (not necessarily distinct) and with node-level labels known during training, respectively, then the goal is to fit a function to the input-output pairs , where we use to denote the rooted graph with being the graph and the node in being the root. Thus, we can evaluate the expressive power of GNNs and GA-MLPs by their ability to approximate functions on the space of rooted graphs, which we denote by .

To do so, we introduce a notion of induced equivalence relations on , analogous to the equivalence relations on introduced in Section 4. Given a family of functions on , we can define an equivalence relation among all rooted graphs such that , if and only if , . By examining the number and sizes of the induced equivalence classes of rooted graphs, we can evaluate the relative expressive power of different families of functions on in a quantitative way.

In the rest of this section, we assume that the node features belong to a finite alphabet and all nodes have degree at most . Firstly, GNNs are known to distinguish neighborhoods up to the rooted aggregation tree, which can be obtained by unrolling the neighborhood aggregation steps in the GNNs as well as the WL test (Xu et al., 2018a; Morris et al., 2019; Garg et al., 2020). The depth- rooted aggregation tree of a rooted graph is a depth- rooted tree with a (possibly many-to-one) mapping from every node in the tree to some node in , where (i) the root of the tree is mapped to node , and (ii) the children of every node in the tree are mapped to the neighbors of the node in to which is mapped. An illustration of rooted graphs and rooted aggregation trees is given in Figure 4. Hence, each equivalence class in induced by the family of all depth- GNNs consists of all rooted graphs that share the same rooted aggregation tree of depth-. Thus, to estimate the number of equivalence classes on induced by GNNs, we need to estimate the number of possible rooted aggregation trees, which is given by Lemma 3 in Appendix F. Thus, we derive the following lower bound on the number of equivalence classes in that depth- GNNs induce:

###### Proposition 3.

Assume that and . The total number of equivalence classes of rooted graphs induced by GNNs of depth grows at least doubly-exponentially in .

In comparison, we next demonstrate that the equivalence classes induced by GA-MLPs are more coarsened. To see this, let’s first consider the example where we take , in which with any is a generalization of the normalized adjacency matrix. From formula (3), by expanding the matrix product, we have

 (~Akφ(X))i=∑(i1,...,ik)∈Wk(G[i])d−αid−(α+β)i1...d−(α+β)ik−1d−βikφ(Xik) , (6)

where we define to be set of all walks of length in the rooted graph starting from node (an illustration is given in Figure 2). Thus, the th augmented feature of node , , is completely determined by the number of each “type” of walks in of length , where the type of a walk, , is determined jointly by the degree multiset, as well as the degree and the node feature of the end node, and . Hence, to prove an upper bound on the total number of equivalence classes on induced by such a GA-MLP, it is sufficient to upper-bound the total number of possibilities of assigning the counts of all types of walks in a rooted graph. This allows us to derive the following result, which we prove in Appendix G.

###### Proposition 4.

Fix , where for some . Then the total number of equivalence classes in induced by such GA-MLPs is poly-exponential in .

Compared with Proposition 3, this shows that the number of equivalence classes on induced by such GA-MLPs is exponentially smaller than that by GNNs. In addition, as the other side of the same coin, these results also indicate the complexity of these hypothesis classes. Building on the results in Chen et al. (2019c, 2020b) on the equivalence between distinguishing non-isomorphic graphs and approximating arbitrary permutation-invariant functions on graphs by neural networks, and by the definition of VC dimension (Vapnik and Chervonenkis, 1971; Mohri et al., 2018), we conclude that

###### Corollary 1.

The VC dimension of all GNNs of layers as functions on rooted graphs grows at least doubly-exponentially in ; Fixing , the VC dimension of all GA-MLPs with as functions on rooted graphs is at most poly-exponential in .

Meanwhile, for more general operators, we can show that the equivalence classes induced by GA-MLPs are coarser than those induced by GNNs at least under some measurements. For instance, the pair of rooted graphs in Figure 2 belong to the same equivalence class induced by any GA-MLP (as we prove in Appendix H) but different equivalence classes induced by GNNs. Rigorously, we characterize such a gap in expressive power by finding certain equivalence classes in induced by GA-MLPs that intersect with many equivalence classes induced by GNNs. In particular, we have the following general result, which we prove in Appendix H:

###### Proposition 5.

If is any family of equivariant linear operators on the graph that only depend on the graph topology of at most hops, then there exist exponentially-in- many equivalence classes in induced by the GA-MLPs with , each of which intersects with doubly-exponentially-in- many equivalence classes in induced by depth- GNNs, assuming that and . Conversely, in constrast, if , in which with any , then each equivalence class in induced by depth- GNNs is contained in one equivalence class induced by the GA-MLPs with .

In essence, this result establishes that GA-MLP circuits can express fewer (exponentially fewer) functions than GNNs with equivalent receptive field. Taking a step further, we can find explicit functions on rooted graphs that can be approximated by GNNs but not GA-MLPs. In the framework that we have developed so far, this occurs when the image of each equivalence class in induced by GNNs under this function contains a single value, whereas the image of some equivalence class in induced by GA-MLPs contains multiple values. Inspired by the proofs of the results above, a natural candidate is the family of functions that count the number of walks of a particular type in the rooted graph. We can establish the following result, which we prove in Appendix I:

###### Proposition 6.

For any sequence of node features , consider the sequence of functions on . For all , the image under of every equivalence class in induced by depth- GNNs contains a single value, while for any GA-MLP using equivariant linear operators that only depend on the graph topology, there exist exponentially-in- many equivalence classes in induced by this GA-MLP whose image under contains exponentially-in- many values.

In words, there exist graph instances where the attributed walk counting function takes different values, yet no GA-MLP model can predict them apart – and there are exponentially many of these instances as the number of hops increases. This suggests the possibility of lower-bounding the average approximation error for certain functions by GA-MLPs under various random graph families, which we leave for future work.

## 6 Experiments

The baseline GA-MLP models we consider has operator family for a certain , and we call it GA-MLP-. For the experiments in Section 6.3, due to the large as well as the analogy with spectral methods (Chen et al., 2019b), we use instance normalization (Ulyanov et al., 2016). Further details are described in Appendix K.

### 6.1 Number of equivalence classes of rooted graphs

Motivated by Propositions 3 and 4, we numerically count the number of equivalence classes induced by GNNs and GA-MLPs among the rooted graphs found in actual graphs with node features removed. For depth- GNNs, we implement a WL-like process with hash functions to map the depth- egonet associated with each node to a string before comparing across nodes. For GA-MLP-, we compare the augmented features of each egonet computed via (3). From the results in Table 2, we see that indeed, the number of equivalence classes induced by GA-MLP- is smaller than that by GNNs, with the highest relative difference occurring at . The contrast is much more visible than their difference in the number of graph equivalence classes given in Appendix B.

### 6.2 Counting attributed walks

Motivated by Proposition 6, we test the ability of GNNs and GA-MLPs to count the number of walks of a particular type in synthetic data. We take graphs from the Cora dataset (with node features removed) as well as generate a random regular graph (RRG) with nodes and the node degree being . We assign node feature blue to all nodes with even index and node feature red to all nodes with odd index. The node feature is given by -dimensional one-hot encoding. On the Cora graph, a node ’s label is given by the number of walks of the type blueblueblue starting from . On the RRG, the label is given by the number of walks of the type blueblueblueblue starting from . The number of nodes for training and testing is split as for the Cora graph and for the random regular graph. We test two GA-MLP models, one with as many powers of the operator as the walk length and the other with twice as many operators, and compare their performances against that of the Graph Isomorphism Network (GIN), a GNN model shown to achieve the expressive power of the WL test (Xu et al., 2018a). From the results in Table 2, we see that GIN significantly outperforms GA-MLP in both training and testing on both graphs, which is consistent with the theoretical result in Proposition 6 that GNNs can count attributed walks while GA-MLPs cannot. Thus, this points out an intuitive task that lies in the gap of expressive power between GNNs and GA-MLPs.

### 6.3 Community detection on Stochastic Block Models (SBM)

We use the task of community detection to illustrate another limitation of GA-MLP models: a lack of flexibility to learn the family of operators. SBM is a random graph model in which nodes are partitioned into underlying communities and each edge is drawn independently with a probability that only depends on whether the pair of nodes belong to the same community or not. The task of community detection is then to recover the community assignments from the connectivity pattern. We focus on binary (that is, having two underlying communities) SBM in the sparse regime, where it is known that the hardness of detecting communities is characterized by a signal-to-noise ratio (SNR) that is a function of the in-group and out-group connectivity (Abbe, 2017). We select pairs of in-group and out-group connectivity, resulting in different hardness levels of the task.

Among all different approaches to community detection, spectral methods are particularly worth mentioning here, which usually aim at finding a certain eigenvector of a certain operator that is correlated with the community assignment, such as the second largest eigenvector of the adjacency matrix or the second smallest eigenvector of the Laplacian matrix or the Bethe Hessian matrix (Krzakala et al., 2013). In particular, the Bethe Hessian matrix is known to be asymptotically optimal in the hard regime, provided that a data-dependent parameter is known. Note that spectral methods bear close resemblance to GA-MLPs and GNNs. In particular, Chen et al. (2019b) propose a spectral GNN (sGNN) model for community detection that can be viewed as a learnable generalization of power iterations on a collection of operators. Further details on Bethe Hessian and sGNN are provided in Appendix K.

We first compare two variants of GA-MLP models: GA-MLP- with , and GA-MLP- with generated from the Bethe Hessian matrix - with the data-dependent parameter given by oracle - up to the same . From Figure 3, we see that the latter consistently outperforms the former, indicating the importance of the choice of the operators for GA-MLPs. Meanwhile, we also test a variant of sGNN that is only based on powers of the and has the same receptive field as GA-MLP- (further details given in Appendix K). We see that its performance is comparable to that of GA-MLP-. Thus, this demonstrates a scenario in which GA-MLP with common choices of do not work well, but there exists some choice of that is a priori unknown, with which GA-MLP can achieve good performance. In contrast, a GNN model does not need to rely on the knowledge of such an oracle set of operators, demonstrating its flexibility of learning.

## 7 Conclusions

We have studied the separation in terms of representation power between GNNs and a popular alternative that we coined GA-MLPs. This latter family is appealing due to its computational scalability and its conceptual simplicity, whereby the role of topology is reduced to creating ‘augmented’ node features then fed into a generic MLP. Our results show that while GA-MLPs can distinguish almost all non-isomorphic graphs, in terms of approximating node-level functions, there exists a gap growing exponentially-in-depth between GA-MLPs and GNNs in terms of the number of equivalence classes of nodes (or rooted graphs) they induce. Furthermore, we find a concrete class of functions that lie in this gap given by the counting of attributed walks. Moreover, through community detection, we demonstrate the lack of GA-MLP’s ability to go beyond the fixed family of operators as compared to GNNs. In other words, GNNs possess an inherent ability to discover topological features through learnt diffusion operators, while GA-MLPs are limited to a fixed, hard-wired family of diffusions.

While we do not attempt to provide a decisive answer of whether GNNs or GA-MLPs should be preferred in practice, our theoretical framework and concrete examples help to understand their differences in expressive power and indicate the types of tasks in which a gap is more likely to be seen – those exploiting stronger structures among nodes like counting attributed walks, or those involving the learning of graph operators. That said, our results are purely on the representation side, and disregard optimization considerations; integrating the possible optimization counterparts is an important direction of future improvement. Finally, another open question is to better understand the links between GA-MLPs and spectral methods, and how this can help learning diffusion operators.

## Acknowledgements

We are grateful to Jiaxuan You for initiating the discussion on GA-MLP-type models, as well as Mufei Li, Minjie Wang, Xiang Song and Lingfan Yu for helpful conversations. This work is partially supported by the Alfred P. Sloan Foundation, NSF RI-1816753, NSF CAREER CIF 1845360, NSF CHS-1901091, Samsung Electronics, and the Institute for Advanced Study.

## Appendix A Examples of existing GA-MLP models

For , let , be the diagonal matrix with , and .

• Simple Graph Convolution (Wu et al., 2019):
for some . In addition, is the identity function and for some trainable weight matrix .

• Graph Feature Network (Chen et al., 2019a):
for some and . In addition, is the identity function and is an MLP.

• Scalable Inception Graph Networks (Rossi et al., 2020):
, where is a family of simple / normalized adjacency matrices, is a family of Personalized-PageRank-based adjacency matrices, and is a family of triangle-based adjacency matrices. In addition, writing , there is , with and being nonlinear activation functions and and being trainable weight matrices of suitable dimensions.

## Appendix B Equivalence classes induced by GNNs and GA-MLPs among real graphs

Given a space of graphs, , and a family of functions mapping to , induces an equivalence relation that we denote by among graphs in such that for , if and only if , . For example, if is powerful enough to distinguish all pairs of non-isomorphic graphs, then each equivalence class under contains exactly one graph. Thus, by examining the number or sizes of the equivalence classes induced by different families of functions on , we can evaluate their relative expressive power in a quantitative way.

Hence, we supplement the theoretical result of Proposition 2 with the following numerical results on five real-world datasets for graph-predictions. For graphs in each of the two real datasets, we remove their node features and count the total number of equivalence classes among them induced by depth- GNNs (equivalent to -iterations of the WL test, as discussed in Section 3.2) as well as GA-MLPs with for different ’s. We see from the results in Table 3 that as soon as , the number of equivalence classes induced by GNNs and the GA-MLPs are both close to the total number of graphs up to isomorphism, implying that they are indeed both able to distinguish almost all pairs of non-isomorphic graphs among the ones occurring in these datasets.

## Appendix C Additional notations

For any and any rooted graph , define

 Wk(G[i])={(i1,...,ik)⊆V:Ai,i1,Ai1,i2,...,Aik−1,ik>0} (7)
 ¯¯¯¯¯¯Wk(G[i])={(i1,...,ik)∈Wk(G[i]):i≠i1,i1≠i2,...,ik−1≠ik} (8)

as the sets of walks and non-backtracking walks of length in starting from the root node, respectively. Note that when is a rooted tree, a non-backtracking walk of length is a path from the root node to a node at depth . In addition, for and , define the following subsets of :

 Wk(G[i];(d1,...,dk),xk)={(ii,...,ik)∈Wk(G[i]):{di1,...,dik}m={d1,...,dk}m,Xik=xk} (9)
 Wk(G[i];(x1,...,xk))={(ii,...,ik)∈Wk(G[i]):(Xi1,...,Xik)=(x1,...,xk)} (10)
 Wk(G[i];xk)={(ii,...,ik)∈Wk(G[i]):Xik=xk} (11)

We also define , and similarly.

## Appendix E Proof of Proposition 2

With node features being identical in the random graphs, we take to be the all- vector. Thus,

 (DX)i=di , (12)

and

Since (4) and (2) together can approximate arbitrary permutation-invariant functions on multisets (Zaheer et al., 2017), if two graphs and cannot be distinguished by the GA-MLP with an operator family that includes under any choice of its parameters, it means that the two multisets , and therefore both of the following hold:

 {di:i∈V}m= {di′:i′∈V′}m (14) {∑j∈N(i)d−αj:i∈V}m= {∑j′∈N(i′)d−αj′:i′∈V′}m (15)

To see what this means, we need the two following lemmas.

###### Lemma 1.

Let be the set of all multisets consisting of at most elements, all of which are integers between and . Consider the function defined for multisets . If , is an injective function on .

Proof of Lemma 1: For to be injective on , it suffices to require that , there is , for which it is sufficient to require that , or .

###### Lemma 2 (Babai et al. (1980), Theorem 1).

Consider the space of graphs with vertices, . There is a subset that contains almost all such graphs (i.e. the fraction converges to as ) such that the following algorithm yields a unique identifier for every graph :

#### Algorithm 1:

Set , and let be the degree of the node in with the th largest degree; For each node in , define the multiset ; Finally define a multiset associated with , , which is the output of the algorithm. In other words, , and are isomorphic if and only if as multisets. In particular, we can choose such that the top node degrees of every graph in are distinct.

Based on these lemmas, we will show that when and for , (14) and (15) together imply that is isomorphic to . To see this, suppose that (14) and (15) hold. Because of (14), we know that and share the same degree sequence, and hence . Because of (15), we know that there is a bijective map from to such that ,

 ∑j∈N(i)d−αj=∑j′∈N(i′)d−αj′ , (16)

which, by Lemma 1, implies that . We then have , and therefore , which implies that and are isomorphic by Lemma 2. This shows a contradiction. Therefore, if are not isomorphic, then it cannot be the case that both (14) and (15) hold, and hence there exists a choice of parameters for the GA-MLP with that makes it return different outputs when applied to and . This proves Proposition 2.

## Appendix F Proof of Proposition 3

As argued in the main text, to estimate the number of equivalence classes on induced by GNNs, we need to estimate the number of possible rooted aggregation trees. In particular, to lower-bound the number of equivalence classes on induced by GNNs, we only need to focus on a subset of all possible rooted aggregation trees, namely those in which every node has exactly children. Letting denote the set of all rooted aggregation trees of depth in which each non-leaf node has degree exactly and the node features belong to , we will first prove the following lemma:

###### Lemma 3.

If , then .

Note that a rooted aggregation tree needs to satisfy the constraint that each of its node must have its parent’s feature equal to one of its children’s feature, and so this lower bound is not as straightforward to prove as lower-bounding the total number of rooted subtrees. As argued above, this will allow us to derive Proposition 3.

Proof of Lemma 3: Define . Since , we assume without loss of generality that . To prove a lower-bound on the cardinality of , it suffices to restrict our attention to its subset, , where all nodes have feature either or . Furthermore, it is sufficient to restrict our attention to the subset of which contain all possible types of paths of length from the root to the leaves. Formally, with defined as in Appendix C, we let

 ~TAm,K={T∈TAm,K:∀x1,...,xK∈B,¯¯¯¯¯¯WK(T;(x1,...,xK))≥1} , (17)

and it is sufficient to prove a lower bound on the cardinality of . Define to be the set of all binary -tuples. By the definition of (17), we know that . This means that , there exists at least one leaf node in such that the path from the root node to this node consists of a sequence of nodes with features exactly as given by . We call any such node a node under .

We show such a lower bound on the cardinality of inductively. For the base case, we know that consists of all binary-featured depth- rooted trees with at least leaf node of feature and leaf node of feature , and hence . Next, we consider the inductive step. For every and every , we can generate rooted aggregation trees belonging to by assigning children of feature or to the leaf nodes of . First note that, from two non-isomorphic rooted aggregation trees and , we obtain non-isomorphic rooted aggregation trees in in this way. Moreover, as we will show next, for every , we can lower-bound the number of distinct rooted aggregation trees belonging to obtained from in this way.

There are many choices to assign the children. To get a lower-bound on the cardinality of , we only need to consider a subset of these choices of assignments, namely, those that assign the same number of children with feature to every node under the same . Thus, we let denote the number of children of feature assigned to every node in . Due to the constraint that each node in the rooted aggregation tree must have its parent’s feature equal to one of its children’s feature, not all choices of lead to legitimate rooted aggregation trees. Nonetheless, when restricting to the choices where , we see that every leaf node of gets assigned at least one child of feature and another child of feature , thereby satisfying the constraint above whether its parent has feature or . Moreover, for such choices, the rooted aggregation tree of depth obtained in this way contains all possible paths of length , and therefore belongs to . Hence, it remains to show a lower bound on how many distinct trees in can be obtained in this way from each . Since for such that , a node under is distinguishable from a node under , we see that every legitimate choice of the tuple of integers, , leads to a distinct rooted aggregation tree of depth , and there are of these choices. Hence, we have derived that , and therefore .

## Appendix G Proof of Proposition 4

According to the formula (3), by expanding the matrix product, we have

 (~Akφ(X))i=∑(i1,...,ik)∈Wk(G[i])d−αid−(α+β)i1...d−(α+β)ik−1d−βikφ(Xik)=d−αi∑{¯d1,...,¯dt−1}m,¯dk,x∑(i,i1,...,ik)∈Wk(G[i];{¯d1,...,¯dk−1}m,¯dk,x)(¯d1...¯dk−1)−(α+β)¯dkφ(x)=d−αi∑{¯d1,...,¯dk−1}m,¯dk,x((¯d1...¯dk−1)−(α+β)¯dkφ(x))∣∣Wk(G[i];{¯d1,...,¯dt−1}m,¯dk,x)∣∣ , (18)

with defined in Appendix C. Hence, for two different nodes in and in ( and can be the same graph), the node-wise outputs of the GA-MLP at and will be identical if the rooted graphs and satisfy for every combination of choices on the multiset , the integer and the node feature , under the constraints of and . Note that there are at most possible choices of the multiset , choices of and choices of , thereby allowing at most possible choices. Because of the constraint

 ∑{¯d1,...,¯dk−1}m,¯dk,x∣∣Wk(G[i]K;{¯d1,...,¯dt−1}m,¯dk,x)∣∣=|Wk(G[i])|≤mk , (19)

We see that the total number of equivalence classes on induced by such a GA-MLP is upper-bounded by , which is on the order of with growing and bounded. Finally, since the total number of equivalence classes induced by multiple operators can be upper-bounded by the product of the number of equivalence classes induced by each operator separately, we derive the proposition as desired.

## Appendix H Proof of Proposition 5

Consider the set of full -ary rooted trees of depth , , that is all rooted trees of depth in which the nodes have features belonging to the discrete set and all non-leaf nodes have children. is a subset of , the space of all rooted graphs. If is a function represented by a GA-MLP using operators of at most -hop, then for , we can write

 f(T)=ρ(∑j∈VajXj) , (20)

where we denote the node set of by and the vectors ’s depend only on the topological relationship between and the root node. Let denote the set of nodes at depth of . By the assumption that the operators depend only on the graph topology, and thanks to the topological symmetry of such full -ary trees among all nodes on the same depth, we have that and , there is . Thus, we can write

 f(T)=ρ(∑0≤k≤K∑j∈Vka[k]ϕ(Xj))=ρ(∑0≤k≤K∑x∈X¯a[k],x|¯¯¯¯¯¯Wk(T;x)|) (21)

for some other set of coefficients ’s, and where is defined in Appendix C. In other words, for two trees and , if , they satisfy