ASAP: Adaptive Structure Aware Pooling for Learning Hierarchical Graph Representations
Abstract
Graph Neural Networks (GNN) have been shown to work effectively for modeling graph structured data to solve tasks such as node classification, link prediction and graph classification. There has been some recent progress in defining the notion of pooling in graphs whereby the model tries to generate a graph level representation by downsampling and summarizing the information present in the nodes. Existing pooling methods either fail to effectively capture the graph substructure or do not easily scale to large graphs. In this work, we propose ASAP (Adaptive Structure Aware Pooling), a sparse and differentiable pooling method that addresses the limitations of previous graph pooling architectures. ASAP utilizes a novel selfattention network along with a modified GNN formulation to capture the importance of each node in a given graph. It also learns a sparse soft cluster assignment for nodes at each layer to effectively pool the subgraphs to form the pooled graph. Through extensive experiments on multiple datasets and theoretical analysis, we motivate our choice of the components used in ASAP. Our experimental results show that combining existing GNN architectures with ASAP leads to stateoftheart results on multiple graph classification benchmarks. ASAP has an average improvement of 4%, compared to current sparse hierarchical stateoftheart method.
1 Introduction
In recent years, there has been an increasing interest in developing Graph Neural Networks (GNNs) for graph structured data. CNNs have shown to be successful in tasks involving images [22, 16] and text [19]. Unlike these regular grid data, arbitrary shaped graphs have rich information present in their graph structure. By inherently capturing such information through message propagation along the edges of the graph, GNNs have proved to be more effective for graphs [14, 15]. GNNs have been successfully applied in tasks such as semantic role labeling [26], relation extraction [37], neural machine translation [2], document dating [36], and molecular feature extraction [18]. While some of the works focus on learning nodelevel representations to perform tasks such as node classification [21, 40] and link prediction [32, 38], others focus on learning graphlevel representations for tasks like graph classification [5, 17, 47, 13, 23] and graph regression [45, 30]. In this paper, we focus on graphlevel representation learning for the task of graph classification.
Briefly, the task of graph classification involves predicting the label of an input graph by utilizing the given graph structure and initial nodelevel representations. For example, given a molecule, the task could be to predict if it is toxic. Current GNNs are inherently flat and lack the capability of aggregating node information in a hierarchical manner. Such architectures rely on learning node representations through some GNN followed by aggregation of the node information to generate the graph representation [41, 24, 48]. But learning graph representations in a hierarchical manner is important to capture local substructures that are present in graphs. For example, in an organic molecule, a set of atoms together can act as a functional group and play a vital role in determining the class of the graph.
To address this limitation, new pooling architectures have been proposed where sets of nodes are recursively aggregated to form a cluster that represents a node in the pooled graph, thus enabling hierarchical learning. DiffPool [47] is a differentiable pooling operator that learns a soft assignment matrix mapping each node to a set of clusters. Since this assignment matrix is dense, it is not easily scalable to large graphs [6]. Following that, TopK [13] is proposed which learns a scalar projection score for each node and selects the top nodes. They address the sparsity concerns of DiffPool but are unable to capture the rich graph structure effectively. Recently, SAGPool [23], a TopK based architecture, has been proposed which leverages selfattention network to learn the node scores. Although local graph structure is used for scoring nodes, it is still not used effectively in determining the connectivity of the pooled graph. Pooling methods that leverage the graph structure effectively while maintaining sparsity currently don’t exist. We address the gap in this paper.
In this work, we propose a new sparse pooling operator called Adaptive Structure Aware Pooling (ASAP) which overcomes the limitations in current pooling methods. Our contributions can be summarized as follows:

We introduce ASAP, a sparse pooling operator capable of capturing local subgraph information hierarchically to learn global features with better edge connectivity in the pooled graph.

We propose Master2Token (M2T), a new selfattention framework which is better suited for global tasks like pooling.

We introduce a new convolution operator LEConv, that can adaptively learn functions of local extremas in a graph substrucutre.
2 Related Work
2.1 Graph Neural Networks
Various formulation of GNNs have been proposed which use both spectral and nonspectral approaches. Spectral methods [5, 17] aim at defining convolution operation using Fourier transformation and graph Laplacian. These methods do not directly generalize to graphs with different structure [4]. Nonspectral methods [8, 21, 46, 27, 28] define convolution through a local neighborhood around nodes in the graph. They are faster than spectral methods and easily generalize to other graphs. GNNs can also be viewed as message passing algorithm where nodes iteratively aggregate messages from neighboring nodes through edges [14].
2.2 Pooling
Pooling layers overcome GNN’s inability to aggregate nodes hierarchically. Earlier pooling methods focused on deterministic graph clustering algorithms [8, 11, 35]. \citeauthordiffpool introduced the first differentiable pooling operator which outperformed the previous deterministic methods. Since then, new datadriven pooling methods have been proposed; both spectral [25, 9] and nonspectral [47, 13]. Spectral methods aim at capturing the graph topology using eigendecomposition algorithms. However, due to higher computational requirement for spectral graph techniques, they are not easily scalable to large graphs. Hence, we focus on nonspectral methods.
Pooling methods can further be divided into global and hierarchical pooling layers. Global pooling summarize the entire graph in just one step. Set2Set [41] finds the importance of each node in the graph through iterative contentbased attention. GlobalAttention [24] uses an attention mechanism to aggregate nodes in the graph. SortPool [48] summarizes the graph by concatenating few nodes after sorting them based on their features. Hierarchical pooling is used to capture the topological information of graphs. DiffPool forms a fixed number of clusters by aggregating nodes. It uses GNN to compute a dense soft assignment matrix, making it infea
Property  DiffPool  TopK  SAGPool  ASAP 

Sparse  ✓  ✓  ✓  
Node Aggregation  ✓  ✓  
Soft Edge Weights  ✓  ✓  
Variable number of clusters  ✓  ✓  ✓ 
sible for large graphs. TopK scores nodes based on a learnable projection vector and samples a fraction of high scoring nodes. It avoids node aggregation and computing soft assignment matrix to maintain the sparsity in graph operations. SAGPool improve upon TopK by using a GNN to consider the graph structure while scoring nodes. Since TopK and SAGPool do not aggregate nodes nor compute soft edge weights, they are unable to preserve node and edge information effectively.
To address these limitations, we propose ASAP, which has all the desirable properties of hierarchical pooling without compromising on sparsity in graph operations. Please see Table. 1 for an overall comparison of hierarchical pooling methods. Further comparison discussions between hierarchical architectures are presented in Sec. 8.1.
3 Preliminaries
3.1 Problem Statement
Consider a graph with nodes and edges. Each node has dimensional feature representation denoted by . denotes the node feature matrix and represents the weighted adjacency matrix. The graph also has a label associated with it. Given a dataset , the task of graph classification is to learn a mapping , where is the set of input graphs and is the set of labels associated with each graph. A pooled graph is denoted by with node embedding matrix and its adjacency matrix as .
3.2 Graph Convolution Networks
We use Graph Convolution Network (GCN) [21] for extracting discriminative features for graph classification. GCN is defined as:
(1) 
where for selfloops, and is a learnable matrix for any layer . We use the initial node feature matrix wherever provided, i.e., .
3.3 SelfAttention
Selfattention is used to find the dependency of an input on itself [7, 39]. An alignment score is computed to map the importance of candidates on target query . In selfattention, target query and candidates are obtained from input entities . Selfattention can be categorized as Token2Token and Source2Token based on the choice of target query [33].
Token2Token (T2T)
selects both the target and candidates from the input set . In the context of additive attention [1], is computed as:
(2) 
where is the concatenation operator.
Source2Token (S2T)
finds the importance of each candidate to a specific global task which cannot be represented by any single entity. is computed by dropping the target query term. Eq. (2) changes to the following:
(3) 
3.4 Receptive Field
We extend the concept of receptive field from pooling operations in CNN to GNN^{3}^{3}3Please refer to Appendix Sec. D for more details on similarity between pooling methods in CNN and ASAP.. We define of a pooling operator as the number of hops needed to cover all the nodes in the neighborhood that influence the representation of a particular output node. Similarly, of a pooling operator is defined as the number of hops needed to cover all the edges in the neighborhood that affect the representation of an edge in the pooled graph .
4 ASAP: Proposed Method
In this section we describe the components of our proposed method ASAP. As shown in Fig. 1(b), ASAP initially considers all possible local clusters with a fixed receptive field for a given input graph. It then computes the cluster membership of the nodes using an attention mechanism. These clusters are then scored using a GNN as depicted in Fig 1(c). Further, a fraction of the top scoring clusters are selected as nodes in the pooled graph and new edge weights are computed between neighboring clusters as shown in Fig. 1(d). Below, we discuss the working of ASAP in details. Please refer to Appendix Sec. I for a pseudo code of the working of ASAP.
4.1 Cluster Assignment
Initially, we consider each node in the graph as a medoid of a cluster such that each cluster can represent only the local neighbors within a fixed radius of hops i.e., . This effectively means that for ASAP. This helps the clusters to effectively capture the information present in the graph substructure.
Let be the feature representation of a cluster centered at . We define as the graph with node feature matrix and adjacency matrix . We denote the cluster assignment matrix by , where represents the membership of node in cluster . By employing such local clustering [31], we can maintain sparsity of the cluster assignment matrix similar to the original graph adjacency matrix i.e., space complexity of both and is .
4.2 Cluster Formation using Master2Token
Given a cluster , we learn the cluster assignment matrix through a selfattention mechanism. The task here is to learn the overall representation of the cluster by attending to the relevant nodes in it. We observe that both T2T and S2T attention mechanisms described in Sec. 3.3 do not utilize any intracluster information. Hence, we propose a new variant of selfattention called Master2Token (M2T). We further motivate the need for M2T framework later in Sec. 8.2. In M2T framework, we first create a master query which is representative of all the nodes within a cluster:
(4) 
where is obtained after passing through a separate GCN to capture structural information in the cluster ^{4}^{4}4If is used as it is then interchanging any two nodes in a cluster will have not affect the final output, which is undesirable.. is a master function which combines and transforms feature representation of to find . In this work we experiment with master function defined as:
(5) 
This master query attends to all the constituent nodes using additive attention:
(6) 
where and are learnable vector and matrix respectively. The calculated attention scores signifies the membership strength of node in cluster . Hence, we use this score to define the cluster assignment matrix discussed above, i.e., . The cluster representation for is computed as follows:
(7) 
4.3 Cluster Selection using LEConv
Similar to TopK [13], we sample clusters based on a cluster fitness score calculated for each cluster in the graph using a fitness function . For a given pooling ratio , the top clusters are selected and included in the pooled graph . To compute the fitness scores, we introduce Local Extrema Convolution (LEConv), a graph convolution method which can capture local extremum information. In Sec. 5.1 we motivate the choice of LEConv’s formulation and contrast it with the standard GCN formulation. LEConv is used to compute as follows:
(8) 
where denotes the neighborhood of the node in . are learnable parameters and is some activation function. Fitness vector is multiplied to the cluster feature matrix to make learnable i.e.,:
where is broadcasted hadamard product. The function ranks the fitness scores and gives the indices of top selected clusters in as follows:
The pooled graph is formed by selecting these top clusters. The pruned cluster assignment matrix and the node feature matrix are given by:
(9) 
where is used for index slicing.
Method  D&D  PROTEINS  NCI1  NCI109  FRANKENSTEIN 

Set2Set [41]  71.60 0.87  72.16 0.43  66.97 0.74  61.04 2.69  61.46 0.47 
GlobalAttention [24]  71.38 0.78  71.87 0.60  69.00 0.49  67.87 0.40  61.31 0.41 
SortPool [48]  71.87 0.96  73.91 0.72  68.74 1.07  68.59 0.67  63.44 0.65 
Diffpool [47]  66.95 2.41  68.20 2.02  62.32 1.90  61.98 1.98  60.60 1.62 
TopK [13]  75.01 0.86  71.10 0.90  67.02 2.25  66.12 1.60  61.46 0.84 
SAGPool [23]  76.45 0.97  71.86 0.97  67.45 1.11  67.86 1.41  61.73 0.76 
ASAP (Ours) 
4.4 Maintaining Graph Connectivity
Following [47], once the clusters have been sampled, we find the new adjacency matrix for the pooled graph using and in the following manner:
(10) 
where . Equivalently, we can see that . This formulation ensures that any two clusters and in are connected if there is any common node in the clusters and or if any of the constituent nodes in the clusters are neighbors in the original graph (Fig. 1(d)). Hence, the strength of the connection between clusters is determined by both the membership of the constituent nodes through and the edge weights . Note that is a sparse matrix by formulation and hence the above operation can be implemented efficiently.
5 Theoretical Analysis
5.1 Limitations of using GCN for scoring clusters
GCNs cannot learn to assign such a fitness score to a cluster which is a function of the local extremas of its constituent nodes. Scoring the clusters based on local extremas would potentially allow us to sample representative clusters from all parts of the graph. GCN from Eq. (1) can be viewed as an operator which first computes a prescore for each node i.e., followed by a weighted average over neighbors and a nonlinearity. If for some node the prescore is very high, it can increase the scores of its neighbors, inherently biasing the pooling operators to select a node in the local neighborhood instead of sampling clusters which represent the whole graph.
Theorem 1.
Let be a graph with positive adjacency matrix A i.e., . Consider any function which depends on difference between a node and its neighbors after a linear transformation . For e.g,:
where and .

[label=)]

If fitness value then cannot learn f.

If fitness value then can learn f.
Proof.
See Appendix Sec. F for proof. ∎
Motivated by the above analysis, we propose to use LEConv (Eq. 8) for scoring clusters. LEConv can learn to score clusters by considering both its global and local importance through the use of selfloops and ability to learn functions of local extremas.
5.2 Graph Connectivity
Here, we analyze ASAP from the aspect of edge connectivity in the pooled graph. When considering hop neighborhood for clustering, both ASAP and DiffPool have because they use Eq. (10) to define the edge connectivity. On the other hand, both TopK and SAGPool have . A larger edge receptive field implies that the pooled graph has better connectivity which is important for the flow of information in the subsequent GCN layers.
Theorem 2.
Let the input graph be a tree of any possible structure with nodes. Let be the lower bound on sampling ratio to ensure the existence of atleast one edge in the pooled graph irrespective of the structure of and the location of the selected nodes. For TopK or SAGPool, whereas for ASAP, as .
Proof.
See Appendix Sec. G for proof. ∎
Theorem 2 suggests that ASAP can achieve a similar degree of connectivity as SAGPool or TopK for a much smaller sampling ratio . For a tree with no prior information about its structure, ASAP would need to sample only half of the clusters whereas TopK and SAGPool would need to sample almost all the nodes, making TopK and SAGPool inefficient for such graphs. In general, independent of any combination of nodes selected, ASAP will have better connectivity due to its larger receptive field. Please refer to Appendix Sec. G for a similar analysis on path graph and more details.
5.3 Graph Permutation Equivariance
Proposition 1.
ASAP is a graph permutation equivariant pooling operator.
Proof.
See Appendix Sec. H for proof. ∎
6 Experimental Setup
In our experiments, we use graph classification benchmarks and compare ASAP with multiple pooling methods. Below, we describe the statistics of the dataset, the baselines used for comparisons and our evaluation setup in detail.
6.1 Datasets
We demonstrate the effectiveness of our approach on graph classification datasets. D&D [34, 10] and PROTEINS [10, 3] are datasets containing proteins as graphs. NCI1 [42] and NCI109 are datasets for anticancer activity classification. FRANKENSTEIN [29] contains molecules as graph for mutagen classification. Please refer to Table 3 for the dataset statistics.
Dataset  

D&D  1178  2  284.32  715.66 
PROTEINS  1113  2  39.06  72.82 
NCI1  4110  2  29.87  32.30 
NCI109  4127  2  29.68  32.13 
FRANKENSTEIN  4337  2  16.90  17.88 
6.2 Baselines
6.3 Training & Evaluation Setup
We use a similar architecture as defined in [6, 23] which is depicted in Fig. 1(f). For ASAP, we choose and to be consistent with baselines.^{5}^{5}5Please refer to Appendix Sec. A for further details on hyperparameter tuning and Appendix Sec. E for ablation on . Following SAGPool[23], we conduct our experiments using fold crossvalidation and report the average accuracy on random seeds.
Aggregation type  FITNESS  CLUSTER 

None     
Only cluster    ✓ 
Both  ✓  ✓ 
7 Results
In this section, we attempt to answer the following questions:
 Q1

How does ASAP perform compared to other pooling methods at the task of graph classification? (Sec. 7.1)
 Q2

Is cluster formation by M2T attention based node aggregation beneficial during pooling? (Sec. 7.3)
 Q3

Is LEConv better suited as cluster fitness scoring function compared to vanilla GCN? (Sec. 7.4)
 Q4

How helpful is the computation of intercluster soft edge weights instead of sampling edges from the input graph? (Sec. 7.5)
7.1 Performance Comparison
We compare the performace of ASAP with baseline methods on graph classification tasks. The results are shown in Table 2. All the numbers for hierarchical pooling (DiffPool, TopK and SAGPool) are taken from [23]. For global pooling (Set2Set, GlobalAttention and SortPool), we modify the architectural setup to make them comparable with the hierarchical variants. ^{6}^{6}6Please refer to Appendix Sec. B for more details. We observe that ASAP consistently outperforms all the baselines on all datasets. We note that ASAP has an average improvement of and over previous stateoftheart hierarchical (SAGPool) and global (SortPool) pooling methods respectively. We also observe that compared to other hierarchical methods, ASAP has a smaller variance in performance which suggests that the training of ASAP is more stable.
7.2 Effect of Node Aggregation
Here, we evaluate the improvement in performance due to our proposed technique of aggregating nodes to form a cluster. There are two aspects involved during the creation of clusters for a pooled graph:

FITNESS: calculating fitness scores for individual nodes. Scores can be calculated either by using only the medoid or by aggregating neighborhood information.

CLUSTER: generating a representation for the new cluster node. Cluster representation can either be the medoid’s representation or some feature aggregation of the neighborhood around the medoid.
We test three types of aggregation methods: ’None’, ’Only cluster’ and ’Both’ as described in Table 4. As shown in Table 5, we observe that our proposed node aggregation helps improve the performance of ASAP.
Aggregation  FRANKENSTEIN  NCI1 

None  67.4 0.6  69.9 2.5 
Only cluster  67.5 0.5  70.6 1.8 
Both 
Attention  FRANKENSTEIN  NCI1 

T2T  67.6 0.5  70.3 2.0 
S2T  67.7 0.5  69.9 2.0 
M2T 
7.3 Effect of M2T Attention
We compare our M2T attention framework with previously proposed S2T and T2T attention techniques. The results are shown in Table 6. We find that M2T attention is indeed better than the rest in NCI1 and comparable in FRANKENSTEIN.
Fitness function  FRANKENSTEIN  NCI1 

GCN  62.70.3  65.42.5 
BasicLEConv  63.10.7  69.81.9 
LEConv  67.80.6  70.72.3 
7.4 Effect of LEConv as a fitness scoring function
In this section, we analyze the impact of LEConv as a fitness scoring function in ASAP. We use two baselines  GCN (Eq. 1) and BasicLEConv which computes . In Table 7 we can see that BasicLEConv and LEConv perform significantly better than GCN because of their ability to model functions of local extremas. Further, we observe that LEConv performs better than BasicLEConv as it has three different linear transformation compared to only one in the latter. This allows LEConv to potentially learn complicated scoring functions which is better suited for the final task. Hence, our analysis in Theorem 1 is emperically validated.
7.5 Effect of computing Soft edge weights
We evaluate the importance of calculating edge weights for the pooled graph as defined in Eq. 10. We use the best model configuration as found from above ablation analysis and then add the feature of computing soft edge weights for clusters. We observe a significant drop in performace when the edge weights are not computed. This proves the necessity of capturing the edge information while pooling graphs.
Soft edge weights  FRANKENSTEIN  NCI1 

Absent  67.8 0.6  70.7 2.3 
Present 
8 Discussion
8.1 Comparison with other pooling methods
DiffPool
DiffPool and ASAP both aggregate nodes to form a cluster. While ASAP only considers nodes which are within hop neighborhood from a node (medoid) as a cluster, DiffPool considers the entire graph. As a result, in DiffPool, two nodes that are disconnected or far away in the graph can be assigned similar clusters if the nodes and their neighbors have similar features. Since this type of cluster formation is undesirable for a pooling operator [47], DiffPool utilizes an auxiliary link prediction objective during training to specifically prevent far away nodes from being clustered together. ASAP needs no such additional regularization because it ensures the localness while clustering. DiffPool’s soft cluster assignment matrix is calculated for all the nodes to all the clusters making a dense matrix. Calculating and storing this does not scale easily for large graphs. ASAP, due to the local clustering over hop neighborhood, generates a sparse assignment matrix while retaining the hierarchical clustering properties of Diffpool. Further, for each pooling layer, DiffPool has to predetermine the number of clusters it needs to pick which is fixed irrespective of the input graph size. Since ASAP selects the top fraction of nodes in current graph, it inherently takes the size of the input graph into consideration.
TopK & SAGPool
While TopK completely ignores the graph structure during pooling, SAGPool modifies the TopK formulation by incorporating the graph structure through the use of a GCN network for computing node scores . To enforce sparsity, both TopK and SAGPool avoid computing the cluster assignment matrix that DiffPool proposed. Instead of grouping multiple nodes to form a cluster in the pooled graph, they drop nodes from the original graph based on a score [6] which might potentially lead to loss of node and edge information. Thus, they fail to leverage the overall graph structure while creating the clusters. In contrast to TopK and SAGPool, ASAP can capture the rich graph structure while aggregating nodes to form clusters in the pooled graph. TopK and SAGPool sample edges from the original graph to define the edge connectivity in the pooled graph. Therefore, they need to sample nodes from a local neighborhood to avoid isolated nodes in the pooled graph. Maintaining graph connectivity prevents these pooling operations from sampling representative nodes from the entire graph. The pooled graph in ASAP has a better edge connectivity compared to TopK and SAGPool because soft edge weights are computed between clusters using upto three hop connections in the original graph. Also, the use of LEConv instead of GCN for finding fitness values further allows ASAP to sample representative clusters from local neighborhoods over the entire graph.
8.2 Comparison of SelfAttention variants
Source2Token & Token2Token
T2T models the membership of a node by generating a query based only on the medoid of the cluster. Graph Attention Network (GAT) [40] is an example of T2T attention in graphs. S2T finds the importance of each node for a global task. As shown in Eq. 3, since a query vector is not used for calculating the attention scores, S2T inherently assigns the same membership score to a node for all the possible clusters that node can belong to. Hence, both S2T and T2T mechanisms fail to effectively utilize the intracluster information while calculating a node’s cluster membership. On the other hand, M2T uses a master function to generate a query vector which depends on all the entities within the cluster and hence is a more representative formulation. To understand this, consider the following scenario. If in a given cluster, a nonmedoid node is removed, then the unnormalized membership scores for the rest of the nodes will remain unaffected in S2T and T2T framework whereas the change will reflect in the scores calculated using M2T mechanism. Also, from Table 6, we find that M2T performs better than S2T and T2T attention showing that M2T is better suited for global tasks like pooling.
9 Conclusion
In this paper, we introduce ASAP, a sparse and differentiable pooling method for graph structured data. ASAP clusters local subgraphs hierarchically which helps it to effectively learn the rich information present in the graph structure. We propose Master2Token selfattention framework which enables our model to better capture the membership of each node in a cluster. We also propose LEConv, a novel GNN formulation that scores the clusters based on its local and global importance. ASAP leverages LEConv to compute cluster fitness scores and samples the clusters based on it. This ensures the selection of representative clusters throughout the graph. ASAP also calculates sparse edge weights for the selected clusters and is able to capture the edge connectivity information efficiently while being scalable to large graphs. We validate the effectiveness of the components of ASAP both theoretically and empirically. Through extensive experiments, we demonstrate that ASAP achieves stateoftheart performace on multiple graph classification datasets.
%In this work, we introduce ASAP, a sparse differentiable pooling method for graph structured data. ̵ASAP clusters local subgraphs hierarchically to learn global properties of the graph. Aggregating local subgraphs allows ASAP to maintain the overall space complexity of the GNN model, enabling it to scale to larger graphs. ̵To this end, ASAP uses Master2Token, a selfattention framework, to learn the importance of each node in the clusters. ASAP aggregates the constituent nodes in the cluster to form new nodes in the pooled graph. It then subsamples clusters based on an adaptive scoring metric learned by LEConv which allows ASAP to score clusters based on relative and absolute importance. This enables ASAP to sample representative clusters throughout the graph. With thorough experimentation and theoretical analysis, we show the effectiveness of each components in ASAP. Combining ASAP with GCN models leads to stateoftheart results on multiple graph classification datasets.
10 Acknowledgements
We would like to thank the developers of Pytorch_Geometric [12] which allows quick implementation of geometric deep learning models. We would like to thank Matthias Fey for actively maintaining the library and quickly responding to our queries on github.
References
 [1] (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.3.
 [2] (2017) Graph convolutional encoders for syntaxaware neural machine translation. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. External Links: Link, Document Cited by: §1.
 [3] (2005) Protein function prediction via graph kernels. Bioinformatics. Cited by: §6.1.
 [4] (2017) Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine. Cited by: §2.1.
 [5] (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §1, §2.1.
 [6] (2018) Towards sparse hierarchical graph classifiers. arXiv preprint arXiv:1811.01287. Cited by: §1, §6.3, §8.1.
 [7] (2016) Long shortterm memorynetworks for machine reading. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. External Links: Link, Document Cited by: §3.3.
 [8] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, Cited by: §2.1, §2.2.
 [9] (2007) Weighted graph cuts without eigenvectors a multilevel approach. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.2.
 [10] (2003) Distinguishing enzyme structures from nonenzymes without alignments. Journal of molecular biology. Cited by: §6.1.
 [11] (2018) SplineCNN: fast geometric deep learning with continuous bspline kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
 [12] (2019) Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428. Cited by: Appendix A, §10.
 [13] (2019) Graph unets. arXiv preprint arXiv:1905.05178. Cited by: §1, §1, §2.2, §4.3, Table 2, §6.2.
 [14] (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning  Volume 70, External Links: Link Cited by: §1, §2.1.
 [15] (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, Cited by: §1.
 [16] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §1.
 [17] (2015) Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163. Cited by: §1, §2.1.
 [18] (201608) Molecular graph convolutions: moving beyond fingerprints. Journal of ComputerAided Molecular Design 30 (8), pp. 595–608. External Links: ISSN 15734951, Link, Document Cited by: §1.
 [19] (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §1.
 [20] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix A.
 [21] (2017) Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.1, §3.2.
 [22] (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), External Links: Link Cited by: §1.
 [23] (201909–15 Jun) Selfattention graph pooling. In Proceedings of the 36th International Conference on Machine Learning, Cited by: Appendix B, §1, §1, Table 2, §6.2, §6.3, §7.1.
 [24] (2016) Gated graph sequence neural networks. CoRR abs/1511.05493. Cited by: §1, §2.2, Table 2, §6.2.
 [25] (2019) Graph convolutional networks with eigenpooling. arXiv preprint arXiv:1904.13107. Cited by: §2.2.
 [26] (2017) Encoding sentences with graph convolutional networks for semantic role labeling. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. External Links: Link, Document Cited by: §1.
 [27] (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
 [28] (2018) Weisfeiler and leman go neural: higherorder graph neural networks. External Links: 1810.02244 Cited by: §2.1.
 [29] (2015) Graph invariant kernels. In TwentyFourth International Joint Conference on Artificial Intelligence, Cited by: §6.1.
 [30] (2018) MTCGCNN: integrating crystal graph convolutional neural network with multitask learning for material property prediction. arXiv preprint arXiv:1811.05660. Cited by: §1.
 [31] (2007) Graph clustering. Computer science review. Cited by: §4.1.
 [32] (2017) Modeling relational data with graph convolutional networks. arXiv preprint arXiv:1703.06103. Cited by: §1.
 [33] (2018) Disan: directional selfattention network for rnn/cnnfree language understanding. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §3.3.
 [34] (2011) Weisfeilerlehman graph kernels. Journal of Machine Learning Research. Cited by: §6.1.
 [35] (2017) Dynamic edgeconditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §2.2.
 [36] (2018) Dating documents using graph convolution networks. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). External Links: Link, Document Cited by: §1.
 [37] (2018) Reside: improving distantlysupervised neural relation extraction using side information. arXiv preprint arXiv:1812.04361. Cited by: §1.
 [38] (2019) Compositionbased multirelational graph convolutional networks. arXiv preprint arXiv:1911.03082. Cited by: §1.
 [39] (2017) Attention is all you need. In Advances in neural information processing systems, Cited by: §3.3.
 [40] (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §8.2.
 [41] (2016) Order matters: sequence to sequence for sets. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §1, §2.2, Table 2, §6.2.
 [42] (2008) Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems. Cited by: §6.1.
 [43] (2017) Starlike tree — Wikipedia, the free encyclopedia. Note: https://en.wikipedia.org/w/index.php?title=Starlike_tree&oldid=791882487[Online; accessed 17November2019] Cited by: Definition 3.
 [44] (2019) Path graph — Wikipedia, the free encyclopedia. Note: [Online; accessed 17November2019] External Links: Link Cited by: Definition 4.
 [45] (201804) Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys. Rev. Lett. 120, pp. 145301. External Links: Document, Link Cited by: §1.
 [46] (2018) How powerful are graph neural networks?. External Links: 1810.00826 Cited by: §2.1.
 [47] (2018) Hierarchical graph representation learning with differentiable pooling. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, External Links: Link Cited by: §1, §1, §2.2, §4.4, Table 2, §6.2, §8.1.
 [48] (2018) An endtoend deep learning architecture for graph classification. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §1, §2.2, Table 2, §6.2.
Appendix
Appendix A Hyperparameter Tuning
For all our experiments, Adam [20] optimizer is used. fold crossvalidation is used with for training and for validation and test each. Models were trained for epochs with lr decay of after every epochs. The range of hyperparameter search are provided in Table 9. The model with best validation accuracy was selected for testing. Our code is based on Pytorch Geometric library [12].
Hyperparameter  Range 

Hidden dimension  
Learning rate  
Dropout  
L2 regularization 
Appendix B Details of Hierarchical Pooling Setup
For hierarchical pooling, we follow SAGPool [23] and use three layers of GCN, each followed by a pooling layer. After each pooling step, the graph is summarized using a readout function which is a concatenation of the and of the node representations (similar to SAGPool). The summaries are then added and passed through a network of fullyconnected layers separated by dropout layers to predict the class.
Appendix C Details of Global Pooling Setup
Global Pooling architecture is same as the hierarchical architecture with the only difference that pooling is done only after all GCN layers. We do not use readout function for global pooling as they do not require them. To be comparable with other models, we restrict the feature dimension of the pooling output to be no more than . For global pooling layers, range for hidden dimension and lr search was same as ASAP.
Method  Range 

Set2Set  processingstep 
GlobalAttention  transform 
SortPool  is chosen such that output of pooling 
Appendix D Similarities between pooling in CNN and ASAP
In CNN, pooling methods (e.g mean pool and max pool) have two hyperparameter: kernel size and stride. Kernel size decides the number of pixels being considered for computing each new pixel value in the next layer. Stride decides the fraction of new pixels being sampled thereby controlling the size of the image in next layer. In ASAP, determines the neighborhood radius of clusters and decides the sampling ratio. This makes and are analogous to kernel size and stride of CNN pooling respectively. There are however some key differences. In CNN, a given kernel size corresponds to a fixed number of pixels around a central pixel whereas in ASAP, the number of nodes being considered is variable, although the neighborhood is constant. In CNN, stride uniformly samples from new pixels whereas in ASAP, the model has the flexibility to attend to different parts of the graph and sample accordingly.
Appendix E Ablation on pooling ratio
Intuitively, higher will lead to more information retention. Hence, we expect an increase in performance with increasing . This is empirically observed in Fig. 2. However, as increases, the computational resources required by the model also increase because a relatively larger pooled graph gets propagated to the later layers. Hence, there is a tradeoff between performance and computational requirement while deciding on the pooling ratio .
Appendix F Proof of Theorem 1
Theorem 1. Let be a graph with positive adjacency matrix A i.e., . Consider any function which depends on difference between a node and its neighbors after a linear transformation . For e.g:
where and .

[label=)]

If fitness value then cannot learn f.

If fitness value then can learn f.
Proof.
For GCN, where is a learnable matrix. Since , cannot have a term of the form which proves the first part of the theorem. We prove the second part by showing that LEConv can learn the following function :
(11) 
LEConv formulation is defined as:
(12) 
where , and are learnable matrices. For , and we find Eq. (12) is equal to Eq. (11). ∎
Appendix G Graph Connectivity
Proof of Theorem 2
Definition 1.
For a graph , we define optimumnodes as the maximum number of nodes that can be selected which are atleast hops away from each other.
Definition 2.
For a given number of nodes , we define optimumtree as the tree which has maximum optimumnodes among all possible trees with nodes.
Lemma 1.
Let be an optimumtree of vertices and be an optimum tree with vertices. The optimumnodes of and differ by atmost one, i.e., .
Proof.
Consider which has nodes. We can remove any one of the leaf nodes in to obtain a tree with nodes. If any one of the nodes in was removed, then would become . If any other node was removed , then being a leaf it does not constitute the shortest path between any of the nodes. This implies that the optimumnodes for is atleast , i.e.,
(13) 
Since is the optimaltree, we know that:
(14) 
Using Eq. (13) and (14) we can write:
which proves our lemma. ∎
Lemma 2.
Let be an optimumtree of vertices and be an optimumtree of vertices. is an induced subgraph of .
Proof.
Let us choose a node to be removed from and join its neighboring nodes to obtain a tree with nodes with an objective of ensuring a maximum . To do so, we can only remove a leaf node from . This is because removing nonleaf nodes can reduce the shortest path between multiple pairs of nodes whereas removing leafnodes will reduce only the shortest path to nodes from the new leaf at that position. This ensures least reduction in optimumnodes for . Removing a leaf node implies that cannot be lesser than as it affects only the paths involving that particular leaf node. Using Lemma 1, we see that is equivalent to , i.e., is one of the possible optimaltrees with nodes. Since was formed by removing a leaf node from , we find that is indeed an induced subgraph of .
∎
Definition 3.
A starlike tree is a tree having atmost one node (root) with degree greater than two [43]. We consider starlike tree with height to be balanced, if there is atmost one leaf which is at a height less than while the rest are all at a height from the root. Figure 3(a) depicts an example of a balanced starlike tree with .
Definition 4.
Lemma 3.
For a balanced starlike tree with height , where is even, , i.e., when the leaves are selected.
Lemma 4.
Among all the possible trees which have N vertices, the maximum achievable is , which is obtained if the tree is a balanced starlike tree with height if is even.
Proof.
To prove the lemma, we use induction. Here, the base case corresponds to a path graph with nodes, a trivial case of starlike graph, as it has only nodes which are hops away. From the formula , we get which verifies the base case.
For any , let us assume that the lemma is true, i.e., a balanced starlike tree with height achieves the maximum for any tree with vertices. Consider to be the optimaltree for nodes. From Lemma (2), we know that is an induced subgraph of . This means that can be obtained by adding a node to . Since we are constructing , we need to add a node to such that maximum nodes can be selected which are atleast hops away. There are three possible structures for the tree depending on the minimum height among all its branches: (a) minimum height among all the branches is less than , (b) minimum height among all the branches is equal to and (c) minimum height among all the branches is equal to . Although case (a) is not possible as we assumed to be a balanced starlike tree, we consider it for the sake of completeness. For case (a), no matter where we add the node, will not increase. However, we should add the node to the leaf of the branch with least height as it will allow the new leaf of that branch to be chosen in case the number of nodes in tree is increased to some such that height of that branch becomes . For case (b), we should add the node to the leaf of the branch with least height so that its height becomes and the new leaf of that branch gets selected. For case (c), no matter where we add the node, will not increase. Unlike case (a), we should add the new node to the root so as to start a new leaf which could be selected if that branch grows to a height for some . For all the three cases, is a balanced starlike tree as the new node is either added to the leaf of a branch if minimum height of a leaf is less than or to the root if the minimum height of the branches is . Hence, by induction, the lemma is proved.
∎
Theorem 2. Let the input graph be a tree of any possible structure with nodes. Let be the lower bound on sampling ratio to ensure the existence of atleast one edge in the pooled graph irrespective of the structure of and the location of the selected nodes. For TopK or SAGPool, whereas for ASAP, as .
Proof.
From Lemma (4) and (3), we know that among all the possible trees which have N vertices, the maximum achievable is . Using pigeonhole principle we can show that for a pooling method with if the number of sampled clusters is greater than then there will always be an edge in the pooled graph irrespective of the position of the selected clusters:
(15) 
Let us consider 1hop neighborhood for pooling, i.e., . Substituting in Eq. (15) for TopK and SAGPool we get:
and as N we obtain . Substituting in Eq. (15) for ASAP we get:
and as N we obtain ∎
Similar Analysis for Path Graph
Lemma 5.
For a path graph with nodes, .
Lemma 6.
Consider the input graph to be a path graph with nodes. To ensure that a pooling operator with and sampling ratio has at least one edge in the pooled graph, irrespective of the location of selected clusters, we have the following inequality on k: .
Proof.
From Lemma (5), we know that . Using pigeonhole principle we can show that for a pooling method with , if the number of sampled clusters is greater than , then there will always be an edge in the pooled graph irrespective of the position of the selected clusters:
(16) 
From Eq. (16), we get which completes the proof. ∎
Theorem 3. Consider the input graph to be a path graph with nodes. Let be the lower bound on sampling ratio to ensure the existence of atleast one edge in the pooled graph. For TopK or SAGPool, as whereas for ASAP, as .
Proof.
From Lemma (6), we get . Using h = 1 for TopK and SAGPool when N tends to infinity i.e. , we get . Using h = 3 for ASAP when N tends to infinity i.e. , we get . ∎
Graph Connectivity via Graph Power.
To minimize the possibility of nodes getting isolated in pooled graph, TopK employs graph power i.e. instead of . This helps in increasing the density of the graph before pooling. While using graph power, TopK can connect two nodes which are atmost hops away whereas ASAP in this setting can connect upto hops in the original graph. As , ASAP will always have better connectivity given graph power.
Appendix H Graph Permutation Equivariance
Given a permutation matrix and a function depending on graph with node feature matrix and adjacency matrix , graph permutation is defined as , node permutation is defined as and edge permutation is defined as .
Graph pooling operations should produce pooled graphs which are isomorphic after graph permutation i.e., they need to be graph permutation equivariant or invariant. We show that ASAP has the property of being graph permutation equivariant.
Proposition 1. ASAP is a graph permutation equivariant pooling operator.
Proof.
Since is computed by an attention mechanism which attends to all edges in the graph, we have:
(17) 
Selecting top clusters denoted by indices , changes as:
(18) 
Using Eq. (18) and , we can write:
(19) 
Since and , we get:
(20) 
From Eq. (19) and Eq. (20), we see that graph permutation does not change the output features. It only changes the order of the computed feature and hence is isomorphic to the pooled graph. ∎
Appendix I Pseudo Code
Algorithm 1 is a pseudo code of ASAP. The Master2Token working is explained in Algorithm LABEL:alg:M2T.