ASAP: Adaptive Structure Aware Pooling for Learning Hierarchical Graph Representations

ASAP: Adaptive Structure Aware Pooling for Learning Hierarchical Graph Representations


Graph Neural Networks (GNN) have been shown to work effectively for modeling graph structured data to solve tasks such as node classification, link prediction and graph classification. There has been some recent progress in defining the notion of pooling in graphs whereby the model tries to generate a graph level representation by downsampling and summarizing the information present in the nodes. Existing pooling methods either fail to effectively capture the graph substructure or do not easily scale to large graphs. In this work, we propose ASAP (Adaptive Structure Aware Pooling), a sparse and differentiable pooling method that addresses the limitations of previous graph pooling architectures. ASAP utilizes a novel self-attention network along with a modified GNN formulation to capture the importance of each node in a given graph. It also learns a sparse soft cluster assignment for nodes at each layer to effectively pool the subgraphs to form the pooled graph. Through extensive experiments on multiple datasets and theoretical analysis, we motivate our choice of the components used in ASAP. Our experimental results show that combining existing GNN architectures with ASAP leads to state-of-the-art results on multiple graph classification benchmarks. ASAP has an average improvement of 4%, compared to current sparse hierarchical state-of-the-art method.

Figure 1: Overview of ASAP: (a) Input graph to ASAP. (b) ASAP initially clusters -hop neighborhood considering all nodes as medoid1. For brevity, we only show cluster formations of nodes 2 & 6 as medoids. Cluster membership is computed using M2T attention (refer Sec. 4.2). (c) Clusters are scored using LEConv (refer Sec. 4.3). Darker shade denotes higher score. (d) A fraction of top scoring clusters are selected in the pooled graph. Adjacency matrix is recomputed using edge weights between the member nodes of selected clusters. (e) Output of ASAP (f) Overview of hierarchical graph classification architecture.

1 Introduction

In recent years, there has been an increasing interest in developing Graph Neural Networks (GNNs) for graph structured data. CNNs have shown to be successful in tasks involving images [22, 16] and text [19]. Unlike these regular grid data, arbitrary shaped graphs have rich information present in their graph structure. By inherently capturing such information through message propagation along the edges of the graph, GNNs have proved to be more effective for graphs [14, 15]. GNNs have been successfully applied in tasks such as semantic role labeling [26], relation extraction [37], neural machine translation [2], document dating [36], and molecular feature extraction [18]. While some of the works focus on learning node-level representations to perform tasks such as node classification [21, 40] and link prediction [32, 38], others focus on learning graph-level representations for tasks like graph classification [5, 17, 47, 13, 23] and graph regression [45, 30]. In this paper, we focus on graph-level representation learning for the task of graph classification.

Briefly, the task of graph classification involves predicting the label of an input graph by utilizing the given graph structure and initial node-level representations. For example, given a molecule, the task could be to predict if it is toxic. Current GNNs are inherently flat and lack the capability of aggregating node information in a hierarchical manner. Such architectures rely on learning node representations through some GNN followed by aggregation of the node information to generate the graph representation [41, 24, 48]. But learning graph representations in a hierarchical manner is important to capture local substructures that are present in graphs. For example, in an organic molecule, a set of atoms together can act as a functional group and play a vital role in determining the class of the graph.

To address this limitation, new pooling architectures have been proposed where sets of nodes are recursively aggregated to form a cluster that represents a node in the pooled graph, thus enabling hierarchical learning. DiffPool [47] is a differentiable pooling operator that learns a soft assignment matrix mapping each node to a set of clusters. Since this assignment matrix is dense, it is not easily scalable to large graphs [6]. Following that, TopK [13] is proposed which learns a scalar projection score for each node and selects the top nodes. They address the sparsity concerns of DiffPool but are unable to capture the rich graph structure effectively. Recently, SAGPool [23], a TopK based architecture, has been proposed which leverages self-attention network to learn the node scores. Although local graph structure is used for scoring nodes, it is still not used effectively in determining the connectivity of the pooled graph. Pooling methods that leverage the graph structure effectively while maintaining sparsity currently don’t exist. We address the gap in this paper.

In this work, we propose a new sparse pooling operator called Adaptive Structure Aware Pooling (ASAP) which overcomes the limitations in current pooling methods. Our contributions can be summarized as follows:

  • We introduce ASAP, a sparse pooling operator capable of capturing local subgraph information hierarchically to learn global features with better edge connectivity in the pooled graph.

  • We propose Master2Token (M2T), a new self-attention framework which is better suited for global tasks like pooling.

  • We introduce a new convolution operator LEConv, that can adaptively learn functions of local extremas in a graph substructure.

2 Related Work

2.1 Graph Neural Networks

Various formulation of GNNs have been proposed which use both spectral and non-spectral approaches. Spectral methods [5, 17] aim at defining convolution operation using Fourier transformation and graph Laplacian. These methods do not directly generalize to graphs with different structure [4]. Non-spectral methods [8, 21, 46, 27, 28] define convolution through a local neighborhood around nodes in the graph. They are faster than spectral methods and easily generalize to other graphs. GNNs can also be viewed as message passing algorithm where nodes iteratively aggregate messages from neighboring nodes through edges [14].

2.2 Pooling

Pooling layers overcome GNN’s inability to aggregate nodes hierarchically. Earlier pooling methods focused on deterministic graph clustering algorithms [8, 11, 35]. \citeauthordiffpool introduced the first differentiable pooling operator which out-performed the previous deterministic methods. Since then, new data-driven pooling methods have been proposed; both spectral [25, 9] and non-spectral [47, 13]. Spectral methods aim at capturing the graph topology using eigen-decomposition algorithms. However, due to higher computational requirement for spectral graph techniques, they are not easily scalable to large graphs. Hence, we focus on non-spectral methods.

Pooling methods can further be divided into global and hierarchical pooling layers. Global pooling summarize the entire graph in just one step. Set2Set [41] finds the importance of each node in the graph through iterative content-based attention. Global-Attention [24] uses an attention mechanism to aggregate nodes in the graph. SortPool [48] summarizes the graph by concatenating few nodes after sorting them based on their features. Hierarchical pooling is used to capture the topological information of graphs. DiffPool forms a fixed number of clusters by aggregating nodes. It uses GNN to compute a dense soft assignment matrix, making it infea-

Property DiffPool TopK SAGPool ASAP
Node Aggregation
Soft Edge Weights
Variable number of clusters
Table 1: Properties desired in hierarchical pooling methods.

sible for large graphs. TopK scores nodes based on a learnable projection vector and samples a fraction of high scoring nodes. It avoids node aggregation and computing soft assignment matrix to maintain the sparsity in graph operations. SAGPool improve upon TopK by using a GNN to consider the graph structure while scoring nodes. Since TopK and SAGPool do not aggregate nodes nor compute soft edge weights, they are unable to preserve node and edge information effectively.

To address these limitations, we propose ASAP, which has all the desirable properties of hierarchical pooling without compromising on sparsity in graph operations. Please see Table. 1 for an overall comparison of hierarchical pooling methods. Further comparison discussions between hierarchical architectures are presented in Sec. 8.1.

3 Preliminaries

3.1 Problem Statement

Consider a graph with nodes and edges. Each node has -dimensional feature representation denoted by . denotes the node feature matrix and represents the weighted adjacency matrix. The graph also has a label associated with it. Given a dataset , the task of graph classification is to learn a mapping , where is the set of input graphs and is the set of labels associated with each graph. A pooled graph is denoted by with node embedding matrix and its adjacency matrix as .

3.2 Graph Convolution Networks

We use Graph Convolution Network (GCN) [21] for extracting discriminative features for graph classification. GCN is defined as:


where for self-loops, and is a learnable matrix for any layer . We use the initial node feature matrix wherever provided, i.e., .

3.3 Self-Attention

Self-attention is used to find the dependency of an input on itself [7, 39]. An alignment score is computed to map the importance of candidates on target query . In self-attention, target query and candidates are obtained from input entities . Self-attention can be categorized as Token2Token and Source2Token based on the choice of target query [33].

Token2Token (T2T)

selects both the target and candidates from the input set . In the context of additive attention [1], is computed as:


where is the concatenation operator.

Source2Token (S2T)

finds the importance of each candidate to a specific global task which cannot be represented by any single entity. is computed by dropping the target query term. Eq. (2) changes to the following:


3.4 Receptive Field

We extend the concept of receptive field from pooling operations in CNN to GNN2. We define of a pooling operator as the number of hops needed to cover all the nodes in the neighborhood that influence the representation of a particular output node. Similarly, of a pooling operator is defined as the number of hops needed to cover all the edges in the neighborhood that affect the representation of an edge in the pooled graph .

4 ASAP: Proposed Method

In this section we describe the components of our proposed method ASAP. As shown in Fig. 1(b), ASAP initially considers all possible local clusters with a fixed receptive field for a given input graph. It then computes the cluster membership of the nodes using an attention mechanism. These clusters are then scored using a GNN as depicted in Fig 1(c). Further, a fraction of the top scoring clusters are selected as nodes in the pooled graph and new edge weights are computed between neighboring clusters as shown in Fig. 1(d). Below, we discuss the working of ASAP in details. Please refer to Appendix Sec. I for a pseudo code of the working of ASAP.

4.1 Cluster Assignment

Initially, we consider each node in the graph as a medoid of a cluster such that each cluster can represent only the local neighbors within a fixed radius of hops i.e., . This effectively means that for ASAP. This helps the clusters to effectively capture the information present in the graph sub-structure.

Let be the feature representation of a cluster centered at . We define as the graph with node feature matrix and adjacency matrix . We denote the cluster assignment matrix by , where represents the membership of node in cluster . By employing such local clustering [31], we can maintain sparsity of the cluster assignment matrix similar to the original graph adjacency matrix i.e., space complexity of both and is .

4.2 Cluster Formation using Master2Token

Given a cluster , we learn the cluster assignment matrix through a self-attention mechanism. The task here is to learn the overall representation of the cluster by attending to the relevant nodes in it. We observe that both T2T and S2T attention mechanisms described in Sec. 3.3 do not utilize any intra-cluster information. Hence, we propose a new variant of self-attention called Master2Token (M2T). We further motivate the need for M2T framework later in Sec. 8.2. In M2T framework, we first create a master query which is representative of all the nodes within a cluster:


where is obtained after passing through a separate GCN to capture structural information in the cluster 3. is a master function which combines and transforms feature representation of to find . In this work we experiment with master function defined as:


This master query attends to all the constituent nodes using additive attention:


where and are learnable vector and matrix respectively. The calculated attention scores signifies the membership strength of node in cluster . Hence, we use this score to define the cluster assignment matrix discussed above, i.e., . The cluster representation for is computed as follows:


4.3 Cluster Selection using LEConv

Similar to TopK [13], we sample clusters based on a cluster fitness score calculated for each cluster in the graph using a fitness function . For a given pooling ratio , the top clusters are selected and included in the pooled graph . To compute the fitness scores, we introduce Local Extrema Convolution (LEConv), a graph convolution method which can capture local extremum information. In Sec. 5.1 we motivate the choice of LEConv’s formulation and contrast it with the standard GCN formulation. LEConv is used to compute as follows:


where denotes the neighborhood of the node in . are learnable parameters and is some activation function. Fitness vector is multiplied to the cluster feature matrix to make learnable i.e.,:

where is broadcasted hadamard product. The function ranks the fitness scores and gives the indices of top selected clusters in as follows:

The pooled graph is formed by selecting these top clusters. The pruned cluster assignment matrix and the node feature matrix are given by:


where is used for index slicing.

Set2Set [41] 71.60 0.87 72.16 0.43 66.97 0.74 61.04 2.69 61.46 0.47
Global-Attention [24] 71.38 0.78 71.87 0.60 69.00 0.49 67.87 0.40 61.31 0.41
SortPool [48] 71.87 0.96 73.91 0.72 68.74 1.07 68.59 0.67 63.44 0.65
Diffpool [47] 66.95 2.41 68.20 2.02 62.32 1.90 61.98 1.98 60.60 1.62
TopK [13] 75.01 0.86 71.10 0.90 67.02 2.25 66.12 1.60 61.46 0.84
SAGPool [23] 76.45 0.97 71.86 0.97 67.45 1.11 67.86 1.41 61.73 0.76
ASAP (Ours)
Table 2: Comparison of ASAP with previous global and hierarchical pooling. Average accuracy and standard deviation is reported for 20 random seeds. We observe that ASAP consistently outperforms all the baselines on all the datasets. Please refer to Sec. 7.1 for more details.

4.4 Maintaining Graph Connectivity

Following [47], once the clusters have been sampled, we find the new adjacency matrix for the pooled graph using and in the following manner:


where . Equivalently, we can see that . This formulation ensures that any two clusters and in are connected if there is any common node in the clusters and or if any of the constituent nodes in the clusters are neighbors in the original graph (Fig. 1(d)). Hence, the strength of the connection between clusters is determined by both the membership of the constituent nodes through and the edge weights . Note that is a sparse matrix by formulation and hence the above operation can be implemented efficiently.

5 Theoretical Analysis

5.1 Limitations of using GCN for scoring clusters

GCN from Eq. (1) can be viewed as an operator which first computes a pre-score for each node i.e., followed by a weighted average over neighbors and a non-linearity. If for some node the pre-score is very high, it can increase the scores of its neighbors which inherently biases the pooling operator to select clusters in the local neighborhood instead of sampling clusters which represent the whole graph. Thus, selecting the clusters which correspond to local extremas of pre-score function would potentially allow us to sample representative clusters from all parts of the graph.

Theorem 1.

Let be a graph with positive adjacency matrix A i.e., . Consider any function which depends on difference between a node and its neighbors after a linear transformation . For e.g,:

where and .

  1. If fitness value then cannot learn f.

  2. If fitness value then can learn f.


See Appendix Sec. F for proof. ∎

Motivated by the above analysis, we propose to use LEConv (Eq. 8) for scoring clusters. LEConv can learn to score clusters by considering both its global and local importance through the use of self-loops and ability to learn functions of local extremas.

5.2 Graph Connectivity

Here, we analyze ASAP from the aspect of edge connectivity in the pooled graph. When considering -hop neighborhood for clustering, both ASAP and DiffPool have because they use Eq. (10) to define the edge connectivity. On the other hand, both TopK and SAGPool have . A larger edge receptive field implies that the pooled graph has better connectivity which is important for the flow of information in the subsequent GCN layers.

Theorem 2.

Let the input graph be a tree of any possible structure with nodes. Let be the lower bound on sampling ratio to ensure the existence of atleast one edge in the pooled graph irrespective of the structure of and the location of the selected nodes. For TopK or SAGPool, whereas for ASAP, as .


See Appendix Sec. G for proof. ∎

Theorem 2 suggests that ASAP can achieve a similar degree of connectivity as SAGPool or TopK for a much smaller sampling ratio . For a tree with no prior information about its structure, ASAP would need to sample only half of the clusters whereas TopK and SAGPool would need to sample almost all the nodes, making TopK and SAGPool inefficient for such graphs. In general, independent of any combination of nodes selected, ASAP will have better connectivity due to its larger receptive field. Please refer to Appendix Sec. G for a similar analysis on path graph and more details.

5.3 Graph Permutation Equivariance

Proposition 1.

ASAP is a graph permutation equivariant pooling operator.


See Appendix Sec. H for proof. ∎

6 Experimental Setup

In our experiments, we use graph classification benchmarks and compare ASAP with multiple pooling methods. Below, we describe the statistics of the dataset, the baselines used for comparisons and our evaluation setup in detail.

6.1 Datasets

We demonstrate the effectiveness of our approach on graph classification datasets. D&D [34, 10] and PROTEINS [10, 3] are datasets containing proteins as graphs. NCI1 [42] and NCI109 are datasets for anticancer activity classification. FRANKENSTEIN [29] contains molecules as graph for mutagen classification. Please refer to Table 3 for the dataset statistics.

D&D 1178 2 284.32 715.66
PROTEINS 1113 2 39.06 72.82
NCI1 4110 2 29.87 32.30
NCI109 4127 2 29.68 32.13
FRANKENSTEIN 4337 2 16.90 17.88
Table 3: Statistics of the graph datasets. , , and denotes the average number of graphs, classes, nodes and edges respectively.

6.2 Baselines

We compare ASAP with previous state-of-the-art hierarchical pooling operators DiffPool [47], TopK [13] and SAGPool [23]. For comparison with global pooling, we choose Set2Set [41], Global-Attention [24] and SortPool [48].

6.3 Training & Evaluation Setup

We use a similar architecture as defined in [6, 23] which is depicted in Fig. 1(f). For ASAP, we choose and to be consistent with baselines.4 Following SAGPool[23], we conduct our experiments using -fold cross-validation and report the average accuracy on random seeds.5

Aggregation type FITNESS CLUSTER
None - -
Only cluster -
Table 4: Different aggregation types as mentioned in Sec 7.2.

7 Results

In this section, we attempt to answer the following questions:


How does ASAP perform compared to other pooling methods at the task of graph classification? (Sec. 7.1)


Is cluster formation by M2T attention based node aggregation beneficial during pooling? (Sec. 7.3)


Is LEConv better suited as cluster fitness scoring function compared to vanilla GCN? (Sec. 7.4)


How helpful is the computation of inter-cluster soft edge weights instead of sampling edges from the input graph? (Sec. 7.5)

7.1 Performance Comparison

We compare the performace of ASAP with baseline methods on graph classification tasks. The results are shown in Table 2. All the numbers for hierarchical pooling (DiffPool, TopK and SAGPool) are taken from [23]. For global pooling (Set2Set, Global-Attention and SortPool), we modify the architectural setup to make them comparable with the hierarchical variants. 6. We observe that ASAP consistently outperforms all the baselines on all datasets. We note that ASAP has an average improvement of and over previous state-of-the-art hierarchical (SAGPool) and global (SortPool) pooling methods respectively. We also observe that compared to other hierarchical methods, ASAP has a smaller variance in performance which suggests that the training of ASAP is more stable.

7.2 Effect of Node Aggregation

Here, we evaluate the improvement in performance due to our proposed technique of aggregating nodes to form a cluster. There are two aspects involved during the creation of clusters for a pooled graph:

  • FITNESS: calculating fitness scores for individual nodes. Scores can be calculated either by using only the medoid or by aggregating neighborhood information.

  • CLUSTER: generating a representation for the new cluster node. Cluster representation can either be the medoid’s representation or some feature aggregation of the neighborhood around the medoid.

We test three types of aggregation methods: ’None’, ’Only cluster’ and ’Both’ as described in Table 4. As shown in Table 5, we observe that our proposed node aggregation helps improve the performance of ASAP.

None 67.4 0.6 69.9 2.5
Only cluster 67.5 0.5 70.6 1.8
Table 5: Performace comparison of different aggregation methods on validation data of FRANKENSTEIN and NCI1.
T2T 67.6 0.5 70.3 2.0
S2T 67.7 0.5 69.9 2.0
Table 6: Effect of different attention framework on pooling evaluated on validation data of FRANKENSTEIN and NCI1. Please refer to Sec. 7.3 for more details.

7.3 Effect of M2T Attention

We compare our M2T attention framework with previously proposed S2T and T2T attention techniques. The results are shown in Table 6. We find that M2T attention is indeed better than the rest in NCI1 and comparable in FRANKENSTEIN.

Fitness function FRANKENSTEIN NCI1
GCN 62.70.3 65.42.5
Basic-LEConv 63.10.7 69.81.9
LEConv 67.80.6 70.72.3
Table 7: Performance comparison of different fitness scoring functions on validation data of FRANKENSTEIN and NCI1. Refer to Sec. 7.4 for details.

7.4 Effect of LEConv as a fitness scoring function

In this section, we analyze the impact of LEConv as a fitness scoring function in ASAP. We use two baselines - GCN (Eq. 1) and Basic-LEConv which computes . In Table 7 we can see that Basic-LEConv and LEConv perform significantly better than GCN because of their ability to model functions of local extremas. Further, we observe that LEConv performs better than Basic-LEConv as it has three different linear transformation compared to only one in the latter. This allows LEConv to potentially learn complicated scoring functions which is better suited for the final task. Hence, our analysis in Theorem 1 is emperically validated.

7.5 Effect of computing Soft edge weights

We evaluate the importance of calculating edge weights for the pooled graph as defined in Eq. 10. We use the best model configuration as found from above ablation analysis and then add the feature of computing soft edge weights for clusters. We observe a significant drop in performace when the edge weights are not computed. This proves the necessity of capturing the edge information while pooling graphs.

Soft edge weights FRANKENSTEIN NCI1
Absent 67.8 0.6 70.7 2.3
Table 8: Effect of calculating soft edge weights on pooling for validation data of FRANKENSTEIN and NCI1. Please refer to Sec. 7.5 for more details.

8 Discussion

8.1 Comparison with other pooling methods


DiffPool and ASAP both aggregate nodes to form a cluster. While ASAP only considers nodes which are within -hop neighborhood from a node (medoid) as a cluster, DiffPool considers the entire graph. As a result, in DiffPool, two nodes that are disconnected or far away in the graph can be assigned similar clusters if the nodes and their neighbors have similar features. Since this type of cluster formation is undesirable for a pooling operator [47], DiffPool utilizes an auxiliary link prediction objective during training to specifically prevent far away nodes from being clustered together. ASAP needs no such additional regularization because it ensures the localness while clustering. DiffPool’s soft cluster assignment matrix is calculated for all the nodes to all the clusters making a dense matrix. Calculating and storing this does not scale easily for large graphs. ASAP, due to the local clustering over -hop neighborhood, generates a sparse assignment matrix while retaining the hierarchical clustering properties of Diffpool. Further, for each pooling layer, DiffPool has to predetermine the number of clusters it needs to pick which is fixed irrespective of the input graph size. Since ASAP selects the top fraction of nodes in current graph, it inherently takes the size of the input graph into consideration.

TopK & SAGPool

While TopK completely ignores the graph structure during pooling, SAGPool modifies the TopK formulation by incorporating the graph structure through the use of a GCN network for computing node scores . To enforce sparsity, both TopK and SAGPool avoid computing the cluster assignment matrix that DiffPool proposed. Instead of grouping multiple nodes to form a cluster in the pooled graph, they drop nodes from the original graph based on a score [6] which might potentially lead to loss of node and edge information. Thus, they fail to leverage the overall graph structure while creating the clusters. In contrast to TopK and SAGPool, ASAP can capture the rich graph structure while aggregating nodes to form clusters in the pooled graph. TopK and SAGPool sample edges from the original graph to define the edge connectivity in the pooled graph. Therefore, they need to sample nodes from a local neighborhood to avoid isolated nodes in the pooled graph. Maintaining graph connectivity prevents these pooling operations from sampling representative nodes from the entire graph. The pooled graph in ASAP has a better edge connectivity compared to TopK and SAGPool because soft edge weights are computed between clusters using upto three hop connections in the original graph. Also, the use of LEConv instead of GCN for finding fitness values further allows ASAP to sample representative clusters from local neighborhoods over the entire graph.

8.2 Comparison of Self-Attention variants

Source2Token & Token2Token

T2T models the membership of a node by generating a query based only on the medoid of the cluster. Graph Attention Network (GAT) [40] is an example of T2T attention in graphs. S2T finds the importance of each node for a global task. As shown in Eq. 3, since a query vector is not used for calculating the attention scores, S2T inherently assigns the same membership score to a node for all the possible clusters that node can belong to. Hence, both S2T and T2T mechanisms fail to effectively utilize the intra-cluster information while calculating a node’s cluster membership. On the other hand, M2T uses a master function to generate a query vector which depends on all the entities within the cluster and hence is a more representative formulation. To understand this, consider the following scenario. If in a given cluster, a non-medoid node is removed, then the un-normalized membership scores for the rest of the nodes will remain unaffected in S2T and T2T framework whereas the change will reflect in the scores calculated using M2T mechanism. Also, from Table 6, we find that M2T performs better than S2T and T2T attention showing that M2T is better suited for global tasks like pooling.

9 Conclusion

In this paper, we introduce ASAP, a sparse and differentiable pooling method for graph structured data. ASAP clusters local subgraphs hierarchically which helps it to effectively learn the rich information present in the graph structure. We propose Master2Token self-attention framework which enables our model to better capture the membership of each node in a cluster. We also propose LEConv, a novel GNN formulation that scores the clusters based on its local and global importance. ASAP leverages LEConv to compute cluster fitness scores and samples the clusters based on it. This ensures the selection of representative clusters throughout the graph. ASAP also calculates sparse edge weights for the selected clusters and is able to capture the edge connectivity information efficiently while being scalable to large graphs. We validate the effectiveness of the components of ASAP both theoretically and empirically. Through extensive experiments, we demonstrate that ASAP achieves state-of-the-art performace on multiple graph classification datasets.

10 Acknowledgements

We would like to thank the developers of Pytorch_Geometric [12] which allows quick implementation of geometric deep learning models. We would like to thank Matthias Fey again for actively maintaining the library and quickly responding to our queries on github.


Appendix A Hyperparameter Tuning

For all our experiments, Adam [20] optimizer is used. -fold cross-validation is used with for training and for validation and test each. Models were trained for epochs with lr decay of after every epochs. The range of hyperparameter search are provided in Table 9. The model with best validation accuracy was selected for testing. Our code is based on Pytorch Geometric library [12].

Hyperparameter Range
Hidden dimension
Learning rate
L2 regularization
Table 9: Hyperparameter tuning Summary.

Appendix B Details of Hierarchical Pooling Setup

For hierarchical pooling, we follow SAGPool [23] and use three layers of GCN, each followed by a pooling layer. After each pooling step, the graph is summarized using a readout function which is a concatenation of the and of the node representations (similar to SAGPool). The summaries are then added and passed through a network of fully-connected layers separated by dropout layers to predict the class.

Appendix C Details of Global Pooling Setup

Global Pooling architecture is same as the hierarchical architecture with the only difference that pooling is done only after all GCN layers. We do not use readout function for global pooling as they do not require them. To be comparable with other models, we restrict the feature dimension of the pooling output to be no more than . For global pooling layers, range for hidden dimension and lr search was same as ASAP.

Method Range
Set2Set processing-step
Global-Attention transform
SortPool is chosen such that output of pooling
Table 10: Global Pooling Hyperparameter Tuning Summary.

Appendix D Similarities between pooling in CNN and ASAP

In CNN, pooling methods (e.g mean pool and max pool) have two hyperparameter: kernel size and stride. Kernel size decides the number of pixels being considered for computing each new pixel value in the next layer. Stride decides the fraction of new pixels being sampled thereby controlling the size of the image in next layer. In ASAP, determines the neighborhood radius of clusters and decides the sampling ratio. This makes and are analogous to kernel size and stride of CNN pooling respectively. There are however some key differences. In CNN, a given kernel size corresponds to a fixed number of pixels around a central pixel whereas in ASAP, the number of nodes being considered is variable, although the neighborhood is constant. In CNN, stride uniformly samples from new pixels whereas in ASAP, the model has the flexibility to attend to different parts of the graph and sample accordingly.

Appendix E Ablation on pooling ratio

Figure 2: Validation Accuracy vs sampling ratio on NCI1 dataset.

Intuitively, higher will lead to more information retention. Hence, we expect an increase in performance with increasing . This is empirically observed in Fig. 2. However, as increases, the computational resources required by the model also increase because a relatively larger pooled graph gets propagated to the later layers. Hence, there is a trade-off between performance and computational requirement while deciding on the pooling ratio .

Appendix F Proof of Theorem 1

Theorem 1. Let be a graph with positive adjacency matrix A i.e., . Consider any function which depends on difference between a node and its neighbors after a linear transformation . For e.g:

where and .

  1. If fitness value then cannot learn f.

  2. If fitness value then can learn f.


For GCN, where is a learnable matrix. Since , cannot have a term of the form which proves the first part of the theorem. We prove the second part by showing that LEConv can learn the following function :


LEConv formulation is defined as:


where , and are learnable matrices. For , and we find Eq. (12) is equal to Eq. (11). ∎

Appendix G Graph Connectivity

Proof of Theorem 2

Definition 1.

For a graph , we define optimum-nodes as the maximum number of nodes that can be selected which are atleast hops away from each other.

Definition 2.

For a given number of nodes , we define optimum-tree as the tree which has maximum optimum-nodes among all possible trees with nodes.

Lemma 1.

Let be an optimum-tree of vertices and be an optimum tree with vertices. The optimum-nodes of and differ by atmost one, i.e., .


Consider which has nodes. We can remove any one of the leaf nodes in to obtain a tree with nodes. If any one of the nodes in was removed, then would become . If any other node was removed , then being a leaf it does not constitute the shortest path between any of the nodes. This implies that the optimum-nodes for is atleast , i.e.,


Since is the optimal-tree, we know that:


Using Eq. (13) and (14) we can write:

which proves our lemma. ∎

Lemma 2.

Let be an optimum-tree of vertices and be an optimum-tree of vertices. is an induced subgraph of .


Let us choose a node to be removed from and join its neighboring nodes to obtain a tree with nodes with an objective of ensuring a maximum . To do so, we can only remove a leaf node from . This is because removing non-leaf nodes can reduce the shortest path between multiple pairs of nodes whereas removing leaf-nodes will reduce only the shortest path to nodes from the new leaf at that position. This ensures least reduction in optimum-nodes for . Removing a leaf node implies that cannot be lesser than as it affects only the paths involving that particular leaf node. Using Lemma 1, we see that is equivalent to , i.e., is one of the possible optimal-trees with nodes. Since was formed by removing a leaf node from , we find that is indeed an induced subgraph of .

(a) Starlike tree
(b) Path Graph
Figure 3: (a) Balanced Starlike tree with height 2. (b) Path Graph
Definition 3.

A starlike tree is a tree having atmost one node (root) with degree greater than two [43]. We consider starlike tree with height to be balanced, if there is atmost one leaf which is at a height less than while the rest are all at a height from the root. Figure 3(a) depicts an example of a balanced starlike tree with .

Definition 4.

A path graph is a graph such that its nodes can be placed on a straight line. There are not more than only two nodes in a path graph which have degree one while the rest have a degree of two. Figure 3(b) shows an example of a path graph [44].

Lemma 3.

For a balanced starlike tree with height , where is even, , i.e., when the leaves are selected.

Lemma 4.

Among all the possible trees which have N vertices, the maximum achievable is , which is obtained if the tree is a balanced starlike tree with height if is even.


To prove the lemma, we use induction. Here, the base case corresponds to a path graph with nodes, a trivial case of starlike graph, as it has only nodes which are hops away. From the formula , we get which verifies the base case.

For any , let us assume that the lemma is true, i.e., a balanced starlike tree with height achieves the maximum for any tree with vertices. Consider to be the optimal-tree for nodes. From Lemma (2), we know that is an induced subgraph of . This means that can be obtained by adding a node to . Since we are constructing , we need to add a node to such that maximum nodes can be selected which are atleast hops away. There are three possible structures for the tree depending on the minimum height among all its branches: (a) minimum height among all the branches is less than , (b) minimum height among all the branches is equal to and (c) minimum height among all the branches is equal to . Although case (a) is not possible as we assumed to be a balanced starlike tree, we consider it for the sake of completeness. For case (a), no matter where we add the node, will not increase. However, we should add the node to the leaf of the branch with least height as it will allow the new leaf of that branch to be chosen in case the number of nodes in tree is increased to some such that height of that branch becomes . For case (b), we should add the node to the leaf of the branch with least height so that its height becomes and the new leaf of that branch gets selected. For case (c), no matter where we add the node, will not increase. Unlike case (a), we should add the new node to the root so as to start a new leaf which could be selected if that branch grows to a height for some . For all the three cases, is a balanced starlike tree as the new node is either added to the leaf of a branch if minimum height of a leaf is less than or to the root if the minimum height of the branches is . Hence, by induction, the lemma is proved.

Figure 4: Minimum sampling ratio vs for Path-Graph

Theorem 2. Let the input graph be a tree of any possible structure with nodes. Let be the lower bound on sampling ratio to ensure the existence of atleast one edge in the pooled graph irrespective of the structure of and the location of the selected nodes. For TopK or SAGPool, whereas for ASAP, as .


From Lemma (4) and (3), we know that among all the possible trees which have N vertices, the maximum achievable is . Using pigeon-hole principle we can show that for a pooling method with if the number of sampled clusters is greater than then there will always be an edge in the pooled graph irrespective of the position of the selected clusters:


Let us consider 1-hop neighborhood for pooling, i.e., . Substituting in Eq. (15) for TopK and SAGPool we get:

and as N we obtain . Substituting in Eq. (15) for ASAP we get:

and as N we obtain

Similar Analysis for Path Graph

Lemma 5.

For a path graph with nodes, .

Lemma 6.

Consider the input graph to be a path graph with nodes. To ensure that a pooling operator with and sampling ratio has at least one edge in the pooled graph, irrespective of the location of selected clusters, we have the following inequality on k: .


From Lemma (5), we know that . Using pigeon-hole principle we can show that for a pooling method with , if the number of sampled clusters is greater than , then there will always be an edge in the pooled graph irrespective of the position of the selected clusters:


From Eq. (16), we get which completes the proof. ∎

Figure 5: Minimum sampling ratio vs for Path-Graph

Theorem 3. Consider the input graph to be a path graph with nodes. Let be the lower bound on sampling ratio to ensure the existence of atleast one edge in the pooled graph. For TopK or SAGPool, as whereas for ASAP, as .


From Lemma (6), we get . Using h = 1 for TopK and SAGPool when N tends to infinity i.e. , we get . Using h = 3 for ASAP when N tends to infinity i.e. , we get . ∎

Graph Connectivity via Graph Power.

To minimize the possibility of nodes getting isolated in pooled graph, TopK employs graph power i.e. instead of . This helps in increasing the density of the graph before pooling. While using graph power, TopK can connect two nodes which are atmost hops away whereas ASAP in this setting can connect upto hops in the original graph. As , ASAP will always have better connectivity given graph power.

Appendix H Graph Permutation Equivariance

Given a permutation matrix and a function depending on graph with node feature matrix and adjacency matrix , graph permutation is defined as , node permutation is defined as and edge permutation is defined as . Graph pooling operations should produce pooled graphs which are isomorphic after graph permutation i.e., they need to be graph permutation equivariant or invariant. We show that ASAP has the property of being graph permutation equivariant.

Proposition 1. ASAP is a graph permutation equivariant pooling operator.


Since is computed by an attention mechanism which attends to all edges in the graph, we have:


Selecting top clusters denoted by indices , changes as:


Using Eq. (18) and , we can write:


Since and , we get:


From Eq. (19) and Eq. (20), we see that graph permutation does not change the output features. It only changes the order of the computed feature and hence is isomorphic to the pooled graph. ∎

Appendix I Pseudo Code

Algorithm 1 is a pseudo code of ASAP. The Master2Token working is explained in Algorithm LABEL:alg:M2T.

Input :  Graph ; Node features ; Weighted adjacency matrix ; Master2Token attention function Master2Token; Local Extrema Convolution operator LEConv; pooling ratio ; Top-k selection operator TopK; non-linearity
Intermediate :  Clustered graph with node features and weighted adjacency matrix ; Cluster assignment matrix ; Cluster fitness vector
Output :  Pooled graph with node features and weighted adjacency matrix
1 ;
2 ;
3 ;
4 ;
5 ;
6 ;
7 ;
Algorithm 1 ASAP algorithm


  1. medoids are representatives of a cluster. They are similar to centroids but are strictly a member of the cluster.
  2. Please refer to Appendix Sec. D for more details on similarity between pooling methods in CNN and ASAP.
  3. If is used as it is then interchanging any two nodes in a cluster will have not affect the final output, which is undesirable.
  4. Please refer to Appendix Sec. A for further details on hyperparameter tuning and Appendix Sec. E for ablation on .
  5. Source code for ASAP can be found at:
  6. Please refer to Appendix Sec. B for more details


  1. D. Bahdanau, K. Cho and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.3.
  2. J. Bastings, I. Titov, W. Aziz, D. Marcheggiani and K. Simaan (2017) Graph convolutional encoders for syntax-aware neural machine translation. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. External Links: Link, Document Cited by: §1.
  3. K. M. Borgwardt, C. S. Ong, S. Schönauer, S. Vishwanathan, A. J. Smola and H. Kriegel (2005) Protein function prediction via graph kernels. Bioinformatics. Cited by: §6.1.
  4. M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam and P. Vandergheynst (2017) Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine. Cited by: §2.1.
  5. J. Bruna, W. Zaremba, A. Szlam and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §1, §2.1.
  6. C. Cangea, P. Veličković, N. Jovanović, T. Kipf and P. Liò (2018) Towards sparse hierarchical graph classifiers. arXiv preprint arXiv:1811.01287. Cited by: §1, §6.3, §8.1.
  7. J. Cheng, L. Dong and M. Lapata (2016) Long short-term memory-networks for machine reading. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. External Links: Link, Document Cited by: §3.3.
  8. M. Defferrard, X. Bresson and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, Cited by: §2.1, §2.2.
  9. I. S. Dhillon, Y. Guan and B. Kulis (2007) Weighted graph cuts without eigenvectors a multilevel approach. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.2.
  10. P. D. Dobson and A. J. Doig (2003) Distinguishing enzyme structures from non-enzymes without alignments. Journal of molecular biology. Cited by: §6.1.
  11. M. Fey, J. Eric Lenssen, F. Weichert and H. Müller (2018) SplineCNN: fast geometric deep learning with continuous b-spline kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
  12. M. Fey and J. E. Lenssen (2019) Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428. Cited by: Appendix A, §10.
  13. H. Gao and S. Ji (2019) Graph u-nets. arXiv preprint arXiv:1905.05178. Cited by: §1, §1, §2.2, §4.3, Table 2, §6.2.
  14. J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals and G. E. Dahl (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, External Links: Link Cited by: §1, §2.1.
  15. W. Hamilton, Z. Ying and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, Cited by: §1.
  16. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §1.
  17. M. Henaff, J. Bruna and Y. LeCun (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163. Cited by: §1, §2.1.
  18. S. Kearnes, K. McCloskey, M. Berndl, V. Pande and P. Riley (2016-08) Molecular graph convolutions: moving beyond fingerprints. Journal of Computer-Aided Molecular Design 30 (8), pp. 595–608. External Links: ISSN 1573-4951, Link, Document Cited by: §1.
  19. Y. Kim (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §1.
  20. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix A.
  21. T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.1, §3.2.
  22. A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou and K. Q. Weinberger (Eds.), External Links: Link Cited by: §1.
  23. J. Lee, I. Lee and J. Kang (2019-09–15 Jun) Self-attention graph pooling. In Proceedings of the 36th International Conference on Machine Learning, Cited by: Appendix B, §1, §1, Table 2, §6.2, §6.3, §7.1.
  24. Y. Li, D. Tarlow, M. Brockschmidt and R. S. Zemel (2016) Gated graph sequence neural networks. CoRR abs/1511.05493. Cited by: §1, §2.2, Table 2, §6.2.
  25. Y. Ma, S. Wang, C. C. Aggarwal and J. Tang (2019) Graph convolutional networks with eigenpooling. arXiv preprint arXiv:1904.13107. Cited by: §2.2.
  26. D. Marcheggiani and I. Titov (2017) Encoding sentences with graph convolutional networks for semantic role labeling. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. External Links: Link, Document Cited by: §1.
  27. F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda and M. M. Bronstein (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
  28. C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan and M. Grohe (2018) Weisfeiler and leman go neural: higher-order graph neural networks. External Links: 1810.02244 Cited by: §2.1.
  29. F. Orsini, P. Frasconi and L. De Raedt (2015) Graph invariant kernels. In Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: §6.1.
  30. S. Sanyal, J. Balachandran, N. Yadati, A. Kumar, P. Rajagopalan, S. Sanyal and P. Talukdar (2018) MT-CGCNN: integrating crystal graph convolutional neural network with multitask learning for material property prediction. arXiv preprint arXiv:1811.05660. Cited by: §1.
  31. S. E. Schaeffer (2007) Graph clustering. Computer science review. Cited by: §4.1.
  32. M. Schlichtkrull, T. N. Kipf, P. Bloem, R. v. d. Berg, I. Titov and M. Welling (2017) Modeling relational data with graph convolutional networks. arXiv preprint arXiv:1703.06103. Cited by: §1.
  33. T. Shen, T. Zhou, G. Long, J. Jiang, S. Pan and C. Zhang (2018) Disan: directional self-attention network for rnn/cnn-free language understanding. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §3.3.
  34. N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn and K. M. Borgwardt (2011) Weisfeiler-lehman graph kernels. Journal of Machine Learning Research. Cited by: §6.1.
  35. M. Simonovsky and N. Komodakis (2017) Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §2.2.
  36. S. Vashishth, S. S. Dasgupta, S. N. Ray and P. Talukdar (2018) Dating documents using graph convolution networks. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). External Links: Link, Document Cited by: §1.
  37. S. Vashishth, R. Joshi, S. S. Prayaga, C. Bhattacharyya and P. Talukdar (2018) Reside: improving distantly-supervised neural relation extraction using side information. arXiv preprint arXiv:1812.04361. Cited by: §1.
  38. S. Vashishth, S. Sanyal, V. Nitin and P. Talukdar (2019) Composition-based multi-relational graph convolutional networks. arXiv preprint arXiv:1911.03082. Cited by: §1.
  39. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, Cited by: §3.3.
  40. P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §8.2.
  41. O. Vinyals, S. Bengio and M. Kudlur (2016) Order matters: sequence to sequence for sets. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §1, §2.2, Table 2, §6.2.
  42. N. Wale, I. A. Watson and G. Karypis (2008) Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems. Cited by: §6.1.
  43. Wikipedia contributors (2017) Starlike tree — Wikipedia, the free encyclopedia. Note: \url[Online; accessed 17-November-2019] Cited by: Definition 3.
  44. Wikipedia contributors (2019) Path graph — Wikipedia, the free encyclopedia. Note: [Online; accessed 17-November-2019] External Links: Link Cited by: Definition 4.
  45. T. Xie and J. C. Grossman (2018-04) Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys. Rev. Lett. 120, pp. 145301. External Links: Document, Link Cited by: §1.
  46. K. Xu, W. Hu, J. Leskovec and S. Jegelka (2018) How powerful are graph neural networks?. External Links: 1810.00826 Cited by: §2.1.
  47. R. Ying, J. You, C. Morris, X. Ren, W. L. Hamilton and J. Leskovec (2018) Hierarchical graph representation learning with differentiable pooling. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, External Links: Link Cited by: §1, §1, §2.2, §4.4, Table 2, §6.2, §8.1.
  48. M. Zhang, Z. Cui, M. Neumann and Y. Chen (2018) An end-to-end deep learning architecture for graph classification. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §2.2, Table 2, §6.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters