Hierarchical Graph Pooling with Structure Learning
Abstract
Graph Neural Networks (GNNs), which generalize deep neural networks to graphstructured data, have drawn considerable attention and achieved stateoftheart performance in numerous graph related tasks. However, existing GNN models mainly focus on designing graph convolution operations. The graph pooling (or downsampling) operations, that play an important role in learning hierarchical representations, are usually overlooked. In this paper, we propose a novel graph pooling operator, called Hierarchical Graph Pooling with Structure Learning (HGPSL), which can be integrated into various graph neural network architectures. HGPSL incorporates graph pooling and structure learning into a unified module to generate hierarchical representations of graphs. More specifically, the graph pooling operation adaptively selects a subset of nodes to form an induced subgraph for the subsequent layers. To preserve the integrity of graph’s topological information, we further introduce a structure learning mechanism to learn a refined graph structure for the pooled graph at each layer. By combining HGPSL operator with graph neural networks, we perform graph level representation learning with focus on graph classification task. Experimental results on six widely used benchmarks demonstrate the effectiveness of our proposed model.
Introduction
Deep neural networks with convolution and pooling layers have achieved great success in various challenging tasks, ranging from computer vision [14], natural language understanding [1] to video processing [17]. The data in these tasks are typically represented in the Euclidean space (i.e., modeled as 2D or 3D tensors), thus usually containing locality and order information for the convolution operations [5]. However, in many realworld problems, a large amount of data, such as social networks, chemical molecules and biological networks, are lying on nonEuclidean domains that can be naturally represented as graphs. Due to the neural network’s powerful capabilities, it’s quite appealing to generalize the convolution and pooling operations to graphstructured data.
Recently, there have been a myriad of attempts to generalize the convolution operations to arbitrary graphs, referred to as graph neural networks (GNNs for short). In general, these algorithms can be classified into two big categories: spectral and spatial approaches. For the spectral methods, they typically define the graph convolution operations based on graph Fourier transform [4, 5, 19]. For the spatial methods, the graph convolution operations are devised by aggregating the node representations directly from its neighborhood [13, 24, 30, 25]. Majority of the aforementioned methods mainly involve transforming, propagating and aggregating node features across the graph, which can fit in the message passing scheme [12]. GNNs have been applied to different types of graphs [30, 6], and obtained outstanding performance in numerous graph related tasks, including node classification [19], link prediction [27, 37] and recommendation [34], etc.
Nevertheless, the pooling operations in graphs have not been extensively studied yet, though they act a pivotal part in learning hierarchical representations for the task of graph classification [35]. The goal of graph classification is to predict the label associated with the entire graph by utilizing its node features and graph structure information, i.e., a graph level representation is needed. GNNs are originally designed to learn meaningful node level representations, thus a commonly adopted approach to generate graph level representation is to globally summarize all the node representations in the graph. Although workable, the graph level representation generated via this way is inherently “flat”, since the entire graph structure information is neglected during this process. Furthermore, GNNs can only pass messages between nodes through edges, but cannot aggregate node information in a hierarchical way. Meanwhile, graphs often have different substructures and nodes are of different roles, therefore they should contribute differently to the graph level representation. For example, in the proteinprotein interaction graphs, the certain substructures may represent some specific functionalities, which are of great significance to predict the whole graph characteristics. To capture both the graph’s local and global structure information, a hierarchical pooling process is demanded.
There exists some very recent work that focuses on the hierarchical pooling procedure in GNNs [35, 11, 8, 10]. These models usually coarsen the graphs through grouping or sampling nodes into subgraphs level by level, thus the entire graph information is gradually reduced to the hierarchical induced subgraphs. However, the graph pooling operations still have room for improvement. In node grouping approaches, the hierarchical pooling methods [35, 8] suffer from high computational complexity, which require additional neural networks to downsize the nodes. In node sampling approaches, the generated induced subgraph [11, 20] might fail to preserve the key substructures and eventually lose the completeness of graph topological information. For instance, two nodes that are not directly connected but sharing many common neighbors in the original graph might become unreachable from each other in the induced subgraph, even if intuitively they ought to be “close” in the subgraph. Therefore, the distorted graph structure will hinder the message passing in subsequent layers.
To address the aforementioned limitations, we propose a novel graph pooling operator HGPSL to learn hierarchical graph level representations. Specifically, HGPSL first adaptively selects a subset of nodes according to our defined node information score, which fully utilizes both the node features and graph topological information. In addition, the proposed graph pooling operation is a nonparametric step, therefore no additional parameters need to be optimized during this procedure. Then, we apply a structure learning mechanism with sparse attention [23] to the pooled graph, aiming to learn a refined graph structure that preserves the key substructures in the original graph. We integrate the pooling operator into graph convolutional neural network to perform graph classification and the whole procedure can be optimized in an endtoend manner. To summarize, the main contributions of this paper are as follows:

We introduce a novel graph pooling operator HGPSL that can be integrated into various graph neural network architectures. Similarly to the pooling operations in convolutional neural networks, our proposed graph pooling operation is nonparametric^{1}^{1}1Note that the pooling process itself is nonparametric, however the structure learning mechanism indeed has an attention parameter. Thus, the overall HGPSL operator is not nonparametric. and very easy to implement.

To the best of our knowledge, we are the first to design a structure learning mechanism for the pooled graph, which has the advantage of learning a refined graph structure to preserve the graph’s key substructures.

We conduct extensive experiments on six public datasets to demonstrate HGPSL’s effectiveness as well as superiority compared to a range of stateoftheart methods.
Related Work
Graph Neural Networks
GNNs can be generally categorized into two branches: spectral and spatial approaches. The spectral methods typically define the parameterized filters according to graph spectral theory. [4] first proposed to define convolution operations for graph in the Fourier transform domain. Due to its heavy computation cost, it has difficulty in scaling to large graphs. Later on, [5] improved its efficiency by approximating the Kpolynomial filters through Chebyshev expansion. GCN [19] further simplified the ChebNet by truncating the Chebyshev polynomial to the firstorder approximation of the localized spectral filters.
The spatial approaches design convolution operations by directly aggregating the node’s neighborhood information. Among them, GraphSAGE [13] proposed an inductive algorithm that can generalize to unseen nodes by aggregating its neighborhood content information. GAT [30] utilized attention mechanism to aggregate nodes’ neighborhood representations with different weights. JKNet [33] leveraged flexible neighborhood ranges to enable better node representations. More details can be found in several comprehensive surveys on graph neural networks [39, 38, 32]. Nevertheless, the above mentioned two branches of GNNs are mainly designed for learning meaningful node representations, and unable to generate hierarchical graph representations due to the lack of pooling operations.
Graph Pooling
Pooling operations in GNNs can scale down the size of inputs and enlarge the receptive fields, thus giving rise to better generalization and performance. DiffPool [35] proposed to softly assign nodes to a set of clusters using neural networks, which forms a dense cluster assignment matrix and is computation expensive. gPool [11] and SAGPool [20] devised a topK node selection procedure to form an induced subgraph for the next input layer. Though efficient, it might lose the completeness of the graph structure information and result in isolated subgraphs, which will hamper the message passing process in subsequent layers. EdgePool [8] designed pooling operation by contracting the edges in the graph, but its flexibility is poor because it will always pool roughly half of the total nodes. EigenPool [22] introduced a pooling operator based on the graph Fourier transform, which controls the pooling ratio through spectral clustering and it’s also very time consuming.
In addition, there are also some approaches that perform global pooling. For instance, Set2Set [31] implemented the global pooling operation by aggregating information through LSTMs [15]. DGCNN [36] pooled the graph according to the last channel of the feature map values which are sorted in the descending order. Graph topological based pooling operations are proposed in [5] and [26] as well, where Graclus method [7] is employed as a pooling module.
The Proposed Model
Notations and Problem Formulation
Given a set of graph data , where the number of nodes and edges in each graph might be quite different. For an arbitrary graph , we have and denote the number of nodes and edges, respectively. Let be the adjacent matrix describing its edge connection information and represents the node feature matrix, where is the dimension of node attributes. Label matrix indicates the associated labels for each graph, i.e., if belongs to class , then , otherwise . Since the graph structure and node numbers change between layers due to the graph pooling operation, we further represent the th graph fed into the th layer as with nodes. The adjacent matrix and hidden representation matrix are then denoted as and . With the above notations, we formally define our problem as follows:
Input: Given a set of graphs with its label information , the number of graph neural network layers , pooling ratio , and representation dimension in each layer.
Output: Our goal is to predict the unknown graph labels of with graph neural network in an endtoend way.
Graph Convolutional Neural Network
Graph convolutional neural network (or GCN) [19] has shown to be very efficient and achieved promising performance in various challenging tasks. Thus, we choose GCN as our model’s building block and briefly review its mechanism in this subsection. Please note that our proposed HGPSL operator can also be integrated into other graph neural network architectures like GraphSAGE [13] and GAT [30]. We will discuss this in the experiment section. For the th layer in GCN, it takes graph ’s adjacent matrix and hidden representation matrix as input, then the next layer’s output will be generated as follows:
(1) 
where is the nonlinear activation function and , is the adjacent matrix with selfconnections. is the diagonal degree matrix of , and is a trainable weight matrix. For the ease of parameter tuning, we set output dimension for all layers.
The Overall Neural Network Architecture
Figure 1 provides an overview of our proposed ierarchical raph ooling with tructure earning (HGPSL) that combines with graph neural network, where graph pooling operations are added between graph convolution operations. The proposed HGPSL operator is composed of two major components: 1) graph pooling, which preserves a subset of informative nodes and forms a smaller induced subgraph; and 2) structure learning, which learns a refined graph structure for the pooled subgraph. The advantage of our proposed structure learning lies in its capability to preserve the essential graph structure information, which will facilitate the message passing procedure. As in this illustrative example, the pooled subgraph might exist isolated nodes but intuitively ought to be connected, thus it would hinder the information propagation in subsequent layers especially when aggregating information from its neighborhood nodes. The whole architecture is the stacking of convolution and pooling operations, thus making it possible to learn graph representations in a hierarchical way. Then, a readout function is utilized to summarize node representations in each level, and the final graph level representation is the addition of different levels’ summarizations. At last, the graph level representation is fed into a MultiLayer Perceptron (MLP) with softmax layer to perform graph classification task. In what follows, we give the details of graph pooling and structure learning layers.
Graph Pooling Operation
In this subsection, we introduce our proposed graph pooling operation to enable downsampling on graph data. Inspired by [11, 20], the pooling operation identifies a subset of informative nodes to form a new but smaller graph. Here, we design a nonparametric pooling operation, which can fully utilize both the node features and graph structure information.
The key of our proposed graph pooling operation is to define a criterion that guides the node selection procedure. We therefore introduce a criterion named node information score to evaluate the information that each node contains given its neighborhood. In general, if a node’s representation can be well reconstructed by its neighborhood, it means this node can probably be removed in the pooled graph with negligible information loss. Here, we formally define the node information score as the Manhattan distance between the node representation itself and the one constructed from its neighbors:
(2) 
where and are the adjacent and node representations matrices, respectively. denotes the norm. represents the diagonal degree matrix of , and is the identity matrix. Therefore, we have encode the information score of each node in the graph.
After having obtained the node information score, we can now select nodes that should be preserved by the pooling operator. To approximate the graph information, we choose to preserve the nodes that can not be well represented by their neighbors, i.e., the nodes with relative larger node information score will be preserved in the construction of the pooled graph, because they can provide more information. Specifically, the graph nodes are first reordered based on the value of their node information scores, then we can simply select a subset of topranked nodes as follows:
(3)  
where is the pooling ratio and toprank denotes the function that returns the indices of the top values. and perform the row or (and) column extraction to form the node representation matrix and adjacent matrix for the induced subgraph. Thus, we have and represent the node feature and graph structure information of next layer .
Structure Learning Mechanism
In this subsection, we present how our proposed structure learning mechanism learns a refined graph structure in the pooled graph. As we have illustrated in Figure 1, the pooling operation might result in highly related nodes being disconnected in the induced subgraph, which loses the completeness of the graph structure information and further hinders the message passing procedure. Meanwhile, the graph structure obtained from domain knowledge (e.g., social network) or established by human (e.g., KNN graph) are usually nonoptimal for the learning task in graph neural networks, due to the lost or noisy information. To overcome this problem, [21] proposed to adaptively estimate graph Laplacian using an approximate distance metric learning algorithm, which might lead to local optimal solution. [16] introduced to learn the constructed graph structure for node label estimation, however it generates dense connected graph and is not applicable in our hierarchical graph level representation learning scenario.
Here, we develop a novel structure learning layer, which learns sparse graph structure through sparse attention mechanism [23]. For graph ’s pooled subgraph at its th layer, we take its structure information and hidden representations as input. Our target is to learn a refined graph structure that encodes the underlying pairwise relationship between each pair of nodes. Formally, we utilize a single layer neural network parameterized by a weight vector . Then, the similarity score between node and calculated by the attention mechanism can be expressed as:
(4) 
where is the activation function like and represents the concatenation operation. and indicate the th and th row of matrix , which denote the representations of node and , respectively. Specifically, encodes the induced subgraph structure information, where if node and are not directly connected. We incorporate into our structure learning layer to bias the attention mechanism to give a relatively larger similarity score between directly connected nodes, and at the same time try to learn the underlying pairwise relationships between disconnected nodes. is a tradeoff parameter between them.
To make the similarity score easily comparable across different nodes, we could normalize them across nodes using the softmax function:
(5) 
However, the softmax transformation always has nonzero values and thus results in dense fully connected graph, which may introduce lots of noise into the learned structure. Hence, we propose to utilize sparsemax function [23], which retains most the important properties of softmax function and has in addition the ability of producing sparse distributions. The function aims to return the Euclidean projection of input onto the probability simplex and can be formulated as follows:
(6) 
where , and is the threshold function that returns a threshold according to the procedure shown in Algorithm 1. Thus, preserves the values above the threshold and the other values will be truncated to zeros, which brings sparse graph structure. Similarly to softmax function, also has the properties of nonnegative and sumtoone, that’s to say, and . The proof procedure is available in the supplemental material.
Improving Structure Learning Efficiency
For large scale graphs, it will be computation expensive to calculate the similarities between each pair of nodes during the learning of structure . If we further take graph’s localization and smoothness properties into consideration, it is reasonable to limit the calculation process within the node’s hop neighborhood ( or ). Therefore, the computation cost of can be greatly reduced.
GCN and Graph Pooling Revisiting
After having obtained the refined graph structure , we conduct graph convolution and pooling operations in the following layers based on and (instead of ). Thus, Equation (1) can be simplified as follows:
(7) 
Since the learned satisfies , therefore we have the diagonal matrix with , which degenerates to identity matrix . Similarly, the calculation of node information score in Equation (2) can also be simplified as below:
(8) 
which makes our model very easy to implement.
The Readout Function and Output Layer
As we have demonstrated in Figure 1, the neural network architecture repeats the graph convolution and pooling operations for several times, thus we would observe multiple subgraphs with different size in each level: . To generate a fixed size graph level representation, we devise a readout function that aggregates all the node representations in the subgraph. Here, we simply use the concatenation of meanpooling and maxpooling in each subgraph as follows:
(9) 
where is a nonlinear activation function and . We then add^{2}^{2}2In our experiment, we use fixed size node representation across all layers, i.e., . the readout outputs of different levels to form our final graph level representation:
(10) 
which summarizes different levels’ graph representations.
Finally, we feed the graph level representation into MLP layer with softmax classifier, and the loss function is defined as the crossentropy of predictions over the labels:
(11) 
where represents the predicted probability that graph belongs to class , and is the ground truth. denotes the training set of graphs that have labels.
Experiments and Analysis
Datasets  Avg.  Avg.  

ENZYMES  600  19,580  32.63  62.14  6 
PROTEINS  1,113  43,471  39.06  72.82  2 
D&D  1,178  334,925  284.32  715.66  2 
NCI1  4,110  122,747  29.87  32.30  2 
NCI109  4,127  122,494  29.68  32.13  2 
Mutagenicity  4,337  131,488  30.32  30.77  2 
Categories  Baselines  ENZYMES  PROTEINS  D&D  NCI1  NCI109  Mutagenicity 

Kernels  GRAPHLET  
SP  
WL  
GNNs  GCN  
GraphSAGE  
GAT  
Pooling  Set2Set  
DGCNN  
DiffPool  
EigenPool  
gPool  
SAGPool  
EdgePool  
Proposed  
HGPSL  68.79 2.11  84.91 1.62  80.96 1.26  78.45 0.77  80.67 1.16  82.15 0.58 
Datasets
We adopt six commonly used public benchmarks^{3}^{3}3Benchmarks are publicly available at https://ls11www.cs.tudortmund.de/staff/morris/graphkerneldatasets for empirical studies. Statistics of the six datasets are summarized in Table 1 with more descriptions as follows: ENZYMES [3] is a dataset of protein tertiary structures, and each enzyme belongs to one of the 6 EC toplevel classes. PROTEINS and D&D [9] are two protein graph datasets, where nodes represent the amino acids and two nodes are connected by an edge if they are less than 6 Angstroms apart. The label indicates whether or not a protein is a nonenzyme. NCI1 and NCI109 [28] are two biological datasets screened for activity against nonsmall cell lung cancer and ovarian cancer cell lines, where each graph is a chemical compound with nodes and edges representing atoms and chemical bonds, respectively. Mutagenicity [18] is a chemical compound dataset of drugs, which can be categorized into two classes: mutagen and nonmutagen.
Baselines
Graph Kernel Methods.
Graph Neural Networks.
Graph Pooling Models.
In this group, we further consider numerous models that combine GNNs with pooling operator for graph level representation learning. Set2Set [31] and DGCNN [36] are two novel global graph pooling algorithms. Another five hierarchical graph pooling models including DiffPool [35], gPool [11], SAGPool [20], EdgePool [8] and EigenPool [22] are also compared as baselines.
HGPSL Variants.
To further analyze the effectiveness of our proposed HGPSL operator, we consider four variants here: (No Structure Learning) which discards the structure learning layer to verify the effectiveness of our proposed structure learning module, which removes the structure learning layer and connects the nodes within its hops, (DENse) which employs the structure learning layer to learn a dense graph structure with softmax function defined in Equation (5) and HGPSL which utilizes sparsemax function define in Equation (6) to learn a sparse graph structure. Both and HGPSL use efficiency improved structure learning strategy.
Experiment and Parameter Settings.
Following many previous work [35, 22], we randomly split each dataset into three parts: as training set, as validation set and the remaining as test set. We repeat this randomly splitting process 10 times, and the average performance with standard derivation is reported. For baseline algorithms, we use the source code released by the authors, and their hyperparameters are tuned to be optimal based on the validation set. In order to ensure a fair comparison, the same neural network architectures are used for the existing pooling baselines and our proposed model. The dimension of node representations is set as 128 for all methods and datasets. We implement our proposed HGPSL with PyTorch, and the Adam optimizer is utilized to optimize the model. The learning rate and weight decays are searched in , pooling ratio and layers . The MLP consists of three fully connected layers with number of neurons in each layer setting as 256, 128, 64, followed by a softmax classifier. Early stopping criterion is employed in the training process, i.e., we stop training if the validation loss dose not decrease for 100 consecutive epochs. The source code is publicly available^{4}^{4}4Code is available at https://github.com/cszhangzhen/HGPSL.
Performance on Graph Classification
The classification performance is reported in Table 2. To summarize, we have the following observations:

First of all, a general observation we can draw from the results is that our proposed HGPSL consistently outperforms other stateoftheart baselines among all datasets. For instance, our method achieves about 3.08% improvement over the best baseline in PROTEINS dataset, which is 12.97% improvement over GCN with no hierarchical pooling mechanism. This verifies the necessity of adding graph pooling module.

It is worth noting that the traditional graph kernel based methods demonstrate competitive performance. However, the carefully designed graph kernels typically involve massive human domain knowledge, which has difficulty in generalizing to graphs with arbitrary structures. Furthermore, the twostage procedure of extracting graph features and performing graph classification might result in suboptimal performance.

In particular, the global pooling approaches Set2Set and DGCNN are surpassed by most of the hierarchical pooling methods with a few exceptions. This is because their learned graph representations are still “flat”, and the hierarchical structure information or functional units in the graph are ignored, which play an important role in predicting the entire graph labels.

We note that the hierarchical pooling models can achieve relative better performance among most baselines, which further shows the effectiveness of the hierarchical pooling mechanism. Among them, gPool and SAGPool perform poorly in ENZYMES dataset. This may be due to the limited training samples per class resulting in the neural network overfitting. EdgePool gains superior performance in this group of competitors, which scales down the size of graphs by contracting each pair of nodes in the graph. Obviously, our proposed HGPSL outperforms EdgePool with different gains for all settings.

Finally, HGPSL and obtain better performance than and , which justifies the effectiveness of our proposed structure learning layer. Moreover, performs worse than HGPSL. This is because the disconnected nodes are still unreachable in its hops. HGPSL further outperforms , which indicates the learned dense graph structure might introduce additional noisy information and degenerate the performance. Furthermore, in the realworld scenario, graphs usually have sparse topologies, thus our proposed HGPSL could learn more reasonable graph structures compared with .
Ablation Study and Visualization
HGPSL Convolutional Neural Network Architectures.
As mentioned in previous sections, our proposed HGPSL can be integrated into various graph neural network architectures. We consider three most widely used graph convolutional architectures as our model’s building block to investigate the affect of different convolution operations: GCN [19], GraphSAGE [13] and GAT [30]. We evaluate them on three datasets, which cover both small and large datasets. Their results are shown in Table 3. Similar results can also be found in the remaining datasets, thus we omit them due to the limited space. As demonstrated in Table 3, the performance on graph classification varies depending on which dataset and the type of GNN in HGPSL are chosen. In addition, we also combine the topK selection procedure proposed in gPool and SAGPool with our proposed structure learning. We name them as gPoolSL and SAGPoolSL for short. From the results, we observe that gPoolSL and SAGPoolSL outperform gPool and SAGPool by incorporating the structure learning mechanism, which verifies the effectiveness of our proposed structure learning.
Hyperparameter Analysis.
We further study the sensitivities of several key hyperparameters by varying them in different scales. Specifically, we investigate how the number of neural network layers , graph representation dimension and pooling ratio will affect the graph classification performance. As we can see in Figure 2, HGPSL almost achieves the best performance across different datasets when setting , and , respectively. The pooling ratio cannot be too small, otherwise most of the graph structure information will be lost during the pooling process.
Visualization.
We utilize networkx^{5}^{5}5https://networkx.github.io/ to visualize the pooling results of HGPSL and its variants. In detail, we randomly sample a graph from PROTEINS dataset, which contains 154 nodes. We build a three layer graph neural network with pooling ratio setting as 0.5, which then generates three pooled graphs with nodes as 77, 39 and 20 respectively. We plot the 3rd pooled graph in Figure 3. It shows and fail to preserve meaningful graph topologies, while HGPSL is able to preserve relatively reasonable topology of the original protein graph after pooling.
Architectures  PROTEINS  NCI109  Mutagenicity 

84.911.62  80.671.16  82.150.58  
85.041.01  79.821.06  82.020.81  
84.990.82  80.110.96  81.960.97  
gPoolSL  81.251.27  77.711.22  80.421.08 
SAGPoolSL  82.671.42  78.011.50  80.001.22 
Conclusion
In this paper, we investigate graph level representation learning for the task of graph classification. We propose a novel graph pooling operator HGPSL, which empowers GNNs to learn hierarchical graph representations. It can also be conveniently integrated into various GNN architectures. Specifically, the graph pooling operation is a nonparametric step, which utilizes node features and graph structure information to perform downsampling on graphs. Then, a structure learning layer is stacked on the pooling operation, which aims to learn a refined graph structure that can best preserve the essential topological information. We combine the proposed HGPSL operator with graph convolutional neural networks to conduct graph classification task. Comprehensive experiments on six widely used benchmarks demonstrate its superiority to a range of stateoftheart methods.
Acknowledgments
This work is supported by AlibabaZhejiang University Joint Institute of Frontier Technologies, The National Key R&D Program of China (No.2018YFC2002603), Zhejiang Provincial Natural Science Foundation of China (No. LZ13F020001), the National Natural Science Foundation of China (No.61972349, 61173185, 61173186) and the National Key Technology R&D Program of China (No.2012BAI34B01, 2014BAK15B02), National Natural Science Foundation of China (Grant No: U1866602) and National Key Research and Development Project (Grant No: 2018AAA0101503).
References
 [1] (2015) Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations. Cited by: Introduction.
 [2] (2005) Shortestpath kernels on graphs. In ICDM, Cited by: Graph Kernel Methods..
 [3] (2005) Protein function prediction via graph kernels. Bioinformatics, pp. 47–56. Cited by: Datasets.
 [4] (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: Introduction, Graph Neural Networks.
 [5] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, pp. 3844–3852. Cited by: Introduction, Introduction, Graph Neural Networks, Graph Pooling.
 [6] (2018) Signed graph convolutional networks. In ICDM, pp. 929–934. Cited by: Introduction.
 [7] (2007) Weighted graph cuts without eigenvectors a multilevel approach. TPAMI 29 (11), pp. 1944–1957. Cited by: Graph Pooling.
 [8] (2019) Edge contraction pooling for graph neural networks. arXiv preprint arXiv:1905.10990. Cited by: Introduction, Graph Pooling, Graph Pooling Models..
 [9] (2003) Distinguishing enzyme structures from nonenzymes without alignments. Journal of molecular biology 330 (4), pp. 771–783. Cited by: Datasets.
 [10] (2019) Learning graph pooling and hybrid convolutional operations for text representations. In The World Wide Web Conference, pp. 2743–2749. Cited by: Introduction.
 [11] (2019) Graph unets. In International Conference on Machine Learning, pp. 2083–2092. Cited by: Introduction, Graph Pooling, Graph Pooling Operation, Graph Pooling Models..
 [12] (2017) Neural message passing for quantum chemistry. In ICML, pp. 1263–1272. Cited by: Introduction.
 [13] (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: Introduction, Graph Neural Networks, Graph Convolutional Neural Network, Graph Neural Networks., HGPSL Convolutional Neural Network Architectures..
 [14] (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: Introduction.
 [15] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: Graph Pooling.
 [16] (2019) Semisupervised learning with graph learningconvolutional networks. In CVPR, pp. 11313–11320. Cited by: Structure Learning Mechanism.
 [17] (2014) Largescale video classification with convolutional neural networks. In CVPR, pp. 1725–1732. Cited by: Introduction.
 [18] (2005) Derivation and validation of toxicophores for mutagenicity prediction. Journal of medicinal chemistry 48 (1), pp. 312–320. Cited by: Datasets.
 [19] (2017) Semisupervised classification with graph convolutional networks. ICLR. Cited by: Introduction, Graph Neural Networks, Graph Convolutional Neural Network, Graph Neural Networks., HGPSL Convolutional Neural Network Architectures..
 [20] (2019) Selfattention graph pooling. In ICML, pp. 3734–3743. Cited by: Introduction, Graph Pooling, Graph Pooling Operation, Graph Pooling Models..
 [21] (2018) Adaptive graph convolutional neural networks. In AAAI, Cited by: Structure Learning Mechanism.
 [22] (2019) Graph convolutional networks with eigenpooling. In SIGKDD, Cited by: Graph Pooling, Graph Pooling Models., Experiment and Parameter Settings..
 [23] (2016) From softmax to sparsemax: a sparse model of attention and multilabel classification. In ICML, pp. 1614–1623. Cited by: Introduction, Structure Learning Mechanism, Structure Learning Mechanism.
 [24] (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, pp. 5115–5124. Cited by: Introduction.
 [25] (2019) Weisfeiler and leman go neural: higherorder graph neural networks. In AAAI, pp. 4602–4609. Cited by: Introduction.
 [26] (2017) Hybrid approach of relation network and localized graph convolutional filtering for breast cancer subtype classification. arXiv preprint arXiv:1711.05859. Cited by: Graph Pooling.
 [27] (2018) Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: Introduction.
 [28] (2011) Weisfeilerlehman graph kernels. JMLR 12 (Sep), pp. 2539–2561. Cited by: Datasets, Graph Kernel Methods..
 [29] (2009) Efficient graphlet kernels for large graph comparison. In AISTATS, pp. 488–495. Cited by: Graph Kernel Methods..
 [30] (2018) Graph attention networks. ICLR. Cited by: Introduction, Graph Neural Networks, Graph Convolutional Neural Network, Graph Neural Networks., HGPSL Convolutional Neural Network Architectures..
 [31] (2015) Order matters: sequence to sequence for sets. arXiv preprint arXiv:1511.06391. Cited by: Graph Pooling, Graph Pooling Models..
 [32] (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: Graph Neural Networks.
 [33] (2018) Representation learning on graphs with jumping knowledge networks. ICML. Cited by: Graph Neural Networks.
 [34] (2018) Graph convolutional neural networks for webscale recommender systems. In SIGKDD, pp. 974–983. Cited by: Introduction.
 [35] (2018) Hierarchical graph representation learning with differentiable pooling. In NIPS, pp. 4800–4810. Cited by: Introduction, Introduction, Graph Pooling, Graph Pooling Models., Experiment and Parameter Settings..
 [36] (2018) An endtoend deep learning architecture for graph classification. In AAAI, Cited by: Graph Pooling, Graph Pooling Models..
 [37] (2018) ANRL: attributed network representation learning via deep neural networks.. In IJCAI, pp. 3155–3161. Cited by: Introduction.
 [38] (2018) Deep learning on graphs: a survey. arXiv preprint arXiv:1812.04202. Cited by: Graph Neural Networks.
 [39] (2018) Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434. Cited by: Graph Neural Networks.
Appendix
Proof for Algorithm 1
To summarize, considers the Euclidean projection of the input vector onto the probability simplex, which can be defined as the following optimization problem:
(12)  
Then, the Lagrangian of the optimization problem in Equation (12) is:
(13) 
The optimal must satisfy the following KarushKuhnTucker conditions:
(14) 
(15) 
(16) 
If for we have , then from Equation (16) we must satisfy . Thus, from Equation (14) we can get . Let . From Equation (15) we obtain , which yields the Line 3 in Algorithm 1, i.e., . Again from Equation (16), we have that implies , which from Equation (14) implies , i.e., for . Thus, we have the procedure in Algorithm 1.