Multi-Channel Graph Convolutional Networks

Multi-Channel Graph Convolutional Networks


Graph neural networks (GNN) has been demonstrated to be effective in classifying graph structures. To further improve the graph representation learning ability, hierarchical GNN has been explored. It leverages the differentiable pooling to cluster nodes into fixed groups, and generates a coarse-grained structure accompanied with the shrinking of the original graph. However, such clustering would discard some graph information and achieve the suboptimal results. It is because the node inherently has different characteristics or roles, and two non-isomorphic graphs may have the same coarse-grained structure that cannot be distinguished after pooling. To compensate the loss caused by coarse-grained clustering and further advance GNN, we propose a multi-channel graph convolutional networks (MuchGCN). It is motivated by the convolutional neural networks, at which a series of channels are encoded to preserve the comprehensive characteristics of the input image. Thus, we define the specific graph convolutions to learn a series of graph channels at each layer, and pool graphs iteratively to encode the hierarchical structures. Experiments have been carefully carried out to demonstrate the superiority of MuchGCN over the state-of-the-art graph classification algorithms.

1 Introduction

Classifying graph-structured data has become an important problem in various domains, such as the biological graph analysis [3]. Among the numerous graph classification techniques, graph neural networks (GNN) [5, 23, 22] tends to achieve the superior performance. The core idea is to update each node’s embedding iteratively, via aggregating the representations of its neighbors and itself. The graph representation is generated through a global pooling layer over all the nodes [19, 9, 8], which encodes the input graph flatly.

To further improve performance, the hierarchical GNNs are proposed to encode both the local and coarse-grained structures of the input graph  [26, 11, 6]. Taking the social network as an example, the local structure is represented by the individual nodes and direct links. After clustering, the coarse-grained structure is constructed by the communities and their correlations in the social network. The motivation is the inherent hierarchy of the graph-structured data, like the different resolutions of image. Accordingly, the pooling modules are leveraged to cluster nodes and generate a coarse-grained graph at the next layer. GNNs are then stacked to encode the hierarchical graphs. While the pooling module obtains the hierarchical representations of the input graph, the accompanied information loss is problematic for the task of classifying graphs. First, given two non-isomorphic graphs, they may be pooled into the same one at the higher layer of model. Similar graph representations would be learned and make them indistinguishable. Second, there is only one coarse-grained graph generated at each layer, which ignores the multi-view poolings of the input graph. For example, the individual social nodes generally have multiple characteristics, and they could be clustered into communities in different ways.

Figure 1: An illustration of the MuchGCN framework consisted of layers. For each layer, the graph convolutions updates the node embedding, and then the feature learning prepares a series of graph channels encoded with different node characteristics. Between the successive two layers, the pooling module is applied to obtain the coarse-grained ones of the graphs at the last layer. Graph embeddings learned at each layer are concatenated to represent the entire graph. It is fed into the differentiable classifier to predict the corresponding label.

Recently, the multi-graph GNNs have been proposed to learn the various node characteristics [1, 12, 28]. It duplicates a series of instances from the input graph, and implements GNNs on them independently to encode the specific characteristics.

To tackle the above problems, we propose the hierarchical framework of being able to encode a series of coarse-grained graphs layer by layer. It is comparable with the convolutional neural networks (CNN), where both the pooling layer and convolutional filter work together to operate on the image channels hierarchically. The grid-like image could be regarded as a special type of the graph-structured data, at which the pixel is represented by a node. The pixel has the fix size of neighborhood patch, e.g., directly adjacent pixels. These neighbors have the determined orders from the upper left to the lower right. However, there are two challenges in building up such a deep neural network for the graph-structured data. First, across each graph, the nodes have various numbers and uncertain orders of their neighbors. The local convolutions in CNN cannot be directly applied to learn the nodes’ characteristics, since it is predefined with the shape of local neighbors and their orders. Second, considering the coarse-grained graphs as shown by and in Figure 1, they have the different adjacency structures. The nodes and edges of one channel cannot be mapped to those of the others. It prevents us from using the channel-wise convolutional filter to add one graph on top of the others to aggregate their features. The filter could only sum up the images with the same and grid-like shape.

To address the above challenges, we develop the multi-channel graph convolutional networks (MuchGCN) as shown in Figure 1. To be specific, it could be separated as the following two research questions. (i) How to define the convolutional filters to learn the various nodes’ characteristics for the graph-structured data? (2) How to define the graph convolutions to combine the distinct coarse-grained graphs? In summary, our major contributions are described below.

  • We propose the new graph representation learning architecture, at which the series of coarse-grained graphs are encoded hierarchically.

  • We design the convolutional filter to learn the series of graph channels, without reliance on the shape and order of the node’s neighbors.

  • We define the inter-channel graph convolutions to aggregate the graph channels via message passing.

  • The experiments show that the graph classification accuracy of MuchGCN is superior than the state-of-the-art baselines.

2 Preliminaries

The goal of graph classification is to map graphs into a set of labels. Let denote the directed or undirected graph consisting of nodes, where denotes the adjacency matrix, and denotes the feature matrix in which each row represents a -dimensional feature vector of a node. Given a set of graphs and the corresponding labels , the challenge is to extract the informative graph representations to facilitate the following graph classification: .

2.1 Graph Neural Networks.

GNN uses the adjacency structure and node features to learn the node embeddings. A general “message-passing” based GNN could be expressed by [24]:


where denotes the hidden node embedding after steps of graph convolutions, denotes the linear transformation matrix, and denotes the activation function of ReLU. We have . Node embedding is updated by aggregating representations of its neighbors and itself, which is the same with the graph convolutional networks (GCN) excepts for the normalization of adjacency matrix [16]. After steps of message passing, we could reach out to the neighbors that are at maximum hops away from the central node. For the simplicity of expression, we denote GNN associated with steps of message passing as . To tackle the task of graph classification, the graph representation is generated by globally pooling the node embedding in matrix .

2.2 Differentiable Pooling.

A major limitation of Equation (1) is that it encodes only the superficial structure of the input graph. Recently, the differentiable pooling [26] (DIFFPOOL) is proposed to cluster nodes from the input graph gradually, and generate the hierarchical coarse-grained graphs layer by layer. Formally, let and denote the node numbers of coarse-grained graphs at layers and , respectively. Generally we have in order to obtain the more abstract graphs at the higher layer of model. Let and denote the cluster matrix and node embedding learned at layer , respectively. DIFFPOOL module clusters the graph from layer and generates a more coarser one at layer as follows:


where and denote the adjacency matrix and node features of the graph at layer , respectively. To prepare the cluster matrix and node embedding at layer , two modules are used: and . The function is applied in the row-wise fashion to determine the assignment probability of each node at layer to the clusters at layer . Given the coarse-grained graphs of all layers, GNNs are stacked to encode the inherently hierarchical structures of the input graph.

2.3 Multi-Graph Learning.

In the real-world graph-structured data, nodes have various roles or characteristics, and they have different types of correlations. Under this prior knowledge, the multi-graph GNN learns the multiple characteristics of nodes that could be informative for the representation learning. Formally, given the input graph associated with feature , it is first duplicated to generate a series of graph instances. Considering graph instance , a specific adjacency matrix is formulated to consider one class of the node characteristics and correlations. Based on Equation (1), the graph convolutions learn the node embedding at each graph instance independently as follows: . The final node representation is obtained via globally pooling the set of node embeddings learned at different graph instances.

3 Multi-channel Graph Convolutional Networks

The differentiable pooling has its own bottlenecks in the graph representation learning. On the one hand, the input graph is distorted gradually after each layer of pooling. It may be difficult to distinguish the heterogeneous graphs at the higher layer of model. On the other hand, the pooling discards the inherent various graph information that would be informative for the graph classification. One of the promising solutions is to learn the various nodes’ characteristics and reserve as more coarse-grained structures as possible, in order to compensate for the information loss of pooling. However, the multi-graph learning directly duplicate the input graph, and ignores its hierarchical structures. The adjacency matrix for each graph instance is formulated manually to encode one the nodes’ characteristics. It prevents us from learning the hidden representation adaptively given a specific task.

To further improve the graph representation learning, we propose the new framework named MuchGCN in Figure 1. It mimics the advanced neural architecture of CNN for the graph-structured data, where a series of graph channels would be learned hierarchically. Before elaborating our framework, we first define two key concepts for the consistency of presentation:

Definition 1. Layer: A layer is composed of operations of graph convolutions and feature learning as shown in Figure 1. Let denote the index of layer. The input to layer is a set of graphs (e.g., at layer ), while the output is a series of graphs associated with the learned node embeddings (e.g., at layer ).

Definition 2. Channel: Given a specific layer, a channel represents the input graph denoted by , where denotes the channel index. As shown in Figure 1, layer consists of one channel: , and layer consists of two channels .

3.1 Proposed Method.

Compared with CNN, the challenges of building up MuchGCN lie in the following two facts. First, nodes across the graph have different numbers of direct neighbors. For example, the upper-left node in graph has only one neighbor as shown in Figure 1, while the others have at least two. There is also no determined order for all the neighbors of one node. In the graph-structured data, the grid-like local filter (e.g., ) cannot learn the node’s characteristics directly based on its neighborhood shape. Second, the coarse-grained graph channels at a layer have different shapes of adjacency structures, such as the and . It is hard to map the nodes and edges of one channel to those of another one. In consequence, the series of graph channels cannot be stacked and pooled together in the node-wise and edge-wise fashions. This precludes the straightforward way to aggregate features of the various graph channels and generate a new channel at the next layer, like in CNN.

We address the above challenges by carefully designing two key components in MuchGCN: (i) the convolutional filter defined based on the steps of message passing, instead of the direct neighbors; (ii) the inter-channel graph convolutions passing messages among the graph channels to aggregate their features. As shown in Figure 1, we first describe how MuchGCN learns the node’s characteristics in the single channel at layer . Following this, we describe how MuchGCN operates graph convolutions on the multiple channels at layer .

Single-channel Learning Process.

We consider the graph at channel of layer . It is given by the input graph, where and . The graph convolutions stage applies GNNs to generate embeddings iteratively. Node embedding after steps of message passing at channel is given by:


where denotes the trainable parameter for the -th message passing at layer . Note that embedding represents the neighborhood structure of height . Based on the Weisfeiler-Leman (WL) algorithm  [24], two non-isomorphic graphs can be distinguished if their node embeddings are different at any step . In the context of graph representation learning, we consider the embedding multiset that consists of the input feature and all intermediate node embeddings, where . The feature learning stage learns the node’s characteristics via a set of trainable filters. At the same time, it generates a series of graphs to improve the graph classification ability. Formally, filter learns the new graph associated with embedding as follows:


where index tuple denotes the -th newly generated graphs at layer , and is a trainable scalar. We have and as shown in Figure 1. , and denote the non-linear function of multilayer perceptron (MLP), summation function and element-wise multiplication, respectively. Following the same process of graph convolution and feature learning, we could also obtain the cluster matrices for the two new graphs at layer .

Based on the graph pooling defined in Equation (2), channel at the next layer are generated from the learned embeddings and . As shown in Figure 1, we have channels and at layer , which encode different coarse-grained structures of the input graph . In addition, there exists adjacency connection between these two channels. Formally, the inter-channel adjacency matrix between channels and is given as follows: . The row and column of represent nodes at channels and , respectively.

Multi-channel Learning Process.

Unlike the layer , the input to layer is given by a series of channels. They represents the different characteristics of the input graph, and have the various coarse-grained structures. It is required to aggregate features from this series of channels to generate the more abstract representation at the higher layer of model. In this section, we introduce the novel graph convolutions, which updates the node embedding at one channel by additionally exploiting information of the others.

Considering channel , the graph convolutions includes both the intra-channel and inter-channel ones to receive information from the current channel and neighboring channel , respectively. The intra-channel graph convolutions is given by Equation (3), at which the indexes of channel and layer are replaced by and . Based on Equation (3), the inter-channel graph convolutions is defined as follows:


denotes the node embedding of channel , after steps of feature aggregations from neighboring channel . We define . denotes the inter-channel adjacency matrix between channels and , at which and at layer . Compared with the intra-channel convolutions, we replace the adjacency matrix and neighbor embedding by , , respectively. In this way, the messages of neighboring channels are passed to update embedding at channel , although they have the different adjacency structure shapes.

Following the graph convolutions, the feature learning stage learns the node’s characteristics based on Equation (4). Different with embedding multiset at channel , the one at channel is given by at the higher layer of model. It includes embedding to aggregate features from all the neighboring channels . Specifically, we have at channel . Given the set of filter , Equation (4) encodes the various characteristics and obtains a series of graphs from channel . As shown in Figure 1, they are denoted by and , respectively. By repeating the previous process for channel , we obtain the graphs associated with embeddings and .

Multi-channel Graph Convolutional Networks.

We stack layers of graph convolutions and feature learning in MuchGCN, at which in Figure 1. Let and denote the node number of a graph and the channel number at layer , respectively. We define named assign ratio, and define named channel expansion. Generally, the relation of is satisfied to generate more coarse-grained graphs. The one of is given to learn the various characteristics of nodes. For each layer , we generate a series of graphs whose node embeddings are given by , . As shown in Figure 1, we have at layer . The graph representation learned at layer could be obtained by combining the generated graphs as follows:


where denotes the global pooling function to read out the graph representation. The entire graph representation is generated by concatenating from all the layers: . It encodes both the local and the hierarchically coarse-grained structures of the input graph. Given the input of , the downstream differentiable classifier, like MLP, is applied to predict the corresponding graph label.

3.2 Theoretical Analysis.

We analyze how the various coarse-grained structures are produced by learning the different characteristics of nodes. Considering channel at layer , we prepare the embedding multiset based on the intra-channel and inter-channel graph convolutions. Then, Equation (4) encodes a specific characteristic via filter , and generates the graph associated with embedding . Note that the representation of node is given by the -th row of embedding matrix. Correspondingly, we have the embedding multiset of node denoted as . In addition, node embedding in the generated graph is denoted as .

Proposition : Assume the set constructed by all -dimensional node embeddings is countable. There exist MLP function and infinitely many scalars in Equation (4), so that node has the unique embedding if the following conditions are satisfied:

a) Node has unique multiset .

b) The trainable filter .

A proof is provided in Section C of Appendix. Node characteristic is computationally represented by the convolutions between multiset and . The diverse nodes are assigned with different embeddings based on the previous proposition. In the pooling module, nodes are clustered together only if they are similar at the specific characteristic. Using a set of filters, we could learn the various characteristics of nodes, and obtain a series of coarse-grained graphs.

Complexity Analysis. Considering channel at layer , we analyze the time complexity to learn the node’s characteristics based on Equation (4). First, node embeddings and within multiset need to be prepared according to Equations (3) and (5), respectively. Let denote the maximum number of edges within channel or between channels and . Since adjacency matrices and are usually sparse, we have . The complexities of Equations (3) and (5) are and , respectively. There are total steps of message passing in the graph convolutions, and neighboring channels waited to be aggregated. Therefore, the complexity of obtaining embedding multiset is given by . Second, the element-wise multiplication in Equation (4) takes the computation cost of . Based on the above two components, the sum of time complexity in feature learning is shown as follows: . It is linearly increase with the product of step and channels . We provide the running time analysis in the Appendix.

4 Experiments

We evaluate our MuchGCN on the task of graph classification to answer the following three questions:

  • Q1: How does MuchGCN perform when it is compared with other state-of-the-art models?

  • Q2: How does the multiple channels in MuchGCN help improve the graph representation learning ability and the classification accuracy?

  • Q3: How does the important hyperparameters in MuchGCN affect the network performance?

4.1 Experiment Settings


We use graph classification benchmarks suggested in [25, 15]: bioinformatic datasets (PTC, DD, PROTEINS [4, 10]) and social network datasets (COLLAB, IMDB-BINARY, IMDB-MULTI, REDDITBINARY-MULTI-12K [7]). The detailed statistics of these seven datasets are summarized in Table 4 in Appendix.


We compare MuchGCN with three classes of state-of-the-art baselines: (1) the kernel methods that include WL subtree [17] and GRAPHLET [20]; (2) the flat GNNs that contain GCN [16], GRAPHSAGE [13], PATCHYSAN [18], DCNN [2], DGCNN [27] and ECC [21]; (3) the hierarchical GNN of DIFFPOOL. In the flat GNNs, the graph representation is produced via the global pooling or with the 1-D convolutions over the ordered nodes. DIFFPOOL stacks GRAPHSAGE hierarchically to learn the coarse-grained structures in work [26]. We implement the another DIFFPOOL framework built on GCN, and compare with both of them. The graph classification performances of GCN and DIFFPOOL based on GCN are obtained via running the models under the same environment with MuchGCN. Those of the others are reported from the publications directly.

Implementation Details.

MuchGCN is built upon the intra-channel and inter-channel graph convolutions as shown in Equations (3) and (5). We have for the message passing step, and for the hidden dimension. The assign ratios of and are applied for the -layer and -layer architectures, respectively. The function is given by the maximization pooling to read out the graph representation. Batch normalization [14] and normalization are applied after each step of graph convolutions to make the training more stable. We regularize the objective function by the entropy of cluster matrix to make the cluster pooling more sparse [26]. The Adam optimizer is adopted to train MuchGCN, and the gradient is clipped when its norm exceeds . We evaluate MuchGCN with the -fold cross validation, at which the average classification accuracy and standard deviation are reported. The model is trained with total of epochs on each fold. Three variants of MuchGCN are considered here:

  • MuchGCN-M: the tailored MuchGCN framework only learns the multiple characteristics of nodes. We have channel expansion , at which a set of convolutional filters learns a series of new graphs based on Equation 4. In addition, we have layer number to remove the pooling module. It encodes the input graph like the multi-graph GNN.

  • MuchGCN-H, the tailored one only learns the hierarchical architectures. We apply the following architecture settings: and . MuchGCN encodes one coarse-grained structure at each layer like DIFFPOOL. To be specific, a total of layers are used for PROTEINS datasets, while the other datasets have the similar performances when .

  • MuchGCN-MH, the complete MuchGCN framework learns the multiple characteristics and hierarchical architectures simultaneously. Here the framework have the same channel expansion with MuchGCN-M, and the same layer number with MuchGCN-H.

4.2 Graph Classification Results

Model Comparison.

Table 1 compares the graph classification accuracy of MuchGCN-MH to those of all the baselines, and provides positive answers for Q1. We observe that MuchGCN-MH achieves state-of-the-art classification performance on out of benchmarks. To be specific, we consider the baseline methods of WL subtree, GCN, DIFFPOOL-GCN, variants MuchGCN-H and MuchGCN-M. MuchGCN-MH obtains the average improvements of , , , and , respectively. Especially, it outperforms DIFFPOOL-GCN significantly on REDDIT-MULTI-12K dataset. This is expected because the baseline methods are not in line with the task of graph classification with the following facts. First, the kernel method predefines some substructure features to measure the input graph manually, failing to learn the representative hidden embedding. Second, GNN obtains the graph representation flatly with a simple global pooling layer. It is problematic for classifying the graph-structured data, which is inherently multi-characteristic and hierarchical. Third, the hierarchical networks of DIFFPOOL and MuchGCN-H pool the input graph gradually and generate a coarse-grained structure at each layer. The pooling module loses the detailed graph information at the higher layers of model, and makes it hard to distinguish the heterogeneous graphs. Although the multi-graph GNN of MuchGCN-M exploits the various characteristics of nodes, it is actually a shallow model that would be unable to reach the abstract expression of the input graph.

MuchGCN-MH successfully encodes both the multiple characteristics and hierarchical structures of the input graph. It bridges the gap between the hierarchical and multi-graph frameworks. One the one hand, the pooling modules are stacked to built up a hierarchical model. The graph convolutions operate on the coarse-grained graphs to learn the abstract representation. On the other hand, the convolutional filters learn the various characteristics of nodes. Via pooling the nodes in different ways, we generate a series of coarse-grained graphs at the next layer. That would help preserve the information of input graph to a large extent.

Methods Datasets
WL subtree
DCNN - -
ECC - - -
DP-GSAGE - - -
Table 1: Classification accuracy and stand deviation in percent. The best results are highlighted with boldface. DP-GSAGE and DP-DCN denote the baseline DIFFPOOL built upon GRAPHSAGE and GCN, respectively. Symbol ’-’ represents that we cannot find the available classification results in the publications.

Effectiveness Validation of Multiple Channels.

There is a series of channels learned at each layer of MuchGCN. They could be concatenated and regarded as a super graph. At layer , the total node numbers in DIFFPOOL and MuchGCN are given by and , respectively. When channel number , MuchGCN has much more nodes than the DIFFPOOL. It would be hard to claim that the performance advantage of MuchGCN relies mostly on the channels encoded with different characteristics, rather than simply reserving more nodes. In this subsection, we validate how multiple channels improve the graph representation learning ability to answer Q2. The channel expansion and cluster ratio of MuchGCN are fixed to control the related variables: and . For DIFFPOOL, we gradually increase the node number in the coarse-grained graphs by considering the following ratios : , and . The last one has the same node number with MuchGCN at each layer, in order to provide a fair comparison. We compare the two models comprehensively by considering different depths of the hierarchical neural networks, and show their graph classification accuracies in Table 2.

Methods Layer number Variance
2 3 4
Table 2: Classification accuracy in percent on PROTEINS dataset. MuchGCN and DIFFPOOL built upon GCN are compared under different scenarios of layer number , which ranges from to .

The following observations are made to claim the effectiveness of multiple channels in learning the graph representation. First, comparing the DIFFPOOL frameworks with different (i.e., and ), the larger ratio leads to a more smaller classification accuracy when layer number . Ratio of preserves more node clusters and structure information in the pooled graphs, which are expected to help distinguish the graphs. However, in the deeper hierarchical frameworks, these extra node clusters may introduce noise to the coarse-grained structure of the input graph. That is because the optimal number of node clusters could be decided under the supervision by the given task. Second, it is observed that MuchGCN outperforms DIFFPOOL consistently even when they have the same node number (i.e., DIFFPOOL of ). Especially, while the classification accuracy of DIFFPOOL decreases significantly with , those of MuchGCN remain stable accompanied with a small variance. Instead of directly retaining more nodes, we learn the various characteristics of nodes, and pool them is different ways to obtain a series of channels. This is in line with the real-world graph-structured data, which is intrinsically multi-view.

Performance improvement via Increasing Channels.

Moving a step forward, we study the variation of graph classification performance with the channel numbers, and answer the research question Q2. We reuse two of the well-performed MuchGCN frameworks in the previous experiments: and . Both of them have the cluster ratio of . Enhancing the channel expansion from to , we show the classification accuracy of MuchGCN in Table 3。

It is obvious that the larger is, the better the classification accuracy could be achieved generally. The reason is intuitive that the multiple channels help encode more graph characteristics. By preserving more and more graph information at the higher layer of model, it would be more easier for the downstream classifier to distinguish the non-isomorphic graphs.

Layer Ratio Channel expansion
1 2 3 4
Table 3: Classification accuracy in percent on PROTEINS dataset. A series of channel expansions are evaluated to measure their contributions to the graph representation learning of MuchGCN.

Hyperparameter Studies.

We investigate the effects of some important hyperparameters on MuchGCN to provide answer for research question Q3. Both cluster ratio and message passing step are evaluated in this section. The pooling module equipped with large will generate the more complex coarse-grained graphs. The large step compute the neighborhood structure of much more hops away in the graph convolutions. We use the following basic configuration of MuchGCN: and . The effects of hyperparameters and on this underlying framework are shown in Figure 2.

We observe that different have the similar best ones of classification accuracy. That is because the pooling module can adaptively learn the appropriate number of node clusters. Compared with the case of small , some node clusters in the coarse-grained graph may be empty or even introduce noise in the case of large . This phenomenon is also explained in the previous experiments and the related works [26]. Considering the message passing step , the larger one provides the more accurate classification when is small. Otherwise when is large, the smaller one of tends to achieve the better classification accuracy. On the one hand, the increasing convolutional steps update the node embedding globally with the distant neighbors. The improved node embedding help improve the graph representation learning and hence the classification performance when is small. On the other hand, the node embeddings across a graph are close to each other in the Euclidean space with the increment of message passing steps . The unrelated nodes may be assigned together to the noisy and redundant clusters when is large.

Figure 2: Classification accuracy in percent on PROTEINS dateset. The hyperparameters of cluster ration and message passing step are evaluated to measure their effects on MuchGCN.

5 Conclusion

Motivated by the CNN architecture, we propose the framework named MuchGCN to learn the graph representation specifically. Comparable with CNN, the series of coarse-grained graph channels are encoded layer by layer for the graph-structured data. In detail, we design the graph convolutional filters to learn the various characteristics of nodes in the series of graph channels. The inter-channel graph convolutions are given to aggregate the entire graph channels and generate the one at the next layer. Experimental results show that we achieve state-of-the-art performance on the task of graph classification, and improve model robustness greatly. In the future works, we would apply MuchGCN to other tasks, such as the node classification and link prediction.

Appendix A Dataset Statistics

The statistics of all the datasets are summarized in Table 4. Each one consists of a series of graphs accompanied with the graph labels. In Table 4, # Graphs denotes the total number of graphs in the corresponding dataset. # Classes denotes the class number of the graph label. The fourth and fifth columns denotes the average numbers of nodes and edges in each graph. The column of Node Label denotes whether there exists the node attribute or not in the dataset.

Datasets # Graphs # Classes Avg.# Nodes per Graph Avg.# Edges per Graph Node Label
Table 4: Dataset Statistics.

Appendix B Implementation Details

b.1 Running Environment.

The baseline methods of GCN, DIFFPOOL-GCN and our proposed MuchGCN are implemented in PyTorch, and tested on a machine with 24 Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GB processors, 4 GeForce GTX-1080 Ti 12 GB GPU, and 128GB memory size. The random seed for packages numpy and torch is set to .

b.2 Features of Input Graph.

Importantly, our goal is to learn the hierarchical graph representations via graph structure rather than relying on the input feature . We don’t choose the specific input features for each dataset to further improve the classification accuracy. We follow the experimental setting in the state-of-the-art frameworks. The input feature in the bioinformatic datasets includes categorical label, degree and clustering coefficient. It contains only the degree information in the social network datasets. The maximum node number of input graph is set to to cover all graphs in PTC, IMDB-B and IMDB-M. On the other hand, it takes the value of in D&D, PROTEINS, COLLAB and RDT-M12K.

b.3 Implementation Details of MuchGCN.

Our proposed MuchGCN is built upon intra-channel and inter-channel graph convolutions as shown in Equations () and () in the paper. We have for the message passing step, and for the hidden dimension. A total of layers are used for PROTEINS datasets, while the others have similar performance when . The assign ratio is set to and for the -layer and -layer architectures, respectively. The non-linear functions and are given by RuLU and MLP, respectively. function is realized by maximization pooling to read out graph representation. Batch normalization and normalization are applied after each graph convolution operation to make the training more stable. We regularize the objective function by the entropy of the assignment matrix to make the cluster assignment sparse. -fold cross validation is applied to evaluate the performance of MuchGCN, whose average classification accuracy and standard deviation are reported. Total of epochs is trained. The Adam optimizer is adopted to train MuchGCN, and the gradient is clipped when its norm exceeds .

b.4 Implementation Details of Baselines.

For the baselines of GCN and DIFFPOOL-GCN, we implement the source code provided by the authors of DIFFPOOL [26]. The running environment setting is the same with the suggestions in the publication. For the other baseline methods, we directly cite the results from the corresponding publications.

Appendix C Proof for Proposition 1

Proof. In this section, we provide proof to analyze how Equation () assigns unique embedding for node . To facilitate the following analysis, we ignore the notation of in Equation () in the paper. The embedding multiset of node is represented as follows: , . We use vector to represent the input feature , embedding features and aggregated from the current and neighboring channels, respectively. The size of multiset is denoted by . Here , at which denotes the message passing step and denotes the channel number.

According to Equation () in paper, the final embedding of node is given by:


where is the -th element of filter , and is realized by MLP function. Note that is a scalar and is a -dimensional vector.

We need to prove that the embedding of nodes is unique if it has unique multisets . Assume the set constructed by all -dimensional node embeddings is countable. According to Corollary in [24], there exists function , so that the value of is unique for each unique multiset . Suppose that filter is normalized where . We then reformulate Equation (7) as . Thanks to the universal approximation theorem, we could model and learn function via the non-linear function implemented by MLP. Since MLP can represent the composition of two consecutive MLP, we have the following equivalence in generating embedding :


Based on Corollary in [24], to prove is unique for each unique multiset , we first need to show that the new set composed of the scaled embeddings is countable. It’s obvious that the set obtained by adding bias into each element of is still countable. Set obtained by scaling each element of with is also countable. In consequence, the new set is countable since it is constructed by the union as follows:


Following the above analysis, embedding is unique only if the scaled multiset is still unique. Given different multisets and of nodes and , we need to show that the scaled multisets and are still different. In the following, we provide the proof by contradiction.

Suppose that and are the same multiset. Then there exits matched pairs of and , which satisfies the condition of . For each -th element in , their exits the matched -th element in . It means that the condition of needs to be satisfied for all matched pairs. However, there exists infinitely many that are not applicable to such condition. Let us consider the following three cases when index : (1) but , (2) but , and (3) and . For the first case, the condition is reduced to , which is obviously not satisfied by any choice of since . For the second case, the condition is reduced to , which means that some embeddings during the steps of message passing are equal to scalar . It is generally hard to be satisfied since the node embedding changes after aggregating neighbor features at each step. For the third case, the condition is reduced to , which is also generally impossible because it is hard to force all elements of -dimensional vector of to have the same value of . Multisets and are the same only the following condition is satisfied: for . But this condition is contradicted with the assumption of multisets and are different. In consequence, we reach the result that multisets and are still different.

Given the node embedding generated by Equation (8), we conclude that embedding is unique if multiset is unique.

Figure 3: Average running time of MuchGCN for each epoch.

Appendix D Running Time Analysis

Given the time complexity analysis in the paper, we evaluate the running time of MuchGCN under the abovementioned environment. We study the running time variation with the message passing step and channel expansion . The underlying neural architecture of MuchGCN is shown as follows: and . The average running time of each epoch is shown in Figure 3.

It is obvious that the running time of MuchGCN is almost linearly increasing with step and channel expansion . The experimental result is consistent with our analysis in the paper.


  1. S. Abu-El-Haija, A. Kapoor, B. Perozzi and J. Lee (2018) N-gcn: multi-scale graph convolution for semi-supervised node classification. arXiv preprint arXiv:1802.08888. Cited by: §1.
  2. J. Atwood and D. Towsley (2016) Diffusion-convolutional neural networks. In NeurIPS, pp. 1993–2001. Cited by: §4.1.2.
  3. T. B. Aynaz Taheri (2018) Learning graph representations with recurrent neural network autoencoders. In KDD’18 Deep Learning Day, Cited by: §1.
  4. K. M. Borgwardt, C. S. Ong, S. Schönauer, S. Vishwanathan, A. J. Smola and H. Kriegel (2005) Protein function prediction via graph kernels. Bioinformatics 21 (suppl_1), pp. i47–i56. Cited by: §4.1.1.
  5. J. Bruna, W. Zaremba, A. Szlam and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §1.
  6. C. Cangea, P. Veličković, N. Jovanović, T. Kipf and P. Liò (2018) Towards sparse hierarchical graph classifiers. arXiv preprint arXiv:1811.01287. Cited by: §1.
  7. P. D Dobson and A. Doig (2003-08) Distinguishing enzyme structures from non-enzymes without alignments. Journal of molecular biology 330, pp. 771–83. External Links: Document Cited by: §4.1.1.
  8. H. Dai, B. Dai and L. Song (2016) Discriminative embeddings of latent variable models for structured data. In ICML, pp. 2702–2711. Cited by: §1.
  9. D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §1.
  10. A. Feragen, N. Kasenburg, J. Petersen, M. de Bruijne and K. M. Borgwardt (2013) Scalable kernels for graphs with continuous attributes. In NeurIPS, Cited by: §4.1.1.
  11. H. Gao and S. Ji (2019) Graph u-net. External Links: Link Cited by: §1.
  12. X. Geng, Y. Li, L. Wang, L. Zhang, Q. Yang, J. Ye and Y. Liu (2019) Spatiotemporal multi-graph convolution network for ride-hailing demand forecasting. In 2019 AAAI Conference on Artificial Intelligence (AAAI’19), Cited by: §1.
  13. W. Hamilton, Z. Ying and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §4.1.2.
  14. S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning. Cited by: §4.1.3.
  15. K. Kersting, N. M. Kriege, C. Morris, P. Mutzel and M. Neumann (2016) Benchmark data sets for graph kernels. External Links: Link Cited by: §4.1.1.
  16. T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. ICLR. Cited by: §2.1, §4.1.2.
  17. C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan and M. Grohe (2018) Weisfeiler and leman go neural: higher-order graph neural networks. arXiv preprint arXiv:1810.02244. Cited by: §4.1.2.
  18. M. Niepert, M. Ahmed and K. Kutzkov (2016) Learning convolutional neural networks for graphs. In International conference on machine learning, pp. 2014–2023. Cited by: §4.1.2.
  19. N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn and K. M. Borgwardt (2011) Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12 (Sep), pp. 2539–2561. Cited by: §1.
  20. N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn and K. Borgwardt (2009) Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pp. 488–495. Cited by: §4.1.2.
  21. M. Simonovsky and N. Komodakis (2017) Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3693–3702. Cited by: §4.1.2.
  22. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1.
  23. P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903 1 (2). Cited by: §1.
  24. K. Xu, W. Hu, J. Leskovec and S. Jegelka (2018) How powerful are graph neural networks?. CoRR abs/1810.00826. External Links: 1810.00826 Cited by: Appendix C, Appendix C, §2.1, §3.1.1.
  25. P. Yanardag and S. V. N. Vishwanathan (2015) A structural smoothing framework for robust graph comparison. In Advances in Neural Information Processing Systems, Cited by: §4.1.1.
  26. Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton and J. Leskovec (2018) Hierarchical graph representation learning with differentiable pooling. In NeurIPS, pp. 4805–4815. Cited by: §B.4, §1, §2.2, §4.1.2, §4.1.3, §4.2.4.
  27. M. Zhang, Z. Cui, M. Neumann and Y. Chen (2018) An end-to-end deep learning architecture for graph classification. In Association for the Advancement of Artificial Intelligence, Cited by: §4.1.2.
  28. X. Zhang, L. He, K. Chen, Y. Luo, J. Zhou and F. Wang (2018) Multi-view graph convolutional network and its applications on neuroimage analysis for parkinson’s disease. In AMIA Annual Symposium Proceedings, Vol. 2018, pp. 1147. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description