MultiChannel Graph Convolutional Networks
Abstract
Graph neural networks (GNN) has been demonstrated to be effective in classifying graph structures. To further improve the graph representation learning ability, hierarchical GNN has been explored. It leverages the differentiable pooling to cluster nodes into fixed groups, and generates a coarsegrained structure accompanied with the shrinking of the original graph. However, such clustering would discard some graph information and achieve the suboptimal results. It is because the node inherently has different characteristics or roles, and two nonisomorphic graphs may have the same coarsegrained structure that cannot be distinguished after pooling. To compensate the loss caused by coarsegrained clustering and further advance GNN, we propose a multichannel graph convolutional networks (MuchGCN). It is motivated by the convolutional neural networks, at which a series of channels are encoded to preserve the comprehensive characteristics of the input image. Thus, we define the specific graph convolutions to learn a series of graph channels at each layer, and pool graphs iteratively to encode the hierarchical structures. Experiments have been carefully carried out to demonstrate the superiority of MuchGCN over the stateoftheart graph classification algorithms.
1 Introduction
Classifying graphstructured data has become an important problem in various domains, such as the biological graph analysis [3]. Among the numerous graph classification techniques, graph neural networks (GNN) [5, 23, 22] tends to achieve the superior performance. The core idea is to update each node’s embedding iteratively, via aggregating the representations of its neighbors and itself. The graph representation is generated through a global pooling layer over all the nodes [19, 9, 8], which encodes the input graph flatly.
To further improve performance, the hierarchical GNNs are proposed to encode both the local and coarsegrained structures of the input graph [26, 11, 6]. Taking the social network as an example, the local structure is represented by the individual nodes and direct links. After clustering, the coarsegrained structure is constructed by the communities and their correlations in the social network. The motivation is the inherent hierarchy of the graphstructured data, like the different resolutions of image. Accordingly, the pooling modules are leveraged to cluster nodes and generate a coarsegrained graph at the next layer. GNNs are then stacked to encode the hierarchical graphs. While the pooling module obtains the hierarchical representations of the input graph, the accompanied information loss is problematic for the task of classifying graphs. First, given two nonisomorphic graphs, they may be pooled into the same one at the higher layer of model. Similar graph representations would be learned and make them indistinguishable. Second, there is only one coarsegrained graph generated at each layer, which ignores the multiview poolings of the input graph. For example, the individual social nodes generally have multiple characteristics, and they could be clustered into communities in different ways.
Recently, the multigraph GNNs have been proposed to learn the various node characteristics [1, 12, 28]. It duplicates a series of instances from the input graph, and implements GNNs on them independently to encode the specific characteristics.
To tackle the above problems, we propose the hierarchical framework of being able to encode a series of coarsegrained graphs layer by layer. It is comparable with the convolutional neural networks (CNN), where both the pooling layer and convolutional filter work together to operate on the image channels hierarchically. The gridlike image could be regarded as a special type of the graphstructured data, at which the pixel is represented by a node. The pixel has the fix size of neighborhood patch, e.g., directly adjacent pixels. These neighbors have the determined orders from the upper left to the lower right. However, there are two challenges in building up such a deep neural network for the graphstructured data. First, across each graph, the nodes have various numbers and uncertain orders of their neighbors. The local convolutions in CNN cannot be directly applied to learn the nodes’ characteristics, since it is predefined with the shape of local neighbors and their orders. Second, considering the coarsegrained graphs as shown by and in Figure 1, they have the different adjacency structures. The nodes and edges of one channel cannot be mapped to those of the others. It prevents us from using the channelwise convolutional filter to add one graph on top of the others to aggregate their features. The filter could only sum up the images with the same and gridlike shape.
To address the above challenges, we develop the multichannel graph convolutional networks (MuchGCN) as shown in Figure 1. To be specific, it could be separated as the following two research questions. (i) How to define the convolutional filters to learn the various nodes’ characteristics for the graphstructured data? (2) How to define the graph convolutions to combine the distinct coarsegrained graphs? In summary, our major contributions are described below.

We propose the new graph representation learning architecture, at which the series of coarsegrained graphs are encoded hierarchically.

We design the convolutional filter to learn the series of graph channels, without reliance on the shape and order of the node’s neighbors.

We define the interchannel graph convolutions to aggregate the graph channels via message passing.

The experiments show that the graph classification accuracy of MuchGCN is superior than the stateoftheart baselines.
2 Preliminaries
The goal of graph classification is to map graphs into a set of labels. Let denote the directed or undirected graph consisting of nodes, where denotes the adjacency matrix, and denotes the feature matrix in which each row represents a dimensional feature vector of a node. Given a set of graphs and the corresponding labels , the challenge is to extract the informative graph representations to facilitate the following graph classification: .
2.1 Graph Neural Networks.
GNN uses the adjacency structure and node features to learn the node embeddings. A general “messagepassing” based GNN could be expressed by [24]:
(1) 
where denotes the hidden node embedding after steps of graph convolutions, denotes the linear transformation matrix, and denotes the activation function of ReLU. We have . Node embedding is updated by aggregating representations of its neighbors and itself, which is the same with the graph convolutional networks (GCN) excepts for the normalization of adjacency matrix [16]. After steps of message passing, we could reach out to the neighbors that are at maximum hops away from the central node. For the simplicity of expression, we denote GNN associated with steps of message passing as . To tackle the task of graph classification, the graph representation is generated by globally pooling the node embedding in matrix .
2.2 Differentiable Pooling.
A major limitation of Equation (1) is that it encodes only the superficial structure of the input graph. Recently, the differentiable pooling [26] (DIFFPOOL) is proposed to cluster nodes from the input graph gradually, and generate the hierarchical coarsegrained graphs layer by layer. Formally, let and denote the node numbers of coarsegrained graphs at layers and , respectively. Generally we have in order to obtain the more abstract graphs at the higher layer of model. Let and denote the cluster matrix and node embedding learned at layer , respectively. DIFFPOOL module clusters the graph from layer and generates a more coarser one at layer as follows:
(2)  
where and denote the adjacency matrix and node features of the graph at layer , respectively. To prepare the cluster matrix and node embedding at layer , two modules are used: and . The function is applied in the rowwise fashion to determine the assignment probability of each node at layer to the clusters at layer . Given the coarsegrained graphs of all layers, GNNs are stacked to encode the inherently hierarchical structures of the input graph.
2.3 MultiGraph Learning.
In the realworld graphstructured data, nodes have various roles or characteristics, and they have different types of correlations. Under this prior knowledge, the multigraph GNN learns the multiple characteristics of nodes that could be informative for the representation learning. Formally, given the input graph associated with feature , it is first duplicated to generate a series of graph instances. Considering graph instance , a specific adjacency matrix is formulated to consider one class of the node characteristics and correlations. Based on Equation (1), the graph convolutions learn the node embedding at each graph instance independently as follows: . The final node representation is obtained via globally pooling the set of node embeddings learned at different graph instances.
3 Multichannel Graph Convolutional Networks
The differentiable pooling has its own bottlenecks in the graph representation learning. On the one hand, the input graph is distorted gradually after each layer of pooling. It may be difficult to distinguish the heterogeneous graphs at the higher layer of model. On the other hand, the pooling discards the inherent various graph information that would be informative for the graph classification. One of the promising solutions is to learn the various nodes’ characteristics and reserve as more coarsegrained structures as possible, in order to compensate for the information loss of pooling. However, the multigraph learning directly duplicate the input graph, and ignores its hierarchical structures. The adjacency matrix for each graph instance is formulated manually to encode one the nodesâ characteristics. It prevents us from learning the hidden representation adaptively given a specific task.
To further improve the graph representation learning, we propose the new framework named MuchGCN in Figure 1. It mimics the advanced neural architecture of CNN for the graphstructured data, where a series of graph channels would be learned hierarchically. Before elaborating our framework, we first define two key concepts for the consistency of presentation:
Definition 1. Layer: A layer is composed of operations of graph convolutions and feature learning as shown in Figure 1. Let denote the index of layer. The input to layer is a set of graphs (e.g., at layer ), while the output is a series of graphs associated with the learned node embeddings (e.g., at layer ).
Definition 2. Channel: Given a specific layer, a channel represents the input graph denoted by , where denotes the channel index. As shown in Figure 1, layer consists of one channel: , and layer consists of two channels .
3.1 Proposed Method.
Compared with CNN, the challenges of building up MuchGCN lie in the following two facts. First, nodes across the graph have different numbers of direct neighbors. For example, the upperleft node in graph has only one neighbor as shown in Figure 1, while the others have at least two. There is also no determined order for all the neighbors of one node. In the graphstructured data, the gridlike local filter (e.g., ) cannot learn the node’s characteristics directly based on its neighborhood shape. Second, the coarsegrained graph channels at a layer have different shapes of adjacency structures, such as the and . It is hard to map the nodes and edges of one channel to those of another one. In consequence, the series of graph channels cannot be stacked and pooled together in the nodewise and edgewise fashions. This precludes the straightforward way to aggregate features of the various graph channels and generate a new channel at the next layer, like in CNN.
We address the above challenges by carefully designing two key components in MuchGCN: (i) the convolutional filter defined based on the steps of message passing, instead of the direct neighbors; (ii) the interchannel graph convolutions passing messages among the graph channels to aggregate their features. As shown in Figure 1, we first describe how MuchGCN learns the node’s characteristics in the single channel at layer . Following this, we describe how MuchGCN operates graph convolutions on the multiple channels at layer .
Singlechannel Learning Process.
We consider the graph at channel of layer . It is given by the input graph, where and . The graph convolutions stage applies GNNs to generate embeddings iteratively. Node embedding after steps of message passing at channel is given by:
(3) 
where denotes the trainable parameter for the th message passing at layer . Note that embedding represents the neighborhood structure of height . Based on the WeisfeilerLeman (WL) algorithm [24], two nonisomorphic graphs can be distinguished if their node embeddings are different at any step . In the context of graph representation learning, we consider the embedding multiset that consists of the input feature and all intermediate node embeddings, where . The feature learning stage learns the node’s characteristics via a set of trainable filters. At the same time, it generates a series of graphs to improve the graph classification ability. Formally, filter learns the new graph associated with embedding as follows:
(4) 
where index tuple denotes the th newly generated graphs at layer , and is a trainable scalar. We have and as shown in Figure 1. , and denote the nonlinear function of multilayer perceptron (MLP), summation function and elementwise multiplication, respectively. Following the same process of graph convolution and feature learning, we could also obtain the cluster matrices for the two new graphs at layer .
Based on the graph pooling defined in Equation (2), channel at the next layer are generated from the learned embeddings and . As shown in Figure 1, we have channels and at layer , which encode different coarsegrained structures of the input graph . In addition, there exists adjacency connection between these two channels. Formally, the interchannel adjacency matrix between channels and is given as follows: . The row and column of represent nodes at channels and , respectively.
Multichannel Learning Process.
Unlike the layer , the input to layer is given by a series of channels. They represents the different characteristics of the input graph, and have the various coarsegrained structures. It is required to aggregate features from this series of channels to generate the more abstract representation at the higher layer of model. In this section, we introduce the novel graph convolutions, which updates the node embedding at one channel by additionally exploiting information of the others.
Considering channel , the graph convolutions includes both the intrachannel and interchannel ones to receive information from the current channel and neighboring channel , respectively. The intrachannel graph convolutions is given by Equation (3), at which the indexes of channel and layer are replaced by and . Based on Equation (3), the interchannel graph convolutions is defined as follows:
(5) 
denotes the node embedding of channel , after steps of feature aggregations from neighboring channel . We define . denotes the interchannel adjacency matrix between channels and , at which and at layer . Compared with the intrachannel convolutions, we replace the adjacency matrix and neighbor embedding by , , respectively. In this way, the messages of neighboring channels are passed to update embedding at channel , although they have the different adjacency structure shapes.
Following the graph convolutions, the feature learning stage learns the node’s characteristics based on Equation (4). Different with embedding multiset at channel , the one at channel is given by at the higher layer of model. It includes embedding to aggregate features from all the neighboring channels . Specifically, we have at channel . Given the set of filter , Equation (4) encodes the various characteristics and obtains a series of graphs from channel . As shown in Figure 1, they are denoted by and , respectively. By repeating the previous process for channel , we obtain the graphs associated with embeddings and .
Multichannel Graph Convolutional Networks.
We stack layers of graph convolutions and feature learning in MuchGCN, at which in Figure 1. Let and denote the node number of a graph and the channel number at layer , respectively. We define named assign ratio, and define named channel expansion. Generally, the relation of is satisfied to generate more coarsegrained graphs. The one of is given to learn the various characteristics of nodes. For each layer , we generate a series of graphs whose node embeddings are given by , . As shown in Figure 1, we have at layer . The graph representation learned at layer could be obtained by combining the generated graphs as follows:
(6) 
where denotes the global pooling function to read out the graph representation. The entire graph representation is generated by concatenating from all the layers: . It encodes both the local and the hierarchically coarsegrained structures of the input graph. Given the input of , the downstream differentiable classifier, like MLP, is applied to predict the corresponding graph label.
3.2 Theoretical Analysis.
We analyze how the various coarsegrained structures are produced by learning the different characteristics of nodes. Considering channel at layer , we prepare the embedding multiset based on the intrachannel and interchannel graph convolutions. Then, Equation (4) encodes a specific characteristic via filter , and generates the graph associated with embedding . Note that the representation of node is given by the th row of embedding matrix. Correspondingly, we have the embedding multiset of node denoted as . In addition, node embedding in the generated graph is denoted as .
Proposition : Assume the set constructed by all dimensional node embeddings is countable. There exist MLP function and infinitely many scalars in Equation (4), so that node has the unique embedding if the following conditions are satisfied:
a) Node has unique multiset .
b) The trainable filter .
A proof is provided in Section C of Appendix. Node characteristic is computationally represented by the convolutions between multiset and . The diverse nodes are assigned with different embeddings based on the previous proposition. In the pooling module, nodes are clustered together only if they are similar at the specific characteristic. Using a set of filters, we could learn the various characteristics of nodes, and obtain a series of coarsegrained graphs.
Complexity Analysis. Considering channel at layer , we analyze the time complexity to learn the node’s characteristics based on Equation (4). First, node embeddings and within multiset need to be prepared according to Equations (3) and (5), respectively. Let denote the maximum number of edges within channel or between channels and . Since adjacency matrices and are usually sparse, we have . The complexities of Equations (3) and (5) are and , respectively. There are total steps of message passing in the graph convolutions, and neighboring channels waited to be aggregated. Therefore, the complexity of obtaining embedding multiset is given by . Second, the elementwise multiplication in Equation (4) takes the computation cost of . Based on the above two components, the sum of time complexity in feature learning is shown as follows: . It is linearly increase with the product of step and channels . We provide the running time analysis in the Appendix.
4 Experiments
We evaluate our MuchGCN on the task of graph classification to answer the following three questions:

Q1: How does MuchGCN perform when it is compared with other stateoftheart models?

Q2: How does the multiple channels in MuchGCN help improve the graph representation learning ability and the classification accuracy?

Q3: How does the important hyperparameters in MuchGCN affect the network performance?
4.1 Experiment Settings
Datasets.
Baselines.
We compare MuchGCN with three classes of stateoftheart baselines: (1) the kernel methods that include WL subtree [17] and GRAPHLET [20]; (2) the flat GNNs that contain GCN [16], GRAPHSAGE [13], PATCHYSAN [18], DCNN [2], DGCNN [27] and ECC [21]; (3) the hierarchical GNN of DIFFPOOL. In the flat GNNs, the graph representation is produced via the global pooling or with the 1D convolutions over the ordered nodes. DIFFPOOL stacks GRAPHSAGE hierarchically to learn the coarsegrained structures in work [26]. We implement the another DIFFPOOL framework built on GCN, and compare with both of them. The graph classification performances of GCN and DIFFPOOL based on GCN are obtained via running the models under the same environment with MuchGCN. Those of the others are reported from the publications directly.
Implementation Details.
MuchGCN is built upon the intrachannel and interchannel graph convolutions as shown in Equations (3) and (5). We have for the message passing step, and for the hidden dimension. The assign ratios of and are applied for the layer and layer architectures, respectively. The function is given by the maximization pooling to read out the graph representation. Batch normalization [14] and normalization are applied after each step of graph convolutions to make the training more stable. We regularize the objective function by the entropy of cluster matrix to make the cluster pooling more sparse [26]. The Adam optimizer is adopted to train MuchGCN, and the gradient is clipped when its norm exceeds . We evaluate MuchGCN with the fold cross validation, at which the average classification accuracy and standard deviation are reported. The model is trained with total of epochs on each fold. Three variants of MuchGCN are considered here:

MuchGCNM: the tailored MuchGCN framework only learns the multiple characteristics of nodes. We have channel expansion , at which a set of convolutional filters learns a series of new graphs based on Equation 4. In addition, we have layer number to remove the pooling module. It encodes the input graph like the multigraph GNN.

MuchGCNH, the tailored one only learns the hierarchical architectures. We apply the following architecture settings: and . MuchGCN encodes one coarsegrained structure at each layer like DIFFPOOL. To be specific, a total of layers are used for PROTEINS datasets, while the other datasets have the similar performances when .

MuchGCNMH, the complete MuchGCN framework learns the multiple characteristics and hierarchical architectures simultaneously. Here the framework have the same channel expansion with MuchGCNM, and the same layer number with MuchGCNH.
4.2 Graph Classification Results
Model Comparison.
Table 1 compares the graph classification accuracy of MuchGCNMH to those of all the baselines, and provides positive answers for Q1. We observe that MuchGCNMH achieves stateoftheart classification performance on out of benchmarks. To be specific, we consider the baseline methods of WL subtree, GCN, DIFFPOOLGCN, variants MuchGCNH and MuchGCNM. MuchGCNMH obtains the average improvements of , , , and , respectively. Especially, it outperforms DIFFPOOLGCN significantly on REDDITMULTI12K dataset. This is expected because the baseline methods are not in line with the task of graph classification with the following facts. First, the kernel method predefines some substructure features to measure the input graph manually, failing to learn the representative hidden embedding. Second, GNN obtains the graph representation flatly with a simple global pooling layer. It is problematic for classifying the graphstructured data, which is inherently multicharacteristic and hierarchical. Third, the hierarchical networks of DIFFPOOL and MuchGCNH pool the input graph gradually and generate a coarsegrained structure at each layer. The pooling module loses the detailed graph information at the higher layers of model, and makes it hard to distinguish the heterogeneous graphs. Although the multigraph GNN of MuchGCNM exploits the various characteristics of nodes, it is actually a shallow model that would be unable to reach the abstract expression of the input graph.
MuchGCNMH successfully encodes both the multiple characteristics and hierarchical structures of the input graph. It bridges the gap between the hierarchical and multigraph frameworks. One the one hand, the pooling modules are stacked to built up a hierarchical model. The graph convolutions operate on the coarsegrained graphs to learn the abstract representation. On the other hand, the convolutional filters learn the various characteristics of nodes. Via pooling the nodes in different ways, we generate a series of coarsegrained graphs at the next layer. That would help preserve the information of input graph to a large extent.
Methods  Datasets  

PTC  DD  PROTEINS  COLLAB  IMDBB  IMDBM  RDTM12K  
WL subtree  
GRAPHLET  
GCN  
GRAPHSAGE  
PATCHYSAN  
DCNN      
DGCNN  
ECC        
DPGSAGE        
DPGCN  
MuchGCNM  
MuchGCNH  
MuchGCNMH 
Effectiveness Validation of Multiple Channels.
There is a series of channels learned at each layer of MuchGCN. They could be concatenated and regarded as a super graph. At layer , the total node numbers in DIFFPOOL and MuchGCN are given by and , respectively. When channel number , MuchGCN has much more nodes than the DIFFPOOL. It would be hard to claim that the performance advantage of MuchGCN relies mostly on the channels encoded with different characteristics, rather than simply reserving more nodes. In this subsection, we validate how multiple channels improve the graph representation learning ability to answer Q2. The channel expansion and cluster ratio of MuchGCN are fixed to control the related variables: and . For DIFFPOOL, we gradually increase the node number in the coarsegrained graphs by considering the following ratios : , and . The last one has the same node number with MuchGCN at each layer, in order to provide a fair comparison. We compare the two models comprehensively by considering different depths of the hierarchical neural networks, and show their graph classification accuracies in Table 2.
Methods  Layer number  Variance  

2  3  4  
DPGCN  
DPGCN  
DPGCN  
MuchGCN 
The following observations are made to claim the effectiveness of multiple channels in learning the graph representation. First, comparing the DIFFPOOL frameworks with different (i.e., and ), the larger ratio leads to a more smaller classification accuracy when layer number . Ratio of preserves more node clusters and structure information in the pooled graphs, which are expected to help distinguish the graphs. However, in the deeper hierarchical frameworks, these extra node clusters may introduce noise to the coarsegrained structure of the input graph. That is because the optimal number of node clusters could be decided under the supervision by the given task. Second, it is observed that MuchGCN outperforms DIFFPOOL consistently even when they have the same node number (i.e., DIFFPOOL of ). Especially, while the classification accuracy of DIFFPOOL decreases significantly with , those of MuchGCN remain stable accompanied with a small variance. Instead of directly retaining more nodes, we learn the various characteristics of nodes, and pool them is different ways to obtain a series of channels. This is in line with the realworld graphstructured data, which is intrinsically multiview.
Performance improvement via Increasing Channels.
Moving a step forward, we study the variation of graph classification performance with the channel numbers, and answer the research question Q2. We reuse two of the wellperformed MuchGCN frameworks in the previous experiments: and . Both of them have the cluster ratio of . Enhancing the channel expansion from to , we show the classification accuracy of MuchGCN in Table 3ã
It is obvious that the larger is, the better the classification accuracy could be achieved generally. The reason is intuitive that the multiple channels help encode more graph characteristics. By preserving more and more graph information at the higher layer of model, it would be more easier for the downstream classifier to distinguish the nonisomorphic graphs.
Layer  Ratio  Channel expansion  

1  2  3  4  
2  
3 
Hyperparameter Studies.
We investigate the effects of some important hyperparameters on MuchGCN to provide answer for research question Q3. Both cluster ratio and message passing step are evaluated in this section. The pooling module equipped with large will generate the more complex coarsegrained graphs. The large step compute the neighborhood structure of much more hops away in the graph convolutions. We use the following basic configuration of MuchGCN: and . The effects of hyperparameters and on this underlying framework are shown in Figure 2.
We observe that different have the similar best ones of classification accuracy. That is because the pooling module can adaptively learn the appropriate number of node clusters. Compared with the case of small , some node clusters in the coarsegrained graph may be empty or even introduce noise in the case of large . This phenomenon is also explained in the previous experiments and the related works [26]. Considering the message passing step , the larger one provides the more accurate classification when is small. Otherwise when is large, the smaller one of tends to achieve the better classification accuracy. On the one hand, the increasing convolutional steps update the node embedding globally with the distant neighbors. The improved node embedding help improve the graph representation learning and hence the classification performance when is small. On the other hand, the node embeddings across a graph are close to each other in the Euclidean space with the increment of message passing steps . The unrelated nodes may be assigned together to the noisy and redundant clusters when is large.
5 Conclusion
Motivated by the CNN architecture, we propose the framework named MuchGCN to learn the graph representation specifically. Comparable with CNN, the series of coarsegrained graph channels are encoded layer by layer for the graphstructured data. In detail, we design the graph convolutional filters to learn the various characteristics of nodes in the series of graph channels. The interchannel graph convolutions are given to aggregate the entire graph channels and generate the one at the next layer. Experimental results show that we achieve stateoftheart performance on the task of graph classification, and improve model robustness greatly. In the future works, we would apply MuchGCN to other tasks, such as the node classification and link prediction.
Appendix A Dataset Statistics
The statistics of all the datasets are summarized in Table 4. Each one consists of a series of graphs accompanied with the graph labels. In Table 4, # Graphs denotes the total number of graphs in the corresponding dataset. # Classes denotes the class number of the graph label. The fourth and fifth columns denotes the average numbers of nodes and edges in each graph. The column of Node Label denotes whether there exists the node attribute or not in the dataset.
Datasets  # Graphs  # Classes  Avg.# Nodes per Graph  Avg.# Edges per Graph  Node Label 

PTC  Y  
D&D  Y  
PROTEINS  Y  
COLLAB  Y  
IMDBB  N  
IMDBM  N  
RDTM12K  N 
Appendix B Implementation Details
b.1 Running Environment.
The baseline methods of GCN, DIFFPOOLGCN and our proposed MuchGCN are implemented in PyTorch, and tested on a machine with 24 Intel(R) Xeon(R) CPU E52650 v4 @ 2.20GB processors, 4 GeForce GTX1080 Ti 12 GB GPU, and 128GB memory size. The random seed for packages numpy and torch is set to .
b.2 Features of Input Graph.
Importantly, our goal is to learn the hierarchical graph representations via graph structure rather than relying on the input feature . We don’t choose the specific input features for each dataset to further improve the classification accuracy. We follow the experimental setting in the stateoftheart frameworks. The input feature in the bioinformatic datasets includes categorical label, degree and clustering coefficient. It contains only the degree information in the social network datasets. The maximum node number of input graph is set to to cover all graphs in PTC, IMDBB and IMDBM. On the other hand, it takes the value of in D&D, PROTEINS, COLLAB and RDTM12K.
b.3 Implementation Details of MuchGCN.
Our proposed MuchGCN is built upon intrachannel and interchannel graph convolutions as shown in Equations () and () in the paper. We have for the message passing step, and for the hidden dimension. A total of layers are used for PROTEINS datasets, while the others have similar performance when . The assign ratio is set to and for the layer and layer architectures, respectively. The nonlinear functions and are given by RuLU and MLP, respectively. function is realized by maximization pooling to read out graph representation. Batch normalization and normalization are applied after each graph convolution operation to make the training more stable. We regularize the objective function by the entropy of the assignment matrix to make the cluster assignment sparse. fold cross validation is applied to evaluate the performance of MuchGCN, whose average classification accuracy and standard deviation are reported. Total of epochs is trained. The Adam optimizer is adopted to train MuchGCN, and the gradient is clipped when its norm exceeds .
b.4 Implementation Details of Baselines.
For the baselines of GCN and DIFFPOOLGCN, we implement the source code provided by the authors of DIFFPOOL [26]. The running environment setting is the same with the suggestions in the publication. For the other baseline methods, we directly cite the results from the corresponding publications.
Appendix C Proof for Proposition 1
Proof. In this section, we provide proof to analyze how Equation () assigns unique embedding for node . To facilitate the following analysis, we ignore the notation of in Equation () in the paper. The embedding multiset of node is represented as follows: , . We use vector to represent the input feature , embedding features and aggregated from the current and neighboring channels, respectively. The size of multiset is denoted by . Here , at which denotes the message passing step and denotes the channel number.
According to Equation () in paper, the final embedding of node is given by:
(7) 
where is the th element of filter , and is realized by MLP function. Note that is a scalar and is a dimensional vector.
We need to prove that the embedding of nodes is unique if it has unique multisets . Assume the set constructed by all dimensional node embeddings is countable. According to Corollary in [24], there exists function , so that the value of is unique for each unique multiset . Suppose that filter is normalized where . We then reformulate Equation (7) as . Thanks to the universal approximation theorem, we could model and learn function via the nonlinear function implemented by MLP. Since MLP can represent the composition of two consecutive MLP, we have the following equivalence in generating embedding :
(8)  
Based on Corollary in [24], to prove is unique for each unique multiset , we first need to show that the new set composed of the scaled embeddings is countable. It’s obvious that the set obtained by adding bias into each element of is still countable. Set obtained by scaling each element of with is also countable. In consequence, the new set is countable since it is constructed by the union as follows:
(9) 
Following the above analysis, embedding is unique only if the scaled multiset is still unique. Given different multisets and of nodes and , we need to show that the scaled multisets and are still different. In the following, we provide the proof by contradiction.
Suppose that and are the same multiset. Then there exits matched pairs of and , which satisfies the condition of . For each th element in , their exits the matched th element in . It means that the condition of needs to be satisfied for all matched pairs. However, there exists infinitely many that are not applicable to such condition. Let us consider the following three cases when index : (1) but , (2) but , and (3) and . For the first case, the condition is reduced to , which is obviously not satisfied by any choice of since . For the second case, the condition is reduced to , which means that some embeddings during the steps of message passing are equal to scalar . It is generally hard to be satisfied since the node embedding changes after aggregating neighbor features at each step. For the third case, the condition is reduced to , which is also generally impossible because it is hard to force all elements of dimensional vector of to have the same value of . Multisets and are the same only the following condition is satisfied: for . But this condition is contradicted with the assumption of multisets and are different. In consequence, we reach the result that multisets and are still different.
Given the node embedding generated by Equation (8), we conclude that embedding is unique if multiset is unique.
Appendix D Running Time Analysis
Given the time complexity analysis in the paper, we evaluate the running time of MuchGCN under the abovementioned environment. We study the running time variation with the message passing step and channel expansion . The underlying neural architecture of MuchGCN is shown as follows: and . The average running time of each epoch is shown in Figure 3.
It is obvious that the running time of MuchGCN is almost linearly increasing with step and channel expansion . The experimental result is consistent with our analysis in the paper.
References
 (2018) Ngcn: multiscale graph convolution for semisupervised node classification. arXiv preprint arXiv:1802.08888. Cited by: §1.
 (2016) Diffusionconvolutional neural networks. In NeurIPS, pp. 1993–2001. Cited by: §4.1.2.
 (2018) Learning graph representations with recurrent neural network autoencoders. In KDD’18 Deep Learning Day, Cited by: §1.
 (2005) Protein function prediction via graph kernels. Bioinformatics 21 (suppl_1), pp. i47–i56. Cited by: §4.1.1.
 (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §1.
 (2018) Towards sparse hierarchical graph classifiers. arXiv preprint arXiv:1811.01287. Cited by: §1.
 (200308) Distinguishing enzyme structures from nonenzymes without alignments. Journal of molecular biology 330, pp. 771–83. External Links: Document Cited by: §4.1.1.
 (2016) Discriminative embeddings of latent variable models for structured data. In ICML, pp. 2702–2711. Cited by: §1.
 (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §1.
 (2013) Scalable kernels for graphs with continuous attributes. In NeurIPS, Cited by: §4.1.1.
 (2019) Graph unet. External Links: Link Cited by: §1.
 (2019) Spatiotemporal multigraph convolution network for ridehailing demand forecasting. In 2019 AAAI Conference on Artificial Intelligence (AAAIâ19), Cited by: §1.
 (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §4.1.2.
 (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning. Cited by: §4.1.3.
 (2016) Benchmark data sets for graph kernels. External Links: Link Cited by: §4.1.1.
 (2017) Semisupervised classification with graph convolutional networks. ICLR. Cited by: §2.1, §4.1.2.
 (2018) Weisfeiler and leman go neural: higherorder graph neural networks. arXiv preprint arXiv:1810.02244. Cited by: §4.1.2.
 (2016) Learning convolutional neural networks for graphs. In International conference on machine learning, pp. 2014–2023. Cited by: §4.1.2.
 (2011) Weisfeilerlehman graph kernels. Journal of Machine Learning Research 12 (Sep), pp. 2539–2561. Cited by: §1.
 (2009) Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pp. 488–495. Cited by: §4.1.2.
 (2017) Dynamic edgeconditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3693–3702. Cited by: §4.1.2.
 (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1.
 (2017) Graph attention networks. arXiv preprint arXiv:1710.10903 1 (2). Cited by: §1.
 (2018) How powerful are graph neural networks?. CoRR abs/1810.00826. External Links: 1810.00826 Cited by: Appendix C, Appendix C, §2.1, §3.1.1.
 (2015) A structural smoothing framework for robust graph comparison. In Advances in Neural Information Processing Systems, Cited by: §4.1.1.
 (2018) Hierarchical graph representation learning with differentiable pooling. In NeurIPS, pp. 4805–4815. Cited by: §B.4, §1, §2.2, §4.1.2, §4.1.3, §4.2.4.
 (2018) An endtoend deep learning architecture for graph classification. In Association for the Advancement of Artificial Intelligence, Cited by: §4.1.2.
 (2018) Multiview graph convolutional network and its applications on neuroimage analysis for parkinsonâs disease. In AMIA Annual Symposium Proceedings, Vol. 2018, pp. 1147. Cited by: §1.