Graph Capsule Convolutional Neural Networks
Abstract
Graph Convolutional Neural Networks (GCNNs) are the most recent exciting advancement in deep learning field and their applications are quickly spreading in multicrossdomains including bioinformatics, chemoinformatics, social networks, natural language processing and computer vision. In this paper, we expose and tackle some of the basic weaknesses of a GCNN model with a capsule idea presented in (Hinton et al., 2011) and propose our Graph Capsule Network (GCAPSCNN) model. In addition, we design our GCAPSCNN model to solve especially graph classification problem which current GCNN models find challenging. Through extensive experiments, we show that our proposed Graph Capsule Network can significantly outperforms both the existing stateofart deep learning methods and graph kernels on graph classification benchmark datasets.
1 Introduction
Graph is one of the most fundamental structure that has been the basis of representing many types of data particularly molecules or atoms for decades. Problems concerning learning on graphs such as graph semisupervised learning, graph classification or graph evolution have witnessed wide applications in the domain of bioinformatics, chemoinformatics, social networks, natural language processing and computer vision. Recently, deep learning approaches has shown remarkable success in solving such graph learning problems especially after the dawn of graph convolutional neural networks (GCNNs).
In just few year, there has been a surge in generalizing convolutional neural networks (CNNs) for structures beyond regular grid domains i.e., from images to arbitrary structures like graphs (Bruna et al., 2013; Henaff et al., 2015; Defferrard et al., 2016; Kipf & Welling, 2016). These convolutional networks are now commonly known as Graph Convolutional Neural Networks (GCNNs). The principle idea behind graph convolution has been derived from graph signal processing domain (Shuman et al., 2013) and has since been extended in different ways for various purposes (Duvenaud et al., 2015; Gilmer et al., 2017; Kondor et al., 2018).
In this paper, we expose three major limitations of a general GCNN model out of which two are specific to solving graph classification problem and aim to address each one of them. First limitation is due the basic graph convolution operation which is in the purest form defined as the aggregation of local neighborhood node values corresponding to each feature (or channel). As a result there is a potential loss of information associated with the basic graph convolution operation. Hence, we seek a way to retain more information other than just performing pure aggregation.
This particular problem has been noted before (Hinton et al., 2011) but not much focus was given untill recently (Sabour et al., 2017). In (Hinton et al., 2011) authors propose new type of neurons called capsules to encapsulate more information in a local pool operation by computing a small vector of highly informative outputs rather than just taking a scalar output. In our case local pool is same as performing neighborhood aggregation in graph convolution operation.
Another source of inspiration which reinforces the same idea comes from one of the most successful Graph Kernels – WeisfeilerLehman (WL) (Shervashidze et al., 2011) designed for the purpose of solving graph classification problem. In WLsubtree graph kernel, node (feature) labels are collected from the localhood neighbors at each node and compressed injectively to form a new node label in each iteration. The histogram of these new node labels are concatenated in each iteration to serve as the graph invariant feature vector. The important point to notice here is that due to the injection process, one can recover back the exact node labels of localhood neighbors in any iteration without ever loosing the track of them. In contrast, this is not possible in GCNN as the input feature values of node neighbors are lost after the graph convolution operation.
To address this limitation, we propose to improve the basic graph convolution operation by encapsulating more information about the localhood neighbors. This is achieved by replacing the scalar output of graph convolution operation with a small vector containing higher order statistical information per feature. We call our model Graph Capsule Convolution Neural Networks (GCAPSCNN) inspired from the original capsule idea. In addition, our graph capsule idea is quite general and can be employed in any version of GCNN model either design for solving graph semisupervised problem or doing sequence learning on graphs via Graph Convolution Recurrent Neural Network models (GCRNNs).
We deal with another limitation of GCNN that is specific to graph classification problem. For this purpose, GCNN models cannot be applied directly because they are equivariant model with respect to the node order in a graph. To be precise, consider a graph with Laplacian and node feature matrix. Let be the output function of a GCNN model where are the number of nodes, input and hidden dimension of node features respectively. Then,
is a permutation equivariant function i.e., for any permutation matrix .
This specific permutation equivariance property prevent us from directly applying GCNN to a graph classification problem, since it cannot provide the guarantee that outputs of any two isomorphic graphs are the same. As a result, GCNN architecture needs an additional permutation invariant layer for performing graph classification. This invariant layer also needs to be differentiable for doing endtoend learning.
Very limited amount of work has gone in carefully designing such an invariant GCNN model for the purpose of graph classification. Currently the most common method for achieving permutation invariance is performing aggregation (or summing) over all graph node values (Atwood & Towsley, 2016; Dai et al., 2016; Zhao et al., 2018; Simonovsky & Komodakis, 2017). Though its a simple and a fast method but can again incur significant loss of information. Similar arguments also hold for maxpooling layer.
Few attempts have been in (Zhang et al., 2018; Kondor et al., 2018) to go beyond aggregation for designing a permutation invariant GCNN. In (Zhang et al., 2018) authors propose a global ordering of nodes by sorting them according to their values in the last hidden layer. This type of invariance is based on creating an order among nodes and has also been explored before in (Niepert et al., 2016). However, we show that there some are issues with this type of approach as discussed in Section 4.1. A more tangential approach has been adopted in (Kondor et al., 2018) based on group theory to design transformation operations and tensor aggregation rules that results in permutation invariant outputs. But their model complexity relies on computing high order tensors which are computationally expensive in many cases.
To that end, we propose a novel permuation invariant layer based on computing the covariance of the data whose output does not depend upon the order of nodes in the graph. It is also fast to compute since it requires only a single densematrix multiplication operation.
Our last concern with GCNN model is their limited ability to exploit global information for the purpose of graph classification. Basically, the filters employed in graph convolutions are local in nature and hence can only provide the average view of the local data. This is concerning for graphs where node labels are not present and thus initializing features with such as node degree are not much helpful. We propose to utilize global features (features that accounts for the full graph strcuture) especially based on family of graph spectral distance (Verma & Zhang, 2017) to remedy this problem.
In summary, the major contributions of our paper are:

[leftmargin=*]

We propose a Graph Capsule Network model based on the capsule idea to capture higly informative output in a small vector as oppose to a scaler output currently employed in GCNN models.

We also propose a novel permutation invariant layer based on computing the covariance of data to solve graph classification problem. We further show that it is a better choice than performing node aggregation or doing maxsort pooling and can be computed in a fast manner.

Lastly, we propose to explicitly include global features at each graph node to enhance the global information exploited by GCAPSCNN model.
We organize our paper into four major sections. We start with the related work about graph kernels and GCNNs in Section 2. In Section 3, we discuss our core idea behind graph capsules. While in Section 4, we focus on building a permuation invariant layer especially for solving graph classification problem. And in Section 5, we propose to equip our GCAPSCNN model with enhanced global features to exploit full graph structure. Lastly in our experiment and result Section 6, we show the superior performance of our GCAPSCNN model.
2 Related Work
Currently there exist three main approaches to solve graph classification problems. The most common approach deals with building graph kernels. In graph kernels, a graph is decomposed into (possibly different) substructures. The graph kernel is defined based on the frequency of each substructure appeared in and respectively, i.e., where is the vector containing frequencies of substructures. Much of work has gone on deciding which substructure is more suitable than the other. Literature surrounding graph kernels is vast and substantial progress has been made in this area. Among the existing graph kernels, strong ones are graphlets (Pržulj, 2007; Shervashidze et al., 2009), random walks or shortest paths (Kashima et al., 2003; Borgwardt & Kriegel, 2005), and WeisfeilerLehman subtree kernel (Shervashidze et al., 2011). While deep graph kernels (Yanardag & Vishwanathan, 2015), graph invariant kernels (Orsini et al., 2015), optimal assignment graph kernels (Kriege et al., 2016) and multiscale laplacian graph kernel (Kondor & Pan, 2016) focus on redefining kernel functions to appropriately measure the substructural similarity at different levels. Another part of this work goes into efficiently computing these kernels either through exploiting some structure dependency, or approximation, or randomization (Feragen et al., 2013; de Vries, 2013; Neumann et al., 2012). There are few other work such as (Montavon et al., 2012) which take atoms 3D space coordinates into account rather than operating on graph structure for constructing features and also falls under this category.
The second category involves constructing explicit graph features such as Fgsd features in (Verma & Zhang, 2017) which is based on family of graph spectral distances and comes with certain theoretical guarantees. The Skew Spectrum of Graphs (Kondor & Borgwardt, 2008) based on grouptheoretic approaches is an another example of this category. Its successor, Graphlet spectrum (Kondor et al., 2009) was introduced later to include labeled information into the spectrum and account for the relative position of subgraphs within the graph. However, the main concern with graphlet spectrum or skew spectrum is its computational complexity.
A more recent and exciting work is going on in developing convolutional neural networks (CNNs) for graphs. The original idea of defining graph convolution operation comes from the graph signal processing domain (Shuman et al., 2013) and has since been recognized as the problem of learning filter parameters that appeared in graph fourier transform given via a graph Laplacian (Bruna et al., 2013; Henaff et al., 2015). In following years, different form of GCNN models were considered such as in (Kipf & Welling, 2016; Atwood & Towsley, 2016; Duvenaud et al., 2015), where traditional graph filters were replaced by a selfloop graph adjacency matrix and each neural network layer output was computed using a propagation rule while updating the network weights. Defferrard et al. (2016) extend GCNN model by utilizing fast localized spectral filters and efficient pooling operations. A very different approach was proposed in (Niepert et al., 2016) where set of local nodes are converted into a sequence in order to create receptive fields which were then fed into a 1D convolutional neural network.
Another popular name for GCNN is message passing neural networks (MPNNs) (Lei et al., 2017; Gilmer et al., 2017; Dai et al., 2016; GarcíaDurán & Niepert, 2017) . Though in (Gilmer et al., 2017) suggests that GCNNs are the particular instance of MPNNs, we believe that both are equivalent models in a certain sense and it is just a matter of how graph convolution operation is being defined. In MPNNs hidden states of each node is updated based on messages received from its neighbors as well as its previous hidden state in each iteration. This is made possible by replacing traditional neural networks in GCNN with a small recurrent neural networks (RNN) with the same weight parameters shared across all nodes in the graph. Note that here the number of iterations in MPNNs can be related to the depth of a GCNN model. In (Simonovsky & Komodakis, 2017) authors propose to condition the learning parameters of filters based on edges rather than on traditional nodes. This approach is similar to some instances of MPNNs such as in (Gilmer et al., 2017) where learning parameters are also associated with edges. All the above MPNNs model have proposed to utilize aggregation as the permutation layer for solving graph classification problem. While in (Zhang et al., 2018; Kondor et al., 2018) authors propose to utilize maxsort pooling layer and group theory to deal with graph invariance respectively.
3 Graph Capsule CNN Model
Basic Setup and Notations: Consider a graph of size , where is the vertex set, the edge set (with no selfloops) and the weighted adjacency matrix. The standard graph Laplacian is defined as , where is the degree matrix. Let be the node feature matrix with input dimensions and (when used) to represent always the number of hidden dimensions.
General GCNN Model: We start by describing a general GCNN model before presenting our Graph Capsule CNN model. Let be a graph with graph Laplacian and be a node feature matrix. Then the most general form of a GCNN layer output function equipped with polynomial filters is given by Equation 1,
(1) 
In Equation 1, is defined as a graph convolution filter of polynomial form with degree . While are learning weight parameters where each .
Note that the can be seen as a new node feature matrix with extended dimension ^{1}^{1}1Also referred as the breadth of a GCNN layer .. Furthermore, can be replaced by any other suitable filter matrix as mentioned in (Levie et al., 2017; Kipf & Welling, 2016).
A GCNN model with a depth of layers can recursively be written as,
(2) 
where is the weight parameter in layer.
One can notice that in any layer the basic computation expression involve is . This expression represents that the new feature value of node (associated with the row) is yielded out as a single (scalar) aggregated value based on its localhood neighbors. This particular operation can incur significant loss of information. We aim to remedy this issue by introducing our novel GCAPSCNN model based on the fundamental capsule idea.
3.1 Graph Capsule Networks
The core idea behind graph capsule network is to capture more information in local pool beyond aggregation (which is the graph convolution operation in our case). This new information is encapsulated in so called instantiation parameters described in (Hinton et al., 2011) which forms a capsule vector of highly informative outputs.
The quality of these parameters are determine by their ability to encode as well decode (i.e., to reconstruct) a node localhood neighbors feature values from the capsule vector. For instance, one can take the histogram of neighborhood feature values as the capsule vector. If histogram bindwidth is sufficiently small, we can guarantee to recover back all the original input node values. This strategy has widely been used in constructing a successful graph kernel. However, histogram is not a continuous differentiable function and hence cannot employed in endtoend deep learning.
Beside seeking representative instantiation parameters, we further impose two more constraints on a graph capsule function. First, we want our graph capsule function to be permutation invariant (unlike equivariant as discussed in (Hinton et al., 2011)) with respect to input node values since we are interested in a model that can produce the same output for isomorphic graphs. Second, we would like to compute these parameters in fast manner.
Graph Capsule Function: To describe a general graph capsule function consider an node with value and set of its neighborhood node values as including itself. In a graph convolution operation output is a scalar function which takes input neighbors at node and yields output as,
(3) 
where represents edge weights between nodes and .
But in graph capsule network, we propose to replace with a vector value capsule function . For example, consider a capsule function that captures higherorder statistical moments as follows (for simplicity we omit mean and standard deviation),
(4) 
Figure 1 shows an instance of applying our graph capsule function on a specific node. As a result for an input feature matrix , our graph capsule network will produce an output where is the number of instantiation parameters.
Managing Graph Capsule Vector Dimension: In the first layer, our graph capsule network receives an input and produces a nonlinear output . Since our graph capsule function produces a vector of dimension (for each input dimension), the feature dimension of the output in subsequent layers can quickly blow up to an unmanageable value. To keep it check, we restrict the feature dimension of the output to be always at any middle layer of GCAPSCNN (here represents the hidden dimension of that layer). This can be accomplished in two ways 1) either by flattening the last two dimension of and carrying out graph convolution in usual way (see Equation 5 for an example) 2) or by taking the weighted combination of dimension capsule vectors (this is similar to performing attention mechanism) at each node as performed in (Sabour et al., 2017). We leave the second approach for our future work. Thus in a nutshell, our graph capsule network in layer () receives an input and produces an output .
Graph Capsule Function with Statistical Moments: In this paper, we consider higherorder statistical moments as instantiation parameters because they are permutationally invariant and can nicely be computed through matrixmultiplication operations in a fast manner. To see exactly how, let be the output matrix corresponding to dimension. Then, we can compute containing statistical moments as instantiation parameters as follows,
(5) 
where is a hadamard product. Here to keep the feature dimensions in check from growing, we flatten the last two dimension of the input as and performs usual graph convolution operation followed by a linear transformation with as the learning weight parameter. Note that here is used to denote both the capsule dimension as well the order of statistical moments.
Graph Capsule Function with Polynomial Coefficients: As mentioned earlier, the quality of instantiation parameters depend upon their capability to encode and decode the input values. Therefore, we seek capsule functions which are bijective in nature i.e., guaranteed to preserve everything about the local neighborhood. For instance, one consider coefficients of polynomial as instantiation parameters by taking the set of local node feature values as roots,
(6) 
One can show that from a given full set of polynomial coefficients, we are guaranteed to recover back all the original node values (upto permutation). However, the first issue with this approach is that they are expensive to compute at each node. Specifically, a combinatorial algorithm without fast fourier transform takes complexity to compute where is the number of roots. Also, there is numerical instability issue associated with computing polynomial coefficients. There are ways to deal with these kind issues but we leave pursuing this direction for our future work.
In short, our graph capsule idea is powerful and can be employed in any type of GCNN model for either solving graph semisupervised learning problem or performing sequence learning on graphs using Graph Recurrent Neural Network models (GCRNNs) or doing link prediction via Graph Autoencoders (GAEs) or/and for generating synthetic graphs through Graph Generative Adversarial models (GGANs).
4 Designing Graph Permutation Invariant Layer
In this section, we focus on the second limitation of GCNN model regarding achieving permutation invariance for graph classification purpose. Before presenting our novel invariant layer in GCAPSCNN model, we first discuss the shortcomings of MaxSort Pooling Layer which is the next popular choice after aggregation for achieving invariance.
4.1 Problems with MaxSort Pooling Layer
We design a test to determine whether the invariant graph feature constructed by a model has any degree of certainty to produce the same output for subgraph isomers or not.
SubGraph Isomorphism Feature Test: Consider two graphs and such that is isomorphic to a subgraph of . Let be the invariant feature vector (w.r.t. to graph isomorphism) of respectively. Then, we define subgraph isomorphism feature test as a criteria providing guarantee that each elements of and are comparable under certain notion i.e., for any . Here represents a comparison operator defined in a sensible way. Satisfying this test is very desirable for graph classification problem since it is quite likely that subgraph isomers of a graph belong to the same class label. This property helps the model to learn weight parameter appropriately which is shared across the same input place i.e., and .
Proposition 1
Let be the feature vectors containing top node values in sorted order for graphs respectively and given is subgraph isomorphic to . Then the MaxSort Pooling Layer fails the Subgraph Isomorphism Feature Test owing to the comparison done with respect to node ordering.
Remarks: MaxSort Pooling layer fails the test because it does not guarantee that for any . Here (not comparable) operator represents that the node corresponding to values and may not be the same in subgraph isomers. Even including a single node (value) in vector which is not present in can mess up the whole comparision order of and elements. As a result, in MaxSort Pooling layer the comparison is not always guaranteed to be sensible which makes the problem of learning weight parameters harder. In general, any invariant graph feature vector that relies on node ordering will fail this test.
4.2 Covariance as Permutation Invariant Layer
Our novel idea of permutation invariant features in GCAPSCNN model is computing the covariance of layer output given as follows,
(7) 
Here is the mean of output and is a covariance function. Since covariance function is differentiable and does not depends upon the order of row elements, it can serve as a permutation invariant layer in GCAPSCNN model. Also, it is fast in computation due to a single matrixmultiplication operation. Note that we flatten the last two dimension of GCAPSCNN layer output in order to compute the covariance.
Moreover, covariance provides much richer information about the data by including shapes, norms and angles (between node hidden features) information rather than just providing the mean of data. Infact in multivariate normal distribution, it is used as a statistical parameter to approximate the normal density and thus also reflects information about the data distribution. This particular property along with invariance has been exploited before in (Kondor & Jebara, 2003) for computing similarity between two set of vectors. One can also think about fitting multivariate normal distribution on but it involves computing inverse of covariance matrix which is computationally expensive.
Since each element of covariance matrix is invariant to node orders, we can flatten the symmetric covariance matrix to construct the graph invariant feature vector . On an another positive note, here the output dimension of does not depend upon number of nodes and can be adjusted according to computational constraints.
Proposition 2
Let be the feature vectors containing covariance elements of node feature matrices for graphs respectively and given is subgraph isomorphic to . Then the covariance invariant layer pass the SubGraph Isomorphism Feature Test owing to the comparison done with respect to feature dimensions.
Remarks: It is quite straightforward to see that the feature dimension order of a node does not depend upon the graph node ordering and hence the order is same across all graphs. As a result, each elements of and are always comparable. To be more specific, covariance output compares both the norms sand angles between the corresponding pairs of feature dimension vectors in two graphs.
5 Designing GCAPCNN with Global Features
Besides guaranteeing permutation invariance in GCAPCNN model, another important desired characteristic of graph classification model is to capture global structure (or features) of a graph. For instance, considering only node degree (as a node feature) is a local information and not much helpful towards solving graph classification problem. On the other hand, considering spectral embedding as a node feature takes global piece of information into account and have been proven successful in serving as a node vector for problems dealing with graph semisupervised learning. We define global features that takes full graph structure into account during their computation. While local features only depend upon some (atmost) hop node neighbors .
Unfortunately, the basic design of GCNN model can only capture local structure information of the graph at each node. We make this loose statement more concrete with the following theorem.
Theorem 1
: Let be a graph with graph Laplacian and node feature matrix. Let be the output function of a GCNN layer equipped with polynomial filters of degree . Then output at node (i.e., row in ) depends upon “only” on the input values of neighbors distant at most “hops” away.
Proof: We can proof this statement by mathematical induction. It is easy to see that the base case holds true. Lets assume it also holds true for i.e., node output depends upon neighbors distant upto hop away. Then in we focus on the term,
(8) 
particularly the last term involving . Matrix multiplication of with will result in node to include all node information which are atmost hop distance away. But since a node in at a distance hops (from node) can contain information upto hops, we have node containing information atmost hops distance away.
Remarks: Above theorem 1 establishes that GCNN model with layers can capture only hop localhood structure information at each node. Thus, employing GCNN for graph classification with say aggregation layer can capture only average variation of hop localhood information over the whole graph. To include more global information about the graph one can either increase (i.e, choose higher order graph convolution filters) or (i.e, the depth of GCNN model). Both these choices increases model complexity and thus would require more data samples to reach satisfying results. However among the two, we prefer increasing the depth of GCNN model because the first choice leads to increase in the breadth of the GCNN layer (see footnote 3 about in Section 3) and based on the current understanding of deep learning theory, increasing the depth is favored more over the breadth.
For cases where graph node features are missing, it is a common practice to take node degree as a node feature. Such practices can work for problems like graph semisupervised where localstructure information drives node output labels (or classes). But in graph classification global features governs the output labels and hence taking node degree is not sufficient. Of course, we can go for a very deep GCNN model that will allows us to exploit more global information but requires higher sample complexity to achieve satisfying results.
To balance the two (model complexity with depth vs. required sample complexity), we propose to incorporate Fgsd features in our GCAPCNN model computed at each node. As shown in (Verma & Zhang, 2017) Fgsd features capture global information about the graph and can also be computed in fast manner. Specifically, at each node Fgsd features are computed as the histogram of the multiset formed by taking the harmonic distance between all nodes and the node. It is given by,
(9) 
where is the harmonic distance, are any graph nodes and is the eigenvalue and eigenvector respectively.
In our experiments, we employ these features only for datasets where node feature are missing (specifically for social network datasets in our case). Although this strategy can always be used by concatenating Fgsd features with original node feature values to capture more global information. Further inspired from Weisfeilerlehman graph kernel (Shervashidze et al., 2011) which also concatenate features in each labeling iteration, we also propose to pass concatenated outputs from intermediate layers to our covariance and fully connected layers. Finally, our whole endtoend GCAPCNN learning model is guaranteed to produce the same output for isomorphic graphs.
6 Experiment and Results
Dataset  PTC  PROTEINS  NCI1  NCI109  D & D  ENZYMES 
(No. Graphs)  
(Max. Graph Size)  
(Avg. Graph Size)  

Deep Learning Methods
DCNN[2016]  
PSCN[2016]  —  —  —  
ECC[2017]  —  —  
DGCNN[2018]  
GCAPSCNN 
Graph Kernels
RW[2003]  Day  Day  Day  
SP[2005]  Day  
GK[2009]  
WL [2011]  
DGK[2015]  
MLG[2016]  
GCAPSCNN 
GCAPSCNN Model Configuration: We build layer GCAPSCNN with following configuration: . Here represents a Graph Capsule CNN layer with hidden dimensions and instantiation parameters. As mentioned earlier, we take the intermediate output of each layers and form a concatenated tensor which is subsequently pass through layer which computes mean and covariance of the input. Output of layer is then passed to two fully connected layers with again output dimensions and finally connects to a softmax layer for computing class probabilities. In between intermediate layers, we use batch normalization and dropout technique to prevent overfitting along with norm regularization. We set depending upon the dataset size (towards higher for larger dataset) and for setting hidden dimension. We restrict for computing higherorder statistical moments due to computational constraints. Further, we employ ADAM optimization technique with initial learning rate chosen from the set with a decaying factor of after every few epochs. Batch size is set according to the given dataset size and memory requirements. Number of epochs are chosen from the set . All the above mentioned hyperparameters are tuned based on the training loss. Average classification accuracy based on fold cross validation error is reported for each dataset. Our GCAPSCNN code and data will be made available at Github^{2}^{2}2https://github.com/vermaMachineLearning/GraphCapsuleCNNNetworks/.
Datasets: To evaluate our GCAPSCNN model, we perform graph classification tasks on variety of benchmark datasets. In first round, we used bioinformatics datasets namely: PTC, PROTEINS, NCI1, NCI109, D&D, and ENZYMES. In second round, we used social network datasets namely: COLLAB, IMDBBINARY, IMDBMULTI, REDDITBINARY and REDDITMULTI5K. D&D dataset contains enzymes and nonenzymes proteins structures. For other datasets details can be found in (Yanardag & Vishwanathan, 2015). Also for each dataset number of graphs, maximum and average number of nodes is shown in the Table 1 and Table 2.
Experimental Setup: All experiments were performed on a single machine loaded with recently launched NVIDIA TITAN VOLTA GPUs and GB RAM. We compare our method with both deep learning models and graph kernels.
Deep Learning Baselines: For deep learning approaches, we adopted recently proposed stateofart graph convolutional neural networks namely: PATCHYSAN (PSCN) (Niepert et al., 2016), Diffusion CNNs (DCNN) [(Atwood & Towsley, 2016)], Dynamic Edge CNN (ECC) (Simonovsky & Komodakis, 2017) and Deep Graph CNN (DGCNN) (Zhang et al., 2018).
Graph Kernel Baselines: We adopted stateofart graphs kernels for comparison namely: Random Walk (RW) (Gärtner et al., 2003), Shortest Path Kernel (SP) (Borgwardt & Kriegel, 2005), Graphlet Kernel (GK) (Shervashidze et al., 2009), WeisfeilerLehman Subtree Kernel (WL) (Shervashidze et al., 2011), Deep Graph Kernels (DGK) (Yanardag & Vishwanathan, 2015) and Multiscale Laplacian Graph Kernels (MLK) (Kondor & Pan, 2016).
Baselines Settings: We adopted the same procedure from previous works (Niepert et al., 2016; Yanardag & Vishwanathan, 2015; Zhang et al., 2018) to make a fair comparison and used fold cross validation with LIBSVM (Chang & Lin, 2011) library to report the classification performance for graph kernels. Parameters of SVM are independently tuned using training folds data and best average classification accuracies are reported for each method. For RandomWalk (RW) kernel, decay factor is chosen from . For WeisfeilerLehman (WL) kernel, we chose height of subtree kernel from . For graphlet kernel (GK), we chose graphlets size and for deep graph kernels (DGK), we report the best classification accuracy obtained among: deep graphlet kernel, deep shortest path kernel and deep WeisfeilerLehman kernel. For Multiscale Laplacian Graph (MLG) kernel, we chose and parameter of the algorithm from , radius size from , and level number from . For diffusionconvolutional neural networks (DCNN), we chose number of hops from . For the rest, best reported results were borrowed from papers PATCHYSAN () (Niepert et al., 2016), ECC (Simonovsky & Komodakis, 2017) (without edge labels since all other methods also relies on only node labels) and DGCNN (with sorting layer) (Zhang et al., 2018), since the experimental setup was the same and a fair comparison can be made. In short, we follow the same procedure as mentioned in previous papers. Note: some results are not present because either they are not previously reported or source code not available to run them.
Graph Classification Results: From Table 1, it is clear that our GCAPSCNN model consistently outperforms most of the considered deep learning methods on bioinformatics datasets (except on D&D dataset) with a significant margin of classification accuracy gain (highest being on NCI1 dataset).
Again, this trend is continued to be the same on social network datasets as shown in Table 2. Here, we were able to achieve upto accuracy gain on COLLAB dataset and rest were around gain with consistency when compared against other deep learning approaches.
Our GCAPSCNN is also very competitive with stateofart graph kernel methods. It again show a consistent performance gain of accuracy (highest being on PTC dataset) on many bioinformatic datasets when compared against with strong graph kernels. While other considered deep learning methods are not even close enough to beat graph kernels on many of these datasets. It is worth mentioning that the most deep learning models (like ours) are also scalable while graph kernels are more fine tuned towards handling small graphs.
For social network datasets, we have a significant gain of atleast accuracy (highest being on REDDITMULTI dataset) against graph kernels as observed in Table 2. But this is expected as deep learning methods tend to do better with the large amount of data available for training on social networks datasets. Altogether, our GCAPSCNN model shows very promising results against both the current stateofart deep learning methods and graph kernels.
7 Conclusion & Future Work
In this paper, we present a novel Graph Capsule Network (GCAPSCNN) model based on the fundamental capsule idea to address some of the basic weaknesses of existing GCNN models. Our graph capsule network model by design captures more local structure information than traditional GCNN and can provide much richer representation of individual graph nodes or for the whole graph. For our purpose, we employ a capsule function that preserves statistical moments formation since they are faster to compute.
Furthermore, we propose a novel permutation invariant layer based on computing covariance in our GCAPSCNN architecture to deal with graph classification problem which most GCNN models find challenging. This covariance can again be computed in a fast manner and has shown to be better than adopting aggregation or maxsort pooling layer. On the top, we also propose to equip our GCAPSCNN model with Fgsd features explicitly to capture more global information in absence of node features. This is essential to consider since nondeep GCNN models are not capable enough to exploit global information implicitly. Finally, we show GCAPSCNN superior performance on many bioinformatics and social network datasets in comparison with existing deep learning methods as well as strong graph kernels and set the current stateoftheart.
Our general idea of graph capsule is quite rich and can taken to another level by designing more sophisticated capsule functions that are capable of preserving more information in a local pool. In our future work, we will investigate various other capsule functions such as polynomial coefficients (as instantiation parameters) which comes with theoretical guarantees. Another choice, we will investigate is performing kernel density estimation technique in endtoend deep learning framework and understanding their theoretical significance. Lastly, we will also explore the other approach of managing the graph capsule vector dimension as discussed in (Sabour et al., 2017).
References
 Atwood & Towsley (2016) Atwood, James and Towsley, Don. Diffusionconvolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1993–2001, 2016.
 Borgwardt & Kriegel (2005) Borgwardt, Karsten M and Kriegel, HansPeter. Shortestpath kernels on graphs. In Data Mining, Fifth IEEE International Conference on, pp. 8–pp. IEEE, 2005.
 Bruna et al. (2013) Bruna, Joan, Zaremba, Wojciech, Szlam, Arthur, and LeCun, Yann. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
 Chang & Lin (2011) Chang, ChihChung and Lin, ChihJen. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
 Dai et al. (2016) Dai, Hanjun, Dai, Bo, and Song, Le. Discriminative embeddings of latent variable models for structured data. In International Conference on Machine Learning, pp. 2702–2711, 2016.
 de Vries (2013) de Vries, Gerben KD. A fast approximation of the weisfeilerlehman graph kernel for rdf data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 606–621. Springer, 2013.
 Defferrard et al. (2016) Defferrard, Michaël, Bresson, Xavier, and Vandergheynst, Pierre. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3837–3845, 2016.
 Duvenaud et al. (2015) Duvenaud, David K, Maclaurin, Dougal, Iparraguirre, Jorge, Bombarell, Rafael, Hirzel, Timothy, AspuruGuzik, Alán, and Adams, Ryan P. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232, 2015.
 Feragen et al. (2013) Feragen, Aasa, Kasenburg, Niklas, Petersen, Jens, de Bruijne, Marleen, and Borgwardt, Karsten. Scalable kernels for graphs with continuous attributes. In Advances in Neural Information Processing Systems, pp. 216–224, 2013.
 GarcíaDurán & Niepert (2017) GarcíaDurán, Alberto and Niepert, Mathias. Learning graph representations with embedding propagation. arXiv preprint arXiv:1710.03059, 2017.
 Gärtner et al. (2003) Gärtner, Thomas, Flach, Peter, and Wrobel, Stefan. On graph kernels: Hardness results and efficient alternatives. In Learning Theory and Kernel Machines, pp. 129–143. Springer, 2003.
 Gilmer et al. (2017) Gilmer, Justin, Schoenholz, Samuel S, Riley, Patrick F, Vinyals, Oriol, and Dahl, George E. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.
 Henaff et al. (2015) Henaff, Mikael, Bruna, Joan, and LeCun, Yann. Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163, 2015.
 Hinton et al. (2011) Hinton, Geoffrey E, Krizhevsky, Alex, and Wang, Sida D. Transforming autoencoders. In International Conference on Artificial Neural Networks, pp. 44–51. Springer, 2011.
 Kashima et al. (2003) Kashima, Hisashi, Tsuda, Koji, and Inokuchi, Akihiro. Marginalized kernels between labeled graphs. In ICML, volume 3, pp. 321–328, 2003.
 Kipf & Welling (2016) Kipf, Thomas N and Welling, Max. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 Kondor & Borgwardt (2008) Kondor, Risi and Borgwardt, Karsten M. The skew spectrum of graphs. In Proceedings of the 25th international conference on Machine learning, pp. 496–503. ACM, 2008.
 Kondor & Jebara (2003) Kondor, Risi and Jebara, Tony. A kernel between sets of vectors. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pp. 361–368, 2003.
 Kondor & Pan (2016) Kondor, Risi and Pan, Horace. The multiscale laplacian graph kernel. In Advances in Neural Information Processing Systems, pp. 2982–2990, 2016.
 Kondor et al. (2009) Kondor, Risi, Shervashidze, Nino, and Borgwardt, Karsten M. The graphlet spectrum. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 529–536. ACM, 2009.
 Kondor et al. (2018) Kondor, Risi, Son, Hy Truong, Pan, Horace, Anderson, Brandon, and Trivedi, Shubhendu. Covariant compositional networks for learning graphs. arXiv preprint arXiv:1801.02144, 2018.
 Kriege et al. (2016) Kriege, Nils M, Giscard, PierreLouis, and Wilson, Richard. On valid optimal assignment kernels and applications to graph classification. In Advances in Neural Information Processing Systems, pp. 1623–1631, 2016.
 Lei et al. (2017) Lei, Tao, Jin, Wengong, Barzilay, Regina, and Jaakkola, Tommi. Deriving neural architectures from sequence and graph kernels. arXiv preprint arXiv:1705.09037, 2017.
 Levie et al. (2017) Levie, Ron, Monti, Federico, Bresson, Xavier, and Bronstein, Michael M. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. arXiv preprint arXiv:1705.07664, 2017.
 Montavon et al. (2012) Montavon, Grégoire, Hansen, Katja, Fazli, Siamac, Rupp, Matthias, Biegler, Franziska, Ziehe, Andreas, Tkatchenko, Alexandre, Lilienfeld, Anatole V, and Müller, KlausRobert. Learning invariant representations of molecules for atomization energy prediction. In Advances in Neural Information Processing Systems, pp. 440–448, 2012.
 Neumann et al. (2012) Neumann, Marion, Patricia, Novi, Garnett, Roman, and Kersting, Kristian. Efficient graph kernels by randomization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 378–393. Springer, 2012.
 Niepert et al. (2016) Niepert, Mathias, Ahmed, Mohamed, and Kutzkov, Konstantin. Learning convolutional neural networks for graphs. In Proceedings of the 33rd annual international conference on machine learning. ACM, 2016.
 Orsini et al. (2015) Orsini, Francesco, Frasconi, Paolo, and De Raedt, Luc. Graph invariant kernels. In IJCAI, pp. 3756–3762, 2015.
 Pržulj (2007) Pržulj, Nataša. Biological network comparison using graphlet degree distribution. Bioinformatics, 23(2):e177–e183, 2007.
 Sabour et al. (2017) Sabour, Sara, Frosst, Nicholas, and Hinton, Geoffrey E. Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pp. 3859–3869, 2017.
 Shervashidze et al. (2009) Shervashidze, Nino, Vishwanathan, SVN, Petri, Tobias, Mehlhorn, Kurt, and Borgwardt, Karsten M. Efficient graphlet kernels for large graph comparison. In AISTATS, volume 5, pp. 488–495, 2009.
 Shervashidze et al. (2011) Shervashidze, Nino, Schweitzer, Pascal, Leeuwen, Erik Jan van, Mehlhorn, Kurt, and Borgwardt, Karsten M. Weisfeilerlehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
 Shuman et al. (2013) Shuman, David I, Narang, Sunil K, Frossard, Pascal, Ortega, Antonio, and Vandergheynst, Pierre. The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83–98, 2013.
 Simonovsky & Komodakis (2017) Simonovsky, Martin and Komodakis, Nikos. Dynamic edgeconditioned filters in convolutional neural networks on graphs. In Proc. CVPR, 2017.
 Verma & Zhang (2017) Verma, Saurabh and Zhang, ZhiLi. Hunt for the unique, stable, sparse and fast feature learning on graphs. In Advances in Neural Information Processing Systems, pp. 87–97, 2017.
 Yanardag & Vishwanathan (2015) Yanardag, Pinar and Vishwanathan, SVN. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374. ACM, 2015.
 Zhang et al. (2018) Zhang, Muhan, Cui, Zhicheng, Neumann, Marion, and Chen, Yixin. An endtoend deep learning architecture for graph classification. In AAAI, pp. 4438–4445, 2018.
 Zhao et al. (2018) Zhao, Xiaohan, Zong, Bo, Guan, Ziyu, Zhang, Kai, and Zhao, Wei. Substructure assembling network for graph classification. 2018.