###### Abstract

We address the problem of graph classification based only on structural information. Most standard methods require either the pairwise comparisons of all graphs in the dataset or the extraction of ad-hoc features to perform classification. Those methods respectively raise scalability issues when the number of samples in the dataset is large, and flexibility issues when discriminative information is characterized by exotic features. Recent advances in neural network architectures offer new possibilities for graph analysis in terms of scalability and feature learning. In this paper, we propose a new sequential approach using recurrent neural networks (RNN). Our model sequentially embeds information allowing to model final class membership probabilities. We also propose a regularization based on variational node prediction ending up with better learning and generalization. We experimentally show that our model reaches state-of-the-art classification results on several common molecular datasets. Finally, we perform a qualitative analysis and give some insights about how the joint node prediction helps the model to better classify graphs.

Graph Classification with Recurrent Variational Neural Networks

Edouard Pineau Nathan de Lara

Telecom ParisTech - Safran Telecom ParisTech

## 1 Introduction

Many natural or synthetic systems have a natural graph representation where entities are described through their mutual connections: chemical compounds, social or biological networks for example. Therefore, automatic mining of such structures is useful in a variety of applications.

Graphs can be studied either individually, considering the nodes as samples or collectively, each sample of the dataset being a graph object. Here consider the later case, applied to classification task. This setting raises several difficulties to leverage standard machine learning algorithms. Indeed, most of these algorithms take vectors of fixed size as inputs. In the case of graphs, usual representations such as edge list or adjacency matrix do not match this constraint. The size of the representations is graph dependent (number of edges in the first case, number of nodes squared in the second) and these representations are index dependent i.e, up to indexing of its nodes, a same graph admits several equivalent representations. In a classification task the label of a graph is independent of the indices of its nodes, the model used for prediction should be invariant to node ordering as well. On the other hand, the problem of variable size inputs is well known in the field of natural language processing as sentences have variable lengths. We inspire ourselves from this field to overcome this difficulty.

In this paper, we propose a method to sequentially embed graph information to perform classification. By construction, our model overpasses the common difficulties listed above. In fact, the sequential modelling allows to solve the graph-dependent size of the input. Although recurrent neural networks have the capacity to deal with large datasets, they stay time-consuming for learning procedure. We propose a regularization that leads to more efficient learning and better generalization, offering additional scalability to our model.

To address the problem, we use neither node attributes nor edges attributes. This way, we want to reveal intrinsic capacities of recurrent neural embedding for pattern recognition. Adding node information would confuse the origin of performances of our model.

## 2 Related work

Graph classification methods can schematically be divided into three categories: graph kernels, sequential methods and embedding methods. In this section, we briefly present these different approaches.

### 2.1 Kernel methods

Kernel methods (nikolentzos2017kernel; nikolentzos2017matching; nikolentzos2018degeneracy; neumann2016propagation) perform pairwise comparisons between the graphs of the dataset and apply a classifier, usually a support vector machine (SVM), on the similarity matrix. In order to maintain the number of comparisons tractable when the number of graphs is large, they often use Nyström algorithm (williams2001using) to compute a low rank approximation of the similarity matrix. The key is to construct an efficient kernel that can be applied to graphs of varying sizes and captures useful features for the downstream classification. The Weisfeiler-Lehman subtree kernel (shervashidze2011weisfeiler) has proven to be very efficient for such tasks (yanardag2015deep), but it requires graphs with labeled nodes, and is therefore not applicable in our unlabeled graph case study.

### 2.2 Sequential methods

Some random walk models are used for node classification (callut2008classification) or graph classification (xu2012protein) problems. The idea is to sequentially walk on a graph, one node at a time in a random fashion, and agglomerate information. The graph is represented by a discrete-time Markov chain where each node is associated to a state of the chain, and each probability of transition is proportional to its adjacency. More recently, jin2018learning or you2018graphrnn transform a graph into a sequence of fixed size vectors. Each of these vectors is an embedding of one node of the graph. The sequence of embeddings is then fed to a recurrent neural network (RNN). The two main challenges in this kind of approaches are the design of the embedding function for the nodes and the order in which the embeddings are given to the recurrent neural network.

### 2.3 Embedding methods

Embedding methods (barnett2016feature; DBLP:journals/corr/NarayananCVCLJ17), derive a fixed number of features for each graph which is used as a vector representation for classification. Some algorithms consider features based on the dynamics of random walks on the graph (gomez2017dynamics) while others are graphlet based (dutta2017high). Even though deriving a good set of features is often a difficult task, this approach has the benefit of being compatible with any standard classifier (SVM, random forest, multilayer perceptron) in a plug and play fashion.

## 3 Model

We propose to use a sequential approach to embed graphs with variable number of nodes and edges into a vector space of a chosen dimension. This latent representation is then used for classification. Node index invariance is approximated through specific pre-processing and aggregation.

Let be an undirected and unweighted graph with a set of nodes and a set of edges. The graph can be represented, modulo any permutation over its nodes , by its boolean adjacency matrix such that if nodes indexed by and are connected in the graph and otherwise. We use this adjacency matrix as a raw representation of the graph.

Our model is a recurrent variational neural network classifier (RVNC), composed of three main parts: node ordering and embedding, classification and regularization with variational auto-regression (VAR), see figure 1 for an illustration. Each of these parts will be respectively detailed in subsections 3.1, 3.2 and 3.3.

### 3.1 Node ordering and embedding

Before being processed by the neural network, the adjacency matrix of a graph is transformed on-the-fly as in (you2018graphrnn). First, a node is selected at random and used as root for a breath first search (BFS) over the graph. The rows and columns of the adjacency matrix are then reordered according to the sequence of nodes returned by the BFS. Next, each row (corresponding to the th node in the BFS ordering) is truncated to keep only the connections of node with the nodes that preceded in the BFS. This way, each node is -dimensional, and each truncated matrix is zero padded in order to have dimensions . Throughout the rest of the paper, we use the notation for .

After node ordering and pre-embedding, each graph is processed as a sequence of -dimensional nodes by a gated recurrent unit (GRU) neural network (cho2014learning). The GRU is a special RNN able to learn long term dependencies by solving vanishing gradient effect^{1}^{1}1The choice of GRU over Long Short Term Memory networks is arbitrary as they have equivalent long-term modeling power (chung2014empirical)..

The GRU sequentially embeds each node by using information contained in and in the memory cell with the recurrent process

where , , , , and are GRU parameters and denotes element-wise vector multiplication. Classically, .

The embedded node sequence feeds both the VAR and the classifier as discussed in later sections.

The embedding part is illustrated in top line of figure 2.

### 3.2 Classification

After the embedding step, we use an additional GRU dedicated to classification that takes as input. Its last memory cell, denoted , feeds a softmax multilayer perceptron (MLP) which performs class prediction.

Formally, let be the class index, the classifier is trained by minimizing the cross-entropy loss

where is the softmax class membership probability vector for a given graph that has been sorted by a BFS rooted with node .

As discriminating patterns might be spread within the whole graph, the network is required to model long-term dependencies. By construction, GRUs have such ability.

The classification part is illustrated in middle line of figure 2.

### 3.3 Regularization with variational auto-regression

As the structure of a graph is the concatenation of the interactions between all nodes and their respective neighbors, learning a good representation without using node attributes requires that the model captures the structure of the graph while classifying. Moreover, we want that the model learns node representation as an exchangeable set to induce permutation-invariance. In order to do so, we add an auto-regression block to our model: at each node the network is asked to predict the next node adjacency. This task is displayed separately from the recurrent classifier.

To do so, we use a variational auto-encoder (VAE) (kingma2013auto) to learn the representation of each node given . This constitutes the variational auto-regressor (VAR). Such a representation for sequence classification has already been used for sentiment analysis (latif2017variational; xu2017variational). It is the natural language processing equivalent of predicting the th word of a sentence, given an aggregated representation of this sentence up to word .

For each graph with embedded nodes (see 3.1), the variational auto-encoder takes as input. Let be the latent random variables for the model

Training is done by minimizing the loss:

(1) |

which is a lower bound of the negative marginal log-likelihood . and are the respective densities of and , whose distribution is parametrized by and respectively. KL denotes the Kullback-Leibler divergence, is the empirical distribution of and is the density of the prior distribution of latent variables . We use the standard VAE prior distribution for , with density denoted .

We note that when the first term in equation 1 is minimized, sampling from is almost sampling from plus residual global graph structural information backed by . Therefore, we obtain the following approximation:

resulting in

The distribution of is factorial with respect to the embedding of the nodes, conditionally to . More specifically, we found that the nodes of the graph are i.i.d. conditionally to a couple of variables . Following the de Finetti’s representation theorem, nicely reviewed in serafino2016finetti, we have a sufficient condition for having exchangeable sequence of nodes, learned by our model. This result is illustrated in figure 3.

In practice, and are modeled by neural networks parametrized by and , which require differentiable functions for training. However, models a binary adjacency vector representing the connections between node and previously visited nodes . Therefore, we use a continuous relaxation of discrete sampling: the Gumbel trick (jang2016categorical) to train our neural network based model.

The regularization part is illustrated in bottom line of figure 2.

In the end, the model is trained minimizing the total loss

where is a hyper-parameter.

### 3.4 Aggregation of the results for testing

The node ordering step introduces randomness in our model. On the one hand, it helps to learn more general graph representations during the training phase, but on the other hand, it might produce different outputs for the same graph during the testing phase, depending on the root of the BFS. In order to compensate this side effect, we add the following aggregation step for the testing phase. Each graph is presented times to the model with different roots for BFS ordering. The class membership probability vectors are extracted and averaged. The average score vector is noted and computed as follow:

with an element wise sum.

This soft vote is repeated times resulting in probability vectors for each graph . The final class attributed to a graph corresponds to the highest probability among the vectors.

This second hard vote enables to choose the batch of votes for which the model is the most confident.

In the end, figure 2 provides a detailed illustration of our model.

## 4 Experiments

### 4.1 Datasets

We evaluated our model against four standard datasets from biology: Mutag (MT), Enzymes (EZ), Proteins Full (PF) and National Cancer Institute (NCI1) (KKMMN2016). All graphs represent chemical compounds, nodes are molecular substructures (typically atoms) and edges represent connections between these substructures (chemical bound or spatial proximity). In MT, the compounds are either mutagenic and not mutagenic. EZ contains tertiary structures of proteins from the 6 Enzyme Commission top level classes, it is the only multiclass dataset of this paper. PF is a subset of the Dobson and Doig dataset representing secondary structures of proteins being either enzyme or not enzyme. In NCI1, compounds have either an anti-cancer activity or not. Statistics about the graphs are presented in table 1.

MT | EZ | PF | NCI1 | |

graphs | 188 | 600 | 1113 | 4110 |

classes | 2 | 6 | 2 | 2 |

bias | 0.66 | 0.17 | 0.60 | 0.5 |

avg. |V| | 18 | 33 | 39 | 29.9 |

avg. |E| | 39 | 124 | 146 | 64.6 |

### 4.2 Experimental setup

We divided MT, EZ, PF and NCI1 in respectively 3, 10, 10 and 10 folds such that the classes proportions are preserved in each fold for all datasets^{2}^{2}2MT counts only 188 graphs, therefore a 10-fold cross validation does not give any insurance of having representative samples at test time. These folds are then used for cross-validation i.e, one fold serves as the testing set while the other ones compose the training set. Results are averaged over all testing sets.

Our model was implemented in Pytorch (paszke2017pytorch) and trained with Adam stochastic optimization method (kingma2014adam) on a NVIDIA TitanXp GPU. The architecture of the model is summarized in table 2.

Step | Architecture |
---|---|

BFS | 1-layer FC. |

embedding | 2-layer GRU. |

VAR | Encoder |

1-layer FC. | |

Gaussian sampling | |

Predictor | |

2-layer ReLU FC. | |

Gumbel sigmoid sampling | |

Classifier | 2-layer GRU. + DP(0.25) |

2-layer ReLU FC. + SF |

The input size of the recurrent neural network is chosen for each dataset according to the algorithm described in (you2018graphrnn), namely 11 for MT, 25 for EZ, 80 for PF and 11 for NCI1. is set to . For training, batch size is set to 5, and the learning rate to , decayed by at iterations and . We shared the hyper-parameters of our networks between different datasets in order to avoid over-fitting and unveil the model capacities.

### 4.3 Results

We compare our results to those obtained by Earth Mover’s Distance (nikolentzos2017matching) (EMD), Pyramid Match (nikolentzos2017matching) (PM), Feature-Based (barnett2016feature) (FB), Dynamic-Based Features (gomez2017dynamics) (DyF) and Stochastic Graphlet Embedding (dutta2017high) (SGE). All values are directly taken from the aforementioned papers as they used a setup similar to ours, at the exception of the number of folds for MT. For algorithms presenting results with and without node features, we reported the results without node features. For algorithms presenting results with several sets of hyper-parameters, we reported the results for the set of parameters that gave the best performance on the largest number of datasets. Results are reported in table 3. We obtain state-of-the-art results on three out of the four datasets used for the paper and the second best result on the last one.

MT | EZ | PF | NCI1 | |
---|---|---|---|---|

EMD | 86.1 | 36.8 | - | 72.7 |

PM | 85.6 | 28.2 | - | 69.7 |

FB | 84.7 | 29.0 | 70.0 | 62.9 |

DyF | 86.3 | 26.6 | 73.1 | 66.6 |

SGE | 87.3 | 40.7 | 71.9 | 83.2 |

RVNC | 88.3 | 48.4 | 74.8 | 80.7 |

### 4.4 Analysis

##### Node indexing invariance

A major need for our model is to be invariant to node ordering of the graph with respect to different BFS roots. Inputs representing the same graph (up to node ordering) should be close in the latent embedding space. As the preprocessing is performed on each graph at each epoch, a same graph is seen in many different time by the model during training with different embeddings. This creates a natural regularization for the network. Figure 3 illustrates such ability of our network as the projections corresponding to the same graphs form a heap in the low dimensional representation of the latent space.

##### Prediction contribution to classification

In order to demonstrate the interest of training our model to both auto-regress and classify each sample, we ran some experiments removing the VAR regularization () on one 90/10 train-test split of EZ dataset (same proportions than experiments). We observe in the figure 4 a more efficient training procedure and a faster generalization when is positive. This allows for convergence of the model in less than one day for all the datasets we used.

##### Auto-regression quality

In order to provide some intuition about node prediction capacities of our network, we propose in figure 5 an illustration of some graphs and their auto-regressed counterpart.

## 5 Conclusion

In this work, we introduced a recurrent neural network based embedding method for graphs. We applied our model to graph classification without nodes nor edges attributes. As each graph can be processed individually, there is no scalability issue with respect to the number of samples in the dataset. Features are neither ad-hoc nor handcraft but learned from the data. In the end, through joint training of classification and prediction objective, we obtained state-of-the-art results on standard benchmark datasets.

Besides, this model can easily be adapted to use exogenous information such as nodes or edges attributes. This could be addressed in a future work.

#### Acknowledgments

We would like to thank NVIDIA and its GPU Grant Program for providing the hardware we used in our experiments. This work is supported by the company Safran through the CIFRE convention 2017/1317.