# Residual Gated Graph ConvNets

## Abstract

Graph-structured data such as functional brain networks, social networks, gene regulatory networks, communications networks have brought the interest in generalizing neural networks to graph domains. In this paper, we are interested to design efficient neural network architectures for graphs with variable length. Several existing works such as [28] have focused on recurrent neural networks (RNNs) to solve this task. A recent different approach was proposed in [29], where a vanilla graph convolutional neural network (ConvNets) was introduced. We believe the latter approach to be a better paradigm to solve graph learning problems because ConvNets are more pruned to deep networks than RNNs. For this reason, we propose the most generic class of residual multi-layer graph ConvNets that make use of an edge gating mechanism, as proposed in [24]. Gated edges appear to be a natural property in the context of graph learning tasks, as the system has the ability to learn which edges are important or not for the task to solve. We apply several graph neural models to two basic network science tasks; subgraph matching and semi-supervised clustering for graphs with variable length. Numerical results show the performances of the new model.

## 1Introduction

Convolutional neural networks of [20] and recurrent neural networks of [17] are deep learning architectures that have been applied with great success to computer vision (CV) and natural language processing (NLP) tasks. Such models require the data domain to be regular, such as 2D or 3D Euclidean grids for CV and 1D line for NLP. Beyond CV and NLP, data does not usually lie on regular domains but on heterogeneous graph domains. Users on social networks, functional time series on brain structures, gene DNA on regulatory networks, IP packets on telecommunication networks are a a few examples to motivate the development of new neural network techniques that can be applied to graphs. One possible classification of these techniques is to consider neural network architectures with fixed length graphs and variable length graphs.

In the case of graphs with fixed length, a family of convolutional neural networks has been developed on spectral graph theory, [7]. The early work of [6] proposed to formulate graph convolutional operations in the spectral domain with the graph Laplacian, as an analogy of the Euclidean Fourier transform as proposed in [14]. This work was extended in [16] to smooth spectral filters for spatial localization. [10] used Chebyshev polynomials to achieve linear complexity for sparse graphs, [21] applied Cayley polynomials to focus on narrow-band frequencies, and [26] dealt with multiple (fixed) graphs. Finally, [19] simplified the spectral convnets architecture using 1-hop filters to solve the semi-supervised clustering task. For related works, see also [4], [3] and references therein.

For graphs with variable length, a generic formulation was proposed in [13] based on recurrent neural networks. The authors defined a multilayer perceptron of a vanilla RNN. This work was extended in [22] using a GRU architecture and a hidden state that captures the average information in local neighborhoods of the graph. The work of [29] introduced a vanilla graph ConvNet and used this new architecture to solve learning communication tasks. [24] introduced an edge gating mechanism in graph ConvNets for semantic role labeling. Finally, [5] designed a network to learn non-linear approximations of the power of graph Laplacian operators, and applied it to the unsupervised graph clustering problem. Other works for drug design, computer graphics and vision are presented in [11].

In this work, we investigate new neural networks for graphs with arbitrary length. Section 2 introduces the related literature. Section 3 presents our new models. And Section 4 reports the numerical experiments.

## 2Neural networks for graphs with arbitrary length

### 2.1Recurrent neural networks

**Generic formulation.** Consider a standard RNN for word prediction in natural language processing. Let be the feature vector associated with word in the sequence. In a regular vanilla RNN, is computed with the feature vector from the previous step and the current word , so we have:

The notion of neighborhood for regular RNNs is the previous step in the sequence. For graphs, the notion of neighborhood is given by the graph structure. If stands for the feature vector of vertex , then the most generic version of a feature vector for a graph RNN is

where refers to a data vector and denotes the set of feature vectors of the neighboring vertices. Observe that the set is unordered, meaning that is intrinsic, i.e. invariant by vertex re-indexing (no vertex matching between graphs is required). Other properties of are locality as only neighbors of vertex are considered, weight sharing, and such vector is independent of the graph length. In summary, to define a feature vector in a graph RNN, one needs a mapping that takes as input an unordered set of vectors , i.e. the feature vectors of all neighboring vertices, and a data vector , Figure ?.

We refer to the mapping as the neighborhood transfer function in graph RNNs. In a regular RNN, each neighbor as a distinct position relatively to the current word (1 position left from the center). In a graph, if the edges are not weighted or annotated, neighbors are not distinguishable. The only vertex which is special is the center vertex around which the neighborhood is built. This explains the generic formulation . This type of formalism for deep learning for graphs with variable length is described in [28] with slightly different terminology and notations.

**Graph Neural Networks of [28].** The earliest work of graph RNNs for arbitrary graphs was introduced in [13]. The authors proposed to use a vanilla RNN with a multilayer perceptron to define the feature vector :

with

and is the sigmoid function, are the weight parameters to learn.

As Eq. does not have a closed-form solution, [28] used a fixed-point iterative scheme: for

The iterative scheme is guaranteed to converge as long as the mapping is contractive, which can be a strong assumption. Besides, a large number of iterations can be computational expensive.

**Gated Graph Neural Networks of [22].** In this work, the authors use gated recurrent units (GRU) of [8]:

As previously, Eq. does not have an analytical solution, and [22] designed the following iterative scheme:

and is equal to

where is the Hadamard point-wise multiplication operator. This model was used for NLP tasks in [22] and also in quantum chemistry by [12] for fast organic molecule properties estimation, for which standard techniques (DFT) require expensive computational time.

**Tree-Structured LSTM of [30].** The authors extended the original LSTM model of [17] to a tree-graph structure:

where refers the set of children of node . is equal to

Unlike [28], Tree-LSTM does not require an iterative process to update its feature vector as the tree structure is also a directed acyclic graph (DAG) as original LSTM. Consequently, the feature representation can be updated with a recurrent formula. Nevertheless, a tree is a special case of graphs, and such recurrence formula cannot be directly applied to arbitrary graph structure. A key property of this model is the function which acts as a gate on the edge from neighbor to vertex . Given the task, the gate will close to let the information flow from neighbor to vertex , or it will open to stop it. It seems to be an essential property for learning systems on graphs as some neighbors can be irrelevant. For example, for the community detection task, the graph neural network should learn which neighbors to communicate (same community) and which neighbors to ignore (different community). In different contexts, [9] added a gated mechanism inside the regular ConvNets in order to improve language modeling for translation tasks. And [31] considered a gated unit with the convolutional layers after activation, and used it for image generation.

### 2.2Convolutional neural networks

**Generic formulation.** Consider now a classical ConvNet for computer vision. Let denote the feature vector at layer associated with pixel . In a regular ConvNet, is obtained by applying a non linear transformation to the feature vectors for all pixels in a neighborhood of pixel . For example, with filters, we would have:

In the above, the notation denote the concatenation of all feature vectors belonging to the neighborhood of vertex . In ConvNets, the notion of neighborhood is given by the euclidian distance. As previously noticed, for graphs, the notion of neighborhood is given by the graph structure. Thus, the most generic version of a feature vector at vertex for a graph ConvNet is

where denotes the set of feature vectors of the neighboring vertices. In other words, to define a graph ConvNet, one needs a mapping taking as input a vector (the feature vector of the center vertex) as well as an unordered set of vectors (the feature vectors of all neighboring vertices), Figure ?. We also refer to the mapping as the neighborhood transfer function. In a regular ConvNet, each neighbor as a distinct position relatively to the center pixel (for example 1 pixel up and 1 pixel left from the center). As for graph RNNs, the only vertex which is special for graph ConvNets is the center vertex around which the neighborhood is built.

**CommNets of [29].** The authors introduced one of the simplest instantiations of a graph ConvNet with the following neighborhood transfer function:

where denotes the layer level, and is the rectified linear unit. We will refer to this architecture as the vanilla graph ConvNet. [29] used this graph neural network to learn the communication between multiple agents to solve multiple tasks like traffic control.

**Syntactic Graph Convolutional Networks of [24].** The authors proposed the following transfer function:

where act as edge gates, and are computed by:

These gated edges are very similar in spirit to the Tree-LSTM proposed in [30]. We believe this mechanism to be important for graphs, as they will be able to learn what edges are important for the graph learning task to be solved.

## 3Models

**Proposed Graph LSTM.** First, we propose to extend the Tree-LSTM of [30] to arbitrary graphs and multiple layers:

As there is no recurrent formula is the general case of graphs, we proceed as [28] and use an iterative process to solve : At layer , for

In other words, the vector is computed by running the model from at layer . It produces the vector which becomes and also the input for the next layer. The proposed Graph LSTM model differs from [23] mostly because cell in these previous models is not iterated over multiple times , which reduces the performance of Graph LSTM (see numerical experiments on Figure 2).

**Proposed Gated Graph ConvNets.** We leverage the vanilla graph ConvNet architecture of [29], Eq., and the edge gating mechanism of [24], Eq., by considering the following model:

where , and the edge gates are defined in . This model is the most generic formulation of a graph ConvNet (because it uses both the feature vector of the center vertex and the feature vectors of neighboring vertices) with the edge gating property.

**Residual Gated Graph ConvNets.** In addition, we formulate a multi-layer gated graph ConvNet using residual networks (ResNets) introduced in [15]. This boils down to add the identity operator between successive convolutional layers:

As we will see, such multi-layer strategy work very well for graph neural networks.

## 4Experiments

### 4.1Subgraph matching

We consider the subgraph matching problem presented in [28], Figure ?. The goal is to find the vertices of a given subgraph in larger graphs with variable sizes. Identifying similar localized patterns in different graphs is one of the most basic tasks for graph neural networks. The subgraph and larger graph are generated with the stochastic block model (SBM), see e.g. [1]. A SBM is a random graph which assigns communities to each node as follows: any two vertices are connected with the probability if they belong to the same community, or they are connected with the probability if they belong to different communities. For all experiments, we generate a subgraph of nodes with a SBM , and the signal on is generated with a uniform random distribution with a vocabulary of size , i.e. . Larger graphs are composed of communities with sizes randomly generated between 15 and 25. The SBM of each community is . The value of , which acts as the noise level, is , unless otherwise specified. Finally, the signal on is also randomly generated between .

All reported results are averaged over 5 trails. We run 5 algorithms; Gated Graph Neural Networks of [22], CommNets of [29], SyntacticNets of [24], and the proposed Graph LSTM and Gated ConvNets from Section 3. We upgrade the existing models of [22] with a multilayer version for [22] and using ResNets for all three architectures. We also use the batch normalization technique of [18] to speed up learning convergence for our algorithms, and also for [22]. The learning schedule is as follows: the maximum number of iterations, or equivalently the number of randomly generated graphs with the attached subgraph is 5,000 and the learning rate is decreased by a factor if the loss averaged over 100 iterations does not decrease. The loss is the cross-entropy with 2 classes (the subgraph class and the class of the larger graph ) respectively weighted by their sizes. The accuracy is the average of the diagonal of the normalized confusion matrix w.r.t. the cluster sizes (the confusion matrix measures the number of nodes correctly and badly classified for each class). We also report the time for a batch of 100 generated graphs. The choice of the architectures will be given for each experiment. All algorithms are optimized as follow. We fix a budget of parameters of and a number of layers . The number of hidden neurons for each layer is automatically computed. Then we manually select the optimizer and learning rate for each architecture that best minimize the loss. For this task, [22] and our gated ConvNets work well with Adam and learning rate . Graph LSTM uses SGD with learning rate . Besides, the value of inner iterative steps for graph LSTM and [22] is .

The first experiment focuses on shallow graph neural networks, i.e. with a single layer . We also vary the level of noise, that is the probability in the SBM that connects two vertices in two different communities (the higher the more mixed are the communities). The hyper-parameters are selected as follows. Besides , the budget is and the number of hidden neurons is automatically computed for each architecture to satisfy the budget. First row of Figure 1 reports the accuracy and time for the five algorithms and for different levels of noise . RNN architectures are plotted in dashed lines and ConvNet architectures in solid lines. For shallow networks, all RNN architectures (graph LSTM and [22]) performs much better, but they also take more time than the graph ConvNets architectures we propose, as well as [29]. As expected, all algorithms performances decrease when the noise increases.

The second experiment demonstrates the importance of having multiple layers compared to shallow networks. We vary the number of layers and we fix the number of hidden neurons to . Notice that the budget is not the same for all architectures. Second row of Figure 1 reports the accuracy and time w.r.t. (middle figure is a zoom in the left figure). All models clearly benefit with more layers, but RNN-based architectures see their performances decrease for a large number of layers. The ConvNet architectures benefit from large values, with the proposed graph ConvNet performing slightly better than [29]. Besides, all ConvNet models are faster than RNN models.

In the third experiment, we evaluate the algorithms for different budgets of parameters . For this experiment, we fix the number of layers and the number of neurons is automatically computed given the budget . The results are reported in the third row of Figure 1. For this task, the proposed graph ConvNet best performs for a large budget, while being faster than RNNs.

We also show the influence of hyper-parameter for [22] and the proposed graph LSTM. We fix , and . Figure 2 reports the results for . The value has an undesirable impact on the performance of graph LSTM. Multi-layer [22] is not really influenced by . The computational time naturally increases with larger values.

### 4.2Semi-supervised clustering

Next, we consider the semi-supervised clustering problem, Figure ?. This is also a standard task in network science. For this work, it consists in finding communities on a graph given single label for each community. This problem is more discriminative w.r.t. to the architectures than the previous single pattern matching problem where there were only clusters to find (i.e. 50% random chance). For clustering, we have clusters (around 10% random chance). As in the previous section, we use SBM to generate graphs of communities with variable length. The size for each community is randomly generated between and , and the label is randomly selected in each community. Probability is 0.5, and depends on the experiment. For this task, [22] and the proposed gated ConvNets work well with Adam and learning rate . Graph LSTM uses SGD with learning rate . The value of for graph LSTM and [22] is .

The same set of experiments as in the previous task are reported in Figure 3. ConvNet architectures get clearly better than RNNs when the number of layers increase (middle row), with the proposed Gated ConvNet outperforming the other architectures. For a fixed number of layers , our graph ConvNets and [24] best perform for all budgets, while paying a reasonable computational cost.

For the last experiment, we report the learning speed of the models. We fix , with being automatically computed to satisfy the budget. Figure 4 reports the accuracy w.r.t. time. The ConvNet architectures converge faster than RNNs, in particular for the semi-supervised task.

## 5Conclusion

First graph neural networks with arbitrary length have focused on RNN-based architectures. But RNNs are less favorable to fully exploit multi-layer architectures. In the other hand, simpler ConvNet-like architectures appear to be more suitable to build deep graph neural networks. This led us to develop the most generic residual formulation of gated graph ConvNets that has the potential to work with a large number of layers. Numerical experiments demonstrated different properties of these networks for two basic graph learning tasks. Future work will focus on solving more specific problems.

### References

**Community Detection and Stochastic Block Models: Recent Developments.**

E. Abbe. arXiv preprint arXiv:1703.10146**Learning shape correspondence with anisotropic convolutional neural networks.**

D. Boscaini, J. Masci, E. Rodolà, and M. M. Bronstein. In*Proc. NIPS*, 2016.**Tutorial on Geometric Deep Learning on Graphs and Manifolds.**

M. Bronstein, X. Bresson, A. Szlam, J. Bruna, and Y. LeCun. Conference on Computer Vision and Pattern Recognition (CVPR)**Geometric deep learning: going beyond euclidean data.**

M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. IEEE Signal Processing Magazine**Community Detection with Graph Neural Networks.**

J. Bruna and X. Li. arXiv preprint arXiv:1705.08415**Spectral Networks and Deep Locally Connected Networks on Graphs.**

J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. arXiv:1312.6203Spectral Graph Theory

F. R. K. Chung. , volume 92.**Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.**

J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. arXiv preprint arXiv:1412.3555**Language Modeling with Gated Convolutional Networks.**

Y. Dauphin, A. Fan, M. Auli, and D. Grangier. International Conference on Machine Learning (ICML)**Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering.**

M. Defferrard, X. Bresson, and P. Vandergheynst. Advances in Neural Information Processing Systems (NIPS)**Convolutional networks on graphs for learning molecular fingerprints.**

D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P Adams. In*Proc. NIPS*, 2015.**Neural Message Passing for Quantum Chemistry.**

J. Gilmer, S. Schoenholz, P. Riley, O. Vinyals, and G. Dahl. arXiv preprint arXiv:1704.01212**A New Model for Learning in Graph Domains.**

M. Gori, G. Monfardini, and F. Scarselli. IEEE Transactions on Neural Networks**Wavelets on Graphs via Spectral Graph Theory.**

D. Hammond, P. Vandergheynst, and R. Gribonval. Applied and Computational Harmonic Analysis**Deep Residual Learning for Image Recognition.**

K. He, X. Zhang, S. Ren, and J. Sun. Computer Vision and Pattern Recognition**Deep Convolutional Networks on Graph-Structured Data.**

M. Henaff, J. Bruna, and Y. LeCun. arXiv:1506.05163**Long Short-Term Memory.**

S. Hochreiter and J. Schmidhuber. Neural Computation**Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.**

S. Ioffe and C. Szegedy. International Conference on Machine Learning (ICML)**Semi-Supervised Classification with Graph Convolutional Networks.**

T. Kipf and M. Welling. International Conference on Learning Representations (ICLR)**Gradient-Based Learning Applied to Document Recognition.**

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. In*Proceedings of the IEEE, 86(11)*, pp. 2278–2324, 1998.**CayleyNets: Graph Convolutional Neural Networks with Complex Rational Spectral Filters.**

R. Levie, F. Monti, X. Bresson, and M.M. Bronstein. arXiv preprint arXiv:1705.07664**Gated Graph Sequence Neural Networks.**

Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. International Conference on Learning Representations (ICLR)**Semantic Object Parsing with graph LSTM.**

X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan. In*European Conference on Computer Vision (ECCV)*, pp. 125–143, 2016.**Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling.**

D. Marcheggiani and I. Titov. arXiv preprint arXiv:1703.04826**Geometric deep learning on graphs and manifolds using mixture model CNNs.**

F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M. Bronstein. In*Proc. CVPR*, 2017a.**Geometric Matrix Completion with Recurrent Multi-Graph Neural Networks.**

F. Monti, M.M. Bronstein, and X. Bresson. Advances in Neural Information Processing Systems (NIPS)**Cross-Sentence N-ary Relation Extraction with Graph LSTMs.**

N. Peng, H. Poon, C. Quirk, K. Toutanova, and W.T. Yih. Transactions of the Association for Computational Linguistics**The Graph Neural Network Model.**

F. Scarselli, M. Gori, A. Tsoi, M. Hagenbuchner, and G. Monfardini. IEEE Transactions on Neural Networks**Learning Multiagent Communication with Backpropagation.**

S. Sukhbaatar, A Szlam, and R. Fergus. Advances in Neural Information Processing Systems (NIPS)**Improved Semantic Representations from Tree-Structured Long Short-Term Memory Networks.**

K. Tai, R. Socher, and C. Manning. Association for Computational Linguistics (ACL)**Conditional Image Generation with PixelCNN Decoders.**

van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, and K. Kavukcuoglu. In*Advances in Neural Information Processing Systems (NIPS)*, pp. 4790–4798, 2016.