On the choice of graph neural network architectures
Seminal works on graph neural networks have primarily targeted semi-supervised node classification problems with few observed labels and high-dimensional signals. With the development of graph networks, this setup has become a de facto benchmark for a significant body of research. Interestingly, several works have recently shown that graph neural networks do not perform much better than predefined low-pass filters followed by a linear classifier in these particular settings. However, when learning with little data in a high-dimensional space, it is not surprising that simple and heavily regularized learning methods are near-optimal. In this paper, we show empirically that in settings with fewer features and more training data, more complex graph networks significantly outperform simpler architectures, and propose a few insights towards to the proper choice of graph neural networks architectures. We finally outline the importance of using sufficiently diverse benchmarks (including lower dimensional signals as well) when designing and studying new types of graph neural networks.
theme=color,mode=multiuser \FXRegisterAuthorpfapfpascal \addressÉcole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
Graph neural networks, Semi-supervised learning, High-dimensional classification, Node classification
Machine learning on graphs has found many applications in domains such as recommender systems [vdberg2017graph], quantum chemistry [schutt2018schnet] or reinforcement learning [bapst2019structured]. Recently, this research field has undergone a new technological revolution with the introduction of the graph convolutional networks that permitted to extend the success of deep learning methods to irregular domains[bronstein2017gdl]. However, there is no single definition of graph convolution, and different methods have been proposed for different applications.
Convolutional networks for graphs were first introduced in the spectral domain using the graph Fourier transform [henaff2015deep]. This required to perform an eigendecomposition of the graph Laplacian, which cannot be done for large graphs. In [defferrard2016convolutional, khasanova:ICML2017], reparameterizations of the filters by Laplacian polynomials have been proposed. This allowed efficient computations and opened the way to a large variety of applications. In order to benchmark their method, the authors of [defferrard2016convolutional] tackled graph classification, a task that requires graph convolutions but also graph coarsening. This significantly complicates the method, thus making it more difficult to compare different types of convolutions. This methodological problem was simplified by [kipf2016semi], who designed a simpler version of Laplacian polynomials called graph convolutional networks (GCN), and proposed to evaluate instead on node classification, which does not require coarsening. Since node classification allows for simpler architectures, it became the reference task to compare graph networks. Since then, neural networks of increasing complexity have been designed [monti2018motifnet, monti2018dual, velivckovic2017graph] and studied in a similar context.
However, the specific choice of dataset111For example, the Planetoid citation dataset in [kipf2016semi] consists of a citation network between scientific papers, where each node has a signal corresponding to a bag of word representation of the paper. has strong implications in the evaluation of different methods. Indeed, in standard settings, simple graph networks perform on par with sophisticated networks when hyperparameters are tuned fairly [shchur2018pitfalls]. Hence, some works [wu2019simplifying, maehara2019revisiting, klicpera2018predict] have advocated for simplifying GCNs even further, claiming that propagation and learning can actually be decoupled and that linear classification methods are powerful enough to solve this task.
In this work, we hypothesize that the success of simplified graph neural networks architectures can actually be explained by the statistics of the specific benchmark datasets. Indeed, semi-supervised learning on graphs with observed nodes and dimensional graph signals can be seen in a first approximation as a classification problem with training examples of dimension . When , it is known that heavily regularized linear models are optimal, and is typically of the order of in standard benchmark datasets. By varying the number of training points or features in the dataset, we show experimentally that more complex models becomes more effective when the ratio grows. This confirms that different benchmarks are necessarily to evaluate the performance of graph networks architectures and develop general insights.
Finally, we provide some insights towards the proper design of graph neural networks. In particular, we confirm that the proposition to decouple propagation and learning can indeed achieve good results even when more data is available. This provides a computational gain, since propagation can be treated as a preprocessing step, and also better interpretability: when propagation and learning are decoupled, propagation can be interpreted as filtering in the graph spectral domain[ortega2018gsp].
2 Problem statement
We consider a graph where represents a set of nodes and is a weighted adjacency matrix defining their connections. We denote by , with being the ones vector, the matrix containing the degrees of all nodes in its main diagonal. We assume that each node carries a signal or feature vector and possibly a corresponding label . These feature vectors are aggregated in .
The goal of node classification is the following: given , and a small subset of labels, one tries to predict the value of the remaining labels of the graph. Node classification can be viewed as an example of semi-supervised learning [chapelle2009semi], where unlabeled data observed during training can be leveraged to achieve better performance. Graph Neural Networks have developed recently as the most popular framework to address node classification. However, there is now a plethora of different graph neural networks architectures [wu2019comprehensive], and it is important to develop insights about the proper design choice for a given dataset.
In supervised learning, the choice of the model is typically guided by complexity measures such as the VC dimension. In particular, two crucial parameters are the number of training points and the dimension of the space : more complex classes of functions can be used when is large and small. In semi-supervised learning however, we currently do not have such complexity measures. As a first approximation, we can consider only the labeled nodes and view the problem as the classification of points in dimension given training examples. The ratio is however extremely small for common benchmark datasets (cf. Table 1). If we now use insights from supervised learning, we would expect heavily regularized linear methods to be optimal, or close to it, for this classification problem. This has been recently verified in [wu2019simplifying]. However, the very specific properties of the standard benchmarks do not permit to confidently extend these insights to other settings. In particular, we believe that the good performance of simple graph neural networks is a consequence of the current standard benchmarks focusing on the small regime. Therefore, we make the following hypothesis:
Depending on the data structure, especially on the ratio , graph neural networks of different complexity are appropriate, with no model being universally superior to all the other ones.
We will verify experimentally this hypothesis and show in particular that, in settings where is high, more complex methods perform significantly better.
3 Graph neural networks
Graph neural networks define a general methodology to address node classification. They define a parameterized function that can be efficiently trained by minimizing a relevant loss, e.g. cross-entropy or mean squared error, using some form of gradient descent. To simplify the complexity of learning and introduce some form of domain prior222Compared to other approaches, graph neural networks not only use the information contained in the node features, but also the graph structure encoded in the adjacency., virtually all graph neural networks are formed using a composition of i) propagation steps , which are applied column-wise to using the structure of , and ii) feature extractors , which act row-wise on .
In this sense, different types of graph neural networks only differ on the way they define and , and on the way they are composed. We briefly review the most relevant graph neural networks for this work.
Graph convolutional networks (GCN)
Without any doubt, the de facto standard graph neural networks are GCNs [kipf2016semi]. They consist on the consecutive application of one-hop aggregation steps , based on the normalized adjacency matrix with self-loops , and simple non-linear feature extractors , where represents a set of coefficients to be learned, and is some point-wise non-linearity, e.g., Rectified Linear Unit (ReLU). In practice, these two steps are applied sequentially leading to
Despite being one of the earliest methods to be introduced, complex extensions of GCNs did not lead to significant improvements in the classification accuracy on standard datasets [shchur2018pitfalls]. For this reason, a recent trend has been to try to simplify GCNs.
Simplified graph convolutions (SGC)
Following this line of thought, authors in [wu2019simplifying] propose to simplify the structure in (1) by removing the intermediate non-linearities, i.e.,
Hence, they reduce the propagation to and propose using a single layer of . Note that doing this, they effectively simplify the learning task to a semi-supervised logistic regression trained on low-pass filtered features333It can be shown that the operation is equivalent to a low-pass graph filter of the normalized Laplacian [wu2019simplifying, ortega2018gsp].. Despite this simplicity, SGC results on the standard benchmarks are still on par with the ones obtained by GCNs and more complex methods.
Approximate personalized propagation of neural predictions (APPNP)
In theory, GCNs of the form (1) can be constructed using an arbitrary number of layers . Similarly, the exponent parameter in SGCs can take an arbitrary value. However, in practice, high values of are never used. The reason is that multiple applications of the adjacency operator on the features tend to produce an important smoothing effect on the signal used for learning, which harms classification accuracy. To circumvent this issue, and inspired by the success of the personalized page rank algorithm, the authors in [klicpera2018predict] propose to modify the propagation strategy to
They approximate this operation using fixed point iterations. Besides, they propose to replace by a simple neural network and to reverse the order of and such that . With these modifications APPNP became state-of-the-art for semi-supervised node classification on standard datasets.
Nevertheless, most performance reviews for graph networks[shchur2018pitfalls] have only been conducted on datasets with very similar properties. Hence, it is important that benchmarking results are considered beyond this specific context. By doing this, we will show that there is actually no universally superior method, and that, depending on the characteristics of the data at hand, various graph neural networks architectures should be considered. Especially, we claim that the complexity of the propagation function and feature extractor should be tuned separately depending on the availability of data, the graph structure and the complexity of the features.
4 Experimental results
All our experiments are based on the experimental platform introduced in [shchur2018pitfalls] in which multiple graph neural networks can be tested against the standard benchmark datasets without risking to overfit to a particular training scenario. In particular, for each configuration of parameters we repeat experiments with different train-test splits (selected uniformly at random).
The main hyphothesis we want to validate is if, depending on the ratio of observed nodes to feature dimensionality , the different methods introduced earlier show different behaviours. We are especially interested in the scenario in which is high, since this is different from the standard evaluation regime for graph neural networks. To tune we control independently the proportion of observed nodes and the dimensionality of the features. To avoid any bias introduced in the feature selection, before every experiment we reduce the dimensionality of our feature matrix using a random sketching matrix. This is
where is a random matrix with entries drawn from a normal distribution, and is the target dimensionality we want to use. Having obtained , we use the randomly scrambled features as inputs to the different methods.
4.1 Changing the dataset statistics
We compare the performance of a GCN and a SGC on different ratios . Both networks are built using and we use their standard hyperparameters [shchur2018pitfalls]. We recall that in the standard setting is of the order of 0.1 for all common datasets, and that both methods perform equally well.
We test separately the effect of reducing (cf. Figure 0(a)) and increasing (cf. Figure 0(b)). Clearly, both methods only perform on par on the small data and high-dimensional regime. When we start to increase , the model based on a GCN tends to perform significantly better than the SGC, and the previously reported similarity [wu2019simplifying] is only retained for the original configuration. The reason for these differences are rooted on the complexity of the two classifiers. Indeed, the GCN has the potential to fit more complex functions (e.g., non-linear) and the bias in this estimation decreases when we increase .
4.2 Decoupling feature extraction and propagation
In this experiment we test the importance of using non-linear classification methods and the need or not to intertwine propagation and feature extraction, i.e., the need to have multiple layers. To this end, we compare the performance of the GCN and the SGC from the previous experiments. We also study the APPNP architecture, as well as an APPNP model (APPNP-MLP) where and are reversed, and where is a two-layer neural network (MLP), and a SGC model in which we substitute the logistic regression by another 2-layer MLP (SGC-MLP). Our configuration uses of the nodes to train and random features, in order to explore settings that are different than those of the standard benchmarks.
Table 2 summarizes the results. From this, it is clear that non-linear classification methods (e.g., GCN, APPNP-MLP) can indeed outperform linear ones (e.g., SGC). However, it seems that using multilayer architectures that intertwine propagation and feature extraction is not needed. Indeed, GCN and SGC-MLP actually perform similarly, while the only difference between the two methods is that and are alternated in GCN, while they are not in SGC-MLP. Finally, we see that having a complex propagation step like in APPNP, is not enough to guarantee state-of-the-art performance, and that in high scenarios, non-linear modifications of it, i.e., APPNP-MLP, can boost accuracy.
Overall these results support our claim that there exist indeed non-trivial setups in which graph neural networks of different complexity perform best. Furthermore, it is clear that having the possibility to tune the complexity of and independently gives rise to a rich set of classification behaviours that can be optimized to better fit the structure of the data at hand. Our study further outlines that the choice of the benchmark dataset is really critical in studying the performance of graph neural network architectures.
We have empirically demonstrated that the surprising good performance of simple graph neural networks reported in the recent literature is essentially driven by the characteristics of the benchmark datasets. In particular, we argue that the high-dimensionality of the current benchmarks combined with the scarcity of labels in the standard setup, render simple methods based on feature smoothing and linear classification nearly optimal. However, for richer datasets (in terms of ) complex GNN models do outperform simpler ones. On the other hand, tuning the complexity of the propagation and feature extraction individually allows to optimize the inductive prior of graph networks with enough granularity to achieve state-of-the-art performance on a wide range of . Overall, it is very important to use benchmarks with lower dimensional features and more observed nodes as well when designing new types of graph neural networks, and to properly adapt the combination of propagation steps and the feature extractors to the properties of the target data.