Geometric Graph Convolutional Neural Networks
Abstract
Graph Convolutional Networks (GCNs) have recently become the primary choice for learning from graphstructured data, superseding hash fingerprints in representing chemical compounds. However, GCNs lack the ability to take into account the ordering of node neighbors, even when there is a geometric interpretation of the graph vertices that provides an order based on their spatial positions. To remedy this issue, we propose Geometric Graph Convolutional Network (geoGCN) which uses spatial features to efficiently learn from graphs that can be naturally located in space. Our contribution is threefold: we propose a GCNinspired architecture which (i) leverages node positions, (ii) is a proper generalisation of both GCNs and Convolutional Neural Networks (CNNs), (iii) benefits from augmentation which further improves the performance and assures invariance with respect to the desired properties. Empirically, geoGCN outperforms stateoftheart graphbased methods on image classification and chemical tasks.
Introduction
Convolutional Neural Networks (CNNs) outperform humans on visual learning tasks, such as image classification [krizhevsky2012imagenet], object detection [seferbekov2018feature] or image captioning [yang2017dense]. They have also been successfully applied to text processing [kim2014convolutional] and time series analysis [yang2015deep]. Nevertheless, CNNs cannot be easily adapted to irregular entities, such as graphs, where data representation is not organised in a gridlike structure.
Graph Convolutional Networks (GCNs) attempt to mimic CNNs by operating on spatially close neighbors. Motivated by spectral graph theory, Kipf and Welling [kipf2016semi] use fixed weights determined by the adjacency matrix of a graph to aggregate labels of the neighbors. Velickovic et al. [velivckovic2017graph] use attention mechanism to learn the strength of these weights. In most cases, the design of new GCNs is based on empirical intuition and there has been little investigation regarding their theoretical properties [xu2018powerful]. In particular, there is no evident correspondence between classical CNNs and GCNs.
In many cases, graphs are coupled with a geometric structure. In medicinal chemistry, the threedimensional structure of a chemical compound, called a molecular conformation, is essential in determining the activity of a drug towards a target protein (Figure 1). Similarly, in image processing tasks, pixels of an image are organised in a two dimensional grid, which constitutes their geometric interpretation. However, standard GCNs do not take spatial positions of the nodes into account, which is a considerable difference between GCNs and CNNs. Moreover, in the case of images, geometric features allow to augment data with translation or rotation and significantly enlarge a given dataset, which is crucial when the number of examples is limited.
In this paper, we propose Geometric Graph Convolutional Networks (geoGCN), a variant of GCNs, which is a proper generalisation of CNNs to the case of graphs. In contrast to existing GCNs, geoGCN uses spatial features of nodes to aggregate information from the neighbors. On one hand, this geometric interpretation is useful to model many real examples of graphs such as graphs of chemical compounds. In this case, we are able to perform data augmentation by rotating a given graph in a spatial domain and, in consequence, improve network generalisation when the amount of data is limited. On the other hand, a single layer of geoGCN can be parametrised so that it returns a result identical to a standard convolutional layer on gridlike objects, such as images (see Theorem 1).
The proposed method was evaluated on various datasets and compared with the stateoftheart methods. We applied geoGCN to classify images and incomplete images represented as graphs. We also tested the proposed method on chemical benchmark datasets. Experiments demonstrate that combining spatial information with data augmentation leads to more accurate predictions.
Our contributions can be summarised as follows:

We show how to use geometric features (spatial coordinates) in GCNs.

We prove that geoGCN is a proper generalisation of GCNs and CNNs.

In contrast to existing approaches, geoGCN allows to perform graph augmentation, which further improves performance of the model.
Related work
The first instances of Graph Neural Networks were proposed in [gori2005new] and [scarselli2008graph]. These authors designed recursive neural networks which iteratively propagate node labels until reaching a stable fixed point. Recursive graph neural network were further developed by [li2015gated] who used gated recurrent units for learning graph representation.
Recent approaches for graph processing rely on adapting convolutional neural networks to graph domain (graph convolutional network – GCNs). The first class of methods is based on spectral representation of graphs. The authors of [bruna2013spectral], [henaff2015deep] and [defferrard2016convolutional] defined spectral filters which operate on the graph spectrum. Kipf and Welling [kipf2016semi] significantly simplified this process by restricting by restricting the neighborhood to only firstorder neighbors. In this approach, convolutional layers followed by nonlinear activity functions were stacked to process graph structure sequentially. This work was also extended to higherorder neighborhood [zhou20179higher]. In [li2018adaptive], the notion of neighborhood in GCNs was generalized and a distance metric for graph was learned by spectral methods. The authors of [wu2019simplifying] showed that removing nonlinearities from GCNs further reduces their complexity, but does not affect heavily the performance. Unfortunately, spectral methods are domaindependent, which means that GCNs trained on one graph cannot be trivially transferred to another graph with a different spectral structure.
The second variant of GCNs does not use Laplacian basis to aggregate node neighbors but attempts to train convolutional filter for this purpose. To deal with variedsized neighborhoods and to preserve the parameter sharing property of CNNs, [duvenaud2015convolutional] used a specific weight matrix for each node degree. Subsequent work [hamilton2017inductive] used sampling strategy to extract a fixed size neighborhood. The authors of [monti2017geometric] used spatial features to construct convolutional filters. In contrast to our approach, they transform geometric features using a predefined Gaussian kernels and do not focus on generalizing classical CNNs. In [velivckovic2017graph], multihead self attention was used to train individual weights for each pair of nodes. To account edge similarities, which appears natural in the chemical domain, the authors of [shang2018edge] applied attention mechanism for edges. In [gilmer2017neural] graph and distance information were integrated in a single model, which allowed to achieve strong performance on molecular property prediction benchmarks. Moreover, not only graph distances, but also threedimensional atom coordinates are useful in molecular predictions as it was emphasized by [cho2018three] who introduced the 3DGCN architecture. They integrated matrix of relative atom positions into GCN architecture. However, 3DGCN is a chemistryinspired model which does not aim to generalize CNNs.
While many variants of graph neural networks achieve impressive performance, their design is mostly based on empirical intuition and evaluation. The work of [xu2018powerful] investigates theoretical properties of neural networks operating on graphs. Based on graph isomorphism test, they formally analyze discriminative power of popular GNN variants [kipf2016semi], [hamilton2017inductive] and show that they cannot learn to distinguish certain simple graph structures. In a similar spirit, our geoGCN is a theoretically justified generalization of classical CNNs to the case of graphs.
Geometric graph convolutions
In this section, we introduce geoGCN. First, we recall a basic construction of standard GCNs. Next, we present the intuition behind our approach and formally introduce geoGCN. Finally, we discuss practical advantages of geometric graph convolutions.
Let be a graph, where denotes a set of nodes (vertices) and represents edges. We put if and are connected by a directed edge and if the edge is missing. Each node is represented by a dimensional feature vector . Typically, graph convolutional neural networks transform these feature vectors over multiple subsequent layers to produce the final prediction.
Graph convolutions.
Let denote the matrix of node features being an input to a convolutional layer, where are column vectors. The dimension of is determined by the number of filters used in previous layer. Clearly, is the input representation to the first layer.
A typical graph convolution is defined by combining two operations. For each node , feature vectors of its neighbors are first aggregated:
(1) 
The weights are either trainable ([velivckovic2017graph] applied attention mechanism) or determined by ([kipf2016semi] motivated their selection using spectral graph theory).
Next, standard MLP is applied to transform the intermediate representation into the final output of a given layer:
(2) 
where a trainable weight matrix is defined by column vectors . The dimension of determines the dimension of the output feature vectors.

Intuition behind geometric graph convolutions.
Classical GCNs operate on the neighborhood given by the adjacency matrix. In some applications, nodes are additionally described by spatial coordinates. For example, the position of each pixel can be expressed as a pair of integers. Analogically, every conformation of chemical compound is a 3dimensional geometrical graph, where each atom is located in the space. The adjacency matrix is not able to preserve the whole information about the graph geometry. In particular, it is not possible to construct an analogue of classical convolution only from adjacency matrix and feature vectors. In our approach, we show how to include this spatial information in graph convolutions to construct a proper generalization of classical convolutions.
To proceed further, we need to introduce notation concerning convolutions (in the case of images). For simplicity we consider only convolutions without pooling. In general, given a mask its result on the image is given by
where
To present an intuition behind our approach, let us show how to mimic a classical linear convolution based on graph representation of the image.
Example 1.
For simplicity, let us consider a linear convolution given by the mask
Observe that as the result of this convolution on the image , every pixel is exchanged by its right upper neighbor, see Figure 2. Now we understand the image as a graph, where the neighborhood of the pixel with coordinates is given by the pixels with coordinates such that .
Given a vector and a bias we can define the (intermediate) graph operation by
Consider now the case when . One can easily observe that
where .
Consequently, we obtain that , which equals the result of the considered linear convolution.
Example 2.
Now, let us consider the mask, see Figure 3:
This convolution cannot be obtained from graph representation using a single transformation as in previous example.
To formulate this convolution, we define two intermediate operations for :
where and . The first operation extracts the right upper corner, while the second one extracts the left bottom corner, i.e.
As demonstrated in the above examples classical linear convolutions can be obtained from graphs by appropriate adaptation of (1) using spatial features. Based on this intuition, the precise formulation of geometric graph convolution is presented in the following paragraph. The complete proof that every linear convolution can be rewritten using geoGCN is given in the next section.
Geometric graph convolutions.
To formalize the above intuition, we define our geometric graph convolutions. We assume that each node is additionally identified with its coordinates . In contrast to standard features , we will not change across layers, but only use them to construct better graph representation. For this purpose, we replace (1) by:
(3) 
where are trainable. The pair plays a role of a convolutional filter which operates on the neighborhood of . The relative positions in the neighborhood are transformed using a linear operation combined with nonlinear ReLU function. This scalar is used to weigh the feature vectors in a neighborhood.
By the analogy with classical convolution, this transformation can be extended to multiple filters (as in Example 2). Let and define filters. The intermediate representation is a vector defined by:
Finally, we apply MLP transformation in the same manner as in (2) to transform these feature vectors.
Practical consequences.
In practice, the number of training data is usually too small to provide sufficient generalization. To overcome this problem, one can perform data augmentation to produce more representative examples. In computer vision, data augmentation is straightforward and relies on rotating or translating the image. Nevertheless, in the case of classical graph structures, analogical procedure is difficult to apply. This is a serious problem in medicinal chemistry, where the goal is to predict biological activity based only on a small amount of verified compounds. The introduction of spatial features and our geometric graph convolutions allow us to perform data augmentation in a natural way, which is not possible using only the adjacency matrix.
The formula (3) is invariant to the translation of spatial features, but its value depends on rotation of graph. In consequence, the rotation of the geometrical graph leads to different values of (3). Since in most domains the rotation does not affect the interpretation of object described by such graph (e.g. rotation does not change the chemical compound although one particular orientation may be useful when considering binding affinity, i.e. how well a given compound binds to the target protein), we can use this property to produce more instances of the same graph. This reasoning is exactly the same as in the classical view of image processing.
In addition, chemical compounds can be represented in many conformations. In a molecule, single bonds can rotate freely. Each molecule seeks to reach minimum energy, and thus some conformations are more probable to be found in nature than others. Because there are multiple stable conformations, augmentation helps to learn only meaningful spatial relations. In some tasks, conformations may be included in the dataset, e.g. in binding affinity prediction active conformations are those formed inside the binding pocket of a protein (see Figure 0(b)). Such a conformation can be discovered experimentally, e.g., through crystallization.
Theoretical Analysis
As shown above, introducing geometric features makes the processing of graphs similar to the way of image processing. In this part, we make this statement even more evident. Namely, we formally prove that our geometric graph convolutions generalise classical convolutions used in the case of images. In other words, we show that the appropriate parametrisation of geometric graph convolutions leads to the classical convolutions.
Theorem 1.
Let be a given convolutional mask, and let (number of elements of ). Then there exist , and such that
Proof.
Let denote all possible positions in the mask , i.e. .
Let denote an arbitrary vector which is not orthogonal to any element from . Then
Consequently, we may order the elements of so that . Let denote the convolutional mask, which has value one at the position , and zero otherwise.
Now we can choose arbitrary such that
for example one may take
Then observe that
and generally for every we get
where all the coefficients in the above sum are strictly positive.
Consequently,
and we obtain recursively that
which trivially implies that every convolution can be obtained as a linear combination of .
Since an arbitrary convolution is given by , we obtain the assertion of the theorem. ∎
On the other hand, if we put all spatial features to 0, then (3) reduces to:
This gives a vanilla graph convolution, where the aggregation over neighbors does not contain parameters. We can also use different for each pair of neighbors, which allows to mimic many types of graph convolutions.
Experiments
We verified our model on graphs with a natural geometric interpretation. We took into account graphs constructed from images as well as graphs of chemical compounds.
Image graph classification
In the first experiment, we consider the wellknown MNIST dataset. We represent the images as graphs in two ways following [monti2017geometric]. In the first case, each node corresponds to a pixel from the original image, making a regular grid with connections between adjacent pixels. The node has 2dimensional location, and it is characterized by a 1dimensional pixel intensity. In the second variant, nodes are constructed from an irregular grid consisting of 75 superpixels. In the latter case, the edges are determined by spatial relations between nodes using knearest neighbors.
We tune the hyperparameters of geoGCN using a random search with a fixed budget of 100 trials, see supplementary material for details. We compare our method with the results reported in the literature by stateoftheart methods used to process geometrical shapes: ChebNet [defferrard2016convolutional], MoNet [monti2017geometric], and SplineCNN [fey2018splinecnn].
The results presented in Table 1 show that geoGCN outperforms comparable methods on both variants on MNIST dataset. Its performance is slightly better than SplineCNN, which reports stateoftheart results on this task.
Method  Grid  Superpixels 

ChebNet  99.14%  75.62% 
MoNet  99.19%  91.11% 
SplineCNN  99.22%  95.22% 
geoGCN  99.36%  95.95% 
Incomplete image classification
Graph representation of images can be useful to describe images with missing regions. In this case, each visible pixel represents a node which is connected with its visible neighbors. Unobserved pixels are not represented in this graph.
For the evaluation, we considered MNIST dataset, where a square patch of the size 13x13 was removed from each image. The location of the patch was uniformly sampled for each image. For a comparison, we used imputation methods, which fill missing regions at preprocessing stage. Imputations were created using:

mean: Missing features were replaced with mean values of those features computed for all (incomplete) training samples.

knn: Missing attributes were filled with mean values of those features computed from the nearest training samples (we used K = 5). Neighborhood was measured using Euclidean distance in the subspace of observed features.

mice: This method fills absent pixels in an iterative process using Multiple Imputation by Chained Equation (mice), where several imputations are drawing from the conditional distribution of data by Markov chain Monte Carlo techniques
Completed MNIST images were processed by fully connected and convolutional neural networks. For complete MNIST images (no missing data), these networks obtained 98.79% and 99.34% of classification accuracy, respectively.
Method  Accuracy 

FCNet + mean  87.59% 
FCNet + kNN  87.10% 
FCNet + mice  88.59% 
ConvNet + mean  90.95% 
ConvNet + kNN  90.67% 
ConvNet + mice  92.10% 
geoGCN  92.40% 
The results presented in Table 2 show that geoGCN gives better accuracy than all imputation methods on both versions of neural networks. The overall performance of geoGCN is impressive, because geoGCN does not use any additional information concerning missing regions. This suggests that it is better to leave unobserved features missing than to complete them with inappropriate values, which is usually a common practice.
Learning from molecules
In the next experiment, we use chemical tasks to evaluate our model. We chose 3 datasets from MoleculeNet [molnet] which is a benchmark for moleculerelated tasks. BloodBrain Barrier Permeability (BBBP) is a binary classification task of predicting whether or not a given compound is able to pass through the barrier between blood and the brain, allowing the drug to impact the central nervous system. The ability of a molecule to penetrate this border depends on many different properties such as lipophilicity, molecule size, and its flexibility. Another 2 datasets, ESOL and FreeSolv, are solubility prediction tasks with continuous targets.
None of the three datasets contain atom positions, so only the graph representation of a compound can be obtained. However, the threedimensional shape of a molecule can be predicted using energy minimization, which is fairly easy to do especially for small compounds. We run universal force field (UFF) method from RDKit package to predict atom positions. Because in our method we use absolute positions, and chemical compounds do not have one canonical orientation, the positional data can be augmented with random rotations. We also run UFF a few times (up to 30) to augment the data as this procedure is not deterministic.
To evaluate our model against methods proposed by MoleculeNet, we split the datasets into train, validation, and test subsets. The splits are done according to the MoleculeNet proposition that ESOL and FreeSolv datasets should be splitted at random, and BBBP data is splitted with a scaffold split that prevents similar structures to be put into different sets – this way an algorithm cannot memorize the structures highly correlated with labels, but it needs to learn more general compound features. We run random search for all models testing 100 hyperparameter sets for each of them. All runs are repeated 3 times. The tuned hyperparameters of all tested methods are shown in the supplementary materials.
We benchmark our approach against popular chemistry models: graphbased models (Graph Convolution [duvenaud2015convolutional], Weave Model [weave], and Message Passing Neural Network [gilmer2017neural]) as well as classical methods such as random forest and SVM, which often perform superbly in chemical tasks where datasets tend to be small (e.g. FreeSolv has only 513 compounds in its training set). Neither RF nor SVM operates on graphs, but rather they use calculated feature vectors which describe a molecule. In our comparison, ECFP [ecfp] was used for this purpose. In addition, EAGCN [shang2018edge] is included in the experiment as the method that utilizes edge attributes together with the graph structure. As for our method, we show results with train and testtime augmentation of the data carried out in the manner described above. For all datasets, we observe slight improvements with the augmented data. In order to investigate the impact of positional features, we also enrich the atom representation of the classical graph convolutional network with our predicted atom positions and apply the same procedure of augmentation. We name this enriched architecture posGCN and include it in the comparison.
Method  BBBP  ESOL  FreeSolv 

SVM  0.603 0.000  0.493 0.000  0.391 0.000 
RF  0.551 0.005  0.533 0.003  0.550 0.004 
GC  0.690 0.015  0.334 0.017  0.336 0.043 
Weave  0.703 0.012  0.389 0.045  0.403 0.035 
MPNN  0.700 0.019  0.303 0.012  0.299 0.038 
EAGCN  0.664 0.007  0.459 0.019  0.410 0.014 
posGCN  0.696 0.008  0.301 0.011  0.278 0.024 
geoGCN  0.743 0.004  0.270 0.005  0.299 0.033 
The results presented in Table 3 show that for FreeSolv dataset our method matches the result of MPNN, which is the best performing model for this task. For the two other datasets, geoGCN outperforms all tested models by a significant margin. Based on posGCN scores, we notice that including positional features consistently improves the performance of the model across all tasks, and for the smallest dataset, FreeSolv, posGCN even surpasses the score of MPNN. Nevertheless, learning from bigger datasets requires a better way of managing positional data, which can be noted for ESOL and BBBP datasets for which posGCN performs significantly worse than geoGCN but still better than vanilla GC.
Ablation study of the data augmentation
We also studied the effect of data augmentation on the geoGCN performance. First, we examined how removing predicted positions, and thus setting all positional vectors to zero in Equation 3, affects the scores achieved by our model on chemical tasks. The results are depicted in Figure 4. It clearly shows that even predicted node coordinates improve the performance of the method. On the same plot we also show the outcome of augmenting the data with random rotations and 30 predicted molecule conformations, which were calculated as described in the previous subsection. As expected, the best performing model uses all types of position augmentation.
Eventually, the impact of various levels of augmentation was studied. For this purpose we precalculated 20 molecular conformations on the BBBP dataset using the universal force field method and used these predictions to augment the dataset. To test the importance of conformation variety, each run we increased the number of available conformations to sample from. The results are presented in Figure 5. One can see that including a bigger number of conformations helps the model to achieve better results. Also, the curve flattens out after a few conformations, which may be caused by limited flexibility of small compounds and high similarity of the predicted shapes.
Conclusion
We proposed geoGCN which is a general model for processing graphstructured data with spatial features. Node positions are integrated into our convolution operation to create a layer which generalizes both GCNs and CNNs. In contrast to the majority of other approaches, our method can effectively use added information about location to construct selftaught feature masking, which can be augmented to achieve invariance of desired properties. Furthermore, we provide a theoretical analysis of our geometric graph convolutions. Experiments confirm strong performance of our method.
References
Appendix A Experimental details
In the following section we list out all hyperparameters ranges used during the random search in our experiments.
In table 4 we present the geoGCN hyperparameters ranges, that were used in all our experiments.
parameters  

batch size  16, 32, 64, 128 
learning rate  0.01, 0.005, 0.001, 0.0005, 0.0001 
model dropout  0.0, 0.1, 0.2, 0.3 
layers number  1, 2, 4, 6, 8 
model dim  16, 32, 64, 128, 256, 512 
model dim  8, 16, 32, 64 
use cluster pooling  True, False 
Chemistry experiments
Below we list the hyperparameters ranges used in the chemistry experiment.
parameters  

C  0.25, 0.4375, 0.625, 0.8125, 1., 1.1875, 1.375, 1.5625, 1.75, 1.9375, 2.125, 2.3125, 2.5, 2.6875, 2.875, 3.0625, 3.25, 3.4375, 3.625, 3.8125, 4. 
gamma  0.0125, 0.021875, 0.03125, 0.040625, 0.05, 0.059375, 0.06875, 0.078125, 0.0875, 0.096875, 0.10625, 0.115625, 0.125, 0.134375, 0.14375, 0.153125, 0.1625, 0.171875, 0.18125, 0.190625, 0.2 
parameters  
estimators number  125, 218, 312, 406, 500, 593, 687, 781, 875, 968, 1062, 1156, 1250, 1343, 1437, 1531, 1625, 1718, 1812, 1906, 2000 
parameters  

batch size  64, 128, 256 
learning rate  0.002, 0.001, 0.0005 
filters number  64, 128, 192, 256 
fully connected nodes number  128, 256, 512 
parameters  
batch size  16, 32, 64, 128 
epochs number  20, 40, 60, 80, 100 
learning rate  0.002, 0.001, 0.00075, 0.0005 
graph features number  32, 64, 96, 128, 256 
pair features number  14 
parameters  

batch size  8, 16, 32, 64 
epochs number  25, 50, 75, 100 
learning rate  0.002, 0.001, 0.00075, 0.0005 
T  1, 2, 3, 4, 5 
M  2, 3, 4, 5, 6 
parameters  

batch size  16, 32, 64, 128, 256, 512 
EAGCN structure  ’concate’, ’weighted’ 
epochs number  100, 500, 1000 
learning rate  0.01, 0.005, 0.001, 0.0005, 0.0001 
dropout  0.0, 0.1, 0.3 
weight decay  0.0, 0.001, 0.01, 0.0001 
sgc1 1  30, 60 
sgc1 2  5, 10, 15, 20, 30 
sgc1 3  5, 10, 15, 20, 30 
sgc1 4  5, 10, 15, 20, 30 
sgc1 5  5, 10, 15, 20, 30 
sgc2 1  30, 60 
sgc2 2  5, 10, 15, 20, 30 
sgc2 3  5, 10, 15, 20, 30 
sgc2 4  5, 10, 15, 20, 30 
sgc2 5  5, 10, 15, 20, 30 
den1  12, 32, 64 
den2  12, 32, 64 
Missing data experiments
Below we list the hyperparameters ranges used in the missing data experiment.
parameters  

batch size  16, 32, 64, 128 
learning rate  0.0001, 0.0005, 0.001, 0.005 
layers dimensionality  64, 128, 256, 512 
layers number  2, 3, 4, 5 