Geometric Graph Convolutional Neural Networks
Graph Convolutional Networks (GCNs) have recently become the primary choice for learning from graph-structured data, superseding hash fingerprints in representing chemical compounds. However, GCNs lack the ability to take into account the ordering of node neighbors, even when there is a geometric interpretation of the graph vertices that provides an order based on their spatial positions. To remedy this issue, we propose Geometric Graph Convolutional Network (geo-GCN) which uses spatial features to efficiently learn from graphs that can be naturally located in space. Our contribution is threefold: we propose a GCN-inspired architecture which (i) leverages node positions, (ii) is a proper generalisation of both GCNs and Convolutional Neural Networks (CNNs), (iii) benefits from augmentation which further improves the performance and assures invariance with respect to the desired properties. Empirically, geo-GCN outperforms state-of-the-art graph-based methods on image classification and chemical tasks.
Convolutional Neural Networks (CNNs) outperform humans on visual learning tasks, such as image classification [krizhevsky2012imagenet], object detection [seferbekov2018feature] or image captioning [yang2017dense]. They have also been successfully applied to text processing [kim2014convolutional] and time series analysis [yang2015deep]. Nevertheless, CNNs cannot be easily adapted to irregular entities, such as graphs, where data representation is not organised in a grid-like structure.
Graph Convolutional Networks (GCNs) attempt to mimic CNNs by operating on spatially close neighbors. Motivated by spectral graph theory, Kipf and Welling [kipf2016semi] use fixed weights determined by the adjacency matrix of a graph to aggregate labels of the neighbors. Velickovic et al. [velivckovic2017graph] use attention mechanism to learn the strength of these weights. In most cases, the design of new GCNs is based on empirical intuition and there has been little investigation regarding their theoretical properties [xu2018powerful]. In particular, there is no evident correspondence between classical CNNs and GCNs.
In many cases, graphs are coupled with a geometric structure. In medicinal chemistry, the three-dimensional structure of a chemical compound, called a molecular conformation, is essential in determining the activity of a drug towards a target protein (Figure 1). Similarly, in image processing tasks, pixels of an image are organised in a two dimensional grid, which constitutes their geometric interpretation. However, standard GCNs do not take spatial positions of the nodes into account, which is a considerable difference between GCNs and CNNs. Moreover, in the case of images, geometric features allow to augment data with translation or rotation and significantly enlarge a given dataset, which is crucial when the number of examples is limited.
In this paper, we propose Geometric Graph Convolutional Networks (geo-GCN), a variant of GCNs, which is a proper generalisation of CNNs to the case of graphs. In contrast to existing GCNs, geo-GCN uses spatial features of nodes to aggregate information from the neighbors. On one hand, this geometric interpretation is useful to model many real examples of graphs such as graphs of chemical compounds. In this case, we are able to perform data augmentation by rotating a given graph in a spatial domain and, in consequence, improve network generalisation when the amount of data is limited. On the other hand, a single layer of geo-GCN can be parametrised so that it returns a result identical to a standard convolutional layer on grid-like objects, such as images (see Theorem 1).
The proposed method was evaluated on various datasets and compared with the state-of-the-art methods. We applied geo-GCN to classify images and incomplete images represented as graphs. We also tested the proposed method on chemical benchmark datasets. Experiments demonstrate that combining spatial information with data augmentation leads to more accurate predictions.
Our contributions can be summarised as follows:
We show how to use geometric features (spatial coordinates) in GCNs.
We prove that geo-GCN is a proper generalisation of GCNs and CNNs.
In contrast to existing approaches, geo-GCN allows to perform graph augmentation, which further improves performance of the model.
The first instances of Graph Neural Networks were proposed in [gori2005new] and [scarselli2008graph]. These authors designed recursive neural networks which iteratively propagate node labels until reaching a stable fixed point. Recursive graph neural network were further developed by [li2015gated] who used gated recurrent units for learning graph representation.
Recent approaches for graph processing rely on adapting convolutional neural networks to graph domain (graph convolutional network – GCNs). The first class of methods is based on spectral representation of graphs. The authors of [bruna2013spectral], [henaff2015deep] and [defferrard2016convolutional] defined spectral filters which operate on the graph spectrum. Kipf and Welling [kipf2016semi] significantly simplified this process by restricting by restricting the neighborhood to only first-order neighbors. In this approach, convolutional layers followed by non-linear activity functions were stacked to process graph structure sequentially. This work was also extended to higher-order neighborhood [zhou20179higher]. In [li2018adaptive], the notion of neighborhood in GCNs was generalized and a distance metric for graph was learned by spectral methods. The authors of [wu2019simplifying] showed that removing nonlinearities from GCNs further reduces their complexity, but does not affect heavily the performance. Unfortunately, spectral methods are domain-dependent, which means that GCNs trained on one graph cannot be trivially transferred to another graph with a different spectral structure.
The second variant of GCNs does not use Laplacian basis to aggregate node neighbors but attempts to train convolutional filter for this purpose. To deal with varied-sized neighborhoods and to preserve the parameter sharing property of CNNs, [duvenaud2015convolutional] used a specific weight matrix for each node degree. Subsequent work [hamilton2017inductive] used sampling strategy to extract a fixed size neighborhood. The authors of [monti2017geometric] used spatial features to construct convolutional filters. In contrast to our approach, they transform geometric features using a predefined Gaussian kernels and do not focus on generalizing classical CNNs. In [velivckovic2017graph], multi-head self attention was used to train individual weights for each pair of nodes. To account edge similarities, which appears natural in the chemical domain, the authors of [shang2018edge] applied attention mechanism for edges. In [gilmer2017neural] graph and distance information were integrated in a single model, which allowed to achieve strong performance on molecular property prediction benchmarks. Moreover, not only graph distances, but also three-dimensional atom coordinates are useful in molecular predictions as it was emphasized by [cho2018three] who introduced the 3DGCN architecture. They integrated matrix of relative atom positions into GCN architecture. However, 3DGCN is a chemistry-inspired model which does not aim to generalize CNNs.
While many variants of graph neural networks achieve impressive performance, their design is mostly based on empirical intuition and evaluation. The work of [xu2018powerful] investigates theoretical properties of neural networks operating on graphs. Based on graph isomorphism test, they formally analyze discriminative power of popular GNN variants [kipf2016semi], [hamilton2017inductive] and show that they cannot learn to distinguish certain simple graph structures. In a similar spirit, our geo-GCN is a theoretically justified generalization of classical CNNs to the case of graphs.
Geometric graph convolutions
In this section, we introduce geo-GCN. First, we recall a basic construction of standard GCNs. Next, we present the intuition behind our approach and formally introduce geo-GCN. Finally, we discuss practical advantages of geometric graph convolutions.
Let be a graph, where denotes a set of nodes (vertices) and represents edges. We put if and are connected by a directed edge and if the edge is missing. Each node is represented by a -dimensional feature vector . Typically, graph convolutional neural networks transform these feature vectors over multiple subsequent layers to produce the final prediction.
Let denote the matrix of node features being an input to a convolutional layer, where are column vectors. The dimension of is determined by the number of filters used in previous layer. Clearly, is the input representation to the first layer.
A typical graph convolution is defined by combining two operations. For each node , feature vectors of its neighbors are first aggregated:
The weights are either trainable ([velivckovic2017graph] applied attention mechanism) or determined by ([kipf2016semi] motivated their selection using spectral graph theory).
Next, standard MLP is applied to transform the intermediate representation into the final output of a given layer:
where a trainable weight matrix is defined by column vectors . The dimension of determines the dimension of the output feature vectors.
Intuition behind geometric graph convolutions.
Classical GCNs operate on the neighborhood given by the adjacency matrix. In some applications, nodes are additionally described by spatial coordinates. For example, the position of each pixel can be expressed as a pair of integers. Analogically, every conformation of chemical compound is a 3-dimensional geometrical graph, where each atom is located in the space. The adjacency matrix is not able to preserve the whole information about the graph geometry. In particular, it is not possible to construct an analogue of classical convolution only from adjacency matrix and feature vectors. In our approach, we show how to include this spatial information in graph convolutions to construct a proper generalization of classical convolutions.
To proceed further, we need to introduce notation concerning convolutions (in the case of images). For simplicity we consider only convolutions without pooling. In general, given a mask its result on the image is given by
To present an intuition behind our approach, let us show how to mimic a classical linear convolution based on graph representation of the image.
For simplicity, let us consider a linear convolution given by the mask
Observe that as the result of this convolution on the image , every pixel is exchanged by its right upper neighbor, see Figure 2. Now we understand the image as a graph, where the neighborhood of the pixel with coordinates is given by the pixels with coordinates such that .
Given a vector and a bias we can define the (intermediate) graph operation by
Consider now the case when . One can easily observe that
Consequently, we obtain that , which equals the result of the considered linear convolution.
Now, let us consider the mask, see Figure 3:
This convolution cannot be obtained from graph representation using a single transformation as in previous example.
To formulate this convolution, we define two intermediate operations for :
where and . The first operation extracts the right upper corner, while the second one extracts the left bottom corner, i.e.
Finally, we put
Making an additional linear transformation (analogical to (2) with ), we obtain:
As demonstrated in the above examples classical linear convolutions can be obtained from graphs by appropriate adaptation of (1) using spatial features. Based on this intuition, the precise formulation of geometric graph convolution is presented in the following paragraph. The complete proof that every linear convolution can be rewritten using geo-GCN is given in the next section.
Geometric graph convolutions.
To formalize the above intuition, we define our geometric graph convolutions. We assume that each node is additionally identified with its coordinates . In contrast to standard features , we will not change across layers, but only use them to construct better graph representation. For this purpose, we replace (1) by:
where are trainable. The pair plays a role of a convolutional filter which operates on the neighborhood of . The relative positions in the neighborhood are transformed using a linear operation combined with non-linear ReLU function. This scalar is used to weigh the feature vectors in a neighborhood.
By the analogy with classical convolution, this transformation can be extended to multiple filters (as in Example 2). Let and define -filters. The intermediate representation is a vector defined by:
Finally, we apply MLP transformation in the same manner as in (2) to transform these feature vectors.
In practice, the number of training data is usually too small to provide sufficient generalization. To overcome this problem, one can perform data augmentation to produce more representative examples. In computer vision, data augmentation is straightforward and relies on rotating or translating the image. Nevertheless, in the case of classical graph structures, analogical procedure is difficult to apply. This is a serious problem in medicinal chemistry, where the goal is to predict biological activity based only on a small amount of verified compounds. The introduction of spatial features and our geometric graph convolutions allow us to perform data augmentation in a natural way, which is not possible using only the adjacency matrix.
The formula (3) is invariant to the translation of spatial features, but its value depends on rotation of graph. In consequence, the rotation of the geometrical graph leads to different values of (3). Since in most domains the rotation does not affect the interpretation of object described by such graph (e.g. rotation does not change the chemical compound although one particular orientation may be useful when considering binding affinity, i.e. how well a given compound binds to the target protein), we can use this property to produce more instances of the same graph. This reasoning is exactly the same as in the classical view of image processing.
In addition, chemical compounds can be represented in many conformations. In a molecule, single bonds can rotate freely. Each molecule seeks to reach minimum energy, and thus some conformations are more probable to be found in nature than others. Because there are multiple stable conformations, augmentation helps to learn only meaningful spatial relations. In some tasks, conformations may be included in the dataset, e.g. in binding affinity prediction active conformations are those formed inside the binding pocket of a protein (see Figure 0(b)). Such a conformation can be discovered experimentally, e.g., through crystallization.
As shown above, introducing geometric features makes the processing of graphs similar to the way of image processing. In this part, we make this statement even more evident. Namely, we formally prove that our geometric graph convolutions generalise classical convolutions used in the case of images. In other words, we show that the appropriate parametrisation of geometric graph convolutions leads to the classical convolutions.
Let be a given convolutional mask, and let (number of elements of ). Then there exist , and such that
Let denote all possible positions in the mask , i.e. .
Let denote an arbitrary vector which is not orthogonal to any element from . Then
Consequently, we may order the elements of so that . Let denote the convolutional mask, which has value one at the position , and zero otherwise.
Now we can choose arbitrary such that
for example one may take
Then observe that
and generally for every we get
where all the coefficients in the above sum are strictly positive.
and we obtain recursively that
which trivially implies that every convolution can be obtained as a linear combination of .
Since an arbitrary convolution is given by , we obtain the assertion of the theorem. ∎
On the other hand, if we put all spatial features to 0, then (3) reduces to:
This gives a vanilla graph convolution, where the aggregation over neighbors does not contain parameters. We can also use different for each pair of neighbors, which allows to mimic many types of graph convolutions.
We verified our model on graphs with a natural geometric interpretation. We took into account graphs constructed from images as well as graphs of chemical compounds.
Image graph classification
In the first experiment, we consider the well-known MNIST dataset. We represent the images as graphs in two ways following [monti2017geometric]. In the first case, each node corresponds to a pixel from the original image, making a regular grid with connections between adjacent pixels. The node has 2-dimensional location, and it is characterized by a 1-dimensional pixel intensity. In the second variant, nodes are constructed from an irregular grid consisting of 75 superpixels. In the latter case, the edges are determined by spatial relations between nodes using k-nearest neighbors.
We tune the hyperparameters of geo-GCN using a random search with a fixed budget of 100 trials, see supplementary material for details. We compare our method with the results reported in the literature by state-of-the-art methods used to process geometrical shapes: ChebNet [defferrard2016convolutional], MoNet [monti2017geometric], and SplineCNN [fey2018splinecnn].
The results presented in Table 1 show that geo-GCN outperforms comparable methods on both variants on MNIST dataset. Its performance is slightly better than SplineCNN, which reports state-of-the-art results on this task.
Incomplete image classification
Graph representation of images can be useful to describe images with missing regions. In this case, each visible pixel represents a node which is connected with its visible neighbors. Unobserved pixels are not represented in this graph.
For the evaluation, we considered MNIST dataset, where a square patch of the size 13x13 was removed from each image. The location of the patch was uniformly sampled for each image. For a comparison, we used imputation methods, which fill missing regions at preprocessing stage. Imputations were created using:
mean: Missing features were replaced with mean values of those features computed for all (incomplete) training samples.
k-nn: Missing attributes were filled with mean values of those features computed from the nearest training samples (we used K = 5). Neighborhood was measured using Euclidean distance in the subspace of observed features.
mice: This method fills absent pixels in an iterative process using Multiple Imputation by Chained Equation (mice), where several imputations are drawing from the conditional distribution of data by Markov chain Monte Carlo techniques
Completed MNIST images were processed by fully connected and convolutional neural networks. For complete MNIST images (no missing data), these networks obtained 98.79% and 99.34% of classification accuracy, respectively.
|FCNet + mean||87.59%|
|FCNet + k-NN||87.10%|
|FCNet + mice||88.59%|
|ConvNet + mean||90.95%|
|ConvNet + k-NN||90.67%|
|ConvNet + mice||92.10%|
The results presented in Table 2 show that geo-GCN gives better accuracy than all imputation methods on both versions of neural networks. The overall performance of geo-GCN is impressive, because geo-GCN does not use any additional information concerning missing regions. This suggests that it is better to leave unobserved features missing than to complete them with inappropriate values, which is usually a common practice.
Learning from molecules
In the next experiment, we use chemical tasks to evaluate our model. We chose 3 datasets from MoleculeNet [molnet] which is a benchmark for molecule-related tasks. Blood-Brain Barrier Permeability (BBBP) is a binary classification task of predicting whether or not a given compound is able to pass through the barrier between blood and the brain, allowing the drug to impact the central nervous system. The ability of a molecule to penetrate this border depends on many different properties such as lipophilicity, molecule size, and its flexibility. Another 2 datasets, ESOL and FreeSolv, are solubility prediction tasks with continuous targets.
None of the three datasets contain atom positions, so only the graph representation of a compound can be obtained. However, the three-dimensional shape of a molecule can be predicted using energy minimization, which is fairly easy to do especially for small compounds. We run universal force field (UFF) method from RDKit package to predict atom positions. Because in our method we use absolute positions, and chemical compounds do not have one canonical orientation, the positional data can be augmented with random rotations. We also run UFF a few times (up to 30) to augment the data as this procedure is not deterministic.
To evaluate our model against methods proposed by MoleculeNet, we split the datasets into train, validation, and test subsets. The splits are done according to the MoleculeNet proposition that ESOL and FreeSolv datasets should be splitted at random, and BBBP data is splitted with a scaffold split that prevents similar structures to be put into different sets – this way an algorithm cannot memorize the structures highly correlated with labels, but it needs to learn more general compound features. We run random search for all models testing 100 hyperparameter sets for each of them. All runs are repeated 3 times. The tuned hyperparameters of all tested methods are shown in the supplementary materials.
We benchmark our approach against popular chemistry models: graph-based models (Graph Convolution [duvenaud2015convolutional], Weave Model [weave], and Message Passing Neural Network [gilmer2017neural]) as well as classical methods such as random forest and SVM, which often perform superbly in chemical tasks where datasets tend to be small (e.g. FreeSolv has only 513 compounds in its training set). Neither RF nor SVM operates on graphs, but rather they use calculated feature vectors which describe a molecule. In our comparison, ECFP [ecfp] was used for this purpose. In addition, EAGCN [shang2018edge] is included in the experiment as the method that utilizes edge attributes together with the graph structure. As for our method, we show results with train- and test-time augmentation of the data carried out in the manner described above. For all datasets, we observe slight improvements with the augmented data. In order to investigate the impact of positional features, we also enrich the atom representation of the classical graph convolutional network with our predicted atom positions and apply the same procedure of augmentation. We name this enriched architecture pos-GCN and include it in the comparison.
|SVM||0.603 0.000||0.493 0.000||0.391 0.000|
|RF||0.551 0.005||0.533 0.003||0.550 0.004|
|GC||0.690 0.015||0.334 0.017||0.336 0.043|
|Weave||0.703 0.012||0.389 0.045||0.403 0.035|
|MPNN||0.700 0.019||0.303 0.012||0.299 0.038|
|EAGCN||0.664 0.007||0.459 0.019||0.410 0.014|
|pos-GCN||0.696 0.008||0.301 0.011||0.278 0.024|
|geo-GCN||0.743 0.004||0.270 0.005||0.299 0.033|
The results presented in Table 3 show that for FreeSolv dataset our method matches the result of MPNN, which is the best performing model for this task. For the two other datasets, geo-GCN outperforms all tested models by a significant margin. Based on pos-GCN scores, we notice that including positional features consistently improves the performance of the model across all tasks, and for the smallest dataset, FreeSolv, pos-GCN even surpasses the score of MPNN. Nevertheless, learning from bigger datasets requires a better way of managing positional data, which can be noted for ESOL and BBBP datasets for which pos-GCN performs significantly worse than geo-GCN but still better than vanilla GC.
Ablation study of the data augmentation
We also studied the effect of data augmentation on the geo-GCN performance. First, we examined how removing predicted positions, and thus setting all positional vectors to zero in Equation 3, affects the scores achieved by our model on chemical tasks. The results are depicted in Figure 4. It clearly shows that even predicted node coordinates improve the performance of the method. On the same plot we also show the outcome of augmenting the data with random rotations and 30 predicted molecule conformations, which were calculated as described in the previous subsection. As expected, the best performing model uses all types of position augmentation.
Eventually, the impact of various levels of augmentation was studied. For this purpose we precalculated 20 molecular conformations on the BBBP dataset using the universal force field method and used these predictions to augment the dataset. To test the importance of conformation variety, each run we increased the number of available conformations to sample from. The results are presented in Figure 5. One can see that including a bigger number of conformations helps the model to achieve better results. Also, the curve flattens out after a few conformations, which may be caused by limited flexibility of small compounds and high similarity of the predicted shapes.
We proposed geo-GCN which is a general model for processing graph-structured data with spatial features. Node positions are integrated into our convolution operation to create a layer which generalizes both GCNs and CNNs. In contrast to the majority of other approaches, our method can effectively use added information about location to construct self-taught feature masking, which can be augmented to achieve invariance of desired properties. Furthermore, we provide a theoretical analysis of our geometric graph convolutions. Experiments confirm strong performance of our method.
Appendix A Experimental details
In the following section we list out all hyperparameters ranges used during the random search in our experiments.
In table 4 we present the geo-GCN hyperparameters ranges, that were used in all our experiments.
|batch size||16, 32, 64, 128|
|learning rate||0.01, 0.005, 0.001, 0.0005, 0.0001|
|model dropout||0.0, 0.1, 0.2, 0.3|
|layers number||1, 2, 4, 6, 8|
|model dim||16, 32, 64, 128, 256, 512|
|model dim||8, 16, 32, 64|
|use cluster pooling||True, False|
Below we list the hyperparameters ranges used in the chemistry experiment.
|C||0.25, 0.4375, 0.625, 0.8125, 1., 1.1875, 1.375, 1.5625, 1.75, 1.9375, 2.125, 2.3125, 2.5, 2.6875, 2.875, 3.0625, 3.25, 3.4375, 3.625, 3.8125, 4.|
|gamma||0.0125, 0.021875, 0.03125, 0.040625, 0.05, 0.059375, 0.06875, 0.078125, 0.0875, 0.096875, 0.10625, 0.115625, 0.125, 0.134375, 0.14375, 0.153125, 0.1625, 0.171875, 0.18125, 0.190625, 0.2|
|estimators number||125, 218, 312, 406, 500, 593, 687, 781, 875, 968, 1062, 1156, 1250, 1343, 1437, 1531, 1625, 1718, 1812, 1906, 2000|
|batch size||64, 128, 256|
|learning rate||0.002, 0.001, 0.0005|
|filters number||64, 128, 192, 256|
|fully connected nodes number||128, 256, 512|
|batch size||16, 32, 64, 128|
|epochs number||20, 40, 60, 80, 100|
|learning rate||0.002, 0.001, 0.00075, 0.0005|
|graph features number||32, 64, 96, 128, 256|
|pair features number||14|
|batch size||8, 16, 32, 64|
|epochs number||25, 50, 75, 100|
|learning rate||0.002, 0.001, 0.00075, 0.0005|
|T||1, 2, 3, 4, 5|
|M||2, 3, 4, 5, 6|
|batch size||16, 32, 64, 128, 256, 512|
|EAGCN structure||’concate’, ’weighted’|
|epochs number||100, 500, 1000|
|learning rate||0.01, 0.005, 0.001, 0.0005, 0.0001|
|dropout||0.0, 0.1, 0.3|
|weight decay||0.0, 0.001, 0.01, 0.0001|
|sgc1 1||30, 60|
|sgc1 2||5, 10, 15, 20, 30|
|sgc1 3||5, 10, 15, 20, 30|
|sgc1 4||5, 10, 15, 20, 30|
|sgc1 5||5, 10, 15, 20, 30|
|sgc2 1||30, 60|
|sgc2 2||5, 10, 15, 20, 30|
|sgc2 3||5, 10, 15, 20, 30|
|sgc2 4||5, 10, 15, 20, 30|
|sgc2 5||5, 10, 15, 20, 30|
|den1||12, 32, 64|
|den2||12, 32, 64|
Missing data experiments
Below we list the hyperparameters ranges used in the missing data experiment.
|batch size||16, 32, 64, 128|
|learning rate||0.0001, 0.0005, 0.001, 0.005|
|layers dimensionality||64, 128, 256, 512|
|layers number||2, 3, 4, 5|