A Convolutional Neural Network into graph space
Abstract
Convolutional neural networks (CNNs), in a few decades, have outperformed the existing state of the art methods in classification context. However, in the way they were formalised, CNNs are bound to operate on euclidean spaces. Indeed, convolution is a signal operation that are defined on euclidean spaces. This has restricted deep learning main use to euclideandefined data such as sound or image.
And yet, numerous computer application fields (among which network analysis, computational social science, chemoinformatics or computer graphics) induce noneuclideanly defined data such as graphs, networks or manifolds.
In this paper we propose a new convolution neural network architecture, defined directly into graph space. Convolution and pooling operators are defined in graph domain. We show its usability in a backpropagation context.
Experimental results show that our model performance is at state of the art level on simple tasks. It shows robustness with respect to graph domain changes and improvement with respect to other euclidean and noneuclidean convolutional architectures.
1 Introduction
Graphs are frequently used in various fields of computer science, since they constitute a universal modeling tool which allows the description of structured data. The handled objects and their relations are described in a single and humanreadable formalism. Hence, tools for graphs supervised classification and graph mining are required in many applications such as pattern recognition [22], chemical components analysis [9], structured data retrieval [20].
1.1 Graph Classification
Graph classifiers can be categorized into two categories whether the classifier operates in a graph space or in a vector space.
Graph space
Graph space classification consists of finding a metric (with the graph space) to evaluate the dissimilarity between two graphs. This metric can be later used in a KNearest Neighbor context, where the distances between the object to be classified and the elements in the learning database are used as a base for classification. The similarity or dissimilarity between two graphs requires the computation and the evaluation of the ”best” matching between them. Since exact isomorphism rarely occurs in pattern analysis applications, the matching process must be errortolerant, i.e., it must tolerate differences on the topology and/or its labeling. For instance, in the Graph Edit Distance (GED) problem [22], the graph matching process and the dissimilarity computation are linked through the introduction of a set of graph edit operations. Each edit operation is characterized by a cost, and the dissimilarity measure is the total cost of the least expensive set of operations that transform one graph into another one. In [22, 3], the GED is shown to be equivalent to a Quadratic Assignment Problem (QAP). Since errortolerant graph matching problems are NPhard most research has long focused on developing accurate and efficient approximate algorithms. In [3], with this quadratic formulation, two well known graph matching methods called Integer Projected Fixed Point method [14] and Graduated Non Convexity and Concavity Procedure [15] are applied to GED. In [14], this heuristic improves an initial solution by solving a linear assignment problem (LSAP) and a relaxed QAP where binary constraints are relaxed to the continuous domain. The algorithm iterates through gradient descent using the Hungarian algorithm to solve the LSAP and a line search. In [15], a path following algorithm aims at approximating the solution of a QAP by considering a convexconcave relaxation through a modified quadratic function.
Vector space
Vector space graph classification is about representing graphs as vectors to classify them.
A first one consists in transforming the initial structural problem in a common statistical pattern recognition one by describing the graphs with vectors in an Euclidean space [16]. In such a context, some features (vertex degree, labels occurrence histograms,etc.) are extracted from the graph. Hence, the graph is projected in a Euclidean space and classical machine learning algorithms can be applied. Such approaches suffer from a main drawback: to have a satisfactory description of topological structure and graph content, the number of such features has to be very large and dimensionality issues occur.
Another possible approach also consists in projecting the graphs in a Euclidean space of a given dimension but using a distance matrix between each pairs of graphs. In such cases, a dissimilarity measure between graphs has to be designed [5]. Kernels can be derived from the distance matrix. It is the case for multidimensional scaling methods proposed in [23].
Alternatively, graph embedding can be implemented implicitly through kernelbased machine learning algorithms. In the kernel approaches, an explicit data representation is of secondary interest. That is, rather than defining individual representations for each pattern or object, the data at hand is represented by pairwise comparisons only. The graphs are not explicitly but implicitly projected in a Euclidean space without defining the function . More formally, under given conditions, a similarity function can be replaced by a graph kernel function . Most kernel methods can only process kernel values which are established by symmetric and positive definite kernel functions. Many kernels have been proposed in the literature [18, 9]. In most cases, the graph is embedded in a feature space composed of label sequences through a graph traversal. According to this traversal, the kernel value is then computed by measuring similarity between label sequences. Even if such approaches have proven to achieve high performance, they suffer from their lack of interpretability. In fact, it is very difficult to come back to graph space from the kernel space. This problem is also known as ”preimage”.
1.2 Euclidean and geometric deep learning
Deep learning has achieved a remarkable performance breakthrough in several fields, most notably in speech recognition, natural language processing, and computer vision. In particular, convolutional neural network (CNN) architectures currently produce stateoftheart performance on a variety of image analysis tasks such as object detection and recognition. Most of deep learning research has so far focused on dealing with 1D, 2D, or 3D Euclidean structured data such as acoustic signals, images, or videos.
Recently, there has been an increasing interest in geometric deep learning, attempting to generalize deep learning methods to nonEuclidean structured data such as graphs and manifolds, with a variety of applications from the domains of network analysis, computational social science, or computer graphics. Graph neural networks are one of possible ways to implement explicit graph embedding: the neural network takes a graph as input and outputs a vector. This one can be used for classification. Moreover, graph neural networks perform learning the explicit embedding according to a given learning criterion.
1.3 Graph Neural Networks
These neural networks often try to apply convolution to graphs so that it mimics classical convolutional neural networks. Convolution definition on graph space is a tedious theoretical task. There is indeed no straightforward definition. However, one can identify two families of definitions in the existing literature. The first family (spectral approaches) relies on the convolution theorem. This theorem states that the convolution operator on the spatial domain is equivalent to the product operator on the frequency domain. Although this theorem was only proven on euclidean spaces, a group of approaches in the litterature postulates its validity on the graph space. A graph frequency domain is accessed through diagonalization of its Laplacian ( and respectively being the degree and adjacency matrices of the graph). Such approaches have two main limitations. The first one is their sensitivity to topological variations: a slight deformation of the graph structure changes the resulting convolution signal drastically. The latter is that there is no Fast Fourier Transform on the graph space: as previously stated, accessing the graph frequency domain relies on matrix diagonalization and therefore inversion. Inverting a matrix is a costly operation.
These drawbacks exist because convolution is applied implicitly to the graph through its frequency domain. A simple way to avoid them is to apply convolution directly on the spatial domain. The second family of approaches (the spatial ones) try to come up with analogies of the original convolution definition. However, existing approaches often degrade graphs and therefore do not fully exploit their structural information.
In this paper, we propose a graph convolution operator which operates solely on graph space. This is made possible through usage of graph matching to define local convolutional operation. By doing so, we try to establish a link between two scientific communities who respectively work on graphs and deep learning. More specifically, we define graphbased computations using operators from the graph matching litterature in a deep learning (neural network) framework.
2 State of the Art
This section offers a review of existing graph neural network definitions. Every graph neural network layer can then be written as a nonlinear function:
As an example, let’s consider the following very simple form of a layerwise propagation rule:
is a nonlinear activation function like the ReLU.
Multiplying the input with now corresponds to taking the average of neighboring node features from the layer . It is also called in the literature ”average neighbor messages” and it acts like passing average node features from one layer to another. In [11], a better (symetric) normalization of the adjacency matrix is proposed i.e. . A perneighbor normalization is performed instead of simple average, normalization varies across neighbors.
with , where is the identity matrix and is the diagonal node degree matrix of . The complexity of this model is time complexity overall (E being the set of edges).
More operations have been investigated in the literature [19]. A complete family of operations can be used :

I : this identity operator does not consider the structure of the graph and neither provide any aggregation. Used alone this operator makes the GNN a composition of MLP completly independent. One MLP for each node feature vector.

: the adjacency operator gather information on the node neighborhood (1 hop).

: . This degree operator gather information on the node degree. is node degree matrix (a diagonal matrix).

: . It encodes hop neighborhoods of each node, and allow us to aggregate local information at different scales, which is useful in regular graphs.

: is matrix filled with ones. This average operator, which allows to broadcast information globally at each layer, thus giving the GNN the ability to recover average degrees, or more generally moments of local graph properties.
Let us denote . A GNN layer is defined as :
, are trainable parameters.
Key distinctions are in how different approaches aggregate messages. So far, proposals have aggregated the neighbor messages by taking their (weighted) average, but is it possible to do better? In [10], a GNN called GraphSAGE is proposed. The aggregation of neighbors information is more complex. The very general scheme of aggregation can written thanks to the function :
Let us define is the set of nodes in the 1hop neighborhood of node .

mean : .

max : . Transform neighbor vectors into a matrix and apply a max pooling elementwise.

LSTM : . Where is a random permutation. The idea is to provide to the LSTM a sequence composed of neighbor embeddings. So the input sequence is composed of vectors. The sequence is randomly permuted by the function .
In [17], the graph structure is locally embedded into a vector space. The distribution of local structures in the local space is estimated by a Gaussian Mixture Model. The function is then expressed by a mixture of Gaussians. The Gaussian parameters are covariance matrix and mean vector and they are learnt during the training of the neural network.
A notable variant of GNN is graph attention networks (GAT), which was first proposed in [25]. This model includes the self attention mechanism to evaluate the individual importance of the adjacent nodes and therefore it can be applied to graph nodes having different degrees by specifying arbitrary weights to the neighbors [25].
Deadlocks, contributions and motivations From the literature, two main deadlocks can be drawn. First, in many of the related works [11, 19, 25], edge features are not well considered. However, the edge information is of first interest to boosts the structural knowledge in the computation of the node embedding. Second, most of the aforementioned approaches do not take full advantage of the graph topology [17, 11]. The graph structure is locally embedded into a vector space (i.e. the tangent space at a given point of a riemannian manifold). In this paper, we propose CNN architectures that remain in the graph domain. Especially, we design a convolution operator onto graph space through the solution of a graph matching problem. The problem of graph matching under node and pairwise constraints is fundamental to capture topological information. It takes into account the nodes and edge features along with their neighborhood structure. Consequently, graph matchingbased convolution can release deadlocks related to edge information integration, domain changes sensitivity and Euclidean space projection. Graph matching can be seen as added local constraints in the machine learning problem. We promote a truly novel class of neural network architecture where layers contain a combinatorial optimization scheme that plays a fundamental role in the construction of the entire neural network architecture. Consequently, we highlight the interplay between machine learning and combinatorial optimization.
3 Graph Convolutional Neural Network
3.1 Notation
Frequently used notations are summarized in Table 1.
Notation  Description 
An input graph  
A filter graph  
Neighbourhood subgraph rooted at vertex in  
,  Vertices in graph 
An edge in graph between and  
A vertex in  
An edge in between and  
Labelling function for vertices  
Labelling function for edges  
A filter graph and its associated weights  
Vertex label of  
Vertex label of parametrized by  
Cardinality of  
Kronecker delta of and 
3.2 Graph matching
To define our convolution operator, we must define the graph matching function that will be pointwisely used.
Graph matching problem
Let and be attributed graphs: and  
(1a)  
subject to  (1b)  
(1c)  
(1d)  
(1e) 
The similarity function is defined as follows:
(2a)  
(2b)  
(2c)  
(2d)  
(2e) 
Let denote an assignment of element (edge or vertex) to some element in :  
(3a)  
(3b)  
(3c)  
(3d) 
The similarity function can be rewritten as follows:
(4a)  
(4b) 
3.3 Graph convolution based on graph matching
Now that our matching operator is formulated, we can apply it over an input graph to compute the result of a convolution.
Let and be attributed graphs: and . and are respectively referred to as the input graph and the filter graph.
Graph convolution operator
The graph convolution operator is a function and is defined as follows:
(5a)  
with  (5b)  
(5c) 
where and are defined as follows.
Vertex neighbourhood graph (hops)
is defining the neighbourhood (which is a subgraph) for vertex in :
(6a)  
with  (6b)  
and  (6c) 
Edge attribute in convolved graph
score is a function mapping an edge to its matching score in the found GMS. The problem is that it might be assigned multiple times:
(7) 
potentially contains more than one element. Therefore, score can be defined as follows:
(8a)  
with  (8b) 
3.4 Convolution layer
Now that the convolution operator is defined, it is possible to use it as a base to build a convolution layer. This layer can be included in a graph neural network.
Graph convolution filter: the filter graph
A graph convolution filter is an attributed graph . Its role is analogous to that of a vanilla CNN kernel: it modifies the output and gets modified through backpropagation. Every attribute function is parametrized with respect to a weight vector .
(9a)  
with  (9b)  
(9c) 
Graph convolution layer
A convolution layer is a set of convolution filters applied on a same input graph . The output of the layer consists of all filters results (analogous to euclidean convolution feature map) stacked up.
Let be the output function of the layer s.t.:
(10a)  
with  (10b)  
(10c)  
(10d)  
(10e)  
function keeps only a single graph structure and concatenates each vertex/edge attribute. The output function of the layer is a graph with same topology as but with attributes as vectors composed by attributes of every filters outputs. 
Graph convolution computation can be seen as a stepbystep process (shown in Figure 1). The first step is neighbourhood extraction: for each vertice in (the input graph), the neighbourhood graph is extracted. It is composed of every neighbour of in a given range (it can be 1hop away but also nhops away). and (the filter graph) are matched. The matching score becomes the output of the convolution at .
Definition 1
Graph convolution differentiation Let the input graph and be the output of a convolution layer (called Conv) s.t. .
(11a)  
(11b) 
To simplify notations, let’s consider the output of the Conv layer to be the vertex and edge labelling functions and , as neither the vertices or edges sets change during convolution. The output will be noted as in Equation 12b
Let be a loss function (for example meansquared error or categorical crossentropy). Let’s suppose Conv is involved in the calculation of such that:
and respectively being the processing before and after Conv. In order to minimize , its gradient must be calculated with respect to . This gradient will then be used to modify itself. For calculus needs, let Conv be the output function of the Conv layer. This output function is defined w.r.t. the graph labeling functions:
(12a)  
(12b)  
(12c)  
(12d)  
(12e) 
The error gradient for is calculated using chain derivative:
(13a)  
(13b)  
(13c) 
Let’s assume exists and is known. Therefore, only is to be calculated.
First of all, we need to expand . Let’s expand first for a given vertex :  
(14a)  
(14b)  
(14c)  
(14d)  
(14e)  
Then let’s expand :  
(14f)  
(14g)  
If is , the same rewriting as in Equation 14c applies:  
(14h)  
(14i)  
(14j)  
If is avg:  
(14k) 
Let’s differentiate with respect to . :
(15a)  
(15b) 
Now let’s differentiate . and :
If is :
(16a) 
If is avg:
(17a) 
In any case:
(18a) 
Finally, let’s suppose is parameterized with vector . In this case, is to be calculated:
(21a)  
(21b)  
(21c) 
Let’s assume exists and is known. has already been evaluated. Therefore, only is to be calculated. :
(22a)  
(22b) 
If is , :
(23a) 
If is avg, :
(24a) 
In any cases:
(25) 
3.5 About graph matching differentiation
In Definition 1, a differentiation of the convolution operator is proposed. This differentiation does not take into account the dependencies between the optimal graph matching and the variables and . As these variables are used to calculate the possible matchings, it is trivial to conclude such dependencies exist. Nevertheless, the matching solver in use (see Subsection 3.9) is not differentiable, at least a priori. We therefore assumed as a constant in the gradient calculus with respect to these variables by means of change of variable in Eq. 14c.
3.6 A ”no edge matching” version of the graph convolution layer
This section presents a degraded model. It ignores topology at a local level by not matching edges. It therefore reduces the graph matching problem to a node assignment problem inside a given neighborhood. One concern on this simplification could be that we do not take advantage of the graphs topology. However, topology information is used when computing vertices neighbourhoods. Additionally, this model has lower time complexity as edge information is not taken into account (see details in Subsection 3.9.)
Used graphs are 3uplets and the similarity function is simplified as follows:
(26a)  
(26b) 
As a consequence of the edge attributes deletion in the filter graph, its parameter becomes vector (as many parameters as vertex). The filter is defined as follows:
(27a)  
with  (27b) 
The output function of the filter is defined as follows:
(28a)  
(28b) 
3.7 Graph pooling
As in euclidean convolutional neural nets, we want to implement not only convolutional layers but also pooling/downsampling layers. In the existing literature, downsampling is view as graph coarsening [4]. A recurrent graph coarsening algorithm choice seems to be Graclus [8] (used in [17, 7]).
We propose to use a community detection algorithm (Louvain method [2]) as the base of our graph pooling layer. Louvain method deals with weighted graphs. In our case, edge weights are computed by scalar products of involved vertices. This choice is brought by the following intuition: the higher nodes attributes scalar product get, the more these vertices probabilities to fall in the same cluster increases (because a higher scalar product implies vector similarity).
3.8 Hyperparameterization
As in any neural network, graph neural networks have parameters that won’t be optimized from gradient descent.
The first one is the graph filter (its number of nodes and adjacency matrix). The number of nodes in the graph filter is analogous to the size of a classic convolution kernel. A kernel filter is equivalent to a 9 nodes filter graph with gridlike adjacency. The second hyperparameter is the size of extracted neighbourhoods graphs which is the maximum node distance in a given node neighbourhood. A 2hopsized neighbourhoods will include nodes that can be reached from the origin node in two hops or less.
These hyperparameters could be optimized through grid or random search. However, to restrain our study, we will consider the following postulate: a graph filter should be congruent with extracted neighbourhoods. In other words, the two should have equal sizes and identical topologies as much as possible. This postulate comes from classic graph convolution where each kernel coefficient is matched with one and only one image coefficient.
3.9 Choosing the graph matching solver
The algorithm for solving the graph matching problem is a critical element for the model. The first reason is that it is potentially the highest in complexity since graph matching problems are up to NPhard. Additionally, graph matching is solved as many times as there are vertices in the input graph (the size of every problem to solve being that of every vertex neighbourhood).
We opted for a bipartite (BP) graph matching algorithm [21]. Complexity of such an algorithm is among the lowest (polynomial time) for solving errortolerant graph matching problems suboptimally.
Bipartite graph matching algorithm reduces graph matching to vertex matching by embedding an estimation for edge costs in the vertex costs. This edge cost estimation is computed by solving an edgeassignment problem for every nodematching possibility. Therefore, BP has to solve as many matching problems as there are edgecosts.
We used a variant of BP called Square Fast BP [24] where the cost matrix for vertex matching is of size with and being number of vertices in filter graph and neighbourhood graph . Assuming both neighbourhood and filter graphs are complete, a matching problem complexity is .
As a consequence, worst case complexity with fast bipartite matching is the following:
Some preliminary experiments showed impracticable computation time of the full model. As a first workaround, the experimental part of this paper will focus on ”no edge matching” model. This workaround allowed to keep processing to an acceptable level (that is suitable for small classification experiments). Edge cost estimation by edge matching is no longer required. The simplified model has the following pointwise complexity:
4 Experimental work
In this section, we test the model according to several parameters. We want to test our model with a simple classification task on MNIST digit images.
4.1 Baselines
Our approach was compared with two other approaches:

Vanilla CNN layer

[17] mixture model graph CNN.
Same network topology was used for all approaches. It consists of classical ConvPool blocks linearly connected. Figure 2 shows the exact network structure in use. In case of graph convolution, convolution filters equivalents are nodes filters and pooling becomes 4 nodes pooling. is set depending on average graph connectivity in a given dataset: if the average number of neighbours in a given dataset is 9, .
The last layer is a global pooling one. As in the euclidean case, it consists in aggregating each filter feature map in one scalar value. In our case, feature maps are aggregated by taking its average value.
4.2 Data
Quantitative experiments in this section are operated on digit images of MNIST dataset [13]. We chose this dataset as this was in use in the graph convolution literature. MNIST is a good ”hello world” machine learning (ML) dataset. MNIST helps at quickly iterating on the learning model. Performance information gathered from experiments on MNIST can be great for judging how the model might perform on much harder and larger datasets like ImageNet.
In addition to the original MNIST dataset, a rotated version was used [12]. To compare results with MNISTrotated, MNISToriginal has to be modified as follows. MNISTreduced proportions are unusual: 10000, 2000 and 50000 images respectively for train, validation and test whereas MNISToriginal has 60000 and 10000 images respectively for train/validation and test. We used MNISTreduced, a resampled version of MNISToriginal to fit MNISTrotated ratio between subsets cardinalities: MNISTreduced and MNISTrotated have both 10000, 2000 and 50000 images respectively for train, validation and test. All the set cardinalities are summed up in Table 2. Note that the test set of MNISTreduced is larger than the training set by a factor 5 consequently, the generalization ability is better assessed.
Dataset  Training set  Validation set  Testing set 
MNISToriginal  48 000  12 000  10 000 
MNISTrotated  10 000  2 000  50 000 
MNISTreduced  
MNISTmixed 
Lastly, to test rotation invariance, a third MNISTbased dataset was added: MNISTmixed. It was generated by combining MNISTreduced train and validation sets and MNISTrotated test set. It is design so that the models are trained on rotationfree images but tested on rotated images.
As MNIST is an image dataset, a graphbased representation of images has to be chosen. Representations used in [17] are superpixels graphs and grid graphs. We used grids ( images resized to ) and generated 75 superpixels Region Adjacency Graphs (RAG) using SLIC algorithm [1] with superpixel adjacency as edges (see Table 3). Sample graphs are depicted in Figure 3.
Representation  Nb nodes  Vertex attributes  Edge attributes 
grid  Pixel intensities  Relative polar coordinates  
75 superpixels  75 (average)  Average superpixel intensities 
4.3 Parameterization
Following hyperparameters were set after preliminary tests were conducted: Models are trained during 50 epochs using Adaptive Moment (Adam) gradient descent (learning rate ). Neighbourhood reach in use is 1hop and filter size was set in accordance with average neighbourhood size (9 nodes).
4.4 Protocol
Following experiments were conducted:
 Experiment 1

Models are tested on MNIST digit images classification task
 Experiment 2

Several neighbourhood connectivities are tested on our model (1 and 2 hops)
 Experiment 3

Rotation invariance is investigated. Spatial information for our datasets is conveyed by edge attributes. In such a frame, as our ”no edges” model ignores edge attributes, it is theoretically rotationinvariant. Experiment 3 aims at experimentally validating this claim. This is done by training models on unrotated images and testing on rotated ones. MNISTmixed set is used to this end.
 Experiment 4

A sample filter is visualized on some MNIST example images
 Experiment 5

Graph based methods are tested on regular grids and on irregular graphs (75 superpixels RAG) for testing sensitivity to domain changes
As stated before and because of technical limitations, experiments involving MNIST datasets will focus on the two first MNIST classes (referred to as MNIST2class)
4.5 Results
Results on MNIST2class are listed on Table 4. Results include classification from both grid graphs and SLIC 75superpixels graphs. This table shows results for each dataset using classic CNN, MoNet [17] and our method.
Experiment 1: MNIST
On MNIST2class, our model competes in a 3% margin with used baselines.
Experiment 2: Neighbourhood size
Extending the neighbourhood size did not have any significant effect on performance (see Table 5)
Experiment 3: Rotation invariance
On MNISTmixed, no performance loss was observed on testing for our method. This is especially visible on grid graphs results where only classic CNN and MoNet show a 10 percent loss. A trivial explanation of how is this invariance obtained is that our graph convolution filters are nonoriented because edge attributes are ignored.
Representation  Dataset  CNN  MoNet  Ours  
Valid  Test  Valid  Test  Valid  Test  
grid  MNIST reduced  100 %  99.88 %  97.56 %  99.40 %  99.51 %  97.76 % 
MNIST mixed  100 %  89.87 %  97.76 %  88.90 %  99.27 %  95.63 %  
75 superpixels  MNIST reduced  94.13 %  92.70 %  94.13 %  89.53 %  
MNIST mixed  94.13 %  92.90 %  94.62 %  94.17 % 
Representation  1 hop  2 hops  
Valid  Test  Valid  Test  
grid  99.02%  97.55%  98.04%  96.47% 
75 superpixels  97.55%  93.74%  96.82%  93.62% 
Experiment 4: Visualizing graph convolution on images
As an additional experimental material, we tried to visualize the result of a handcrafted filter on images. As for euclidean convolution, the most straightforward filter operation is edge detection. This is usually done by using Sobel operator that calculates intensity gradient at each spatial point of the image.
A potential equivalent graph convolution filter is (the filter is a 2nodes graph with respective attributes and .) The intuition behind this filter is that the nodes will be matched respectively to the lowest (for the attributed node) and highest (for the attributed node) intensities. As a consequence, this filter will find the highest node attribute difference in every node neighbourhood, making it a sort of eager edge detection filter.
We applied this filter on grid graphs to visualize the output graph as an image (as the graphtoimage transformation is trivial). Figure 4 shows example applications of this filter on both original and rotated examples. This last figure suggests rotation invariance.
Experiment 5: Testing graph convolution across domain
A particular concern on graph convolution operators is sensitivity to domain changes, i.e. capacity to identify similarities on irregular graphs. Both graph convolution tested show little performance loss between regular (grids) and irregular (75 superpixels RAG) results.
Training duration
As mentioned in Subsection 3.9, complexity of the model makes experiment tedious to lead. Epoch durations are given in Table 6.
Representation  CNN  MoNet  Ours 
grid  1s  1s  17min 29s 
75 superpixels  NA  1s  2min 42s 
5 Conclusion and perspectives
In this paper, a graph convolutional neural network layer is proposed and tested in a simplified form.
Our model performance is at state of the art level on simple tasks. It shows robustness with respect to graph domain changes.
Following improvements could highly benefit to performances and computational costs. The bipartite solver is not the most suitable choice for our use. Complexity seems to be too high for an efficient application. Using a less complex solver would allow the full model to be used in practice and applied to larger graphs. Using the edge information would probably enhance performances significantly. Moreover, it will probably help with solving more complex problems.
Another point of improvement is regarding differentiation: the solver operator is not differentiable. The gradient must then be approximated by neglecting contribution of the solver intermediary states. Finding a differentiable solver would enhance trainability of the model.
Addressing these issues will not only enhance the current degraded version of the model but also allow to implement the full model in a usable form. This model has the peculiarity to learn edge attributes as well as vertex attributes. It is to our knowledge the only graph convolution formulation that suggests to modify the spatiality of edge attibutes.
Finally, investigating our downsampling layer would justify a whole study for itself. It would be interesting to study the quality of the downsampled graphs but also to study the effect of weighting edges regarding vertex similarity.
6 Code
Code for running the model can be found at https://github.com/prafiny/graphconv
References
 (2012) SLIC superpixels compared to stateoftheart superpixel methods. IEEE transactions on pattern analysis and machine intelligence 34 (11), pp. 2274–2282. Cited by: §4.2.
 (200810) Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008 (10), pp. 10008. External Links: Document Cited by: §3.7.
 (2017) Graph edit distance as a quadratic assignment problem. Pattern Recognition Letters 87, pp. 38–46. External Links: Link, Document Cited by: §1.1.1.
 (2016) Geometric deep learning: going beyond euclidean data. CoRR abs/1611.08097. Cited by: §3.7.
 (2008) Graph classification on dissimilarity space embedding. See Structural, syntactic, and statistical pattern recognition, joint IAPR international workshop, SSPR & SPR 2008, orlando, usa, december 46, 2008. proceedings, da Vitoria Lobo et al., pp. 2. Cited by: §1.1.2.
 N. da Vitoria Lobo, T. Kasparis, F. Roli, J. T. Kwok, M. Georgiopoulos, G. C. Anagnostopoulos and M. Loog (Eds.) (2008) Structural, syntactic, and statistical pattern recognition, joint IAPR international workshop, SSPR & SPR 2008, orlando, usa, december 46, 2008. proceedings. Lecture Notes in Computer Science, Vol. 5342, Springer. External Links: Link, Document, ISBN 9783540896883 Cited by: 5.
 (2016) Convolutional neural networks on graphs with fast localized spectral filtering. CoRR abs/1606.09375. Cited by: §3.7.
 (2007) Weighted graph cuts without eigenvectors a multilevel approach. IEEE transactions on pattern analysis and machine intelligence 29 (11), pp. 1944–1957. Cited by: §3.7.
 (2012) Two new graphs kernels in chemoinformatics. Pattern Recogn. Lett. 33 (15), pp. 2038 – 2047. Cited by: §1.1.2, §1.
 (2017) Inductive representation learning on large graphs. CoRR abs/1706.02216. Cited by: §2.
 (2016) Semisupervised classification with graph convolutional networks. CoRR abs/1609.02907. External Links: Link Cited by: §2, §2.
 (2007) An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on Machine learning, pp. 473–480. Cited by: §4.2.
 (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.2.
 (2009) An integer projected fixed point method for graph matching and map inference. In Proceedings Neural Information Processing Systems, pp. 1114–1122. Cited by: §1.1.1.
 (2014) GNCCP  graduated nonconvexityand concavity procedure. IEEE Trans. Pattern Anal. Mach. Intell. 36, pp. 1258–1267. Cited by: §1.1.1.
 (2013) Fuzzy multilevel graph embedding. Pattern Recognition 46 (2), pp. 551–565. Cited by: §1.1.2.
 (2016) Geometric deep learning on graphs and manifolds using mixture model cnns. CoRR abs/1611.08402. Cited by: §2, §2, §3.7, 2nd item, §4.2, §4.5.
 (2007) Bridging the gap between graph edit distance and kernel machines. Series in Machine Perception and Artificial Intelligence, Vol. 68, WorldScientific. External Links: ISBN 9789812708175 Cited by: §1.1.2.
 (2017) A note on learning algorithms for quadratic assignment with graph neural networks. CoRR abs/1706.07450. External Links: Link Cited by: §2, §2.
 (2013) Structured representations in a content based image retrieval context. J. Visual Communication and Image Representation 24 (8), pp. 1252–1268. Cited by: §1.
 (2009) Approximate graph edit distance computation by means of bipartite graph matching. Image Vision Comput. 27 (7), pp. 950–959. Cited by: §3.9.
 (2015) Structural pattern recognition with graph edit distance  approximation algorithms and applications. Advances in Computer Vision and Pattern Recognition, Springer. Cited by: §1.1.1, §1.
 (2003) Optimal cluster preserving embedding of nonmetric proximity data. IEEE Trans. Pattern Anal. Mach. Intell. 25 (12), pp. 1540–1551. Cited by: §1.1.2.
 (2015) Speeding up fast bipartite graph matching through a new cost matrix. International Journal of Pattern Recognition and Artificial Intelligence 29 (02), pp. 1550010. Cited by: §3.9.
 (201710) Graph Attention Networks. arXiv eprints, pp. arXiv:1710.10903. Cited by: §2, §2.
 (2019) A comprehensive survey on graph neural networks. CoRR abs/1901.00596. External Links: Link Cited by: §2.
 (2018) Deep learning on graphs: A survey. CoRR abs/1812.04202. External Links: Link Cited by: §2.
 (2018) Graph neural networks: A review of methods and applications. CoRR abs/1812.08434. External Links: Link Cited by: §2.