Generalizing the Convolution Operator to extend CNNs to Irregular Domains
Convolutional Neural Networks (CNNs) have become the state-of-the-art in supervised learning vision tasks. Their convolutional filters are of paramount importance for they allow to learn patterns while disregarding their locations in input images. When facing highly irregular domains, generalized convolutional operators based on an underlying graph structure have been proposed. However, these operators do not exactly match standard ones on grid graphs, and introduce unwanted additional invariance (e.g. with regards to rotations). We propose a novel approach to generalize CNNs to irregular domains using weight sharing and graph-based operators. Using experiments, we show that these models resemble CNNs on regular domains and offer better performance than multilayer perceptrons on distorded ones.
CNNs are state-of-the-art for performing supervised learning on data defined on lattices . Contrary to classical multilayer perceptrons (MLPs) , CNNs take advantage of the underlying structure of the inputs. When facing data defined on irregular domains, CNNs cannot always be directly applied even if the data may present an exploitable underlying structure.
An example of such data are spatio-temporal time series generated by a set of Internet of Things devices. They typically consist of datapoints irregularly spaced out on a Euclidean space. As a result, a graph can be defined where the vertices are the devices and the edges connect neighbouring ones. Other examples include signals on graphs including brain imaging, transport networks, bag of words graphs…In all these examples, a signal can generally be seen as a vector in where is the order of the graph. As such, each vertex is associated with a coordinate, and the edges weights represent some association between the corresponding vertices.
Disregarding the graph , MLPs can be applied on these datasets. We are interested in defining convolutional operators that are able to exploit as well as a possible embedding Euclidean space. Our motivation is to imitate the gain of performance allowed by CNNs over MLPs on irregular domains.
In graph signal processing, extended convolutional operators using spectral graph theory  have been proposed . These operators have been applied to deep learning  and obtain performance similar to CNNs on regular domains, despite the fact they differ from classical convolutions. However, for slightly distorted domains, the obtained operators become non-localized, thus failing taking account of the underlying structure.
We propose another approach that generalizes convolutional operators by taking into account . Namely, we make sure that the proposed solution has properties inherent to convolutions: linearity, locality and kernel weight sharing. We then apply it directly to the graph structure. We use it as a substitution of a convolutional layer in a CNN and stress our resulting technique on comparative benchmarks, showing a significant improvement compared with a MLP. The obtained operator happens to exactly be the classical convolutional one when applied on regular domains.
The outline of the paper is as follows. In Section 2, we discuss related work. In Section 3 we introduce our proposed operator. In Section 4 we explain how to apply it to CNN-like structures. Section 5 contains the experiments. Section 6 is a conclusion.
For graph-structured data, Bruna et al.  have proposed an extension of the CNN using graph signal processing theory . Their convolution is defined in the spectral domain related to the Laplacian matrix of the graph. As such, in the case where the graph is a lattice, the construction is analogous to the regular convolution defined in the Fourier domain. The operation is defined as spectral multipliers obtained by smooth interpolation of a weight kernel, and they explain how it ensures the localization property. In that paper, they also define a construction that creates a multi-resolution structure of a graph, for allowing it to support a deep learning architecture. Henaff et al.  have extended Bruna’s spectral network to large scale classification tasks and have proposed both supervised and unsupervided methods to find an underlying graph structure when it isn’t already given.
However, when the graph is irregular, they partly lose the localization property of their convolution is partially lost. As the spectral domain is undirected with respect to the graph domain, some sort of rotation invariances are also introduced. Hence, the results in the graph domain aren’t ressembling to those of a convolution. In our case, we want to define a convolution supported locally. Moreover, we want that every input can be defined on a different underlying graph structure, so that the learnt filters can be applied on data embedded in the same space regardless of what structure have supported the convolution during the training.
These properties are also retained by the convolution defined in the ShapeNet paper, which defines a convolution for data living on non-euclidean manifolds. I.e their construction does maintain the locality and allow for reusing the learnt filters on other manifolds. However, it requires at least a manifold embedding of the data and a geodesic polar coordinates system. Although being less specific, our proposed method will be a strict generalization of CNN in the sense that CNNs are a special case of it.
In the case where the data is sparse, Graham  has proposed a framework to implement a spatially sparse CNN efficiently. If the underlying graph structure is embedded into an Euclidean space, the convolution we propose in this paper can be seen as spatially sparse too, in the sense that the data has only non-zero coordinates on the vertices of the graph it is defined on. In our case we want to define a convolution for which the inputs can have values on any point of the embedding space, whereas in the regular case inputs can only have values on vertices of an underlying grid.
3Proposed convolution operator
Let be a set of points, such that each one defines a set of neighbourhoods.
An entry of a dataset is a column vector that can be of any size, for which each dimension represents a value taken by at a certain point . is said to be activated by . A point can be associated to at most one dimension of . If it is the th dimension of , then we denote the value taken by at by either or . is said to be embedded in .
We say that two entries and are homogeneous if they have the same size and if their dimensions are always associated to the same points.
3.2Formal description of the generalized convolution
Let’s denote by a generalized convolution operator.We want it to observe the following conditions:
Kernel weight sharing
As must be linear, then for any entry , there is a matrix such that . Unless the entries are all homogeneous, depends on . For example, in the case of regular convolutions on images, is a Toeplitz matrix and doesn’t depend on .
In order to meet the locality condition, we first want that the coordinates of have a local meaning. To this end, we impose that lives in the same space as , and that and are homogeneous. Secondly, for each , we want to be only function of values taken by at points contained in a certain neighbourhood of . It results that rows of are generally sparse.
Let’s attribute to a kernel of weights in the form of a row vector , and let’s define the set of allocation matrices as the set of binary matrices that have at most one non-zero coordinate per column. As must share its weights across the activated points, then for each row of , there is an allocation matrix such that . To maintain locality, the th column of must have a non-zero coordinate if and only if the th and th activated points are in a same neighbourhood.
Let’s denotes the block column vector that has the matrices for attributes, and let’s denotes the tensor product. Hence, is defined by a weight kernel and an allocation map that maintains locality, such that
3.3The underlying graph that supports the generalized convolution
As vertices, the set of activated points of an entry defines a complete oriented graph that we call . If all the entries are homogeneous, then is said to be static. In this case, we note . Otherwise, there can be one different graph per entry so the kernel of weights can be re-used for new entries defined on different graphs.
Suppose we are given which maps each to a neighbourhood . We then define as the subgraph of such that it contains the edge if and only if . Let be an allocation map such that the th column of have a non-zero coordinate if and only if is an edge of .
Then the generalized convolution of by the couple is supported by the underlying graph in the sense that is its adjacency matrix. Note that the underlying graph of a regular convolution is a lattice.
Also note that the family can be seen as local receptive fields for the generalized convolution, and that the map can be seen as if it were distributing the kernel weights into each .
The generalized convolution has been defined here as an operation between a kernel weight and a vector. If denotes a graph, then such operator with respect to a third rank tensor could also have been defined as a bilinear operator between two graph signals :
Note that the underlying graph depends on the entry. So the learnt filter is reusable regardless of the entry’s underlying graph structure. This is not the case for convolutions defined on fixed graphs like in .
3.4Example of a generalized convolution shaped by a rectangular window
Let be a two-dimensional Euclidean space and let’s suppose here that .
Let’s denote by , a generalized convolution shaped by a rectangular window . We suppose that its weight kernel is of size and that is of width and height , where is a given unit scale.
Let be the subgraph of such that it contains the edge if and only if and . In other terms, connects a vertex to every vertex contained in when centered on .
Then, we define as being supported by . As such, its adjacency matrix acts as the convolution operator. At this point, we still need to affect the kernel weights to its non-zero coordinates, via the edges of . This amounts for defining explicitely the map for . To this end, let’s consider the grid of same size as that rectangle, which breaks it down into squares of side length , and let’s associate a different weight to each square. Then, for each edge , we affect to it the weight associated with the square within which falls when the grid is centered on . This procedure allows for the weights to be shared across the edges. It is illustrated on figure ?.
Note that if the entries are homogeneous and the activated points are vertices of a regular grid, then the matrix , independant of , is a Toepliz matrix which acts as a regular convolution operator on the entries . In this case, is just a regular convolution. For example, this is the case in Section 3.5.
3.5Link with the standard convolution on image datasets
Let be an image dataset. Entries of are homogeneous and their dimensions represent the value at each pixel. In this case, we can set , of dimension 2, such that each pixel is located at entire coordinates. More precisely, if the images are of width and height , then the pixels are located at coordinates . Hence, the pixels lie on a regular grid and thus are spaced out by a constant distance .
Let’s consider the static underlying graph and the generalized convolution by a rectangular window , as defined in the former section. Then, applying the same weight allocation strategy will lead to affect every weight of the kernel into the moving window . Except on the border, one and only one pixel will fall into each square of the moving grid at each position, as depicted in figure ?. Indeed, behaves exactly like a moving window of the standard convolution, except that it considers that the images are padded with zeroes on the borders.
4Application to CNNs
4.1Neural network interpretation
Let and be two layers of neurons, such that forward-propagation is defined from to . Let’s define such layers as a set of neurons being located in . These layers must contain as many neurons as points that can be activated. In other terms, . As such, we will abusively use the term of neuron instead of point.
The generalized convolution between these two layers can be interpreted as follow. An entry activates the same neurons in each layer. Then, a convolution shape takes positions onto , each position being associated to one of the activated neurons of . At each position, connections are drawn from the activated neurons located inside the convolution shape in destination to the associated neuron. And a subset of weights from are affected to these connections, according to a weight sharing strategy defined by an allocation map. Figure 1 illustrates a convolution shaped by a rectangular window.
The forward and backward propagation between and are applied using the described neurons and connections. After a generalized convolution operation, an activation function is applied on the output neurons. Then a pooling operation is done spatially: the input layer is divided into patches of same size, and all activated neurons included in this patch are pooled together. Unlike a standard pooling operation, the number of activated neurons in a patch may vary.
Generalized convolution layers can be vectorized. They can have multiple input channels and multiple feature maps. They shall naturally be placed into the same kind of deep neural network structure than in a CNN. Thus, they are for the irregular input spaces what the standard convolution layers are for regular input spaces.
There are two main strategies to implement the propagations. The first one is to start from (1), derive it and vectorize it. It implies handling semi-sparse representations to minimize memory consumption and to use adequate semi-sparse tensor products.
Instead, we decide to use the neural network interpretation and the underlying graph structure whose edges amount for neurons connections. By this mean, the sparse part of the computations is included via the use of this graph. Also, computations on each edge can be parallelized.
4.3Forward and back propagation formulae
Let’s first recall the propagation formulae from a neural network point of view.
Let’s denote by the value of a neuron of located at , by for a neuron of , and by if this neuron is activated by the activation function .
We denote by the set neurons from the previous layer connected to , and by those of the next layer connected to . is the weight affected to the connection between neurons and . is the bias term associated to .
After the forward propagation, values of neurons of are determined by those of :
Thanks to the chain rule, we can express derivatives of a layer with those of the next layer:
We call the set of edges to which the weight is affected. If , denotes the value of the destination neuron, and denotes the value of the origin neuron.
The back propagation allows to express the derivative of any weight :
The sets , and are determined by the graph structure, which in turn is determined beforehand by a procedure like the one described in Section 3.4. The particularization of the propagation formulae with these sets is the main difference with the standard formlulae.
Computations are done per batch of entries . Hence, the graph structure used for the computations must contain the weighted edges of all entries . If necessary, entries of are made homogeneous: if a neuron is not activated by an entry but is activated by another entry of , then is defined and is set to zero.
The third-rank tensor counterparts of , , and are thus denoted by , and . Their third dimension indexes the channels (input channels or feature maps). Their submatrix along the neuron located at are denoted , and , rows are indexing entries and columns are indexing channels. The counterparts of and are and . The first being a third-rank tensor and the second being a vector with one value per feature map. denotes the 3D tensor obtained by broadcasting along the two other dimensions of . denotes the submatrix of along the kernel weight . Its rows index the feature maps and its columns index the input channels.
With these notations, the convolution formulae (1) rewrites:
and being tensor products. is contracted along the dimensions that index the kernel weights, and is contracted along the dimensions that index the points present in the entries.
The vectorized counterparts of the formulae from Section 4.3 can be obtained in the same way:
Where and respectively denotes Hadamard product and transpose.
In order to measure the gain of performance allowed by the generalized CNN over MLP on irregular domains, we made a series of benchmarks on distorded versions of the MNIST dataset , consisting of images of 28x28 pixels. To distort the input domain, we plunged the images into a 2-d euclidean space by giving entire coordinates to pixels. Then, we applied a gaussian displacement on each pixel, thus making the data irregular and unsuitable for regular convolutions. For multiple values of standard deviation of the displacement, we trained a generalized CNN and compared it with a MLP that has the same number of parameters. We choosed a simple yet standard architecture in order to better see the impact of generalized layers.
The architecture used is the following: a generalized convolution layer with relu  and max pooling, made of 20 feature maps, followed by a dense layer and a softmax output layer. The generalized convolution is shaped by a rectangular window of width and height where the unit scale is chosen to be equal to original distance between two pixels. The max-pooling is done with square patches of side length . The dense layer is composed of 500 hidden units and is terminated by relu activation as well. In order to have the same number of parameters, the compared MLP have 2 dense layers of 500 hidden units each and is followed by the same output layer. For training, we used stochastic gradient descent  with Nesterov momentum  and a bit of L2 regularization .
The plot drawn on Figure 2 illustrates the gain of performance of a generalized convolutional layer over a dense layer with equal number of parameters. After 15 epochs for both, it shows that the generalized CNN on a distorded domain performs better than the MLP. Indeed, in the case of no domain distortion, the score is the same than a CNN with zero-padding. The error rate goes up a bit with distortion. But even at , the generalized CNN still performs better than the MLP.
6Conclusion and future work
In this paper, we have defined a generalized convolution operator. This operator makes possible to transport the CNN paradigm to irregular domains. It retains the proprieties of a regular convolutional operator. Namely, it is linear, supported locally and uses the same kernel of weights for each local operation. The generalized convolution operator can then naturally be used instead of convolutional layers in a deep learning framework. Typically, the created model is well suited for input data that has an underlying graph structure.
The definition of this operator is flexible enough for it allows to adapt its weight-allocation map to any input domain, so that depending on the case, the distribution of the kernel weight can be done in a way that is natural for this domain. However, in some cases, there is no natural way but multiple acceptable methods to define the weight allocation. In further works, we plan to study these methods. We also plan to apply the generalized operator on unsupervised learning tasks.
I would like to thank my academic mentors, Vincent Gripon and Grégoire Mercier who helped me in this work, as well as my industrial mentor, Mathias Herberts who gave me insights in view of applying the designed model to industrial datasets.
This work was partly funded by Cityzen Data, the company behind the Warp10 platform, and by the ANRT (Agence Nationale de la Recherche et de la Technologie) through a CIFRE (Convention Industrielle de Formation par la REcherche), and also by the European Research Council under the European Union’s Seventh Framework Program (FP7/2007-2013) / ERC grant agreement number 290901.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.
- K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural networks, vol. 2, no. 5, pp. 359–366, 1989.
- American Mathematical Society, 1996.
F. R. K. Chung, Spectral Graph Theory (CBMS Regional Conference Series in Mathematics, No. 92).
- D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Processing Magazine, vol. 30, no. EPFL-ARTICLE-189192, pp. 83–98, 2013.
- J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
- M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on graph-structured data,” arXiv preprint arXiv:1506.05163, 2015.
- J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst, “Shapenet: Convolutional neural networks on non-euclidean manifolds,” tech. rep., 2015.
- B. Graham, “Spatially-sparse convolutional neural networks,” arXiv preprint arXiv:1409.6070, 2014.
- Y. LeCun, C. Cortes, and C. J. Burges, “The mnist database of handwritten digits,” 1998.
- X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in International Conference on Artificial Intelligence and Statistics, pp. 315–323, 2011.
- L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010, pp. 177–186, Springer, 2010.
- I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in Proceedings of the 30th international conference on machine learning (ICML-13), pp. 1139–1147, 2013.
- A. Y. Ng, “Feature selection, l 1 vs. l 2 regularization, and rotational invariance,” in Proceedings of the twenty-first international conference on Machine learning, p. 78, ACM, 2004.