Dynamic EdgeConditioned Filters in Convolutional Neural Networks on Graphs
Abstract
A number of problems can be formulated as prediction on graphstructured data. In this work, we generalize the convolution operator from regular grids to arbitrary graphs while avoiding the spectral domain, which allows us to handle graphs of varying size and connectivity. To move beyond a simple diffusion, filter weights are conditioned on the specific edge labels in the neighborhood of a vertex. Together with the proper choice of graph coarsening, we explore constructing deep neural networks for graph classification. In particular, we demonstrate the generality of our formulation in point cloud classification, where we set the new state of the art, and on a graph classification dataset, where we outperform other deep learning approaches. The source code is available at https://github.com/mys007/ecc.
1 Introduction
Convolutional Neural Networks (CNNs) have gained massive popularity in tasks where the underlying data representation has a grid structure, such as in speech processing and natural language understanding (1D, temporal convolutions), in image classification and segmentation (2D, spatial convolutions), or in video parsing (3D, volumetric convolutions) [22].
On the other hand, in many other tasks the data naturally lie on irregular or generally nonEuclidean domains, which can be structured as graphs in many cases. These include problems in 3D modeling, computational chemistry and biology, geospatial analysis, social networks, or natural language semantics and knowledge bases, to name a few. Assuming that the locality, stationarity, and composionality principles of representation hold to at least some level in the data, it is meaningful to consider a hierarchical CNNlike architecture for processing it.
However, a generalization of CNNs from grids to general graphs is not straightforward and has recently become a topic of increased interest. We identify that the current formulations of graph convolution do not exploit edge labels, which results in an overly homogeneous view of local graph neighborhoods, with an effect similar to enforcing rotational invariance of filters in regular convolutions on images. Hence, in this work we propose a convolution operation which can make use of this information channel and show that it leads to an improved graph classification performance.
This novel formulation also opens up a broader range of applications; we concentrate here on point clouds specifically. Point clouds have been mostly ignored by deep learning so far, their voxelization being the only trend to the best of our knowledge [26, 19]. To offer a competitive alternative with a different set of advantages and disadvantages, we construct graphs in Euclidean space from point clouds in this work and demonstrate state of the art performance on Sydney dataset of LiDAR scans [9].
Our contributions are as follows:

We formulate a convolutionlike operation on graph signals performed in the spatial domain where filter weights are conditioned on edge labels (discrete or continuous) and dynamically generated for each specific input sample. Our networks work on graphs with arbitrary varying structure throughout a dataset.

We are the first to apply graph convolutions to point cloud classification. Our method outperforms volumetric approaches and attains the new state of the art performance on Sydney dataset, with the benefit of preserving sparsity and presumably fine details.

We reach a competitive level of performance on graph classification benchmark NCI1 [39], outperforming other approaches based on deep learning there.
2 Related Work
The first formulation of a convolutional network analogy for irregular domains modeled with graphs has been introduced by Bruna et al. [6], who looked into both the spatial and the spectral domain of representation for performing localized filtering.
Spectral Methods.
A mathematically sound definition of convolution operator makes use of the spectral analysis theory, where it corresponds to multiplication of the signal on vertices transformed into the spectral domain by graph Fourier transform. The spatial locality of filters is then given by smoothness of the spectral filters, in case of [6] modeled as Bsplines. The transform involves very expensive multiplications with the eigenvector matrix. However, by a parameterization of filters as Chebyshev polynomials of eigenvalues and their approximate evaluation, computationally efficient and localized filtering has been recently achieved by Defferrard et al. [11]. Nevertheless, the filters are still learned in the context of the spectrum of graph Laplacian, which therefore has to be the same for all graphs in a dataset. This means that the graph structure is fixed and only the signal defined on the vertices may differ. This precludes applications on problems where the graph structure varies in the dataset, such as meshes, point clouds, or diverse biochemical datasets.
To cover these important cases, we formulate our filtering approach in the spatial domain, where the limited complexity of evaluation and the localization property is provided by construction. The main challenge here is dealing with weight sharing among local neighborhoods [6], as the number of vertices adjacent to a particular vertex varies and their ordering is often not well definable.
Spatial Methods.
Bruna et al. [6] assumed fixed graph structure and did not share any weights among neighborhoods. Several works have independently dealt with this problem. Duvenaud et al. [14] sum the signal over neighboring vertices followed by a weight matrix multiplication, effectively sharing the same weights among all edges. Atwood and Towsley [2] share weights based on the number of hops between two vertices. Kipf and Welling [21] further approximate the spectral method of [11] and weaken the dependency on the Laplacian, but ultimately arrive at centersurround weighting of neighborhoods. None of these methods captures finer structure of the neighborhood and thus does not generalize the standard convolution on grids. In contrast, our method can make use of possible edge labels and is shown to generalize regular convolution (Section 3.2).
The approach of Niepert et al. [28] introduces a heuristic for linearizing selected graph neighborhoods so that a conventional 1D CNN can be used. We share their goal of capturing structure in neighborhoods but approach it in a different way. Finally, Graph neural networks [34, 24] propagate features across a graph until (near) convergence and exploit edge labels as one of the sources of information as we do. However, their system is quite different from the current multilayer feedforward architectures, making the reuse of today’s common building blocks not straightforward.
CNNs on Point Clouds and Meshes.
There has been little work on deep learning on point clouds or meshes. Masci et al. [25] define convolution over patch descriptors around every vertex of a 3D mesh using geodesic distances, formulated in a deep learning architecture. The only way of processing point clouds using deep learning has been to first voxelize them before feeding them to a 3D CNN, be it for classification [26] or segmentation [19] purposes. Instead, we regard point cloud as graphs in Euclidean space in this work.
3 Method
We propose a method for performing convolutions over local graph neighborhoods exploiting edge labels (Section 3.1) and show it to generalize regular convolutions (Section 3.2). Afterwards, we present deep networks with our convolution operator (Section 3.3) in the case of point clouds (Section 3.4) and general graphs (Section 3.5).
3.1 EdgeConditioned Convolution
Let us consider a directed or undirected graph , where is a finite set of vertices with and is a set of edges with . Let be the layer index in a feedforward neural network. We assume the graph is both vertex and edgelabeled, i.e. there exists function assigning labels (also called signals or features) to each vertex and assigning labels (also called attributes) to each edge. These functions can be regarded as matrices and , then being the input signal. A neighborhood of vertex is defined to contain all adjacent vertices (predecessors in directed graphs) including itself (selfloop).
Our approach computes the filtered signal at vertex as a weighted sum of signals in its neighborhood, . While such a commutative aggregation solves the problem of undefined vertex ordering and varying neighborhood sizes, it also smooths out any structural information. To retain it, we propose to condition each filtering weight on the respective edge label. To this end, we borrow the idea from Dynamic filter networks [5] and define a filtergenerating network which given edge label outputs edgespecific weight matrix , see Figure 1.
The convolution operation, coined EdgeConditioned Convolution (ECC), is formalized as follows:
(1) 
where is a learnable bias and is parameterized by learnable network weights . For clarity, and are model parameters updated only during training and are dynamically generated parameters for an edge label in a particular input graph. The filtergenerating network can be implemented with any differentiable architecture; we use multilayer perceptrons in our applications.
Complexity.
Computing for all vertices requires at most^{1}^{1}1If edge labels are represented by discrete values in a particular graph and , can be evaluated only times. evaluations of and or matrixvector multiplications for directed, resp. undirected graphs. Both operations can be carried out efficiently on the GPU in batchmode.
3.2 Relationship to Existing Formulations
Our formulation of convolution on graph neighborhoods retains the key properties of the standard convolution on regular grids that are useful in the context of CNNs: weight sharing and locality.
The weights in ECC are tied by edge label, which is in contrast to tying them by hop distance from a vertex [2], according to a neighborhood linearization heuristic [28], by being the central vertex or not [21], indiscriminately [14], or not at all [6].
In fact, our definition reduces to that of Duvenaud et al. [14] (up to scaling) in the case of uninformative edge labels: if .
More importantly, the standard discrete convolution on grids is a special case of ECC, which we demonstrate in 1D for clarity. Consider an ordered set of vertices forming a path graph (chain). To obtain convolution with a centered kernel of size , we form so that each vertex is connected to its spatially nearest neighbors including self by a directed edge labeled with onehot encoding of the neighbor’s discrete offset , see Figure 2. Taking as a singlelayer perceptron without bias, we have , where denotes the respective reshaped column of the parameter matrix . With a slight abuse of notation, we arrive at the equivalence to the standard convolution: , ignoring the normalization factor of playing a role only at grid boundaries.
3.3 Deep Networks with ECC
While ECC is in principle applicable to both vertex classification and graph classification tasks, in this paper we restrict ourselves only to the latter one, i.e. predicting a class for the whole input graph. Hence, we follow the common architectural pattern for feedforward networks of interlaced convolutions and poolings topped by global pooling and fullyconnected layers, see Figure 3 for an illustration. This way, information from the local neighborhoods gets combined over successive layers to gain context (enlarge receptive field). While edge labels are fixed for a particular graph, their (learned) interpretation by the means of filter generating networks may change from layer to layer (weights of are not shared among layers). Therefore, the restriction of ECC to 1hop neighborhoods is not a constraint, akin to using small 33 filters in normal CNNs in exchange for deeper networks, which is known to be beneficial [17].
We use batch normalization [20] after each convolution, which was necessary for the learning to converge. Interestingly, we had no success with other feature normalization techniques such as datadependent initialization [27] or layer normalization [3].
Pooling.
While (nonstrided) convolutional layers and all pointwise layers do not change the underlying graph and only evolve the signal on vertices, pooling layers are defined to output aggregated signal on the vertices of a new, coarsened graph. Therefore, a pyramid of progressively coarser graphs has to be constructed for each input graph. Let us extend here our notation with an additional superscript to distinguish among different graphs in the pyramid when necessary. Each has also its associated labels and signal . A coarsening typically consists of three steps: subsampling or merging vertices, creating the new edge structure and labeling (socalled reduction), and mapping the vertices in the original graph to those in the coarsened one with . We use a different algorithm depending on whether we work with general graphs or graphs in Euclidean space, therefore we postpone discussing the details to the applications. Finally, the pooling layer with index aggregates into a lower dimensional based on . See Figure 3 for an example of using the introduced notation.
During coarsening, a small graph may be reduced to several disconnected vertices in its lower resolutions without problems as selfedges are always present. Since the architecture is designed to process graphs with variable , we deal with varying vertex count in the lowest graph resolution by global average/max pooling.
3.4 Application in Point Clouds
Point clouds are an important 3D data modality arising from many acquisition techniques, such as laser scanning (LiDAR) or multiview reconstruction. Due to their natural irregularity and sparsity, so far the only way of processing point clouds using deep learning has been to first voxelize them before feeding them to a 3D CNN, be it for classification [26] or segmentation [19] purposes. Such a dense representation is very hardware friendly and simple to handle with the current deep learning frameworks.
On the other hand, there are several disadvantages too. First, voxel representation tends to be much more expensive in terms of memory than usually sparse point clouds (we are not aware of any GPU implementation of convolutions on sparse tensors). Second, the necessity to fit them into a fixed size 3D grid brings about discretization artifacts and the loss of metric scale and possibly of details. With this work, we would like to offer a competitive alternative to the mainstream by performing deep learning on point clouds directly. As far as we know, we are the first to demonstrate such a result.
Graph Construction.
Given a point cloud with its point features (such as laser return intensity or color) we build a directed graph and set up its labels and as follows. First, we create vertex for every point and assign the respective signal to it by (or 0 if there are no features ). Then we connect each vertex to all vertices in its spatial neighborhood by a directed edge . In our experiments with neighborhoods, fixed metric radius worked better than a fixed number of neighbors. The offset between the points corresponding to vertices , is represented in Cartesian and spherical coordinates as 6D edge label vector .
Graph Coarsening.
For a single input point cloud , a pyramid of downsampled point clouds is obtained by the VoxelGrid algorithm [31], which overlays a grid of resolution over the point cloud and replaces all points within a voxel with their centroid (and thus maintains subvoxel accuracy). Each of the resulting point clouds is then independently converted into a graph and labeling with neighborhood radius as described above. The pooling map is defined so that each point in is assigned to its spatially nearest point in the subsampled point cloud .
Data Augmentation.
In order to reduce overfitting on small datasets, we perform online data augmentation. In particular, we randomly rotate point clouds about their upaxis, jitter their scale, perform mirroring, or delete random points.
3.5 Application in General Graphs
Many problems can be modeled directly as graphs. In such cases the graph dataset is already given and only the appropriate graph coarsening scheme needs to be chosen. This is by no means trivial and there exists a large body of literature on this problem [32]. Without any concept of spatial localization of vertices, we resort to established graph coarsening algorithms and utilize the multiresolution framework of Shuman et al. [36, 29], which works by repeated downsampling and graph reduction of the input graph. The downsampling step is based on splitting the graph into two components by the sign of the largest eigenvector of the Laplacian. This is followed by Kron reduction [13], which also defines the new edge labeling, enhanced with spectral sparsification of edges [37]. Note that the algorithm regards graphs as unweighted for the purpose of coarsening.
This method is attractive for us because of two reasons. Each downsampling step removes approximately half of the vertices, guaranteeing a certain level of pooling strength, and the sparsification step is randomized. The latter property is exploited as a useful data augmentation technique since several different graph pyramids can be generated from a single input graph. This is in spirit similar to the effect of fractional maxpooling [16]. We do not perform any other data augmentation.
4 Experiments
The proposed method is evaluated in point cloud classification (realworld data in Section 4.1 and synthetic in 4.2) and on a standard graph classification benchmark (Section 4.3). In addition, we validate our method and study its properties on MNIST (Section 4.4).
4.1 Sydney Urban Objects
This point cloud dataset [9] consists of 588 objects in 14 categories (vehicles, pedestrians, signs, and trees) manually extracted from 360 LiDAR scans, see Figure 4. It demonstrates nonideal sensing conditions with occlusions (holes) and a large variability in viewpoint (single viewpoint). This makes object classification a challenging task.
Following the protocol employed by the dataset authors, we report the mean F1 score weighted by class frequency, as the dataset is imbalanced. This score is further aggregated over four standard training/testing splits.
Network Configuration.
Our ECCnetwork has 7 parametric layers and 4 levels of graph resolution. Its configuration can be described as C(16)C(32)MP(0.25,0.5)C(32)C(32)MP(0.75,1.5)C(64)MP(1.5,1.5)GAPFC(64)D(0.2)FC(14), where C() denotes ECC with output channels followed by affine batch normalization and ReLU activation, MP(,) stands for maxpooling down to grid resolution of meters and neighborhood radius of meters, GAP is global average pooling, FC() is fullyconnected layer with output channels, and D() is dropout with probability . The filtergenerating networks have configuration FC(16)FC(32)FC() with orthogonal weight initialization [33] and ReLUs in between. Input graphs are created with and meters to break overly dense point clusters. Networks are trained with SGD and crossentropy loss for 250 epochs with batch size 32 and learning rate 0.1 stepwise decreasing after 200 and 245 epochs. Vertex signal is scalar laser return intensity (0255), representing depth.
Results.
Table 1 compares our result (ECC, 78.4) against two methods based on volumetric CNNs evaluated on voxelized occupancy grids of size 32x32x32 (VoxNet [26] 73.0 and ORION [1] 77.8), which we outperform by a small margin and set the new state of the art result on this dataset.
In the same table, we also study the dependence on convolution radii : increasing them or in all convolutional layers leads to a drop in performance, which would correspond to a preference of using smaller filters in regular CNNs. The average neighborhood size is roughly 10 vertices for our bestperforming network. We hypothesize that larger radii smooth out the information in the central vertex. To investigate this, we increased the importance of the selfloop by adding an identity skipconnection (see Appendix E) and retrained the networks. We achieved 77.0, 79.5 (the new state of the art), and 77.4 mean F1 for ECC, ECC , and ECC , respectively. Stronger identity connection allowed for successful integration of a larger context, up to some limit, which indeed suggests that information should be aggregated neither too much nor too little.
4.2 ModelNet
ModelNet [40] is a large scale collection of object meshes. We evaluate classification performance on its subsets ModelNet10 (3991/908 train/test examples in 10 categories) and ModelNet40 (9843/2468 train/test examples in 40 categories). Synthetic point clouds are created from meshes by uniformly sampling 1000 points on mesh faces according to face area (a simulation of acquisition from multiple viewpoints) and rescaled into a unit sphere.
Network Configuration.
Our ECCnetwork for ModelNet10 has 7 parametric layers and 3 levels of graph resolution with configuration C(16)C(32)MP(2.5/32,7.5/32)C(32)C(32)MP(7.5/32,22.5/32)C(64)GMPFC(64)D(0.2)FC(10), GMP being global max pooling. Other definitions and filtergenerating networks are as in Section 4.1. Input graphs are created with and units, mimicking the typical grid resolution of in voxelbased methods. The network is trained with SGD and crossentropy loss for 175 epochs with batch size 64 and learning rate 0.1 stepwise decreasing after every 50 epochs. There is no vertex signal, i.e. are zero. For ModelNet40, the network is wider (C(24), C(48), C(48), C(48), C(96), FC(64), FC(40)) and is trained for 100 epochs with learning rate decreasing after each 30 epochs.
Results.
Table 2 compares our result to several recent works, based either on volumetric [40, 26, 1, 30] or rendered image representation [38]. Test sets were expanded to include 12 orientations (ECC). We also evaluate voting over orientations (ECC 12 votes), which slightly improves the results likely due to the rotational variance of VoxelGrid algorithm. While not fully reaching the state of the art, we believe our method remains very competitive (90.8%, resp. 87.4% mean instance accuracy). For a fairer comparison, a leading volumetric method should be retrained on voxelized synthetic point clouds.
Model  ModelNet10  ModelNet40 

3DShapeNets [40]  83.5  77.3 
MVCNN [38]  —  90.1 
VoxNet [26]  92  83 
ORION [1]  93.8  — 
SubvolumeSup [30]  —  86.0 (89.2) 
ECC  89.3 (90.0)  82.4 (87.0) 
ECC (12 votes)  90.0 (90.8)  83.2 (87.4) 
4.3 Graph Classification
We evaluate on a graph classification benchmark frequently used in the community, consisting of five datasets: NCI1, NCI109, MUTAG, ENZYMES, and D&D. Their properties can be found in Table 3, indicating the variability in dataset sizes, in graph sizes, and in the availability of labels. Following [35], we perform 10fold crossvalidation with 9 folds for training and 1 for testing and report the average prediction accuracy.
NCI1 and NCI109 [39] consist of graph representations of chemical compounds screened for activity against nonsmall cell lung cancer and ovarian cancer cell lines, respectively. MUTAG [10] is a dataset of nitro compounds labeled according to whether or not they have a mutagenic effect on a bacterium. ENZYMES [4] contains representations of tertiary structure of 6 classes of enzymes. D&D [12] is a database of protein structures (vertices are amino acids, edges indicate spatial closeness) classified as enzymes and nonenzymes.
Network Configuration.
Our ECCnetwork for NCI1 has 8 parametric layers and 3 levels of graph resolution. Its configuration can be described as C(48)C(48)C(48)MPC(48)C(64)MPC(64)GAPFC(64)D(0.1)FC(2), where C() denotes ECC with output channels followed by affine batch normalization, ReLU activation and dropout (probability 0.05), MP stands for maxpooling onto a coarser graph, GAP is global average pooling, FC() is fullyconnected layer with output channels, and D() is dropout with probability . The filtergenerating networks have configuration FC(64)FC() with orthogonal weight initialization [33] and ReLU in between. Labels are encoded as onehot vectors ( and due to an extra label for selfconnections). Networks are trained with SGD and crossentropy loss for 50 epochs with batch size 64 and learning rate 0.1 stepwise decreasing after 25, 35, and 45 epochs. The dataset is expanded five times by randomized sparsification (Section 3.5). Small deviations from this description for the other four datasets are mentioned in the supplementary.
Baselines.
We compare our method (ECC) to the state of the art WeisfeilerLehman graph kernel et al. [35] and to four approaches using deep learning as at least one of their components [2, 28, 41, 8]. Randomized sparsification used during training time can also be exploited at test time, when the network prediction scores (ECC5scores) or votes (ECC5votes) are averaged over 5 runs. To judge the influence of edge labels, we run our method with uniform labels and being a single layer FC() without bias^{2}^{2}2Also possible for unlabeled ENZYMES and D&D, since our method uses labels from Kron reduction for all coarsened graphs by default. (ECC no edge labels).
Results.
Table 4 conveys that while there is no clear winning algorithm, our method performs at the level of state of the art for edgelabeled datasets (NCI1, NCI109, MUTAG). The results demonstrate the importance of exploiting edge labels for convolutionbased methods, as the performance of DCNN [2] and ECC without edge labels is distinctly worse, justifying the motivation behind this paper. Averaging over random sparsifications at test time improves accuracy by a small amount. Our results on datasets without edge labels (ENZYMES, D&D) are somewhat below the state of the art but still at a reasonable level, though improvement in this case was not the aim of this work. This indicates that further research is needed into the adaptation of CNNs to general graphs. A more detailed discussion for each dataset is available in the supplementary.
NCI1  NCI109  MUTAG  ENZYMES  D&D  
# graphs  4110  4127  188  600  1178 
mean  29.87  29.68  17.93  32.63  284.32 
mean  32.3  32.13  19.79  62.14  715.66 
# classes  2  2  2  6  2 
# vertex labels  37  38  7  3  82 
# edge labels  3  3  11  —  — 
4.4 Mnist
To further validate our method, we applied it to the MNIST classification problem [23], a dataset of 70k greyscale images of handwritten digits represented on a 2D grid of size 2828. We regard each image as point cloud with points and signal representing each pixel, . Edge labeling and graph coarsening is performed as explained in Section 3.4. We are mainly interested in two questions: Is ECC able to reach the standard performance on this classic baseline? What kind of representation does it learn?
Network Configuration.
Our ECCnetwork has 5 parametric layers with configuration C(16)MP(2,3.4)C(32)MP(4,6.8)C(64)MP(8,30)C(128)D(0.5)FC(10); the notation and filtergenerating network being as in Section 4.1. The last convolution has a stride of 30 and thus maps all points to only a single point. Input graphs are created with and . This model exactly corresponds to a regular CNN with three convolutions with filters of size 55, 33, and 33 interlaced with maxpoolings of size 22, finished with two fully connected layers. Networks are trained with SGD and crossentropy loss for 20 epochs with batch size 64 and learning rate 0.01 stepwise decreasing after 10 and 15 epochs.
Results.
Table 5 proves that our ECC network can achieve the level of quality comparable to the good standard in the community (99.14). This is exactly the same accuracy as reported by Defferrard et al. [11] and better than what is offered by other spectralbased approaches (98.2 [6], 94.96 [15]). Note that we are not aiming at becoming the state of the art on MNIST by this work.
Next, we investigate the effect of regular grid and irregular mesh. To this end, we discard all black points () from the point clouds, corresponding to 80.9% of data, and retrain the network (ECC sparse input). Exactly the same test performance is obtained (99.14), indicating that our method is very stable with respect to graph structure changing from sample to sample.
Furthermore, we check the quality of the learned filter generating networks . We compare with ECC configured to mimic regular convolution using singlelayer filter networks and onehot encoding of offsets (ECC onehot), as described in Section 3.2. This configuration reaches 99.37 accuracy, or 0.23 more than ECC, implying that are not perfect but still perform very well in learning the proper partitioning of edge labels.
Last, we explore the generated filters visually for the case of the sparse input ECC. As filters are a continuous function of an edge label, we can visualize the change of values in each dimension in 16 images by sampling labels over grids of two resolutions. The coarser one in Figure 5 has integer steps corresponding to the offsets . It shows filters exhibiting the structured patterns typically found in the first layer of CNNs. The finer resolution in Figure 5 (subpixel steps of 0.1) reveals that the filters are in fact smooth and do not contain any discontinuities apart from the angular artifact due to the periodicity of azimuth. Interestingly, the artifact is not distinct in all filters, suggesting the network may learn to overcome it if necessary.
Model  Train accuracy  Test accuracy 

ECC  99.12  99.14 
ECC (sparse input)  99.36  99.14 
ECC (onehot)  99.53  99.37 
5 Conclusion
We have introduced edgeconditioned convolution (ECC), an operation on graph signal performed in the spatial domain where filter weights are conditioned on edge labels and dynamically generated for each specific input sample. We have shown that our formulation generalizes the standard convolution on graphs if edge labels are chosen properly and experimentally validated this assertion on MNIST. We applied our approach to point cloud classification in a novel way, setting a new state of the art performance on Sydney dataset. Furthermore, we have outperformed other deep learningbased approaches on graph classification dataset NCI1. The source code is available at https://github.com/mys007/ecc.
In feature work we would like to treat meshes as graphs rather than point clouds. Moreover, we plan to address the currently higher level of GPU memory consumption in case of large graphs with continuous edge labels, for example by randomized clustering, which could also serve as additional regularization through data augmentation.
Acknowledgments.
We gratefully acknowledge NVIDIA Corporation for the donated GPU used in this research. We are thankful to anonymous reviewers for their feedback.
References
 [1] N. S. Alvar, M. Zolfaghari, and T. Brox. Orientationboosted voxel nets for 3d object recognition. CoRR, abs/1604.03351, 2016.
 [2] J. Atwood and D. Towsley. Diffusionconvolutional neural networks. In NIPS, 2016.
 [3] L. J. Ba, R. Kiros, and G. E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016.
 [4] K. M. Borgwardt and H. Kriegel. Shortestpath kernels on graphs. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), 2730 November 2005, Houston, Texas, USA, pages 74–81, 2005.
 [5] B. D. Brabandere, X. Jia, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In NIPS, 2016.
 [6] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on graphs. CoRR, abs/1312.6203, 2013.
 [7] T. Chen, B. Dai, D. Liu, and J. Song. Performance of global descriptors for velodynebased urban object recognition. In 2014 IEEE Intelligent Vehicles Symposium Proceedings, Dearborn, MI, USA, June 811, 2014, pages 667–673, 2014.
 [8] H. Dai, B. Dai, and L. Song. Discriminative embeddings of latent variable models for structured data. In ICML, 2016.
 [9] M. De Deuge, A. Quadros, C. Hung, and B. Douillard. Unsupervised feature learning for classification of outdoor 3d scans. In Australasian Conference on Robitics and Automation, volume 2, 2013.
 [10] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch. Structureactivity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry, 34(2):786–797, 1991.
 [11] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, 2016.
 [12] P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from nonenzymes without alignments. Journal of molecular biology, 330(4):771–783, 2003.
 [13] F. Dörfler and F. Bullo. Kron reduction of graphs with applications to electrical networks. IEEE Trans. on Circuits and Systems, 60I(1):150–163, 2013.
 [14] D. K. Duvenaud, D. Maclaurin, J. AguileraIparraguirre, R. Bombarell, T. Hirzel, A. AspuruGuzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, 2015.
 [15] M. Edwards and X. Xie. Graph based convolutional neural network. In BMVC, 2016.
 [16] B. Graham. Fractional maxpooling. CoRR, abs/1412.6071, 2014.
 [17] K. He and J. Sun. Convolutional neural networks at constrained time cost. In CVPR, 2015.
 [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [19] J. Huang and S. You. Point cloud labeling using 3d convolutional neural network. In ICPR, 2016.
 [20] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 [21] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2016.
 [22] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 [23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [24] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel. Gated graph sequence neural networks. In ICLR, 2016.
 [25] J. Masci, D. Boscaini, M. M. Bronstein, and P. Vandergheynst. Geodesic convolutional neural networks on riemannian manifolds. pages 37–45, 2015.
 [26] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for realtime object recognition. In IROS, 2015.
 [27] D. Mishkin and J. Matas. All you need is a good init. In ICLR, 2016.
 [28] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In ICML, 2016.
 [29] N. Perraudin, J. Paratte, D. I. Shuman, V. Kalofolias, P. Vandergheynst, and D. K. Hammond. GSPBOX: A toolbox for signal processing on graphs. CoRR, abs/1408.5781, 2014.
 [30] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multiview cnns for object classification on 3d data. In CVPR, 2016.
 [31] R. B. Rusu and S. Cousins. 3d is here: Point cloud library (pcl). In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 1–4. IEEE, 2011.
 [32] I. Safro, P. Sanders, and C. Schulz. Advanced coarsening schemes for graph partitioning. ACM Journal of Experimental Algorithmics, 19(1), 2014.
 [33] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR, 2014.
 [34] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Trans. Neural Networks, 20(1):61–80, 2009.
 [35] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeilerlehman graph kernels. Journal of Machine Learning Research, 12:2539–2561, 2011.
 [36] D. I. Shuman, M. J. Faraji, and P. Vandergheynst. A multiscale pyramid transform for graph signals. IEEE Trans. Signal Processing, 64(8):2119–2134, 2016.
 [37] D. A. Spielman and N. Srivastava. Graph sparsification by effective resistances. SIAM Journal on Computing, 40(6):1913–1926, 2011.
 [38] H. Su, S. Maji, E. Kalogerakis, and E. G. LearnedMiller. Multiview convolutional neural networks for 3d shape recognition. In ICCV, 2015.
 [39] N. Wale, I. A. Watson, and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems, 14(3):347–375, 2008.
 [40] Z. Wu, S. Song, A. Khosla, X. Tang, and J. Xiao. 3d shapenets for 2.5d object recognition and nextbestview prediction. In CVPR, 2015.
 [41] P. Yanardag and S. V. N. Vishwanathan. Deep graph kernels. In SIGKDD, 2015.
Appendix
Appendix A Overview
In the first part, the appendix provides further discussion of the graph classification results (Section B) and investigates robustness of point cloud classification to noise (Section C). In the second part, we explore several extensions of our ECC formulation, specifically with different edge labeling for point clouds (Section D), with identity connections (Section E), with degree labels (Section F), and with a learned normalization factor (Section G).
Appendix B Details on Graph Classification Benchmark
In this section we describe the differences in our network architecture to the one introduced for NCI1 in the main paper and discuss evaluation results for each dataset in detail.
Nci1.
ECC (83.80%) performs distinctly better than convolution methods that are not able to use edge labels (DCNN [2] 62.61%, PSCN [28] 78.59%). Methods not approaching the problem as convolutions on graphs but rather combining deep learning with other techniques are stronger (Deep WL [41] 80.31%, structure2vec [8] 83.72%) but are still outperformed by ECC. While the WeisfeilerLehman graph kernel remains the strongest method (WL [35] 84.55%), it is fair to conclude that ECC, structure2vec, and WL perform at the same level.
Nci109.
We use the same ECCnetwork configuration and training details as described in Section 4.3 for NCI1, since both datasets are similar. ECC (82.14%) performs distinctly better than DCNN [2] (62.86%), which is not able to use edge labels, and is on par with nonconvolutional approaches (Deep WL [41] 80.32%, structure2vec [8] 82.16%, WL [35] 84.49%).
Mutag.
As MUTAG is a tiny dataset of small graphs, we trained a downsized ECCnetwork to combat overfitting. Using the notation from Section 4.3, its configuration is C(16)C(32)C(48)MPC(64)MPGAPFC(64)D(0.2)FC(2), all other details are as with NCI1. While by numbers ECC (89.44%) outperforms all other approaches except of PSCN [28] (92.63%), we note that all four leading methods (Deep WL [41] 87.44%, structure2vec [8] 88.28%, ECC, PSCN) can be seen to perform equally well due to fluctuations caused by the dataset size. We account the tiny decrease in performance with testtime randomization (88.33%) to the same reason.
Enzymes.
Due to higher complexity of this task we use a wider ECCnetwork configured as C(64)C(64)C(96)MPC(96)C(128)MPC(128)C(160)MPC(160)GAPFC(192)D(0.2)FC(6) using the notation and other details in Section 4.3. As this dataset is not edgelabeled, we do not expect to obtain the best performance. Indeed, our method (53.50%) performs at the level of Deep WL [41] (53.43%) and is overperformed by WL [35] (59.05%) and structure2vec [8] (61.10%). Note that the gap to the other convolutionbased method DCNN [2] (18.10%) is huge and there is an improvement of more than 4 percentage points due to edge labels in coarser graph resolutions from Kron reduction.
D&d.
Due to large graphs in this dataset we designed a ECCnetwork with more pooling configured as C(48)C(48)C(48)MPC(48)MPC(64)MPC(64)MPC(64)MPC(64)MPGAPFC(64)D(0.2)FC(2) using the notation and other details in Section 4.3. As this dataset is not edgelabeled, we do not expect to obtain the best performance. Our method (74.10%) is overperformed by the others who evaluated on this dataset (PSCN [28] 77.12%, WL [35] 79.78%, structure2vec [8] 82.22%), though the margin is not very large.
Appendix C Robustness to Noise
Realworld point clouds contain several kinds of artifacts, such as holes due to occlusions and Gaussian noise due to measurement uncertainty. Figure 6 shows that ECC is highly robust to point removal and can be made robust to additive Gaussian noise by a proper training data augmentation.
Appendix D Edge Labels for Point Clouds
In Section 3.4 we defined edge labels as the offset in Cartesian and spherical coordinates, . Here, we explore the importance of individual elements in the proposed edge labeling and further evaluate labels invariant to rotation about objects’ vertical axis (IRz). Table 6 conveys that models with isotropic (60.7) or no labels (38.9) perform poorly as expected, while either of the coordinate systems is important. IRz labeling performs comparably or even slightly better than our proposed one. However, we believe this is a property of the specific dataset and may not necessarily generalize, an example being MNIST, where IRz is equivalent to full isotropy and decreases accuracy to 89.9%.
Label  Mean F1 

78.4  
76.1  
77.3  
75.8  
78.2  
78.7  
60.7  
38.9 
Appendix E Identity Connections
The formulation of ECC in Equation 1 does not treat selfloop edges in a special way. However, the success of residual networks [18] is a strong motivation to consider adding identity skipconnections to our model and encouraging ECC in residual learning. We thus formulate ECCresnet as follows:
(2) 
where is an identity mapping if and a linear mapping otherwise.
The results listed in Table 7 show that with two exceptions (NCI109 and ENZYMES) ECC does not benefit from identity connections in the specific network configurations. The trend may be different for other configurations, e.g. ECC improved from 76.9 to 79.5 mean F1 score on Sydney due to identity connections as mentioned in Section 4.1.
NCI1  NCI109  MUTAG  ENZYMES  D&D  Sydney  ModelNet10  

ECCresnet  83.24  81.97  85.56  51.83  70.48  77.0  88.5 (89.3) 
ECC  83.80  81.87  89.44  50.00  73.65  78.4  89.3 (90.0) 
Appendix F Vertex Degrees in Edge Labels
In the task of graph classification, we used categorical labels (if present) encoded as onehot vectors for edges in the input graph and scalars computed by Kron reduction for edges in all coarsened graphs.
Here we investigate making the edge labels more informative by including the degrees of the pair of vertices forming an edge. The degree information is implicitly used by spectral convolution methods, as the degree information is contained in the graph Laplacian, and also appears in the explicit propagation rules [21, 2].
Our model can be easily extended to make use of this information by simply appending it to the existing edge label vectors. We consider four variants of providing additional degree labels and about a directed edge : , , , and , where is the degree of vertex . We use these additional labels in all graph resolutions.
Table 8 reveals that degree information can improve the results considerably, especially for datasets without given edge labels (by up to 5 percentage points for ENZYMES and up to 2.14 percentage points for D&D). However, no variant of can guarantee consistent improvement over all datasets.
NCI1  NCI109  MUTAG  ENZYMES  D&D  

82.99  81.94  87.78  53.67  73.65  
83.60  82.40  88.89  52.67  71.77  
83.58  82.28  86.67  55.00  75.79  
83.16  83.03  86.67  52.83  73.74  
ECC without  83.80  81.87  89.44  50.00  73.65 
Appendix G Vertex Degrees in Normalization
The formulation of ECC in Equation 1 performs normalization by the neighborhood size. Here we explore learning an additional multiplicative factor, conditioned on the neighborhood size . To this end, we again make use of Dynamic filter networks [5] and design a factorgenerating network which given vertex degree outputs a vertexspecific normalization factor. We formulate ECCZ as follows:
(3) 
In our experiments, the factorgenerating networks have configuration FC(32)FC(1) with orthogonal weight initialization [33] and ReLUs in between.
The results in Table 9 show that while being helpful on some datasets (NCI109, ENZYMES, ModelNet10), ECCZ harms the performance on the other ones. Embedding vertex information in labels instead seems to achieve higher performance (Section F).
NCI1  NCI109  MUTAG  ENZYMES  D&D  Sydney  ModelNet10  

ECCZ  83.48  82.57  86.67  52.50  72.03  75.5  89.9 (90.6) 
ECC  83.80  81.87  89.44  50.00  73.65  78.4  89.3 (90.0) 