Constant Curvature Graph Convolutional Networks

# Constant Curvature Graph Convolutional Networks

## Abstract

Interest has been rising lately towards methods representing data in non-Euclidean spaces, e.g. hyperbolic or spherical, that provide specific inductive biases useful for certain real-world data properties, e.g. scale-free, hierarchical or cyclical. However, the popular graph neural networks are currently limited in modeling data only via Euclidean geometry and associated vector space operations. Here, we bridge this gap by proposing mathematically grounded generalizations of graph convolutional networks (GCN) to (products of) constant curvature spaces. We do this by i) introducing a unified formalism that can interpolate smoothly between all geometries of constant curvature, ii) leveraging gyro-barycentric coordinates that generalize the classic Euclidean concept of the center of mass. Our class of models smoothly recover their Euclidean counterparts when the curvature goes to zero from either side. Empirically, we outperform Euclidean GCNs in the tasks of node classification and distortion minimization for symbolic data exhibiting non-Euclidean behavior, according to their discrete curvature.

\tcbset

boxsep=0mm,boxrule=0pt,colframe=white,arc=0mm,left=0.5mm,right=0.5mm

\printAffiliationsAndNotice\icmlEqualContribution

## 1 Introduction

Graph Convolutional Networks. The success of convolutional networks and deep learning for image data has inspired generalizations for graphs for which sharing parameters is consistent with the graph geometry. Bruna et al. (2014); Henaff et al. (2015) are the pioneers of spectral graph convolutional neural networks in the graph Fourier space using localized spectral filters on graphs. However, in order to reduce the graph-dependency on the Laplacian eigenmodes, Defferrard et al. (2016) approximate the convolutional filters using Chebyshev polynomials leveraging a result of Hammond et al. (2011). The resulting method (discussed in Appendix A) is computationally efficient and superior in terms of accuracy and complexity. Further, Kipf and Welling (2017) simplify this approach by considering first-order approximations obtaining high scalability. The proposed graph convolutional networks (GCN) is interpolating node embeddings via a symmetrically normalized adjacency matrix, while this weight sharing can be understood as an efficient diffusion-like regularizer. Recent works extend GCNs to achieve state of the art results for link prediction (Zhang and Chen, 2018), graph classification (Hamilton et al., 2017; Xu et al., 2018) and node classification (Klicpera et al., 2019; Veličković et al., 2018).

Euclidean geometry in ML. In machine learning (ML), data is most often represented in a Euclidean space for various reasons. First, some data is intrinsically Euclidean, such as positions in 3D space in classical mechanics. Second, intuition is easier in such spaces, as they possess an appealing vectorial structure allowing basic arithmetic and a rich theory of linear algebra. Finally, a lot of quantities of interest such as distances and inner-products are known in closed-form formulae and can be computed very efficiently on the existing hardware. These operations are the basic building blocks for most of today’s popular machine learning models. Thus, the powerful simplicity and efficiency of Euclidean geometry has led to numerous methods achieving state-of-the-art on tasks as diverse as machine translation (Bahdanau et al., 2015; Vaswani et al., 2017), speech recognition (Graves et al., 2013), image classification (He et al., 2016) or recommender systems (He et al., 2017).

Riemannian ML. In spite of this success, certain types of data (e.g. hierarchical, scale-free or spherical data) have been shown to be better represented by non-Euclidean geometries (Defferrard et al., 2019; Bronstein et al., 2017; Nickel and Kiela, 2017; Gu et al., 2019), leading in particular to the rich theories of manifold learning (Roweis and Saul, 2000; Tenenbaum et al., 2000) and information geometry (Amari and Nagaoka, 2007). The mathematical framework in vigor to manipulate non-Euclidean geometries is known as Riemannian geometry (Spivak, 1979). Although its theory leads to many strong and elegant results, some of its basic quantities such as the distance function are in general not available in closed-form, which can be prohibitive to many computational methods.

Representational Advantages of Geometries of Constant Curvature. An interesting trade-off between general Riemannian manifolds and the Euclidean space is given by manifolds of constant sectional curvature. They define together what are called hyperbolic (negative curvature), elliptic (positive curvature) and Euclidean (zero curvature) geometries. As discussed below and in Appendix B, Euclidean spaces have limitations and suffer from large distortion when embedding certain types of data such as trees. In these cases, the hyperbolic and spherical spaces have representational advantages providing a better inductive bias for the respective data.

The hyperbolic space can be intuitively understood as a continuous tree: the volume of a ball grows exponentially with its radius, similarly as how the number of nodes in a binary tree grows exponentially with its depth (see fig. 1). Its tree-likeness properties have long been studied mathematically (Gromov, 1987; Hamann, 2017; Ungar, 2008) and it was proven to better embed complex networks (Krioukov et al., 2010), scale-free graphs and hierarchical data compared to the Euclidean geometry (Cho et al., 2019; Sala et al., 2018; Ganea et al., 2018b; Gu et al., 2019; Nickel and Kiela, 2018, 2017; Tifrea et al., 2019). Several important tools or methods found their hyperbolic counterparts, such as variational autoencoders (Mathieu et al., 2019; Ovinnikov, 2019), attention mechanisms (Gulcehre et al., 2018), matrix multiplications, recurrent units and multinomial logistic regression (Ganea et al., 2018a).

Similarly, spherical geometry provides benefits for modeling spherical or cyclical data (Defferrard et al., 2019; Matousek, 2013; Davidson et al., 2018; Xu and Durrett, 2018; Gu et al., 2019; Grattarola et al., 2018; Wilson et al., 2014).

Computational Efficiency of Constant Curvature Spaces (CCS). CCS are some of the few Riemannian manifolds to possess closed-form formulae for geometric quantities of interest in computational methods, i.e. distance, geodesics, exponential map, parallel transport and their gradients. We also leverage here the closed expressions for weighted centroids.

“Linear Algebra” of CCS: Gyrovector Spaces. In order to study the geometry of constant negative curvature in analogy with the Euclidean geometry, Ungar (1999, 2005, 2008, 2016) proposed the elegant non-associative algebraic formalism of gyrovector spaces. Recently, Ganea et al. (2018a) have linked this framework to the Riemannian geometry of the space, also generalizing the building blocks for non-Euclidean deep learning models operating with hyperbolic data representations.

However, it remains unclear how to extend in a principled manner the connection between Riemannian geometry and gyrovector space operations for spaces of constant positive curvature (spherical). By leveraging Euler’s formula and complex analysis, we present to our knowledge the first unified gyro framework that smoothly interpolates between geometries of constant curvatures irrespective of their signs. This is possible when working with the Poincaré ball and stereographic spherical projection models of respectively hyperbolic and spherical spaces.

GCNs in Constant Curvature Spaces. In this work, we introduce an extension of graph convolutional networks that allows to learn representations residing in (products of) constant curvature spaces with any curvature sign. We achieve this by combining the derived unified gyro framework together with the effectiveness of GCNs (Kipf and Welling, 2017). Concurrent to our work, Chami et al. (2019); Liu et al. (2019) consider graph neural networks that learn embeddings in hyperbolic space via tangent space aggregation. Their approach will be analyzed more closely in section 3.4. Our model is more general as it produces representations in a strict super-set containing the hyperbolic space.

## 2 The Geometry of Constant Curvature Spaces

Riemannian Geometry. A manifold of dimension is a generalization to higher dimensions of the notion of surface, and is a space that locally looks like . At each point , can be associated a tangent space , which is a vector space of dimension that can be understood as a first order approximation of around . A Riemannian metric is given by an inner-product at each tangent space , varying smoothly with . A given defines the geometry of , because it can be used to define the distance between and as the infimum of the lengths of smooth paths from to , where the length is defined as . Under certain assumptions, a given also defines a curvature at each point.

Unifying all curvatures . There exist several models of respectively constant positive and negative curvatures. For positive curvature, we choose the stereographic projection of the sphere, while for negative curvature we choose the Poincaré model which is the stereographic projection of the Lorentz model. As explained below, this choice allows us to generalize the gyrovector space framework and unify spaces of both positive and negative curvature into a single model which we call the -stereographic model.

The -stereographic model. For a curvature and a dimension , we study the model defined as equipped with its Riemannian metric . Note in particular that when , is , while when it is the open ball of radius .

Gyrovector spaces & Riemannian geometry. As discussed in section 1, the gyrovector space formalism is used to generalize vector spaces to the Poincaré model of hyperbolic geometry (Ungar, 2005, 2008). In addition, important quantities from Riemannian geometry can be rewritten in terms of the MÃ¶bius vector addition and scalar-vector multiplication (Ganea et al., 2018a). We here extend gyrovector spaces to the -stereographic model, i.e. allowing positive curvature.

For and any point , we will denote by the unique point of the sphere of radius in whose stereographic projection is . As detailed in Appendix C, it is given by

 ~x:=(λκxx,κ−12(λκx−1)). (1)

For , we define the -addition, in the -stereographic model by:

 x⊕κy=(1−2κxTy−κ||y||2)x+(1+κ||x||2)y1−2κxTy+κ2||x||2||y||2∈stdκ. (2)

The -addition is defined in all the cases except for spherical geometry and as stated by the following theorem proved in Appendix C.2.1. {tcolorbox}

###### Theorem 1 (Definiteness of κ-addition).

We have
if and only if and .

For and (and if ), the -scaling in the -stereographic model is given by:

 s⊗κx=tanκ(s⋅tan−1κ||x||)x||x||∈stdκ, (3)

where equals if and if . This formalism yields simple closed-forms for various quantities including the distance function (see fig. 3) inherited from the Riemannian manifold (, ), the exp and log maps, and geodesics (see fig. 2), as shown by the following theorem. {tcolorbox}

###### Theorem 2 (Extending gyrovector spaces to positive curvature).

For , , , (and if ), the distance function is given by1:

 dκ(x,y)=2|κ|−1/2tan−1κ∥−x⊕κy∥, (4)

the unit-speed geodesic from to is unique and given by

 γx→y(t)=x⊕κ(t⊗κ(−x⊕κy)), (5)

and finally the exponential and logarithmic maps are described as:

 expκx(v) =x⊕κ(tanκ(|κ|12λκx||v||2)v||v||) (6) logκx(y) =2|κ|−12λκxtan−1κ||−x⊕κy||−x⊕κy||−x⊕ky|| (7)

Proof sketch:
The case was already taken care of by Ganea et al. (2018a). For , we provide a detailed proof in Appendix C.2.2. The exponential map and unit-speed geodesics are obtained using the Egregium theorem and the known formulas in the standard spherical model. The distance then follows from the formula which holds in any Riemannian manifold.

Around . One notably observes that choosing yields all corresponding Euclidean quantities, which guarantees a continuous interpolation between -stereographic models of different curvatures, via Euler’s formula where . But is this interpolation differentiable with respect to ? It is, as shown by the following theorem, proved in Appendix C.2.3.

{tcolorbox}
###### Theorem 3 (Smoothness of stdκ w.r.t. κ around 0).

Let and , such that (and if ). Quantities in Eqs. (4,5,6, 7) are well-defined for , i.e. for small enough. Their first order derivatives at and exist and are equal. Moreover, for the distance we have up to quadratic terms in :

 dκ(x,y)≈2∥x−y∥−2κ(∥x−y∥3/3+(xTy)∥x−y∥2) (8)

Note that for , this tells us that an infinitesimal change of curvature from zero to small negative, i.e. towards , while keeping fixed, has the effect of increasing their distance.

As a consequence, we have a unified formalism that interpolates smoothly between all three geometries of constant curvature.

## 3 κ-GCNs

We start by introducing the methods upon which we build. We present our models for spaces of constant sectional curvature, in the -stereographic model. However, the generalization to cartesian products of such spaces (Gu et al., 2019) follows naturally from these tools.

### 3.1 Graph Convolutional Networks

The problem of node classification on a graph has long been tackled with explicit regularization using the graph Laplacian (Weston et al., 2012). Namely, for a directed graph with adjacency matrix , by adding the following term to the loss: , where is the (unnormalized) graph Laplacian, defines the (diagonal) degree matrix, contains the trainable parameters of the model and the node features of the model. Such a regularization is expected to improve generalization if connected nodes in the graph tend to share labels; node with feature vector is represented as in a Euclidean space.

With the aim to obtain more scalable models, Defferrard et al. (2016); Kipf and Welling (2017) propose to make this regularization implicit by incorporating it into what they call graph convolutional networks (GCN), which they motivate as a first order approximation of spectral graph convolutions, yielding the following scalable layer architecture (detailed in Appendix A):

 H(t+1)=σ(~D−12~A~D−12H(t)W(t)) (9)

where has added self-connections, defines its diagonal degree matrix, is a non-linearity such as sigmoid, or , and and are the parameter and activation matrices of layer respectively, with the input feature matrix.

### 3.2 Tools for a κ-Gcn

Learning a parametrized function that respects hyperbolic geometry has been studied in Ganea et al. (2018a): neural layers and hyperbolic softmax. We generalize their definitions into the -stereographic model, unifying operations in positive and negative curvature. We explain how curvature introduces a fundamental difference between left and right matrix multiplications, depicting the Möbius matrix multiplication of Ganea et al. (2018a) as a right multiplication, independent for each embedding. We then introduce a left multiplication by extension of gyromidpoints which ties the embeddings, which is essential for graph neural networks.

### 3.3 κ-Right-Matrix-Multiplication

Let denote a matrix whose rows are -dimensional embeddings in , and let denote a weight matrix. Let us first understand what a right matrix multiplication is in Euclidean space: the Euclidean right multiplication can be written row-wise as . Hence each -dimensional Euclidean embedding is modified independently by a right matrix multiplication. A natural adaptation of this operation to the -stereographic model yields the following definition. {tcolorbox}

###### Definition 1.

Given a matrix holding -stereographic embeddings in its rows and weights , the -right-matrix-multiplication is defined row-wise as

 (X⊗κW)i∙=expκ0((logκ0(X)W)i∙)=tanκ(αitan−1κ(||X∙i||))(XW)i∙||(XW)i∙|| (10)

where and and denote the exponential and logarithmic map in the -stereographic model.

This definition is in perfect agreement with the hyperbolic scalar multiplication for , which can also be written as . This operation is known to have desirable properties such as associativity (Ganea et al., 2018a).

### 3.4 κ-Left-Matrix-Multiplication as a Midpoint Extension

For graph neural networks we also need the notion of message passing among neighboring nodes, i.e. an operation that combines / aggregates the respective embeddings together. In Euclidean space such an operation is given by the left multiplication of the embeddings matrix with the (preprocessed) adjacency : . Let us consider this left multiplication. For , the matrix product is given row-wise by:

 (AX)i∙=Ai1X1∙+⋯+AinXn∙

This means that the new representation of node is obtained by calculating the linear combination of all the other node embeddings, weighted by the -th row of . An adaptation to the -stereographic model hence requires a notion of weighted linear combination.

We propose such an operation in by performing a -scaling of a gyromidpoint whose definition is reminded below. Indeed, in Euclidean space, the weighted linear combination can be re-written as with Euclidean midpoint . See fig. 5 for a geometric illustration. This motivates generalizing the above operation to as follows. {tcolorbox}

###### Definition 2.

Given a matrix holding -stereographic embeddings in its rows and weights , the -left-matrix-multiplication is defined row-wise as

 (A⊠κX)i∙:=(∑jAij)⊗κmκ(X1∙,⋯,Xn∙;Ai∙). (11)

The -scaling is motivated by the fact that for all , . We remind that the gyromidpoint is defined when in the -stereographic model as (Ungar, 2010):

 mκ(x1,⋯,xn;α)=12⊗κ(n∑i=1αiλκxi∑nj=1αj(λκxj−1)xi), (12)

with . Whenever , we have to further require the following condition:

 ∑jαj(λκxj−1)≠0. (13)

For two points, one can calculate that is equivalent to , which holds in particular whenever . See fig. 4 for illustrations of gyromidpoints.

Our operation satisfies interesting properties, proved in Appendix C.2.4: {tcolorbox}

###### Theorem 4 (Neuter element & κ-scalar-associativity).

We have , and for ,

 r⊗κ(A⊠κX)=(rA)⊠κX.

The matrix . In most graph neural networks, the matrix is intended to be a preprocessed adjacency matrix, i.e. renormalized by the diagonal degree matrix . This normalization is often taken either (i) to the left: , (ii) symmetric: or (iii) to the right: . Note that the latter case makes the matrix right-stochastic2, which is a property that is preserved by matrix product and exponentiation. For this case, we prove the following result in Appendix C.2.5: {tcolorbox}

###### Theorem 5 (κ-left-multiplication by right-stochastic matrices is intrinsic).

If are right-stochastic, is a isometry of and , are two matrices holding -stereographic embeddings:

 ∀i,dϕ=dκ((A⊠κϕ(X))i∙,(B⊠κϕ(Y))i∙)=dκ((A⊠κX)i∙,(B⊠κY)i∙). (14)

The above result means that can easily be preprocessed as to make its -left-multiplication intrinsic to the metric space (, ). At this point, one could wonder: does there exist other ways to take weighted centroids on a Riemannian manifold? We comment on two plausible alternatives.

Fréchet/Karcher means. They are obtained as ; note that although they are also intrinsic, they usually require solving an optimization problem which can be prohibitively expensive, especially when one requires gradients to flow through the solution moreover, for the space , it is known that the minimizer is unique if and only if .

Tangential aggregations. The linear combination is here lifted to the tangent space by means of the exponential and logarithmic map and were in particular used in the recent works of Chami et al. (2019) and Liu et al. (2019). {tcolorbox}

###### Definition 3.

The tangential aggregation of w.r.t. weights , at point (for if ) is defined by:

 tgκx(x1,...,xn;α1,...,αn):=expκx(n∑i=1αilogκx(xi)). (15)

The below theorem describes that for the -stereographic model, this operation is also intrinsic. We prove it in Appendix C.2.6. {tcolorbox}

###### Theorem 6 (Tangential aggregation is intrinsic).

For any isometry of , we have

 tgϕ(x)({ϕ(xi)};{αi})=ϕ(tgx({xi};{αi})). (16)

### 3.5 Logits

Finally, we need the logit and softmax layer, a neccessity for any classification task. We here use the model of Ganea et al. (2018a), which was obtained in a principled manner for the case of negative curvature. Their derivation rests upon the closed-form formula for distance to a hyperbolic hyperplane. We naturally extend this formula to , hence also allowing for but leave for future work the adaptation of their theoretical analysis.

 p(y=k|x)=S(||ak||pk√|κ|sin−1κ(2√|κ|⟨zk,ak⟩(1+κ||zk||2)||ak||)), (17)

where and is the softmax function.
We reference the reader to Appendix D for further details and to fig. 6 for an illustration of eq. 17.

### 3.6 κ-Gcn

We are now ready to introduce our -stereographic GCN (Kipf and Welling, 2017), denoted by -GCN3. Assume we are given a graph with node level features where with each row and adjacency . We first perform a preprocessing step by mapping the Euclidean features to via the projection , where denotes the maximal Euclidean norm among all stereographic embeddings in . For , the -th layer of -GCN is given by:

 H(l+1)=σ⊗κ(^A⊠κ(H(l)⊗κW(l))), (18)

where , is the Möbius version (Ganea et al., 2018a) of a pointwise non-linearity and . The final layer is a -logit layer (Appendix D):

 H(L)=softmax(^A logitκ(H(L−1),W(L−1))), (19)

where contains the parameters and of the -logits layer. A very important property of -GCN is that its architecture recovers the Euclidean GCN when we let curvature go to zero: {tcolorbox}

 κ-GCNκ→0−−→GCN.

## 4 Experiments

We evaluate the architectures introduced in the previous sections on the tasks of node classification and minimizing embedding distortion for several synthetic as well as real datasets. We detail the training setup and model architecture choices to Appendix E.

Minimizing Distortion Our first goal is to evaluate the graph embeddings learned by our GCN models on the representation task of fitting the graph metric in the embedding space. We desire to minimize the average distortion, i.e. defined similarly as in Gu et al. (2019): , where is the distance between the embeddings of nodes i and j, while is their graph distance (shortest path length).

We create three synthetic datasets that best reflect the different geometries of interest: i) “Tree‘”: a balanced tree of depth 5 and branching factor 4 consisting of 1365 nodes and 1364 edges. ii) “Torus”: We sample points (nodes) from the (planar) torus, i.e. from the unit connected square; two nodes are connected by an edge if their toroidal distance (the warped distance) is smaller than a fixed ; this gives 1000 nodes and 30626 edges. iii) “Spherical Graph”: we sample points (nodes) from , connecting nodes if their distance is smaller than 0.2, leading to 1000 nodes and 17640 edges.

For the GCN models, we use 1-hot initial node features. We use two GCN layers with dimensions 16 and 10. The non-Euclidean models do not use additional non-linearities between layers. All Euclidean parameters are updated using the ADAM optimizer with learning rate 0.01. Curvatures are learned using (stochastic) gradient descent and learning rate of 0.0001. All models are trained for 10000 epochs and we report the minimal achieved distortion.

Distortion results. The obtained distortion scores shown in table 1 reveal the benefit of our models. The best performing architecture is the one that matches the underlying geometry of the graph.

### 4.1 Node Classification

We consider the popular node classification datasets Citeseer (Sen et al., 2008), Cora-ML (McCallum et al., 2000) and Pubmed (Namata et al., 2012). Node labels correspond to the particular subfield the published document is associated with. Dataset statistics and splitting details are deferred to the Appendix E due to the lack of space. We compare against the Euclidean model (Kipf and Welling, 2017) and the recently proposed hyperbolic variant (Chami et al., 2019).

Curvature Estimations of Datasets To understand how far are the real graphs of the above datasets from the Euclidean geometry, we first estimate the graph curvature of the four studied datasets using the deviation from the Parallelogram Law (Gu et al., 2019) as detailed in Appendix F. Curvature histograms are shown in fig. 7. It can be noticed that the datasets are mostly non-Euclidean, thus offering a good motivation to apply our constant-curvature GCN architectures.

Training Details We trained the baseline models in the same setting as done in Chami et al. (2019). Namely, for GCN we use one hidden layer of size 16, dropout on the embeddings and the adjacency of rate as well as -regularization for the weights of the first layer. We used ReLU as the non-linear activation function.

For the non-Euclidean architectures, we used a combination of dropout and dropconnect for the non-Euclidean models as reported in Chami et al. (2019), as well as -regularization for the first layer. All models have the same number of parameters and for fairness are compared in the same setting, without attention. We use one hidden layer of dimension 16. For the product models we consider two-component spaces (e.g ) and we split the embedding space into equal dimensions of size 8. We also distribute the input features equally among the components. Non-Euclidean models use the MÃ¶bius version of ReLU as activation function. Euclidean parameters use a learning rate of 0.01 for all models using ADAM. The curvatures are learned using gradient descent with a learning rate of 0.01. We show the values of the learned curvatures in Appendix E. We use early stopping: we first train for a maximum of 2000 epochs, then we check every 200 epochs for improvement in the validation cross entropy loss; if that is not observed, we stop.

Node classification results. These are shown in table 2. It can be seen that our models are competitive with the Euclidean GCN considered and outperforms Chami et al. (2019) on Citeseer and Cora, showcasing the benefit of our proposed architecture.

## 5 Conclusion

In this paper, we introduced a natural extension of graph convolutional networks to the stereographic models of both positive and negative curvatures in a unified manner. We show how this choice of models permits to smoothly interpolate between positive and negative curvature, allowing the curvature of the model to be trained independent of an initial sign choice. We hope that our models will open new exciting directions into non-Euclidean graph neural networks.

## 6 Acknowledgements

We thank prof. Thomas Hofmann, Andreas Bloch, Calin Cruceru and Ondrej Skopek for useful discussions and anonymous reviewers for suggestions.
Gary Bécigneul is funded by the Max Planck ETH Center for Learning Systems.

## Appendix A GCN - A Brief Survey

### a.1 Convolutional Neural Networks on Graphs

One of the pioneering works on neural networks in non-Euclidean domains was done by Defferrard et al. (2016). Their idea was to extend convolutional neural networks for graphs using tools from graph signal processing.

Given a graph , where is the adjacency matrix and is a set of nodes, we define a signal on the nodes of a graph to be a vector where is the value of the signal at node . Consider the diagonalization of the symmetrized graph Laplacian , where . The eigenbasis allows to define the graph Fourier transform .
In order to define a convolution for graphs, we shift from the vertex domain to the Fourier domain:

 x⋆Gy=U((UTx)⊙(UTy))

Note that and are the graph Fourier representations and we use the element-wise product since convolutions become products in the Fourier domain. The left multiplication with maps the Fourier representation back to a vertex representation.
As a consequence, a signal filtered by becomes where with constitutes a filter with all parameters free to vary. In order to avoid the resulting complexity , Defferrard et al. (2016) replace the non-parametric filter by a polynomial filter:

 gθ(Λ)=K−1∑k=0θkΛk

where resulting in a complexity . Filtering a signal is unfortunately still expensive since requires the multiplication with the Fourier basis , thus resulting in complexity . As a consequence, Defferrard et al. (2016) circumvent this problem by choosing the Chebyshev polynomials as a polynomial basis, where . As a consequence, the filter operation becomes where . This led to a -localized filter since it depended on the -th power of the Laplacian. The recursive nature of these polynomials allows for an efficient filtering of complexity , thus leading to an computationally appealing definition of convolution for graphs. The model can also be built in an analogous way to CNNs, by stacking multiple convolutional layers, each layer followed by a non-linearity.

### a.2 Graph Convolutional Networks

Kipf and Welling (2017) extended the work of Defferrard et al. (2016) and inspired many follow-up architectures (Chen et al., 2018; Hamilton et al., 2017; Abu-El-Haija et al., 2018; Wu et al., 2019). The core idea of Kipf and Welling (2017) is to limit each filter to 1-hop neighbours by setting , leading to a convolution that is linear in the Laplacian :

 gθ⋆x=θ0x+θ1^Lx

They further assume , resulting in the expression

To additionally alleviate overfitting, Kipf and Welling (2017) constrain the parameters as , leading to the convolution formula

Since has its eigenvalues in the range , they further employ a reparametrization trick to stop their model from suffering from numerical instabilities:

 gθ⋆x=θ~D−12~A~D−12x

where and .

Rewriting the architecture for multiple features and parameters instead of and , gives

 Z=~D−12~A~D−12XΘ∈Rn×d2

The final model consists of multiple stacks of convolutions, interleaved by a non-linearity :

 H(k+1)=σ(~D−12~A~D−12H(k)Θ(k))

where and .

The final output represents the embedding of each node as and can be used to perform node classification:

 ^Y=softmax(~D−12~A~D−12H(K)W)∈Rn×L

where , with denoting the number of classes.

In order to illustrate how embeddings of neighbouring nodes interact, it is easier to view the architecture on the node level. Denote by the neighbours of node . One can write the embedding of node at layer as follows:

 h(k+1)i=σ⎛⎜⎝Θ(l)∑j∈Ni∪{i}h(k)j√|N(j)||N(i)|⎞⎟⎠

Notice that there is no dependence of the weight matrices on the node , in fact the same parameters are shared across all nodes.
In order to obtain the new embedding of node , we average over all embeddings of the neighbouring nodes. This Message Passing mechanism gives rise to a very broad class of graph neural networks (Kipf and Welling, 2017; Veličković et al., 2018; Hamilton et al., 2017; Gilmer et al., 2017; Chen et al., 2018; Klicpera et al., 2019; Abu-El-Haija et al., 2018).

To be more precise, GCN falls into the more general category of models of the form

 z(k+1)i =AGGREGATE(k)({h(k)j:j∈N(i)};W(k)) h(k+1)i =COMBINE(k)(h(k)i,z(k+1)i;V(k))

Models of the above form are deemed Message Passing Graph Neural Networks and many choices for AGGREGATE and COMBINE have been suggested in the literature Kipf and Welling (2017); Hamilton et al. (2017); Chen et al. (2018).

## Appendix B Graph Embeddings in Non-Euclidean Geometries

In this section we will motivate non-Euclidean embeddings of graphs and show why the underlying geometry of the embedding space can be very beneficial for its representation. We first introduce a measure of how well a graph is represented by some embedding :

###### Definition 4.

Given an embedding of a graph in some metric space , we call a D-embedding for if there exists such that

 r⋅dG(i,j)≤dX(f(i),f(j))≤D⋅r⋅dG(i,j)

The infimum over all such is called the distortion of .

The in the definition of distortion allows for scaling of all distances. Note further that a perfect embedding is achieved when .

### b.1 Trees and Hyperbolic Space

Trees are graphs that do not allow for a cycle, in other words there is no node for which there exists a path starting from and returning back to without passing through any node twice. The number of nodes increases exponentially with the depth of the tree. This is a property that prohibits Euclidean space from representing a tree accurately. What intuitively happens is that ”we run out of space”. Consider the trees depicted in fig. 1. Here the yellow nodes represent the roots of each tree. Notice how rapidly we struggle to find appropriate places for nodes in the embedding space because their number increases just too fast.

Moreover, graph distances get extremely distorted towards the leaves of the tree. Take for instance the green and the pink node. In graph distance they are very far apart as one has to travel up all the way to the root node and back to the border. In Euclidean space however, they are very closely embedded in a -sense, hence introducing a big error in the embedding.

This problem can be very nicely illustrated by the following theorem:

###### Theorem 7.

Consider the tree (also called 3-star) consisting of a root node with three children. Then every embedding with achieves at least distortion for any .

###### Proof.

We will prove this statement by using a special case of the so called PoincarÃ©-type inequalities (Deza and Laurent, 1996):

For any with and points it holds that

 k∑i,j=1bibj||xi−xj||2≤0

Consider now an embedding of the tree where represents the root node. Choosing and for leads to the inequality

 ||x2−x3||2+||x2−x4||2+||x3−x4||2≤3||x1−x2||2+3||x1−x3||2+3||x1−x4||2

The left-hand side of this inequality in terms of the graph distance is

 dG(2,3)2+dG(2,4)2+dG(3,4)2=22+22+22=12

and the right-hand side is

 3⋅dG(1,2)2+3⋅dG(1,3)2+3⋅dG(1,4)2=3+3+3=9

As a result, we always have that the distortion is lower-bounded by

Euclidean space thus already fails to capture the geometric structure of a very simple tree. This problem can be remedied by replacing the underlying Euclidean space by hyperbolic space.

Consider again the distance function in the PoincarÃ© model, for simplicity with :

Assume that the tree is embedded in the same way as in fig.1, just restricted to lie in the disk of radius . Notice that as soon as points move closer to the boundary (), the fraction explodes and the resulting distance goes to infinity. As a result, the further you move points to the border, the more their distance increases, exactly as nodes on different branches are more distant to each other the further down they are in the tree. We can express this advantage in geometry in terms of distortion:

###### Theorem 8.

There exists an embedding for achieving distortion for arbitrary small.

###### Proof.

Since the PoincarÃ© distance is invariant under MÃ¶bius translations we can again assume that . Let us place the other nodes on a circle of radius . Their distance to the root is now given as

 dP(xi,0)=cosh−1(1+2||xi||21−||xi||2)=cosh−1(1+2r21−r2) (20)

By invariance of the distance under centered rotations we can assume w.l.o.g. . We further embed

• .

This procedure gives:

 dP(x2,x3)=cosh−1⎛⎜ ⎜⎝1+2||(3r2,−√32r)||2(1−r2)2⎞⎟ ⎟⎠=cosh−1(1+23r2(1−r2)2) (21)

If we let the points now move to the border of the disk we observe that

 cosh−1(1+23r2(1−r2)2)cosh−1(1+2r21−r2)r→1−−→2

But this means in turn that we can achieve distortion for arbitrary small. ∎

The tree-likeliness of hyperbolic space has been investigated on a deeper mathematical level. Sarkar,Rik (2011) show that a similar statement as in theorem 8 holds for all weighted or unweighted trees. The interested reader is referred to Hamann,Matthias (2017); Sarkar,Rik (2011) for a more in-depth treatment of the subject.

Cycles are the subclasses of graphs that are not allowed in a tree. They consist of one path that reconnects the first and the last node: . Again there is a very simple example of a cycle, hinting at the limits Euclidean space incurs when trying to preserve the geometry of these objects (Matousek, 2013).

###### Theorem 9.

Consider the cycle of length four. Then any embedding where achieves at least distortion .

###### Proof.

Denote by the embeddings in Euclidean space where and are the pairs without an edge. Again using the PoincarÃ©-type inequality with and leads to the short diagonal theorem (Matousek, 2013):

 ||x1−x3||2+||