Constant Curvature Graph ConvolutionalNetworks

# Constant Curvature Graph Convolutional Networks

Gregor Bachmann & Gary Bécigneul
Department of Computer Science
ETH Zürich, Switzerland
{gregor.bachmann,gary.becigneul}@inf.ethz.ch
\ANDOctavian-Eugen Ganea
MIT Computer Science & Artificial Intelligence Lab
oct@mit.edu
Authors contributed equally.
###### Abstract

Interest has been rising lately towards methods representing data in non-Euclidean spaces, e.g. hyperbolic or spherical, that provide specific inductive biases useful for certain real-world data properties, e.g. scale-free, hierarchical or cyclical. However, the popular graph neural networks are currently limited in modeling data only via Euclidean geometry and associated vector space operations. Here, we bridge this gap by proposing mathematically grounded generalizations of graph convolutional networks (GCN) to (products of) constant curvature spaces. We do this by i) introducing a unified formalism that can interpolate smoothly between all geometries of constant curvature, ii) leveraging gyro-barycentric coordinates that generalize the classic Euclidean concept of the center of mass. Our class of models smoothly recover their Euclidean counterparts when the curvature goes to zero from either side. Empirically, we outperform Euclidean GCNs in the tasks of node classification and distortion minimization for symbolic data exhibiting non-Euclidean behavior, according to their discrete curvature.

\tcbset

boxsep=0mm,boxrule=0pt,colframe=white,arc=0mm,left=0.5mm,right=0.5mm

## 1 Introduction

##### Graph Convolutional Networks.

The success of convolutional networks and deep learning for image data has inspired generalizations for graphs for which sharing parameters is consistent with the graph geometry. Bruna et al. (2014); Henaff et al. (2015) are the pioneers of spectral graph convolutional neural networks in the graph Fourier space using localized spectral filters on graphs. However, in order to reduce the graph-dependency on the Laplacian eigenmodes, Defferrard et al. (2016) approximate the convolutional filters using Chebyshev polynomials leveraging a result of Hammond et al. (2011). The resulting method (discussed in appendix A) is computationally efficient and superior in terms of accuracy and complexity. Further, Kipf and Welling (2017) simplify this approach by considering first-order approximations obtaining high scalability. The proposed graph convolutional networks (GCN) is interpolating node embeddings via a symmetrically normalized adjacency matrix, while this weight sharing can be understood as an efficient diffusion-like regularizer. Recent works extend GCNs to achieve state of the art results for link prediction (Zhang and Chen, 2018), graph classification (Hamilton et al., 2017; Xu et al., 2018) and node classification (Klicpera et al., 2019; Veličković et al., 2018).

##### Euclidean geometry in ML.

In machine learning (ML), data is most often represented in a Euclidean space for various reasons. First, some data is intrinsically Euclidean, such as positions in 3D space in classical mechanics. Second, intuition is easier in such spaces, as they possess an appealing vectorial structure allowing basic arithmetic and a rich theory of linear algebra. Finally, a lot of quantities of interest such as distances and inner-products are known in closed-form formulae and can be computed very efficiently on the existing hardware. These operations are the basic building blocks for most of today’s popular machine learning models. Thus, the powerful simplicity and efficiency of Euclidean geometry has led to numerous methods achieving state-of-the-art on tasks as diverse as machine translation (Bahdanau et al., 2014; Vaswani et al., 2017), speech recognition (Graves et al., 2013), image classification (He et al., 2016) or recommender systems (He et al., 2017).

##### Riemannian ML.

In spite of this success, certain types of data (e.g. hierarchical, scale-free or spherical data) have been shown to be better represented by non-Euclidean geometries (Defferrard et al., 2019; Bronstein et al., 2017; Nickel and Kiela, 2017; Gu et al., 2019), leading in particular to the rich theories of manifold learning (Roweis and Saul, 2000; Tenenbaum et al., 2000) and information geometry (Amari and Nagaoka, 2007). The mathematical framework in vigor to manipulate non-Euclidean geometries is known as Riemannian geometry (Spivak, 1979). Although its theory leads to many strong and elegant results, some of its basic quantities such as the distance function are in general not available in closed-form, which can be prohibitive to many computational methods.

##### Representational Advantages of Geometries of Constant Curvature.

An interesting trade-off between general Riemannian manifolds and the Euclidean space is given by manifolds of constant sectional curvature. They define together what are called hyperbolic (negative curvature), elliptic (positive curvature) and Euclidean (zero curvature) geometries. As discussed below and in appendix B, Euclidean spaces have limitations and suffer from large distortion when embedding certain types of data such as trees, e.g. fig. 1. In these cases, the hyperbolic and spherical spaces have representational advantages providing a better inductive bias for the respective data.

The hyperbolic space can be intuitively understood as a continuous tree: the volume of a ball grows exponentially with its radius, similarly as how the number of nodes in a binary tree grows exponentially with its depth. Its tree-likeness properties have long been studied mathematically (Gromov, 1987; Hamann, 2017; Ungar, 2008) and it was proven to better embed complex networks (Krioukov et al., 2010), scale-free graphs and hierarchical data compared to the Euclidean geometry (Cho et al., 2019; Sala et al., 2018; Ganea et al., 2018b; Gu et al., 2019; Nickel and Kiela, 2018, 2017; Tifrea et al., 2019). Several important tools or methods found their hyperbolic counterparts, such as variational autoencoders (Mathieu et al., 2019; Ovinnikov, 2019), attention mechanisms (Gulcehre et al., 2018), matrix multiplications, recurrent units and multinomial logistic regression (Ganea et al., 2018a).

Similarly, spherical geometry provides benefits for modeling spherical or cyclical data (Defferrard et al., 2019; Matousek, 2013; Davidson et al., 2018; Xu and Durrett, 2018; Gu et al., 2019; Grattarola et al., 2018; Wilson et al., 2014).

##### Computational Efficiency of Constant Curvature Spaces (CCS).

CCS are some of the few Riemannian manifolds to possess closed-form formulae for geometric quantities of interest in computational methods, i.e. distance, geodesics, exponential map, parallel transport and their gradients. We also leverage here the closed expressions for weighted centroids.

##### “Linear Algebra” of CCS: Gyrovector Spaces.

In order to study the geometry of constant negative curvature in analogy with the Euclidean geometry, Ungar (1999, 2005, 2008, 2016) proposed the elegant non-associative algebraic formalism of gyrovector spaces. Recently, Ganea et al. (2018a) have linked this framework to the Riemannian geometry of the space, also generalizing the building blocks for non-Euclidean deep learning models operating with hyperbolic data representations.

However, it remains unclear how to extend in a principled manner the connection between Riemannian geometry and gyrovector space operations for spaces of constant positive curvature (spherical). By leveraging Euler’s formula and complex analysis, we present to our knowledge the first unified gyro framework that smoothly interpolates between geometries of constant curvatures irrespective of their signs. This is possible when working with the Poincaré ball and stereographic spherical projection models of respectively hyperbolic and spherical spaces.

How should one adapt graph neural networks to non-flat geometries of constant curvature?

In this work, we propose constant curvature GCNs to model non-Euclidean data. Node embeddings lie in spaces of constant curvature or product of those instead of a Euclidean space, thus leveraging both the representational power of these geometries and the effectiveness of GCNs.

Concurrent to our work, Chami et al. (2019); Liu et al. (2019) propose hyperbolic graph neural networks using tangent space aggregation.

## 2 The Geometry of Constant Curvature Spaces

##### Riemannian Geometry.

A manifold of dimension is a generalization to higher dimensions of the notion of surface, and is a space that locally looks like . At each point , can be associated a tangent space , which is a vector space of dimension that can be understood as a first order approximation of around . A Riemannian metric is given by an inner-product at each tangent space , varying smoothly with . A given defines the geometry of , because it can be used to define the distance between and as the infimum of the lengths of smooth paths from to , where the length is defined as . Under certain assumptions, a given also defines a curvature at each point.

##### Unifying all curvatures κ.

There exist several models of respectively constant positive and negative curvatures. For positive curvature, we choose the stereographic projection of the sphere, while for negative curvature we choose the Poincaré model which is the stereographic projection of the Lorentz model. As explained below, this choice allows us to generalize the gyrovector space framework and unify spaces of both positive and negative curvature into a single model which we call the -stereographic model.

##### The κ-stereographic model.

For a curvature and a dimension , it is defined as equipped with its Riemannian metric . Note in particular that when , is , while when it is the open ball of radius .

##### Gyrovector spaces & Riemannian geometry.

As discussed in section 1, the gyrovector space formalism is used to generalize vector spaces to the Poincaré model of hyperbolic geometry (Ungar, 2005, 2008). In addition, important quantities from Riemannian geometry can be rewritten in terms of the Möbius vector addition and scalar-vector multiplication (Ganea et al., 2018a). We here extend gyrovector spaces to the -stereographic model, i.e. allowing positive curvature.

For and any point , we will denote by the unique point of the sphere of radius in whose stereographic projection is . As detailed in appendix C.2.2, it is given by

 ~x:=(λκxx,κ−12(λκx−1)). (1)

For , we define the -addition, in the -stereographic model by:

 x⊕κy=(1−2κxTy−κ||y||2)x+(1+κ||x||2)y1−2κxTy+κ2||x||2||y||2∈stdκ. (2)

The -addition is defined in all the cases except for spherical geometry and as stated by the following theorem proved in Appendix C.2.1. {tcolorbox}

###### Theorem 1 (Definiteness of κ-addition).

We have if and only if and .

For and (and if ), the -scaling in the -stereographic model is given by:

 s⊗κx=tanκ(s⋅tan−1κ||x||)x||x||∈stdκ, (3)

where equals if and if . This formalism yields simple closed-forms for various quantities including the distance function inherited from the Riemannian manifold (, ), the exp and log maps, and geodesics, as shown by the following theorem. {tcolorbox}

###### Theorem 2 (Extending gyrovector spaces to positive curvature).

For , , , (and if ), the distance function is given by111We write for and not .:

 dκ(x,y)=2|κ|−1/2tan−1κ∥−x⊕κy∥, (4)

the unit-speed geodesic from to is unique and given by

 γx→y(t)=x⊕κ(t⊗κ(−x⊕κy)), (5)

and finally the exponential and logarithmic maps are described as:

 expκx(v)=x⊕κ(tanκ(|κ|12λκx||v||2)v||v||);logκx(y)=2|κ|−12λκxtan−1κ||−x⊕κy||−x⊕κy||−x⊕ky|| (6)

Proof sketch:
The case was already taken care of by (Ganea et al., 2018a). For , we provide a detailed proof in Appendix C.2.2. The exponential map and unit-speed geodesics are obtained using the Egregium theorem and the known formulas in the standard spherical model. The distance then follows from the formula which holds in any Riemannian manifold.

##### Around κ=0.

One notably observes that choosing yields all corresponding Euclidean quantities, which guarantees a continuous interpolation between -stereographic models of different curvatures, via Euler’s formula where . But is this interpolation differentiable with respect to ? It is as shown by the following theorem, proved in Appendix C.2.3.

{tcolorbox}
###### Theorem 3 (Smoothness of stdκ w.r.t. κ around 0).

Let and , such that (and if ). Quantities in Eqs. (4,5,6) are well-defined for , i.e. for small enough. Their first order derivatives at and exist and are equal. Moreover, for the distance we have:

 dκ(x,y)=2∥x−y∥−2κ(∥x−y∥3/3+(xTy)∥x−y∥2)+O(κ2). (7)

Note that for , this tells us that an infinitesimal change of curvature from zero to small negative, i.e. towards , while keeping fixed, has the effect of increasing their distance.

As a consequence, we have a unified formalism that interpolates smoothly between all three geometries of constant curvature.

## 3 κ-GCNs

We start by introducing the methods upon which we build. We present our models for spaces of constant sectional curvature, in the -stereographic model. However, the generalization to cartesian products of such spaces (Gu et al., 2019) follows naturally from these tools.

### 3.1 Graph Convolutional Networks

The problem of node classification on a graph has long been tackled with explicit regularization using the graph Laplacian (Weston et al., 2012). Namely, for a directed graph with adjacency matrix , by adding the following term to the loss: , where is the (unnormalized) graph Laplacian, defines the (diagonal) degree matrix, contains the trainable parameters of the model and the node features of the model. Such a regularization is expected to improve generalization if connected nodes in the graph tend to share labels; node with feature vector is represented as in a Euclidean space.

With the aim to obtain more scalable models, Defferrard et al. (2016); Kipf and Welling (2017) propose to make this regularization implicit by incorporating it into what they call graph convolutional networks (GCN), which they motivate as a first order approximation of spectral graph convolutions, yielding the following scalable layer architecture (detailed in appendix A):

 H(t+1)=σ(~D−12~A~D−12H(t)W(t)), (8)

where has added self-connections, defines its diagonal degree matrix, is a non-linearity such as sigmoid, or , and and are the parameter and activation matrices of layer respectively, with the input feature matrix.

### 3.2 Tools for a κ-Gcn

Learning a parametrized function that respects hyperbolic geometry has been studied in (Ganea et al., 2018a): neural layers and hyperbolic softmax. We generalize their definitions into the -stereographic model, unifying operations in positive and negative curvature. We explain how curvature introduces a fundamental difference between left and right matrix multiplications, depicting the Möbius matrix multiplication of (Ganea et al., 2018a) as a right multiplication, independent for each embedding. We then introduce a left multiplication by extension of gyromidpoints which ties the embeddings, which is essential for graph neural networks.

### 3.3 κ-Right-Matrix-Multiplication

Let denote a matrix whose rows are -dimensional embeddings in , and let denote a weight matrix. Let us first understand what a right matrix multiplication is in Euclidean space: the Euclidean right multiplication can be written row-wise as . Hence each -dimensional Euclidean embedding is modified independently by a right matrix multiplication. A natural adaptation of this operation to the -stereographic model yields the following definition. {tcolorbox}

###### Definition 1.

Given a matrix holding -stereographic embeddings in its rows and weights , the -right-matrix-multiplication is defined row-wise as

 (X⊗κW)i∙=expκ0((logκ0(X)W)i∙)=tanκ(||(XW)i∙||||Xi∙||tan−1κ(||X∙i||))(XW)i∙||(XW)i∙||

where and denote the exponential and logarithmic map in the -stereographic model.

This definition is in perfect agreement with the hyperbolic scalar multiplication for , which can also be written as . This operation is known to have desirable properties such as associativity (Ganea et al., 2018a).

### 3.4 κ-Left-Matrix-Multiplication as a Midpoint Extension

For graph neural networks we also need the notion of message passing among neighboring nodes, i.e. an operation that combines / aggregates the respective embeddings together. In Euclidean space such an operation is given by the left multiplication of the embeddings matrix with the (preprocessed) adjacency : . Let us consider this left multiplication. For , the matrix product is given row-wise by:

 (AX)i∙=Ai1X1∙+⋯+AinXn∙

This means that the new representation of node is obtained by calculating the linear combination of all the other node embeddings, weighted by the -th row of . An adaptation to the -stereographic model hence requires a notion of weighted linear combination. We propose such an operation in by performing a -scaling of a gyromidpoint whose definition is reminded below. Indeed, in Euclidean space, the weighted linear combination can be re-written as with Euclidean midpoint . This motivates generalizing the above operation to as follows. {tcolorbox}

###### Definition 2.

Given a matrix holding -stereographic embeddings in its rows and weights , the -left-matrix-multiplication is defined row-wise as

 (A⊠κX)i∙:=(∑jAij)⊗κmκ(X1∙,⋯,Xn∙;Ai1,⋯,Ain). (9)

The -scaling is motivated by the fact that for all , . We remind that the gyromidpoint is defined when in the -stereographic model as (Ungar, 2010):

 mκ(x1,⋯,xn;α1,⋯,αn)=12⊗κ(n∑i=1αiλκxi∑nj=1αj(λκxj−1)xi), (10)

with . Whenever , we have to further require the following condition:

 ∑jαj(λκxj−1)≠0. (11)

For two points, one can calculate that is equivalent to , which holds in particular whenever . See fig. 3 for illustrations of gyromidpoints.

Our operation satisfies interesting properties, proved in Appendix C.2.4: {tcolorbox}

###### Theorem 4 (Neuter element & κ-scalar-associativity).

We have , and for ,

 r⊗κ(A⊠κX)=(rA)⊠κX.
##### The matrix A.

In most graph neural networks, the matrix is intended to be a preprocessed adjacency matrix, i.e. renormalized by the diagonal degree matrix . This normalization is often taken either (i) to the left: , (ii) symmetric: or (iii) to the right: . Note that the latter case makes the matrix right-stochastic222 is right-stochastic if for all , ., which is a property that is preserved by matrix product and exponentiation. For this case, we prove the following result in Appendix C.2.5: {tcolorbox}

###### Theorem 5 (κ-left-multiplication by right-stochastic matrices is intrinsic).

If are right-stochastic, is a isometry of and , are two matrices holding -stereographic embeddings:

 ∀i,dκ((A⊠κϕ(X))i∙,(B⊠κϕ(Y))i∙)=dκ((A⊠κX)i∙,(B⊠κY)i∙). (12)

The above result means that can easily be preprocessed as to make its -left-multiplication intrinsic to the metric space (, ). At this point, one could wonder: does there exist other ways to take weighted centroids on a Riemannian manifold? We comment on two plausible alternatives.

##### Fréchet/Karcher means.

They are obtained as ; note that although they are also intrinsic, they usually require solving an optimization problem which can be prohibitively expensive, especially when one requires gradients to flow through the solution moreover, for the space , it is known that the minimizer is unique if and only if .

##### Tangential aggregations.

They are defined by lifting the points in a chosen tangent space via the logarithmic map, performing a linear combination and then projecting back via the exponential map, and were in particular used in the recent works of Chami et al. (2019) and Liu et al. (2019). The below theorem describes that for the -stereographic model, this operation is also intrinsic, i.e. commutes with isometries. We prove it in Appendix C.2.6. {tcolorbox}

###### Theorem 6 (Tangential aggregation is intrinsic).

Define the tangential aggregation of w.r.t. weights , at point (for if ) by:

 tgκx(x1,...,xn;α1,...,αn):=expκx(n∑i=1αilogκx(xi)). (13)

For any isometry of , we have

 tgϕ(x)({ϕ(xi)};{αi})=ϕ(tgx({xi};{αi})). (14)

### 3.5 Logits

Finally, we need the logit and softmax layer, a neccessity for any classification task. We here use the model of (Ganea et al., 2018a), which was obtained in a principled manner for the case of negative curvature. We leave for future work the adaptation of their analysis to positive curvature and use in our experiments the straightforwardly adapted formula to positive curvature, which we detail in appendix D.

### 3.6 κ-Gcn

We are now ready to introduce our -stereographic GCN (Kipf and Welling, 2017), denoted by -GCN333To be pronounced “kappa” GCN; the greek letter being commonly used to denote sectional curvature. Assume we are given a graph with node level features where with each row and adjacency . We first perform a preprocessing step by mapping the Euclidean features to via the projection , where denotes the maximal Euclidean norm among all stereographic embeddings in . For , the -th layer of -GCN is given by:

 H(l+1)=σ⊗κ(^A⊠κ(H(l)⊗κW(l))), (15)

where , is the Möbius version (Ganea et al., 2018a) of a pointwise non-linearity and . The final layer is a -logit layer (appendix D):

 H(L)=softmax(^A logitκ(H(L−1),W(L−1))), (16)

where contains the parameters and of the -logits layer. A very important property of -GCN is that its architecture recovers the Euclidean GCN when we let curvature go to zero: {tcolorbox}

 κ-GCNκ→0−−→GCN.

## 4 Experiments

We evaluate the architectures introduced in the previous sections on the tasks of node classification and minimizing embedding distortion for several synthetic as well as real datasets. We detail the training setup and model architecture choices to appendix E.

##### Minimizing Distortion

Our first goal is to evaluate the graph embeddings learned by our GCN models on the representation task of fitting the graph metric in the embedding space. We desire to minimize the average distortion, i.e. defined similarly as in (Gu et al., 2019): , where is the distance between the embeddings of nodes i and j, while is their graph distance (shortest path length).

We create three synthetic datasets that best reflect the different geometries of interest: i) “Tree‘”: a balanced tree of depth 5 and branching factor 4 consisting of 1365 nodes and 1364 edges. ii) “Torus”: We sample points (nodes) from the (planar) torus, i.e. from the unit connected square; two nodes are connected by an edge iff their toroidal distance (the warped distance) is smaller than a fixed R = 0.01; this gives 1000 nodes and 30626 edges. iii) “Spherical Graph”: we sample points (nodes) from , connecting nodes iff their distance is smaller than 0.2, leading to 1000 nodes and 17640 edges.

For the GCN models, we use 1-hot initial node features. We use two GCN layers with dimensions 16 and 10. The non-Euclidean models do not use additional non-linearities between layers. All Euclidean parameters are updated using the ADAM optimizer with learning rate 0.01. Curvatures are learned using (stochastic) gradient descent and learning rate of 0.0001. All models are trained for 10000 epochs and we report the minimal achieved distortion. The results shown in table 1 reveal the benefit of our models. One can notice that estimated curvatures correspond to our geometric knowledge about these specific datasets.

### 4.1 Node Classification

We consider the popular node classification datasets Citeseer (Sen et al., 2008), Cora-ML (McCallum et al., 2000) and Pubmed (Namata et al., 2012). Node labels correspond to the particular subfield the published document is associated with. Dataset statistics and splitting details are deferred to the appendix E due to the lack of space.

##### Curvature Estimations of Datasets

To understand how far are the real graphs of the above datasets from the Euclidean geometry, we first estimate the graph curvature of the four studied datasets using the deviation from the Parallelogram Law (Gu et al., 2019) as detailed in appendix F. Curvature histograms are shown in fig. 4. It can be noticed that the datasets are mostly non-Euclidean, thus offering a good motivation to apply our constant-curvature GCN architectures.

##### Training Details

We trained the Euclidean models with the hyperparameters chosen as reported in (Klicpera et al., 2019). Namely, for GCN we use one hidden layer of size , dropout on the embeddings and the adjacency of rate as well as -regularization for the weights of the first layer with . Only for Cora-ML we had to adjust the regularization factor to to ensure similar scores as achieved in (Klicpera et al., 2019).

All Non-Euclidean models use biased-L2 regularization for their weights defined as with and . Euclidean models used L2 regularization with the same parameter . We used a combination of dropout and dropconnect for the non-Euclidean models. All models have the same number of parameters. We use 2 GCN layers, hidden dimension 64. Product models split hidden dimension into [32, 32] and also input features equally. Non-Euclidean models do not use additional non-linearities. Euclidean parameters use a learning rate of 0.01 for all models using ADAM. The curvatures are learned using gradient descent with a learning rate of 0.01. We show the values of the learned curvatures in appendix E. We use early stopping: we first train for a maximum of 2000 epochs, then we check every 200 epochs for improvement in the validation cross entropy loss; if that is not observed, we stop.

##### Node classification results.

These are shown in table 2. It can be seen that our models are competitive with the two Euclidean GCN considered (with or without non-linearities), showcasing the benefit of our proposed architecture.

## 5 Conclusion

In this paper, we introduced a natural extension of graph convolutional networks to the stereographic models of both positive and negative curvatures in a unified manner. We show how this choice of models permits to smoothly interpolate between positive and negative curvature, allowing the curvature of the model to be trained independent of an initial sign choice. We hope that our models will open new exciting directions into non-Euclidean graph neural networks.

## Acknowledgements

We thank Andreas Bloch, Calin Cruceru and Ondrej Skopek for useful discussions and anonymous reviewers for suggestions.

Gary Bécigneul is funded by the Max Planck ETH Center for Learning Systems.

## References

• S. Abu-El-Haija, A. Kapoor, B. Perozzi, and J. Lee (2018) N-GCN: Multi-scale Graph Convolution for Semi-supervised Node Classification. International Workshop on Mining and Learning with Graphs (MLG). Cited by: §A.2, §A.2.
• S. Amari and H. Nagaoka (2007) Methods of information geometry. Vol. 191, American Mathematical Soc.. Cited by: §1.
• D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
• M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst (2017) Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.
• J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun (2014) Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, pp. http–openreview. Cited by: §1.
• I. Chami, R. Ying, C. Ré, and J. Leskovec (2019) Hyperbolic graph convolutional neural networks. arXiv preprint arXiv:1910.12933. Cited by: §1, §3.4.
• J. Chen, T. Ma, and C. Xiao (2018) Fastgcn: fast learning with graph convolutional networks via importance sampling. ICLR. Cited by: §A.2, §A.2, §A.2.
• H. Cho, B. DeMeo, J. Peng, and B. Berger (2019) Large-margin classification in hyperbolic space. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1832–1840. Cited by: §1.
• T. R. Davidson, L. Falorsi, N. De Cao, T. Kipf, and J. M. Tomczak (2018) Hyperspherical Variational Auto-Encoders. Uncertainty in Artificial Intelligence (UAI), 856- 865. Cited by: §1.
• M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §A.1, §A.1, §A.2, §1, §3.1.
• M. Defferrard, N. Perraudin, T. Kacprzak, and R. Sgier (2019) DeepSphere: towards an equivariant graph-based spherical cnn. In ICLR Workshop on Representation Learning on Graphs and Manifolds, External Links: 1904.05146, Link Cited by: §1, §1.
• M. Deza and M. Laurent (1996) Geometry of Cuts and Metrics. Springer, Vol. 15. Cited by: §B.1.
• O. Ganea, G. Bécigneul, and T. Hofmann (2018a) Hyperbolic neural networks. In Advances in neural information processing systems, pp. 5345–5355. Cited by: Appendix D, §1, §1, §2, §2, §3.2, §3.3, §3.5, §3.6.
• O. Ganea, G. Becigneul, and T. Hofmann (2018b) Hyperbolic entailment cones for learning hierarchical embeddings. In International Conference on Machine Learning, pp. 1632–1641. Cited by: §1.
• J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural Message Passing for Quantum Chemistry. Proceedings of the International Conference on Machine Learning. Cited by: §A.2.
• D. Grattarola, D. Zambon, C. Alippi, and L. Livi (2018) Learning graph embeddings on constant-curvature manifolds for change detection in graph streams. stat 1050, pp. 16. Cited by: §1.
• A. Graves, A. Mohamed, and G. Hinton (2013) Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. Cited by: §1.
• M. Gromov (1987) Hyperbolic groups. In Essays in group theory, pp. 75–263. Cited by: §1.
• A. Gu, F. Sala, B. Gunel, and C. Ré (2019) Learning mixed-curvature representations in product spaces. Cited by: Appendix F, §1, §1, §1, §3, §4, §4.1.
• C. Gulcehre, M. Denil, M. Malinowski, A. Razavi, R. Pascanu, K. M. Hermann, P. Battaglia, V. Bapst, D. Raposo, A. Santoro, et al. (2018) Hyperbolic attention networks. Proceedings of the International Conference on Learning Representations. Cited by: §1.
• M. Hamann (2017) On the tree-likeness of hyperbolic spaces. Mathematical Proceedings of the Cambridge Philosophical Society, pp. 1–17. External Links: Document Cited by: §1.
• Hamann,Matthias (2017) On the tree-likeness of hyperbolic spaces. Mathematical Proceedings of the Cambridge Philo- sophical Society, pp. 117. Cited by: §B.1.
• W. L. Hamilton, R. Ying, and J. Leskovec (2017) Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems. Cited by: §A.2, §A.2, §A.2, §1.
• D. K. Hammond, P. Vandergheynst, and R. Gribonval (2011) Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30 (2), pp. 129–150. Cited by: §1.
• K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
• X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua (2017) Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pp. 173–182. Cited by: §1.
• M. Henaff, J. Bruna, and Y. LeCun (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163. Cited by: §1.
• D. P. Kingma and J. Ba (2015) ADAM: A method for stochastic optimization. ICLR. Cited by: Appendix D.
• T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations. Cited by: §A.2, §A.2, §A.2, §1, §3.1, §3.6.
• J. Klicpera, A. Bojchevski, and S. Günnemann (2019) Predict then propagate: graph neural networks meet personalized pagerank. International Conference on Learning Representations. Cited by: §A.2, Appendix E, §1, §4.1.
• D. Krioukov, F. Papadopoulos, M. Kitsak, A. Vahdat, and M. Boguná (2010) Hyperbolic geometry of complex networks. Physical Review E 82 (3), pp. 036106. Cited by: §1.
• Q. Liu, M. Nickel, and D. Kiela (2019) Hyperbolic graph neural networks. arXiv preprint arXiv:1910.12892. Cited by: §1, §3.4.
• E. Mathieu, C. L. Lan, C. J. Maddison, R. Tomioka, and Y. W. Teh (2019) Hierarchical representations with poincar’e variational auto-encoders. arXiv preprint arXiv:1901.06033. Cited by: §1.
• J. Matousek (2013) Lecture notes on metric embeddings. Cited by: §B.1, §B.1, §1.
• A. McCallum, K. Nigam, J. Rennie, and K. Seymore (2000) Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127–163. Cited by: §4.1.
• G. Namata, B. London, L. Getoor, and B. Huang (2012) Query-driven Active Surveying for Collective Classification. International Workshop on Mining and Learning with Graphs (MLG). Cited by: §4.1.
• M. Nickel and D. Kiela (2018) Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In International Conference on Machine Learning, Cited by: §1.
• M. Nickel and D. Kiela (2017) Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems, pp. 6341–6350. Cited by: §1, §1.
• I. Ovinnikov (2019) Poincar’e wasserstein autoencoder. arXiv preprint arXiv:1901.01427. Cited by: §1.
• S. T. Roweis and L. K. Saul (2000) Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: §1.
• F. Sala, C. De Sa, A. Gu, and C. Re (2018) Representation tradeoffs for hyperbolic embeddings. In International Conference on Machine Learning, pp. 4457–4466. Cited by: §1.
• Sarkar,Rik (2011) Low distortion delaunay embedding of trees in hyperbolic plane. International Symposium on Graph Drawing, pp. 355–366. Springer,. Cited by: §B.1.
• P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad (2008) Collective Classification in Network Data. AI Magazine, 29(3):93–106. Cited by: §4.1.
• M. Spivak (1979) A comprehensive introduction to differential geometry. volume four. Cited by: §1.
• J. B. Tenenbaum, V. De Silva, and J. C. Langford (2000) A global geometric framework for nonlinear dimensionality reduction. science 290 (5500), pp. 2319–2323. Cited by: §1.
• A. Tifrea, G. Bécigneul, and O. Ganea (2019) Poincaré glove: hyperbolic word embeddings. Cited by: §1.
• A. A. Ungar (1999) The hyperbolic pythagorean theorem in the poincaré disc model of hyperbolic geometry. The American mathematical monthly 106 (8), pp. 759–763. Cited by: §1.
• A. A. Ungar (2005) Analytic hyperbolic geometry: mathematical foundations and applications. World Scientific. Cited by: §C.2.5, §1, §2.
• A. A. Ungar (2008) A gyrovector space approach to hyperbolic geometry. Synthesis Lectures on Mathematics and Statistics 1 (1), pp. 1–194. Cited by: §C.2.6, §C.2.6, §1, §1, §2, Figure 3.
• A. A. Ungar (2014) Analytic hyperbolic geometry in n dimensions: an introduction. CRC Press. Cited by: §C.2.6, §C.2.6.
• A. A. Ungar (2016) Novel tools to determine hyperbolic triangle centers. In Essays in Mathematics and its Applications, pp. 563–663. Cited by: §1.
• A. Ungar (2010) Barycentric Calculus in Euclidean and Hyperbolic Geometry. World Scientific, ISBN 9789814304931. Cited by: §3.4.
• A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
• P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. International Conference on Learning Representations. Cited by: §A.2, §1.
• J. Weston, F. Ratle, H. Mobahi, and R. Collobert (2012) Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pp. 639–655. Cited by: §3.1.
• R. C. Wilson, E. R. Hancock, E. Pekalska, and R. P. Duin (2014) Spherical and hyperbolic embeddings of data. IEEE transactions on pattern analysis and machine intelligence 36 (11), pp. 2255–2269. Cited by: §1.
• Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §A.2.
• J. Xu and G. Durrett (2018) Spherical latent spaces for stable variational autoencoders. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4503–4513. Cited by: §1.
• K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. International Conference on Learning Representations. Cited by: §1.
• M. Zhang and Y. Chen (2018) Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems. Cited by: §1.

## Appendix A GCN - A Brief Survey

### a.1 Convolutional Neural Networks on Graphs

One of the pioneering works on neural networks in non-Euclidean domains was done by (Defferrard et al., 2016). Their idea was to extend convolutional neural networks for graphs using tools from graph signal processing.

Given a graph , where is the adjacency matrix and is a set of nodes, we define a signal on the nodes of a graph to be a vector where is the value of the signal at node . Consider the diagonalization of the symmetrized graph Laplacian , where . The eigenbasis allows to define the graph Fourier transform .
In order to define a convolution for graphs, we shift from the vertex domain to the Fourier domain:

 x⋆Gy=U((UTx)⊙(UTy))

Note that and are the graph Fourier representations and we use the element-wise product since convolutions become products in the Fourier domain. The left multiplication with maps the Fourier representation back to a vertex representation.
As a consequence, a signal filtered by becomes where with constitutes a filter with all parameters free to vary. In order to avoid the resulting complexity , (Defferrard et al., 2016) replace the non-parametric filter by a polynomial filter:

 gθ(Λ)=K−1∑k=0θkΛk

where resulting in a complexity . Filtering a signal is unfortunately still expensive since requires the multiplication with the Fourier basis , thus resulting in complexity . As a consequence, (Defferrard et al., 2016) circumvent this problem by choosing the Chebyshev polynomials as a polynomial basis, where . As a consequence, the filter operation becomes where . This led to a -localized filter since it depended on the -th power of the Laplacian. The recursive nature of these polynomials allows for an efficient filtering of complexity , thus leading to an computationally appealing definition of convolution for graphs. The model can also be built in an analogous way to CNNs, by stacking multiple convolutional layers, each layer followed by a non-linearity.

### a.2 Graph Convolutional Networks

(Kipf and Welling, 2017) extended the work of (Defferrard et al., 2016) and inspired many follow-up architectures (Chen et al., 2018; Hamilton et al., 2017; Abu-El-Haija et al., 2018; Wu et al., 2019). The core idea of (Kipf and Welling, 2017) is to limit each filter to 1-hop neighbours by setting , leading to a convolution that is linear in the Laplacian :

 gθ⋆x=θ0x+θ1^Lx

They further assume , resulting in the expression

To additionally alleviate overfitting, (Kipf and Welling, 2017) constrain the parameters as , leading to the convolution formula

Since has its eigenvalues in the range , they further employ a reparametrization trick to stop their model from suffering from numerical instabilities:

 gθ⋆x=θ~D−12~A~D−12x

where and .

Rewriting the architecture for multiple features and parameters instead of and , gives

 Z=~D−12~A~D−12XΘ∈Rn×d2

The final model consists of multiple stacks of convolutions, interleaved by a non-linearity :

 H(k+1)=σ(~D−12~A~D−12H(k)Θ(k))

where and .

The final output represents the embedding of each node as and can be used to perform node classification:

 ^Y=softmax(~D−12~A~D−12H(K)W)∈Rn×L

where , with denoting the number of classes.

In order to illustrate how embeddings of neighbouring nodes interact, it is easier to view the architecture on the node level. Denote by the neighbours of node . One can write the embedding of node at layer as follows:

 h(k+1)i=σ⎛⎜⎝Θ(l)∑j∈Ni∪{i}h(k)j√|N(j)||N(i)|⎞⎟⎠

Notice that there is no dependence of the weight matrices on the node , in fact the same parameters are shared across all nodes.
In order to obtain the new embedding of node , we average over all embeddings of the neighbouring nodes. This Message Passing mechanism gives rise to a very broad class of graph neural networks (Kipf and Welling, 2017; Veličković et al., 2018; Hamilton et al., 2017; Gilmer et al., 2017; Chen et al., 2018; Klicpera et al., 2019; Abu-El-Haija et al., 2018).

To be more precise, GCN falls into the more general category of models of the form

 z(k+1)i =AGGREGATE(k)({h(k)j:j∈N(i)};W(k)) h(k+1)i =COMBINE(k)(h(k)i,z(k+1)i;V(k))

Models of the above form are deemed Message Passing Graph Neural Networks and many choices for AGGREGATE and COMBINE have been suggested in the literature (Kipf and Welling, 2017; Hamilton et al., 2017; Chen et al., 2018).

## Appendix B Graph Embeddings in Non-Euclidean Geometries

In this section we will motivate non-Euclidean embeddings of graphs and show why the underlying geometry of the embedding space can be very beneficial for its representation. We first introduce a measure of how well a graph is represented by some embedding :

###### Definition 3.

Given an embedding of a graph in some metric space , we call a D-embedding for if there exists such that

 r⋅dG(i,j)≤dX(f(i),f(j))≤D⋅r⋅dG(i,j)

The infimum over all such is called the distortion of .

The in the definition of distortion allows for scaling of all distances. Note further that a perfect embedding is achieved when .

### b.1 Trees and Hyperbolic Space

Trees are graphs that do not allow for a cycle, in other words there is no node for which there exists a path starting from and returning back to without passing through any node twice. The number of nodes increases exponentially with the depth of the tree. This is a property that prohibits Euclidean space from representing a tree accurately. What intuitively happens is that "we run out of space". Consider the trees depicted in fig. 5. Here the yellow nodes represent the roots of each tree. Notice how rapidly we struggle to find appropriate places for nodes in the embedding space because their number increases just too fast.

Moreover, graph distances get extremely distorted towards the leaves of the tree. Take for instance the green and the pink node. In graph distance they are very far apart as one has to travel up all the way to the root node and back to the border. In Euclidean space however, they are very closely embedded in a -sense, hence introducing a big error in the embedding.

This problem can be very nicely illustrated by the following theorem:

###### Theorem 7.

Consider the tree (also called 3-star) consisting of a root node with three children. Then every embedding with achieves at least distortion for any .

###### Proof.

We will prove this statement by using a special case of the so called Poincaré-type inequalities (Deza and Laurent, 1996):

For any with and points it holds that

 k∑i,j=1bibj||xi−xj||2≤0

Consider now an embedding of the tree where represents the root node. Choosing and for leads to the inequality

 ||x2−x3||2+||x2−x4||2+||x3−x4||2≤3||x1−x2||2+3||x1−x3||2+3||x1−x4||2

The left-hand side of this inequality in terms of the graph distance is

 dG(2,3)2+dG(2,4)2+dG(3,4)2=22+22+22=12

and the right-hand side is

 3⋅dG(1,2)2+3⋅dG(1,3)2+3⋅dG(1,4)2=3+3+3=9

As a result, we always have that the distortion is lower-bounded by

Euclidean space thus already fails to capture the geometric structure of a very simple tree. This problem can be remedied by replacing the underlying Euclidean space by hyperbolic space.

Consider again the distance function in the Poincaré model, for simplicity with :

 dP(x,y)=cosh−1(1+2||x−y||2(1−||x||2)(1−||y||2))

Assume that the tree is embedded in the same way as in fig. 5, just restricted to lie in the disk of radius . Notice that as soon as points move closer to the boundary (), the fraction explodes and the resulting distance goes to infinity. As a result, the further you move points to the border, the more their distance increases, exactly as nodes on different branches are more distant to each other the further down they are in the tree. We can express this advantage in geometry in terms of distortion:

###### Theorem 8.

There exists an embedding for achieving distortion for arbitrary small.

###### Proof.

Since the Poincaré distance is invariant under Möbius translations we can again assume that . Let us place the other nodes on a circle of radius . Their distance to the root is now given as

 dP(xi,0)=cosh−1(1+2||xi||21−||xi||2)=cosh−1(1+2r21−r2)

By invariance of the distance under centered rotations we can assume w.l.o.g. . We further embed

• .

This procedure gives:

 dP(x2,x3)=cosh−1⎛⎜ ⎜⎝1+2||(3r2,−√32r)||2(1−r2)2⎞⎟ ⎟⎠=cosh−1(1+23r2(1−r2)2)

If we let the points now move to the border of the disk we observe that

 cosh−1(1+23r2(1−r2)2)cosh−1(1+2r21−r2)r→1−−→2

But this means in turn that we can achieve distortion for arbitrary small. QED. ∎

The tree-likeliness of hyperbolic space has been investigated on a deeper mathematical level. (Sarkar,Rik, 2011) show that a similar statement as in theorem 8 holds for all weighted or unweighted trees. The interested reader is referred to (Hamann,Matthias, 2017; Sarkar,Rik, 2011) for a more in-depth treatment of the subject.

Cycles are the subclasses of graphs that are not allowed in a tree. They consist of one path that reconnects the first and the last node: