Neural Embeddings of Graphs in Hyperbolic Space

# Neural Embeddings of Graphs in Hyperbolic Space

## Abstract.

Neural embeddings have been used with great success in Natural Language Processing (NLP). They provide compact representations that encapsulate word similarity and attain state-of-the-art performance in a range of linguistic tasks. The success of neural embeddings has prompted significant amounts of research into applications in domains other than language. One such domain is graph-structured data, where embeddings of vertices can be learned that encapsulate vertex similarity and improve performance on tasks including edge prediction and vertex labelling. For both NLP and graph based tasks, embeddings have been learned in high-dimensional Euclidean spaces. However, recent work has shown that the appropriate isometric space for embedding complex networks is not the flat Euclidean space, but negatively curved, hyperbolic space. We present a new concept that exploits these recent insights and propose learning neural embeddings of graphs in hyperbolic space. We provide experimental evidence that embedding graphs in their natural geometry significantly improves performance on downstream tasks for several real-world public datasets.

neural networks, graph embeddings, complex networks, geometry
12\setenumerate

leftmargin=4mm \setitemizeleftmargin=4mm

## 1. Introduction

Embedding (or vector space) methods find a lower-dimensional continuous space in which to represent high-dimensional complex data (Roweis2000, ; Belkin2001, ). The distance between objects in the lower-dimensional space gives a measure of their similarity. This is usually achieved by first postulating a low-dimensional vector space and then optimising an objective function of the vectors in that space. Vector space representations provide three principle benefits over sparse schemes: (1) They encapsulate similarity, (2) they are compact, (3) they perform better as inputs to machine learning models (Salton1975, ). This is true of graph structured data where the native data format is the adjacency matrix, a typically large, sparse matrix of connection weights.

Neural embedding models are a flavour of embedding scheme where the vector space corresponds to a subset of the network weights, which are learned through backpropagation. Neural embedding models have been shown to improve performance in a large number of downstream tasks across multiple domains. These include word analogies (Mikolov2013, ; Mnih2013, ), machine translation (Sutskever2014, ), document comparison (Kusner2015, ), missing edge prediction (Grover, ), vertex attribution (Perozzi2014, ), product recommendations (Grbovic2015, ; Baeza-yates2015, ), customer value prediction (Kooti2017, ; Chamberlain2017, ) and item categorisation (Barkan2016, ). In all cases the embeddings are learned without labels (unsupervised) from a sequence of entities.

To the best of our knowledge, all previous work on neural embedding models either explicitly or implicitly (by using the Euclidean dot product) assumes that the vector space is Euclidean. Recent work from the field of complex networks has found that many interesting networks, such as the Internet (Boguna2010, ) or academic citations (Clough2015a, ; Clough2016, ) can be well described by a framework with an underlying non-Euclidean hyperbolic geometry. Hyperbolic geometry provides a continuous analogue of tree-like graphs, and even infinite trees have nearly isometric embeddings in hyperbolic space (Gromov, ). Additionally, the defining features of complex networks, such as power-law degree distributions, strong clustering and hierarchical community structure, emerge naturally when random graphs are embedded in hyperbolic space (Krioukov, ).

The starting point for our model is the celebrated word2vec Skipgram architecture, which is shown in Figure 3 (Mikolov2013, ; Mikolov2013a, ). Skipgram is a shallow neural network with three layers: (1) An input projection layer that maps from a one-hot-encoded to a distributed representation, (2) a hidden layer, and (3) an output softmax layer. The network is necessarily simple for tractability as there are a very large number of output states (every word in a language). Skipgram is trained on a sequence of words that is decomposed into (input word, context word)-pairs. The model employs two separate vector representations, one for the input words and another for the context words, with the input representation comprising the learned embedding. The word pairs are generated by taking a sequence of words and running a sliding window (the context) over them. As an example the word sequence “chance favours the prepared mind” with a context window of size three would generate the following training data: (chance, favours), (chance, the), (favours, chance), … }. Words are initially randomly allocated to vectors within the two vector spaces. Then, for each training pair, the vector representations of the observed input and context words are pushed towards each other and away from all other words (see Figure 2).

The concept can be extended from words to network structured data using random walks to create sequences of vertices. The vertices are then treated exactly analogously to words in the NLP formulation. This was originally proposed as DeepWalk (Perozzi2014, ). Extensions varying the nature of the random walks have been explored in LINE (Tang2015, ) and Node2vec (Grover, ).

#### Contribution

In this paper, we introduce the new concept of neural embeddings in hyperbolic space. We formulate backpropagation in hyperbolic space and show that using the natural geometry of complex networks improves performance in vertex classification tasks across multiple networks.

## 2. Hyperbolic Geometry

Hyperbolic geometry emerges from relaxing Euclid’s fifth postulate (the parallel postulate) of geometry. In hyperbolic space there is not just one, but an infinite number of parallel lines that pass through a single point. This is illustrated in Figure 0(b) where every line is parallel to the bold, blue line and all pass through the same point. Hyperbolic space is one of only three types of isotropic spaces that can be defined entirely by their curvature. The most familiar is Euclidean, which is flat, having zero curvature. Space with uniform positive curvature has an elliptic geometry (e.g. the surface of a sphere), and space with uniform negative curvature is called hyperbolic, which is analogous to a saddle-like surface. As, unlike Euclidean space, in hyperbolic space even infinite trees have nearly isometric embeddings, it has been successfully used to model complex networks with hierarchical structure, power-law degree distributions and high clustering (Krioukov, ).

One of the defining characteristics of hyperbolic space is that it is in some sense larger than the more familiar Euclidean space; the area of a circle or volume of a sphere grows exponentially with its radius, rather than polynomially. This suggests that low-dimensional hyperbolic spaces may provide effective representations of data in ways that low-dimensional Euclidean spaces cannot. However this makes hyperbolic space hard to visualise as even the 2D hyperbolic plane can not be isometrically embedded into Euclidean space of any dimension,(unlike elliptic geometry where a 2-sphere can be embedded into 3D Euclidean space). For this reason there are many different ways of representing hyperbolic space, with each representation conserving some geometric properties, but distorting others. In the remainder of the paper we use the Poincaré disk model of hyperbolic space.

### 2.1. Poincaré Disk Model

The Poincaré disk models two-dimensional hyperbolic space where the infinite plane is represented as a unit disk. We work with the two-dimensional disk, but it is easily generalised to the -dimensional Poincaré ball, where hyperbolic space is represented as a unit -ball.

In this model hyperbolic distances grow exponentially towards the edge of the disk. The circle’s boundary represents infinitely distant points as the infinite hyperbolic plane is squashed inside the finite disk. This property is illustrated in Figure 0(a) where each tile is of constant area in hyperbolic space, but the tiles rapidly shrink to zero area in Euclidean space. Although volumes and distances are warped, the Poincaré disk model is conformal, meaning that Euclidean and hyperbolic angles between lines are equal. Straight lines in hyperbolic space intersect the boundary of the disk orthogonally and appear either as diameters of the disk, or arcs of a circle. Figure 0(b) shows a collection of straight hyperbolic lines in the Poincaré disk. Just as in spherical geometry, the shortest path from one place to another is a straight line, but appears as a curve on a flat map. Similarly, these straight lines show the shortest path (in terms of distance in the underlying hyperbolic space) from one point on the disk to another, but they appear curved. This is because it is quicker to move close to the centre of the disk, where distances are shorter, than nearer the edge. In our proposed approach, we will exploit both the conformal property and the circular symmetry of the Poincaré disk.

Overall, the geometric intuition motivating our approach is that vertices embedded near the middle of the disk can have more close neighbours than they could in Euclidean space, whilst vertices nearer the edge of the disk can still be very far from each other.

### 2.2. Inner Product, Angles, and Distances

The mathematics is considerably simplified if we exploit the symmetries of the model and describe points in the Poincaré disk using polar coordinates, , with and . To define similarities and distances, we require an inner product. In the Poincaré disk, the inner product of two vectors and is given by

 (1) ⟨x,y⟩ =∥x∥∥y∥cos(θx−θy) (2) =4arctanhrxarctanhrycos(θx−θy)

The distance of from the origin of the hyperbolic co-ordinate system is given by and the circumference of a circle of hyperbolic radius R is .

## 3. Neural Embedding in Hyperbolic Space

We adopt the original notation of (Mikolov2013, ) whereby the input vertex is and the output is . Their corresponding vector representations are and , which are elements of the two vector spaces shown in Figure 3, and respectively. Skipgram has a geometric interpretation, which we visualise in Figure 2 for vectors in . Updates to are performed by simply adding (if is the observed output vertex) or subtracting (otherwise) an error-weighted portion of the input vector. Similar, though slightly more complicated, update rules apply to the vectors in . Given this interpretation, it is natural to look for alternative geometries in which to perform these updates.

To embed a graph in hyperbolic space we replace Skipgram’s two Euclidean vector spaces ( and in Figure 3) with two Poincaré disks. We learn embeddings by optimising an objective function that predicts output/context vertices from an input vertex, but we replace the Euclidean dot products used in Skipgram with hyperbolic inner products. A softmax function is used for the conditional predictive distribution

 (3) p(wO|wI)=exp(⟨v′wO,vwI⟩)∑Vi=1exp(⟨v′wi,vwI⟩),

where is the vector representation of the vertex, primed indicates members of the output vector space (See Figure 3) and is the hyperbolic inner product. Directly optimising (3) is computationally demanding as the sum in the denominator extends over every vertex in the graph. Two commonly used techniques to make word2vec more efficient are (a) replacing the softmax with a hierarchical softmax (Mnih2008, ; Mikolov2013, ) and (b) negative sampling (Mnih2012, ; Mnih2013, ). We use negative sampling as it is faster.

### 3.1. Negative Sampling

Negative sampling is a form of Noise Contrastive Estimation (NCE) (Gutmann2012, ). NCE is an estimation technique that is based on the assumption that a good model should be able to separate signal from noise using only logistic regression.

As we only care about generating good embeddings, the objective function does not need to produce a well-specified probability distribution. The negative log likelihood using negative sampling is

 (4) E (5) =−logσ(uO)−K∑j=1Ewj∼Pn[logσ(−uj)]

where , are the vector representation of the input and output vertices, , is a set of samples drawn from the noise distribution, is the number of samples and is the sigmoid function. The first term represents the observed data and the second term the negative samples. To draw , we specify the noise distribution to be unigrams raised to as in (Mikolov2013, ).

### 3.2. Model Learning

We learn the model using backpropagation. To perform backpropagation it is easiest to work in natural hyperbolic co-ordinates on the disk and map back to Euclidean co-ordinates only at the end. In natural co-ordinates , and . The major drawback of this co-ordinate system is that it introduces a singularity at the origin. To address the complexities that result from radii that are less than or equal to zero, we initialise all vectors to be in a patch of space that is small relative to its distance from the origin.

The gradient of the negative log-likelihood in (5) w.r.t. is given by

 (6) ∂E∂uj =⎧⎪⎨⎪⎩σ(uj)−1,if wj=wOσ(uj),if wj=Wneg0,otherwise

Taking the derivatives w.r.t. the components of vectors in (in natural polar hyperbolic co-ordinates) yields

 (7) ∂E∂(r′j)k =∂E∂uj∂uj∂(r′j)k=∂E∂ujrIcos(θI−θ′j) (8) ∂E∂(θ′j)k =∂E∂ujr′jrIsin(θI−θ′j).

The Jacobian is then

 (9) ∇rE=∂E∂r^r+1sinhr∂E∂θ^θ,

 (10) r′newj ={r′oldj−ηϵjrIcos(θI−θ′j),if wj∈wO∪Wnegr′oldj,otherwise (11) θ′newj =⎧⎨⎩θ′oldj−ηϵjrIrjsinhrjsin(θI−θ′j),if wj∈wO∪Wnegθ′oldj,otherwise

where is the learning rate and is the prediction error defined in Equation (6). Calculating the derivatives w.r.t. the input embedding follows the same pattern, and we obtain

 (12) ∂E∂rI =∑j:wj∈wO∪Wneg∂E∂uj∂uj∂rI (13) =∑j:wj∈wO∪Wneg∂E∂ujr′jcos(θI−θ′j), (14) ∂E∂θI =∑j:wj∈wO∪Wneg∂E∂uj∂uj∂θI (15) =∑j:wj∈wO∪Wneg−∂E∂ujrIr′jsin(θI−θ′j).

The corresponding update equations are

 (16) rnewI =roldI−η∑j:wj∈wO∪Wnegϵjr′jcos(θI−θ′j), (17) θnewI =θoldI−η∑j:wj∈wO∪WnegϵjrIr′jsinhrIsin(θI−θ′j),

where is an indicator variable s.t. if and only if , and otherwise. On completion of backpropagation, the vectors are mapped back to Euclidean co-ordinates on the Poincaré disk through and .

## 4. Experimental Evaluation

In this section, we assess the quality of hyperbolic embeddings and compare them to embeddings in Euclidean spaces on a number of public benchmark networks.

### 4.1. Datasets

We report results on five publicly available network datasets for the problem of vertex attribution.

1. Karate: Zachary’s karate club contains 34 vertices divided into two factions (Zachary1977, ).

2. Polbooks: A network of books about US politics published around the time of the 2004 presidential election and sold by the online bookseller Amazon.com. Edges between books represent frequent co-purchasing of books by the same buyers.

3. Football: A network of American football games between Division IA colleges during regular season Fall 2000 (Girvan2002, ).

4. Adjnoun: Adjacency network of common adjectives and nouns in the novel David Copperfield by Charles Dickens (Newman2006, ).

5. Polblogs: A network of hyperlinks between weblogs on US politics, recorded in 2005 (Adamic2005, ).

Statistics for these datasets are recorded in Table 1.

### 4.2. Visualising Embeddings

To illustrate the utility of hyperbolic embeddings we compare embeddings in the Poincaré disk to the two-dimensional deepwalk embeddings for the 34-vertex karate network with two factions. The results are shown in Figure 4. Both embeddings were generated by running for five epochs on an intermediate dataset of 34, ten step random walks, one originating at each vertex.

The figure clearly shows that the hyperbolic embedding is able to capture the community structure of the underlying network. When embedded in hyperbolic space, the two factions (black and white discs) of the underlying graph are linearly separable, while the Deepwalk embedding does not exhibit such an obvious structure.

### 4.3. Vertex Attribute Prediction

We evaluate the success of neural embeddings in hyperbolic space by using the learned embeddings to predict held-out labels of vertices in networks. In our experiments, we compare our embedding to deepwalk (Perozzi2014, ) embeddings of dimensions 2, 4, 8, 16, 32, 64 and 128. To generate embeddings we first create an intermediate dataset by taking a series of random walks over the networks. For each network we use a ten-step random walk originating at each vertex.

The embedding models are all trained using the same parameters and intermediate random walk dataset. For deepwalk, we use the gensim (Rehurek2010, ) python package, while our hyperbolic embeddings are written in custom TensorFlow. In both cases, we use five training epochs, a window size of five and do not prune any vertices.

The results of our experiments are shown in Figure 5. The graphs show macro F1 scores against the percentage of labelled data used to train a logistic regression classifier. Here we follow the method for generating F1 scores when each test case can have multiple labels that is described in (Liu2006, ). The error bars show one standard error from the mean over ten repetitions. The blue lines show hyperbolic embeddings while the red lines depict deepwalk embeddings at various dimensions. It is apparent that in all datasets hyperbolic embeddings significantly outperform deepwalk.

## 5. Conclusion

We have introduced the concept of neural embeddings in hyperbolic space. To the best of our knowledge, all previous embeddings models have assumed a flat Euclidean geometry. However, a flat geometry is not the natural geometry of all data structures. A hyperbolic space has the property that power-law degree distributions, strong clustering and hierarchical community structure emerge naturally when random graphs are embedded in hyperbolic space. It is therefore logical to exploit the structure of the hyperbolic space for useful embeddings of complex networks. We have demonstrated that when applied to the task of classifying vertices of complex networks, hyperbolic space embeddings significantly outperform embeddings in Euclidean space.

### Footnotes

2. journalyear: 2017

### References

1. Lada A. Adamic and Natalie Glance. The political blogosphere and the 2004 U.S. election. Proceedings of the 3rd international workshop on Link discovery - LinkKDD ’05, pages 36–43, 2005.
2. Ricardo Baeza-Yates and Diego Saez-Trumper. Wisdom of the Crowd or Wisdom of a Few? Proceedings of the 26th ACM Conference on Hypertext & Social Media - HT ’15, pages 69–74, 2015.
3. Oren Barkan and Noam Koenigstein. Item2Vec : Neural Item Embedding for Collaborative Filtering. Arxiv, pages 1–8, 2016.
4. Mikhail Belkin and Partha Niyogi. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering. Advances in neural information processing systems, 14:585–591, 2001.
5. Marian Boguna, Fragkiskos Papadopoulos, and Dmitri Krioukov. Sustaining the Internet with Hyperbolic Mapping. Nature Communications, 1(62):62, 2010.
6. Benjamin P. Chamberlain, Angelo Cardoso, Chak H. Liu, Roberto Pagliari, and Marc P. Deisenroth. Customer Life Time Value Prediction Using Embeddings. In Proceedings of the 23nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017.
7. James R. Clough and Tim S. Evans. What is the dimension of citation space? Physica A: Statistical Mechanics and its Applications, 448:235–247, 2016.
8. James R. Clough, Jamie Gollings, Tamar V. Loach, and Tim S. Evans. Transitive reduction of citation networks. Journal of Complex Networks, 3(2):189–203, 2015.
9. Michelle Girvan and Mark E. J. Newman. Community structure in social and biological networks. In Proceedings of the national academy of sciences, 99:7821–7826, 2002.
10. Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, and Doug Sharp. E-commerce in Your Inbox: Product Recommendations at Scale Categories and Subject Descriptors. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1809–1818, 2015.
11. Mikhail Gromov. Metric Structures for Riemannian and Non-riemannian Spaces. Springer Science and Business Media, 2007.
12. Aditya Grover and Jure Leskovec. node2vec : Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 855–864, 2016.
13. Michael U Gutmann. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics. Journal of Machine Learning Research, 13:307–361, 2012.
14. Farshad Kooti, Mihajlo Grbovic, Luca Maria Aiello, Eric Bax, and Kristina Lerman. iPhone’s Digital Marketplace: Characterizing the Big Spenders. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, pages 13–21, 2017.
15. Dmitri Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, and Amin Vahdat. Hyperbolic Geometry of Complex Networks. Physical Review E 82.3:036106, 2010.
16. Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. From Word Embeddings To Document Distances. In Proceedings of The 32nd International Conference on Machine Learning, 37:957–966, 2015.
17. Yi Liu, Rong Jin, and Liu Yang. Semi-supervised multi-label learning by constrained non-negative matrix factorization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 6, pages 421–426, 2006.
18. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. Advances in neural information processing systems, pages 3111-3119, 2013.
19. Tomas Mikolov, Greg Corrado, Kai Chen, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations, pages 1–12, 2013.
20. Andriy Mnih and Koray Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. Advances in neural information processing systems, pages 2265-2273, 2013.
21. Andriy Mnih and Geoffrey E. Hinton. A Scalable Hierarchical Distributed Language Model. Advances in Neural Information Processing Systems, pages 1081-1088, 2008.
22. Andriy Mnih and Yee Whye Teh. A Fast and Simple Algorithm for Training Neural Probabilistic Language Models. In Proceedings of the 29th International Conference on Machine Learning, pages 1751–1758, 2012.
23. Mark E. J. Newman. Finding community structure in networks using the eigenvectors of matrices. Physical Review E - Statistical, Nonlinear, and Soft Matter Physics, 74(3):1–19, 2006.
24. Bryan Perozzi and Steven Skiena. DeepWalk : Online Learning of Social Representations. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pages 701–710, 2014.
25. Radim Rehurek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, 2010.
26. Sam T. Roweis and Lawrence K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, New Series, 290(5500):2323–2326, 2000.
27. Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.
28. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to Sequence Learning with Neural Networks. Advances in neural information processing systems, pages 3104–3112, 2014.
29. Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. LINE: Large-scale Information Network Embedding. Proceedings of the 24th International Conference on World Wide Web. ACM, , pages 1067–1077, 2015.
30. Wayne W. Zachary. An information flow model for conflict and fission in small groups. Journal of anthropological research, 33:452–473, 1977.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters