Beyond Vector Spaces: Compact Data Representation as Differentiable Weighted Graphs
Abstract
Learning useful representations is a key ingredient to the success of modern machine learning. Currently, representation learning mostly relies on embedding data into Euclidean space. However, recent work has shown that data in some domains is better modeled by noneuclidean metric spaces, and inappropriate geometry can result in inferior performance. In this paper, we aim to eliminate the inductive bias imposed by the embedding space geometry. Namely, we propose to map data into more general nonvector metric spaces: a weighted graph with a shortest path distance. By design, such graphs can model arbitrary geometry with a proper configuration of edges and weights. Our main contribution is PRODIGE: a method that learns a weighted graph representation of data endtoend by gradient descent. Greater generality and fewer model assumptions make PRODIGE more powerful than existing embeddingbased approaches. We confirm the superiority of our method via extensive experiments on a wide range of tasks, including classification, compression, and collaborative filtering.
1 Introduction
Nowadays, representation learning is a major component of most data analysis systems; this component aims to capture the essential information in the data and represent it in a form that is useful for the task at hand. Typical examples include word embeddings Mikolov et al. (2013); Pennington et al. (2014); Bojanowski et al. (2017), image embeddings Donahue et al. (2014); Chopra et al. (2005), user/item representations in recommender systems He et al. (2017) and others. To be useful in practice, representation learning methods should meet two main requirements. Firstly, they should be effective, i.e., they should not lose the information needed to achieve high performance in the specific task. Secondly, the constructed representations should be efficient, e.g., have small dimensionality, sparsity, or other structural constraints, imposed by the particular machine learning pipeline. In this paper, we focus on compact data representations, i.e., we aim to achieve the highest performance with the smallest memory consumption.
Most existing methods represent data items as points in some vector space, with the Euclidean space being a "default" choice. However, several recent works Nickel and Kiela (2017); Gu et al. (2018); De Sa et al. (2018) have demonstrated that the quality of representation is heavily influenced by the geometry of the embedding space. In particular, different space curvature can be more appropriate for different types of data Gu et al. (2018). While some prior works aim to learn the geometrical properties of the embedding space from dataGu et al. (2018), all of them assume a vectorial representation of the data, which may be an unnecessary inductive bias.
In contrast, we investigate a more general paradigm by proposing to represent finite datasets as weighted graphs equipped with a shortest path distance. It can be shown that such graphs can represent any geometry for a finite datasetHopkins (2015) and can be naturally treated as finite metric spaces. Specifically, we introduce Probabilistic Differentiable Graph Embeddings (PRODIGE), a method that learns a weighted graph from data, minimizing the problemspecific objective function via gradient descent. Unlike existing methods, PRODIGE does not learn vectorial embeddings of the data points, instead information from the data is effectively stored in the graph structure. Via extensive experiments on several different tasks, we confirm that, in terms of memory consumption, PRODIGE is more efficient than its vectorial counterparts.
To sum up, the contributions of this paper are as follows:

We propose a new paradigm for constructing representations of finite datasets. Instead of constructing pointwise vector representations, our proposed PRODIGE method represents data as a weighted graph equipped with a shortest path distance.

Applied to several tasks, PRODIGE is shown to outperform vectorial embeddings when given the same memory budget; this confirms our claim that PRODIGE can capture information from the data more effectively.

The PyTorch source code of PRODIGE is available online^{1}^{1}1https://github.com/stanismorozov/prodige.
The rest of the paper is organized as follows. We discuss the relevant prior work in Section 2 and describe the general design of PRODIGE in Section 3. In Section 4 we consider several practical tasks and demonstrate how they can benefit from the usage of PRODIGE as a dropin replacement for existing vectorial representation methods. Section 5 concludes the paper.
2 Related work
In this section, we briefly review relevant prior work and describe how the proposed approach relates to the existing machine learning concepts.
Embeddings. Vectorial representations of data have been used in machine learning systems for decades. The first representations were mostly handengineered and did not involve any learning, e.g. SIFTLowe (2004) and GISTOliva and Torralba (2001) representations for images, and ngram frequencies for texts. The recent success of machine learning is largely due to the transition from the handcrafted to learnable data representations in domains, such as NLPMikolov et al. (2013); Pennington et al. (2014); Bojanowski et al. (2017), visionDonahue et al. (2014), speechChung et al. (2018). Most applications now use learnable representations, which are typically vectorial embeddings in Euclidean space.
Embedding space geometry. It has recently been shown that Euclidean space is a suboptimal model for several data domains Nickel and Kiela (2017); Gu et al. (2018); De Sa et al. (2018). In particular, spaces with a hyperbolic geometry are more appropriate to represent data with a latent hierarchical structure, and several papers investigate hyperbolic embeddings for different applications Tifrea et al. (2018); Vinh et al. (2018); Khrulkov et al. (2019). The current consensus appears to be that different geometries are appropriate for solving different problems and there have been attempts to learn these geometrical properties (e.g. curvature) from data Gu et al. (2018). Instead, our method aims to represent data as a weighted graph with a shortest path distance; this by design can express an arbitrary finite metric space.
Connections to graph embeddings. Representing the vertices of a given graph as vectors is a longstanding problem in machine learning and complex networks communities. Modern approaches to this problem rely on graph theory and/or graph neural networks Wu et al. (2019); Ivanov and Burnaev (2018), which are both areas of active research. In some sense, we aim to solve the inverse problem; given data entities, our goal is to learn a graph approximating the distances that satisfy the requirements of the task at hand.
Existing works on graph learning. We are not aware of prior work that proposes a general method to represent data as a weighted graph for any differentiable objective function. Probably, the closest work to ours is Escolano and Hancock (2011); they solve the problem of distancepreserving compression via a dual optimization problem. Their proposed approach lacks endtoend learning and does not generalize to arbitrary loss functions. There are also several approaches that learn graphs from data for specific problems. Some studiesKarasuyama and Mamitsuka (2016); Kang et al. (2019) learn specialized graphs that perform clustering or semisupervised learning. OthersYu et al. (2019); Zheng et al. (2018) focus specifically on learning probabilistic graphical model structure. Most of the proposed approaches are highly problemspecific and do not scale well to large graphs.
3 Method
In this section we describe the general design of PRODIGE and its training procedure.
3.1 Learnable Graph Metric Spaces
Consider an undirected weighted graph , where is a set of vertices, corresponding to data items, and is a set of edges, . We use to denote nonnegative edge weights, which are learnable parameters of our model.
Our model includes another set of learnable parameters that correspond to the probabilities of edges in the graph. Specifically, for each edge , we define a Bernoulli random variable that indicates whether an edge is present in the graph . For simplicity, we assume that all random variables in our model are independent. In this case the joint probability of all edges in the graph can be written as .
The expected distance between any two vertices and can now be expressed as a sum of edge weights along the shortest path:
(1) 
Here denotes the set of all paths from to over the edges of , or more formally, . For a given , the shortest path can be found exactly using Dijkstra’s algorithm. Generally speaking, a shortest path is not guaranteed to exist, e.g., if is disconnected; in this case we define to be equal to a sufficiently large constant.
The parameters of our model must satisfy two constraints: for weights and for probabilities. We avoid constrained optimization by defining as Zheng et al. (2015) and as .
This model can be trained by minimizing an arbitrary differentiable objective function with respect to parameters directly by gradient descent. We explore several problemspecific objective functions in Section 4.
3.2 Sparsity
In order to learn a compact representation, we encourage the algorithm to learn sparse graphs by imposing additional regularization on . Namely, we employ a recent sparsification technique proposed in Louizos et al. (2018). Denoting the taskspecific loss function as , the regularized training objective can be written as:
(2) 
Intuitively, the term penalizes the number of edges being used, on average. It effectively encourages sparsity by forcing the edge probabilities to decay over time. For instance, if a certain edge has no effect on the main objective (e.g., the edge never occurs on any shortest path), the optimal probability of that edge being present is exactly zero. The regularization coefficient affects the “zeal” of edge pruning, with larger values of corresponding to greater sparsity of the learned graph.
We minimize this regularized objective function (2) by stochastic gradient descent. The gradients are straightforward to compute using existing autograd packages, such as TensorFlow or PyTorch. The gradients are, however, more tricky and require the logderivative trickGlynn (1990):
(3) 
In practice, we can use a MonteCarlo estimate of gradient (3). However, the variance of this estimate can be too large to be used in practical optimization. To reduce the variance, we use the fact that the optimal path usually contains only a tiny subset of all possible edges. More formally, if the objective function only depends on a subgraph , then we can integrate out all edges from :
(4) 
The expression (4) allows for an efficient training procedure that only samples edges that are required by Dijkstra’s algorithm. More specifically, on every iteration the pathfinding algorithm selects a vertex and explores its neighbors by sampling corresponding to the edges that are potentially connected to . In this case, the size of is proportional to the number of iterations of Dijkstra’s algorithm, which is typically lower than the number of vertices in the original graph .
Finally, once the training procedure converges, most edges in the obtained graph are nearly deterministic: . We make this graph exactly deterministic by keeping only the edges with probability greater or equal to .
3.3 Scalability
As the total number of edges in a complete graph grows quadratically with the number of vertices , memory consumption during PRODIGE training on large datasets can be infeasible. To reduce memory requirements, we explicitly restrict a subset of possible edges and learn probabilities only for edges from this subset. The subset of possible edges is constructed by a simple heuristic described below.
First, we add an edge from each data item to most similar items in terms of problemspecific similarity in the original data space. Second, we also add random edges between uniformly chosen source and destination vertices. Overall, the size of the constructed subset scales linearly with the number of vertices , which makes training feasible for largescale datasets. In our experiments, we observe that increasing the number of possible edges improves the model performance until some saturation point, typically with edges per vertex.
Finally, we observed that the choice of an optimization algorithm is crucial for the convergence speed of our method. In particular, we use SparseAdamKingma and Ba (2015) as it significantly outperformed other sparse SGD alternatives in our experiments.
4 Applications
In this section, we experimentally evaluate the proposed PRODIGE model on several practical tasks and compare it to taskspecific baselines.
4.1 Distancepreserving compression
Distancepreserving compression learns a compact representation of highdimensional data that preserves the pairwise distances from the original highdimensional space. The learned representations can then be used in practical applications, e.g., data visualization.
Objective. We optimize the squared error between pairwise distances in the original and compressed spaces. Since PRODIGE graphs are stochastic, we minimize this objective in expectation over edge probabilities:
(5) 
In the formula above, are objects in the original highdimensional Euclidean space and are the corresponding graph vertices. Note that the objective (5) resembles the wellknown Multidimensional Scaling algorithmKruskal (1964).
However, the objective (5) does not account for the graph size in the learned model. This can likely lead to a trivial solution since the PRODIGE graph can simply memorize . Therefore, we use the sparsification technique described earlier in Section 3.2. We also employ the initialization heuristic from Section 3.3 to speed up training. Namely, we start with 64 edges per vertex, half of which are links to the nearest neighbors and the other half are random edges.
Experimental setup. We experiment with three publicly available datasets:

MNIST10k: images from the test set of the MNIST dataset, represented as dimensional vectors;

GLOVE10k: top most frequent words, represented as dimensional pretrained^{1}^{1}1We used a pretrained gensim model "glovewikigigaword300"; see https://radimrehurek.com/gensim/downloader.html for details GloVePennington et al. (2014) vectors;

CelebA10K: dimensional embeddings of random face photographs from the CelebA dataset, produced by deep CNN^{2}^{2}2The dataset was obtained by running face_recognition package on CelebA images and uniformly sampling face embeddings, see https://github.com/ageitgey/face_recognition.
In these experiments, we aim to preserve the Euclidean distances (5) for all datasets. Note, however, that any distance function can be used in PRODIGE.
Method  #parameters  #parameters  MNIST10k  GLOVE10k  CelebA10k 

per instance  total  
4 parameters per instance  
PRODIGE  39.2k  0.00938  0.03289  0.00374  
MDS  4  40k  0.05414  0.13142  0.01678 
Poincare MDS  4  40k  0.04683  0.11386  0.01649 
PCA  4  40k  0.30418  0.84912  0.09078 
8 parameters per instance  
PRODIGE  76.5k  0.00886  0.02856  0.00367  
MDS  8  80k  0.01857  0.05584  0.00621 
Poincare MDS  8  80k  0.01503  0.04839  0.00619 
PCA  8  80k  0.16237  0.62424  0.05298 
We compare our method with three baselines, which construct vectorial embeddings:

Poincare MDS is a version of MDS that maps data into Poincare Ball. This method approximates the original distance with a hyperbolic distance between learned vector embeddings:

PCA. Principal Component Analysis is the most popular techinique for data compression. We include this method for sanity check.
For all the methods, we compare the performance given the same memory budget. Namely, we investigate two operating points, corresponding to four and eight 32bit numbers per data item. For embeddingbased baselines, this corresponds to 4dimensional and 8dimensional embeddings, respectively. As for PRODIGE, it requires a total of parameters where is a number of objects and is the number of edges with . The learned graph is represented as a sequence of edges ordered by the source vertex, and each edge is stored as a tuple of target vertex (int32) and weight (float32). We tune the regularization coefficient to achieve the overall memory consumption close to the considered operating points. The distance approximation errors for all methods are reported in Table 1, which illustrates that PRODIGE outperforms the embeddingbased counterparts by a large margin. These results confirm that the proposed graph representations are more efficient in capturing the underlying data geometry compared to vectorial embeddings. We also verify the robustness of our training procedure by running several experiments with different random initializations and different initial numbers of neighbors. Figure 4.1 shows the learning curves of PRODIGE under various conditions for GLOVE10K dataset and four numbers/vertex budget. While these results exhibit some variability due to optimization stochasticity, the overall training procedure is stable and robust.
Qualitative results. To illustrate the graphs obtained by learning the PRODIGE model, we train it on a toy dataset containing 100 randomly sampled MNIST images of five classes. We start the training from a full graph, which contains edges, and increase till the moment when only of edges are preserved. The tSNEMaaten and Hinton (2008) plot of the obtained graph, based on distances, is shown on Figure 1.
Figure 1 reveals several properties of the PRODIGE graphs. First, the number of edges per vertex is very uneven, with a large fraction of edges belonging to a few high degree vertices. We assume that these "popular" vertices play the role of "traffic hubs" for pathfinding in the graph. The shortest path between distant vertices is likely to begin by reaching the nearest "hub", then travel over the "long" edges to the hub that "contains" the target vertex, after which it would follow the short edges to reach the target itself.
Another important observation is that nonhub vertices tend to have only a few edges. We conjecture that this is the key to the memoryefficiency of our method. Effectively, the PRODIGE graph represents most of the data items by their relation to one or several "landmark" vertices, corresponding to graph hubs. Interestingly, this representation resembles the topology of humanmade transport networks with few dedicated hubs and a large number of peripheral nodes. We plot the distribution of vertex degrees on Figure 4.1, which resembles powerlaw distribution, demonstrating "scalefree" property of PRODIGE graphs.
Sanity check. In this experiment we also verify that PRODIGE is able to reconstruct graphs from datapoints, which mutual distances are obtained as shortest paths in some graph. Namely, we generate connected Erdős–Rényi graphs with 1025 vertices with edge probability and edge weights sampled from uniform distribution. We then train PRODIGE to reconstruct these graphs given pairwise distances between vertices. Out of random graphs, in cases PRODIGE was able to reconstruct all edges that affected shortest paths and all 100 runs the resulting graph had distances approximation error below .
4.2 Collaborative Filtering
In the next series of experiments, we investigate the usage of PRODIGE in the collaborative filtering task, which is a key component of modern recommendation systems.
Consider a sparse binary useritem matrix that represents the preferences of users over items. means that the th item is relevant for the th user. means that the th item is not relevant for the th user or the relevance information is absent. The recommendation task is to extract the most relevant items for the particular user. A common performance measure for this task is HitRatio@k. This metric measures how frequently there is at least one relevant item among topk recommendations suggested by the algorithm. Note that in these experiments we consider the simplest setting where user preferences are binary. All experiments are performed on the Pinterest datasetGeng et al. (2015).
Objective. Intuitively, we want our model to rank the relevant items above the nonrelevant ones. We achieve this via maximizing the following objective:
(6) 
For each user , this objective enforces the positive items to be closer in terms of than the negative items . We sample positive items uniformly from the training items that are relevant to . In turn, are sampled uniformly from all items. In practice, we only need to run Dijkstra’s algorithm once per each user to calculate the distances to all positive and negative items.
Similarly to the previous section, we speed up the training stage by considering only a small subset of edges. Namely, we build an initial graph using as adjacency matrix and add extra edges that connect the nearest users (useruser edges) and the nearest items (itemitem edges) in terms of cosine distance between the corresponding rows/columns of .
For this task, we compare the following methods:

PRODIGEnormal: PRODIGE method as described above; we restrict a set of possible edges to include 16 useruser and itemitem edges and all relevant useritem edges available in the training data;

PRODIGEbipartite: a version of PRODIGE that is allowed to learn only edges that connect users to items, counting approximately 30 edges per item. The useruser and the itemitem edges are prohibited;

PRODIGErandom: a version of our model that is initialized with completely random edges. All useruser, itemitem, and useritem edges are sampled uniformly. In total, we sample 50 edges per each user and item;

SVD: truncated Singular Value Decomposition of useritem matrix^{3}^{3}3we use the Implicit package for both ALS and SVD baselines https://github.com/benfred/implicit;

ALS: Alternating Least Squares method for implicit feedbackHu et al. (2008);

Metric Learning: metric learning approach that learns the user/item embeddings in the Euclidean space and optimizes the same objective function (6) with Euclidean distance between user and item embeddings; all other training parameters are shared with PRODIGE.
The comparison results for two memory budgets are presented in Table 2. It illustrates that our PRODIGE model achieves better overall recommendation quality compared to embeddingbased methods. Interestingly, the bipartite graph performs nearly as well as the taskspecific heuristic with nearest edges. In contrast, starting from a random set of edges results in poor performance, which indicates the importance of initial edges choice.
Method  PRODIGE  SVD  ALS  Metric Learning  
normal  bipartite  random  
4 parameters per user/item  
HR@5  0.50213  0.48533  0.35587  0.33365  0.37005  0.45898 
HR@10  0.66192  0.64250  0.57492  0.49619  0.51815  0.60079 
8 parameters per user/item  
HR@5  0.59921  0.57659  0.50113  0.5107  0.48617  0.54728 
HR@10  0.79021  0.77980  0.73485  0.70489  0.70075  0.74891 
4.3 Sentiment classification
As another application, we consider a simple text classification problem: the algorithm takes a sequence of tokens (words) as input and predicts a single class label . This problem arises in a wide range of tasks, such as sentiment classification or spam detection
In this experiment, we explore the potential applications of learnable weighted graphs as an intermediate data representation within a multilayer model. Our goal here is to learn graph edges endtoend using the gradients from the subsequent layers. To make the whole pipeline fully differentiable, we design a projection of data items, represented as graph vertices, into feature vectors digestible by subsequent convolutional or dense layers.
Namely, a data item is represented as a vector of distances to predefined "anchor" vertices:
(7) 
In practice, we add the "anchor" vertices to a graph before training and connect them to random vertices. Note that the "anchor" vertices do not correspond to any data object. The architecture used in this experiment is schematically presented on Figure 2.
Intuitively, the usage of PRODIGE in architecture from Figure 2 can be treated as a generalization of vectorial embedding layer. For instance, if a graph contains only the edges between vertices and anchors, this is equivalent to embedding with trainable parameters. However, in practice, our regularized PRODIGE model learns a more compact graph by using vertexvertex edges to encode words via their relation to other words.
Model and objective. We train a simple sentiment classification model with four consecutive layers: embedding layer, onedimensional convolutional layer with 32 output filters, followed by a global max pooling layer, a ReLU nonlinearity and a final dense layer that predicts class logits. Indeed, this model is smaller than the stateoftheart models for sentiment classification and should be considered only as a proof of concept. We minimize the crossentropy objective by backpropagation, computing gradients w.r.t all trainable parameters including graph edges .
Experimental setup. We evaluate our model on the IMDB benchmark Maas et al. (2011), a popular dataset for text sentiment binary classification. The data is split into training and test sets, each containing text instances. For simplicity, we only consider most frequent tokens.
We compare our model with embeddingbased baselines, which follow the same architecture from Figure 2, but with the standard embedding layer instead of PRODIGE. All embeddings are initialized with pretrained word representations and then finetuned with the subsequent layers by backpropagation. As pretrained embeddings, we use GloVe vectors^{4}^{4}4We used https://pypi.org/project/gensim/ to train GloVe model with recommended parameters trained on texts from the training set and select vectors corresponding to the most frequent tokens. In PRODIGE, the model graph is pretrained by distancepreserving compression of the GloVe embeddings, as described in Section 4.1. In order to encode the "anchor" objects, we explicitly add synthetic objects to the data by running Kmeans clustering and compress the resulting objects by minimizing the objective (5).
Representation  PRODIGE  GloVe  
100d  100d  18d  
Accuracy  0.8443  0.8483  0.8028 
Model size  2.16 MB  12.24 MB  2.20 MB 
For each model, we report test accuracy and the total model size. The results in Table 3 illustrate that PRODIGE learns a model with nearly the same quality as full dimensional embeddings with much smaller size. On the other hand, PRODIGE significantly outperforms its vectorial counterpart of the same model size.
5 Discussion
In this work, we introduce PRODIGE, a new method constructing representations of finite datasets. The method represents the dataset as a weighted graph with a shortest distance path metric, which is able to capture geometry of any finite metric space. Due to minimal inductive bias, PRODIGE captures the essential information from data more effectively compared to embeddingbased representation learning methods. The graphs are learned endtoend via minimizing any differentiable objective function, defined by the target task. We empirically confirm the advantage of PRODIGE in several machine learning problems and publish its source code online.
Acknowledgements: Authors would like to thank David Talbot for his numerous suggestions on paper readability. We would also like to thank anonymous metareviewer for his insightful feedback. Vage Egiazaryan was supported by the Russian Science Foundation under Grant 194104109.
References
 [1] (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics. Cited by: §1, §2.
 [2] (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR (1), Cited by: §1.
 [3] (2018) VoxCeleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622. Cited by: §2.
 [4] (2018) Representation tradeoffs for hyperbolic embeddings. arXiv preprint arXiv:1804.03329. Cited by: §1, §2.
 [5] (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In International conference on machine learning, Cited by: §1, §2.
 [6] (2011) From points to nodes: inverse graph embedding through a lagrangian formulation. In CAIP, Cited by: §2.
 [7] (2015) Learning image and user features for recommendation in social networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4274–4282. Cited by: §4.2.
 [8] (1990) Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM. Cited by: §3.2.
 [9] (2018) Learning mixedcurvature representations in product spaces. Cited by: §1, §2.
 [10] (2017) Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, Cited by: §1.
 [11] (2015) Finite metric spaces and their embedding into lebesgue spaces. Cited by: §1.
 [12] (2008) Collaborative filtering for implicit feedback datasets. 2008 Eighth IEEE International Conference on Data Mining, pp. 263–272. Cited by: 5th item.
 [13] (2018) Anonymous walk embeddings. ArXiv abs/1805.11921. Cited by: §2.
 [14] (2019) Robust graph learning from noisy data. IEEE transactions on cybernetics. Cited by: §2.
 [15] (2016) Adaptive edge weighting for graphbased learning algorithms. Machine Learning 106, pp. 307–335. Cited by: §2.
 [16] (2019) Hyperbolic image embeddings. arXiv preprint arXiv:1904.02239. Cited by: §2.
 [17] (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Cited by: §3.3.
 [18] (1964) Nonmetric multidimensional scaling: a numerical method. Psychometrika 29 (2), pp. 115–129. Cited by: 1st item, §4.1.
 [19] (2018) Learning sparse neural networks through l_0 regularization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, Cited by: §3.2.
 [20] (2004) Distinctive image features from scaleinvariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §2.
 [21] (201106) Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Cited by: §4.3.
 [22] (2008) Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: Figure 1, §4.1.
 [23] (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, Cited by: §1, §2.
 [24] (2017) Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems, Cited by: §1, §2.
 [25] (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. International journal of computer vision 42 (3), pp. 145–175. Cited by: §2.
 [26] (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Cited by: §1, §2, 2nd item.
 [27] (2018) Poincar’e glove: hyperbolic word embeddings. arXiv preprint arXiv:1810.06546. Cited by: §2.
 [28] (2018) Hyperbolic recommender systems. arXiv preprint arXiv:1809.01703. Cited by: §2.
 [29] (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §2.
 [30] (2019) DAGgnn: dag structure learning with graph neural networks. CoRR abs/1904.10098. Cited by: §2.
 [31] (2015) Improving deep neural networks using softplus units. In 2015 International Joint Conference on Neural Networks, IJCNN 2015, Killarney, Ireland, July 1217, 2015, pp. 1–4. Cited by: §3.1.
 [32] (2018) DAGs with no tears: continuous optimization for structure learning. In NeurIPS, Cited by: §2.