Beyond Vector Spaces: Compact Data Representation as Differentiable Weighted Graphs
Learning useful representations is a key ingredient to the success of modern machine learning. Currently, representation learning mostly relies on embedding data into Euclidean space. However, recent work has shown that data in some domains is better modeled by non-euclidean metric spaces, and inappropriate geometry can result in inferior performance. In this paper, we aim to eliminate the inductive bias imposed by the embedding space geometry. Namely, we propose to map data into more general non-vector metric spaces: a weighted graph with a shortest path distance. By design, such graphs can model arbitrary geometry with a proper configuration of edges and weights. Our main contribution is PRODIGE: a method that learns a weighted graph representation of data end-to-end by gradient descent. Greater generality and fewer model assumptions make PRODIGE more powerful than existing embedding-based approaches. We confirm the superiority of our method via extensive experiments on a wide range of tasks, including classification, compression, and collaborative filtering.
Nowadays, representation learning is a major component of most data analysis systems; this component aims to capture the essential information in the data and represent it in a form that is useful for the task at hand. Typical examples include word embeddings Mikolov et al. (2013); Pennington et al. (2014); Bojanowski et al. (2017), image embeddings Donahue et al. (2014); Chopra et al. (2005), user/item representations in recommender systems He et al. (2017) and others. To be useful in practice, representation learning methods should meet two main requirements. Firstly, they should be effective, i.e., they should not lose the information needed to achieve high performance in the specific task. Secondly, the constructed representations should be efficient, e.g., have small dimensionality, sparsity, or other structural constraints, imposed by the particular machine learning pipeline. In this paper, we focus on compact data representations, i.e., we aim to achieve the highest performance with the smallest memory consumption.
Most existing methods represent data items as points in some vector space, with the Euclidean space being a "default" choice. However, several recent works Nickel and Kiela (2017); Gu et al. (2018); De Sa et al. (2018) have demonstrated that the quality of representation is heavily influenced by the geometry of the embedding space. In particular, different space curvature can be more appropriate for different types of data Gu et al. (2018). While some prior works aim to learn the geometrical properties of the embedding space from dataGu et al. (2018), all of them assume a vectorial representation of the data, which may be an unnecessary inductive bias.
In contrast, we investigate a more general paradigm by proposing to represent finite datasets as weighted graphs equipped with a shortest path distance. It can be shown that such graphs can represent any geometry for a finite datasetHopkins (2015) and can be naturally treated as finite metric spaces. Specifically, we introduce Probabilistic Differentiable Graph Embeddings (PRODIGE), a method that learns a weighted graph from data, minimizing the problem-specific objective function via gradient descent. Unlike existing methods, PRODIGE does not learn vectorial embeddings of the data points, instead information from the data is effectively stored in the graph structure. Via extensive experiments on several different tasks, we confirm that, in terms of memory consumption, PRODIGE is more efficient than its vectorial counterparts.
To sum up, the contributions of this paper are as follows:
We propose a new paradigm for constructing representations of finite datasets. Instead of constructing pointwise vector representations, our proposed PRODIGE method represents data as a weighted graph equipped with a shortest path distance.
Applied to several tasks, PRODIGE is shown to outperform vectorial embeddings when given the same memory budget; this confirms our claim that PRODIGE can capture information from the data more effectively.
The PyTorch source code of PRODIGE is available online111https://github.com/stanis-morozov/prodige.
The rest of the paper is organized as follows. We discuss the relevant prior work in Section 2 and describe the general design of PRODIGE in Section 3. In Section 4 we consider several practical tasks and demonstrate how they can benefit from the usage of PRODIGE as a drop-in replacement for existing vectorial representation methods. Section 5 concludes the paper.
2 Related work
In this section, we briefly review relevant prior work and describe how the proposed approach relates to the existing machine learning concepts.
Embeddings. Vectorial representations of data have been used in machine learning systems for decades. The first representations were mostly hand-engineered and did not involve any learning, e.g. SIFTLowe (2004) and GISTOliva and Torralba (2001) representations for images, and n-gram frequencies for texts. The recent success of machine learning is largely due to the transition from the handcrafted to learnable data representations in domains, such as NLPMikolov et al. (2013); Pennington et al. (2014); Bojanowski et al. (2017), visionDonahue et al. (2014), speechChung et al. (2018). Most applications now use learnable representations, which are typically vectorial embeddings in Euclidean space.
Embedding space geometry. It has recently been shown that Euclidean space is a suboptimal model for several data domains Nickel and Kiela (2017); Gu et al. (2018); De Sa et al. (2018). In particular, spaces with a hyperbolic geometry are more appropriate to represent data with a latent hierarchical structure, and several papers investigate hyperbolic embeddings for different applications Tifrea et al. (2018); Vinh et al. (2018); Khrulkov et al. (2019). The current consensus appears to be that different geometries are appropriate for solving different problems and there have been attempts to learn these geometrical properties (e.g. curvature) from data Gu et al. (2018). Instead, our method aims to represent data as a weighted graph with a shortest path distance; this by design can express an arbitrary finite metric space.
Connections to graph embeddings. Representing the vertices of a given graph as vectors is a long-standing problem in machine learning and complex networks communities. Modern approaches to this problem rely on graph theory and/or graph neural networks Wu et al. (2019); Ivanov and Burnaev (2018), which are both areas of active research. In some sense, we aim to solve the inverse problem; given data entities, our goal is to learn a graph approximating the distances that satisfy the requirements of the task at hand.
Existing works on graph learning. We are not aware of prior work that proposes a general method to represent data as a weighted graph for any differentiable objective function. Probably, the closest work to ours is Escolano and Hancock (2011); they solve the problem of distance-preserving compression via a dual optimization problem. Their proposed approach lacks end-to-end learning and does not generalize to arbitrary loss functions. There are also several approaches that learn graphs from data for specific problems. Some studiesKarasuyama and Mamitsuka (2016); Kang et al. (2019) learn specialized graphs that perform clustering or semi-supervised learning. OthersYu et al. (2019); Zheng et al. (2018) focus specifically on learning probabilistic graphical model structure. Most of the proposed approaches are highly problem-specific and do not scale well to large graphs.
In this section we describe the general design of PRODIGE and its training procedure.
3.1 Learnable Graph Metric Spaces
Consider an undirected weighted graph , where is a set of vertices, corresponding to data items, and is a set of edges, . We use to denote non-negative edge weights, which are learnable parameters of our model.
Our model includes another set of learnable parameters that correspond to the probabilities of edges in the graph. Specifically, for each edge , we define a Bernoulli random variable that indicates whether an edge is present in the graph . For simplicity, we assume that all random variables in our model are independent. In this case the joint probability of all edges in the graph can be written as .
The expected distance between any two vertices and can now be expressed as a sum of edge weights along the shortest path:
Here denotes the set of all paths from to over the edges of , or more formally, . For a given , the shortest path can be found exactly using Dijkstra’s algorithm. Generally speaking, a shortest path is not guaranteed to exist, e.g., if is disconnected; in this case we define to be equal to a sufficiently large constant.
The parameters of our model must satisfy two constraints: for weights and for probabilities. We avoid constrained optimization by defining as Zheng et al. (2015) and as .
This model can be trained by minimizing an arbitrary differentiable objective function with respect to parameters directly by gradient descent. We explore several problem-specific objective functions in Section 4.
In order to learn a compact representation, we encourage the algorithm to learn sparse graphs by imposing additional regularization on . Namely, we employ a recent sparsification technique proposed in Louizos et al. (2018). Denoting the task-specific loss function as , the regularized training objective can be written as:
Intuitively, the term penalizes the number of edges being used, on average. It effectively encourages sparsity by forcing the edge probabilities to decay over time. For instance, if a certain edge has no effect on the main objective (e.g., the edge never occurs on any shortest path), the optimal probability of that edge being present is exactly zero. The regularization coefficient affects the “zeal” of edge pruning, with larger values of corresponding to greater sparsity of the learned graph.
We minimize this regularized objective function (2) by stochastic gradient descent. The gradients are straightforward to compute using existing autograd packages, such as TensorFlow or PyTorch. The gradients are, however, more tricky and require the log-derivative trickGlynn (1990):
In practice, we can use a Monte-Carlo estimate of gradient (3). However, the variance of this estimate can be too large to be used in practical optimization. To reduce the variance, we use the fact that the optimal path usually contains only a tiny subset of all possible edges. More formally, if the objective function only depends on a subgraph , then we can integrate out all edges from :
The expression (4) allows for an efficient training procedure that only samples edges that are required by Dijkstra’s algorithm. More specifically, on every iteration the path-finding algorithm selects a vertex and explores its neighbors by sampling corresponding to the edges that are potentially connected to . In this case, the size of is proportional to the number of iterations of Dijkstra’s algorithm, which is typically lower than the number of vertices in the original graph .
Finally, once the training procedure converges, most edges in the obtained graph are nearly deterministic: . We make this graph exactly deterministic by keeping only the edges with probability greater or equal to .
As the total number of edges in a complete graph grows quadratically with the number of vertices , memory consumption during PRODIGE training on large datasets can be infeasible. To reduce memory requirements, we explicitly restrict a subset of possible edges and learn probabilities only for edges from this subset. The subset of possible edges is constructed by a simple heuristic described below.
First, we add an edge from each data item to most similar items in terms of problem-specific similarity in the original data space. Second, we also add random edges between uniformly chosen source and destination vertices. Overall, the size of the constructed subset scales linearly with the number of vertices , which makes training feasible for large-scale datasets. In our experiments, we observe that increasing the number of possible edges improves the model performance until some saturation point, typically with edges per vertex.
Finally, we observed that the choice of an optimization algorithm is crucial for the convergence speed of our method. In particular, we use SparseAdamKingma and Ba (2015) as it significantly outperformed other sparse SGD alternatives in our experiments.
In this section, we experimentally evaluate the proposed PRODIGE model on several practical tasks and compare it to task-specific baselines.
4.1 Distance-preserving compression
Distance-preserving compression learns a compact representation of high-dimensional data that preserves the pairwise distances from the original high-dimensional space. The learned representations can then be used in practical applications, e.g., data visualization.
Objective. We optimize the squared error between pairwise distances in the original and compressed spaces. Since PRODIGE graphs are stochastic, we minimize this objective in expectation over edge probabilities:
In the formula above, are objects in the original high-dimensional Euclidean space and are the corresponding graph vertices. Note that the objective (5) resembles the well-known Multidimensional Scaling algorithmKruskal (1964).
However, the objective (5) does not account for the graph size in the learned model. This can likely lead to a trivial solution since the PRODIGE graph can simply memorize . Therefore, we use the sparsification technique described earlier in Section 3.2. We also employ the initialization heuristic from Section 3.3 to speed up training. Namely, we start with 64 edges per vertex, half of which are links to the nearest neighbors and the other half are random edges.
Experimental setup. We experiment with three publicly available datasets:
MNIST10k: images from the test set of the MNIST dataset, represented as -dimensional vectors;
CelebA10K: -dimensional embeddings of random face photographs from the CelebA dataset, produced by deep CNN222The dataset was obtained by running face_recognition package on CelebA images and uniformly sampling face embeddings, see https://github.com/ageitgey/face_recognition.
In these experiments, we aim to preserve the Euclidean distances (5) for all datasets. Note, however, that any distance function can be used in PRODIGE.
|4 parameters per instance|
|8 parameters per instance|
We compare our method with three baselines, which construct vectorial embeddings:
Poincare MDS is a version of MDS that maps data into Poincare Ball. This method approximates the original distance with a hyperbolic distance between learned vector embeddings:
PCA. Principal Component Analysis is the most popular techinique for data compression. We include this method for sanity check.
For all the methods, we compare the performance given the same memory budget. Namely, we investigate two operating points, corresponding to four and eight 32-bit numbers per data item. For embedding-based baselines, this corresponds to 4-dimensional and 8-dimensional embeddings, respectively. As for PRODIGE, it requires a total of parameters where is a number of objects and is the number of edges with . The learned graph is represented as a sequence of edges ordered by the source vertex, and each edge is stored as a tuple of target vertex (int32) and weight (float32). We tune the regularization coefficient to achieve the overall memory consumption close to the considered operating points. The distance approximation errors for all methods are reported in Table 1, which illustrates that PRODIGE outperforms the embedding-based counterparts by a large margin. These results confirm that the proposed graph representations are more efficient in capturing the underlying data geometry compared to vectorial embeddings. We also verify the robustness of our training procedure by running several experiments with different random initializations and different initial numbers of neighbors. Figure 4.1 shows the learning curves of PRODIGE under various conditions for GLOVE10K dataset and four numbers/vertex budget. While these results exhibit some variability due to optimization stochasticity, the overall training procedure is stable and robust.
Qualitative results. To illustrate the graphs obtained by learning the PRODIGE model, we train it on a toy dataset containing 100 randomly sampled MNIST images of five classes. We start the training from a full graph, which contains edges, and increase till the moment when only of edges are preserved. The tSNEMaaten and Hinton (2008) plot of the obtained graph, based on distances, is shown on Figure 1.
Figure 1 reveals several properties of the PRODIGE graphs. First, the number of edges per vertex is very uneven, with a large fraction of edges belonging to a few high degree vertices. We assume that these "popular" vertices play the role of "traffic hubs" for pathfinding in the graph. The shortest path between distant vertices is likely to begin by reaching the nearest "hub", then travel over the "long" edges to the hub that "contains" the target vertex, after which it would follow the short edges to reach the target itself.
Another important observation is that non-hub vertices tend to have only a few edges. We conjecture that this is the key to the memory-efficiency of our method. Effectively, the PRODIGE graph represents most of the data items by their relation to one or several "landmark" vertices, corresponding to graph hubs. Interestingly, this representation resembles the topology of human-made transport networks with few dedicated hubs and a large number of peripheral nodes. We plot the distribution of vertex degrees on Figure 4.1, which resembles power-law distribution, demonstrating "scale-free" property of PRODIGE graphs.
Sanity check. In this experiment we also verify that PRODIGE is able to reconstruct graphs from datapoints, which mutual distances are obtained as shortest paths in some graph. Namely, we generate connected Erdős–Rényi graphs with 10-25 vertices with edge probability and edge weights sampled from uniform distribution. We then train PRODIGE to reconstruct these graphs given pairwise distances between vertices. Out of random graphs, in cases PRODIGE was able to reconstruct all edges that affected shortest paths and all 100 runs the resulting graph had distances approximation error below .
4.2 Collaborative Filtering
In the next series of experiments, we investigate the usage of PRODIGE in the collaborative filtering task, which is a key component of modern recommendation systems.
Consider a sparse binary user-item matrix that represents the preferences of users over items. means that the -th item is relevant for the -th user. means that the -th item is not relevant for the -th user or the relevance information is absent. The recommendation task is to extract the most relevant items for the particular user. A common performance measure for this task is HitRatio@k. This metric measures how frequently there is at least one relevant item among top-k recommendations suggested by the algorithm. Note that in these experiments we consider the simplest setting where user preferences are binary. All experiments are performed on the Pinterest datasetGeng et al. (2015).
Objective. Intuitively, we want our model to rank the relevant items above the non-relevant ones. We achieve this via maximizing the following objective:
For each user , this objective enforces the positive items to be closer in terms of than the negative items . We sample positive items uniformly from the training items that are relevant to . In turn, are sampled uniformly from all items. In practice, we only need to run Dijkstra’s algorithm once per each user to calculate the distances to all positive and negative items.
Similarly to the previous section, we speed up the training stage by considering only a small subset of edges. Namely, we build an initial graph using as adjacency matrix and add extra edges that connect the nearest users (user-user edges) and the nearest items (item-item edges) in terms of cosine distance between the corresponding rows/columns of .
For this task, we compare the following methods:
PRODIGE-normal: PRODIGE method as described above; we restrict a set of possible edges to include 16 user-user and item-item edges and all relevant user-item edges available in the training data;
PRODIGE-bipartite: a version of PRODIGE that is allowed to learn only edges that connect users to items, counting approximately 30 edges per item. The user-user and the item-item edges are prohibited;
PRODIGE-random: a version of our model that is initialized with completely random edges. All user-user, item-item, and user-item edges are sampled uniformly. In total, we sample 50 edges per each user and item;
SVD: truncated Singular Value Decomposition of user-item matrix333we use the Implicit package for both ALS and SVD baselines https://github.com/benfred/implicit;
ALS: Alternating Least Squares method for implicit feedbackHu et al. (2008);
Metric Learning: metric learning approach that learns the user/item embeddings in the Euclidean space and optimizes the same objective function (6) with Euclidean distance between user and item embeddings; all other training parameters are shared with PRODIGE.
The comparison results for two memory budgets are presented in Table 2. It illustrates that our PRODIGE model achieves better overall recommendation quality compared to embedding-based methods. Interestingly, the bipartite graph performs nearly as well as the task-specific heuristic with nearest edges. In contrast, starting from a random set of edges results in poor performance, which indicates the importance of initial edges choice.
|4 parameters per user/item|
|8 parameters per user/item|
4.3 Sentiment classification
As another application, we consider a simple text classification problem: the algorithm takes a sequence of tokens (words) as input and predicts a single class label . This problem arises in a wide range of tasks, such as sentiment classification or spam detection
In this experiment, we explore the potential applications of learnable weighted graphs as an intermediate data representation within a multi-layer model. Our goal here is to learn graph edges end-to-end using the gradients from the subsequent layers. To make the whole pipeline fully differentiable, we design a projection of data items, represented as graph vertices, into feature vectors digestible by subsequent convolutional or dense layers.
Namely, a data item is represented as a vector of distances to predefined "anchor" vertices:
In practice, we add the "anchor" vertices to a graph before training and connect them to random vertices. Note that the "anchor" vertices do not correspond to any data object. The architecture used in this experiment is schematically presented on Figure 2.
Intuitively, the usage of PRODIGE in architecture from Figure 2 can be treated as a generalization of vectorial embedding layer. For instance, if a graph contains only the edges between vertices and anchors, this is equivalent to embedding with trainable parameters. However, in practice, our regularized PRODIGE model learns a more compact graph by using vertex-vertex edges to encode words via their relation to other words.
Model and objective. We train a simple sentiment classification model with four consecutive layers: embedding layer, one-dimensional convolutional layer with 32 output filters, followed by a global max pooling layer, a ReLU nonlinearity and a final dense layer that predicts class logits. Indeed, this model is smaller than the state-of-the-art models for sentiment classification and should be considered only as a proof of concept. We minimize the cross-entropy objective by backpropagation, computing gradients w.r.t all trainable parameters including graph edges .
Experimental setup. We evaluate our model on the IMDB benchmark Maas et al. (2011), a popular dataset for text sentiment binary classification. The data is split into training and test sets, each containing text instances. For simplicity, we only consider most frequent tokens.
We compare our model with embedding-based baselines, which follow the same architecture from Figure 2, but with the standard embedding layer instead of PRODIGE. All embeddings are initialized with pre-trained word representations and then fine-tuned with the subsequent layers by backpropagation. As pretrained embeddings, we use GloVe vectors444We used https://pypi.org/project/gensim/ to train GloVe model with recommended parameters trained on texts from the training set and select vectors corresponding to the most frequent tokens. In PRODIGE, the model graph is pre-trained by distance-preserving compression of the GloVe embeddings, as described in Section 4.1. In order to encode the "anchor" objects, we explicitly add synthetic objects to the data by running K-means clustering and compress the resulting objects by minimizing the objective (5).
|Model size||2.16 MB||12.24 MB||2.20 MB|
For each model, we report test accuracy and the total model size. The results in Table 3 illustrate that PRODIGE learns a model with nearly the same quality as full -dimensional embeddings with much smaller size. On the other hand, PRODIGE significantly outperforms its vectorial counterpart of the same model size.
In this work, we introduce PRODIGE, a new method constructing representations of finite datasets. The method represents the dataset as a weighted graph with a shortest distance path metric, which is able to capture geometry of any finite metric space. Due to minimal inductive bias, PRODIGE captures the essential information from data more effectively compared to embedding-based representation learning methods. The graphs are learned end-to-end via minimizing any differentiable objective function, defined by the target task. We empirically confirm the advantage of PRODIGE in several machine learning problems and publish its source code online.
Acknowledgements: Authors would like to thank David Talbot for his numerous suggestions on paper readability. We would also like to thank anonymous meta-reviewer for his insightful feedback. Vage Egiazaryan was supported by the Russian Science Foundation under Grant 19-41-04109.
-  (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics. Cited by: §1, §2.
-  (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR (1), Cited by: §1.
-  (2018) VoxCeleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622. Cited by: §2.
-  (2018) Representation tradeoffs for hyperbolic embeddings. arXiv preprint arXiv:1804.03329. Cited by: §1, §2.
-  (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In International conference on machine learning, Cited by: §1, §2.
-  (2011) From points to nodes: inverse graph embedding through a lagrangian formulation. In CAIP, Cited by: §2.
-  (2015) Learning image and user features for recommendation in social networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4274–4282. Cited by: §4.2.
-  (1990) Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM. Cited by: §3.2.
-  (2018) Learning mixed-curvature representations in product spaces. Cited by: §1, §2.
-  (2017) Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, Cited by: §1.
-  (2015) Finite metric spaces and their embedding into lebesgue spaces. Cited by: §1.
-  (2008) Collaborative filtering for implicit feedback datasets. 2008 Eighth IEEE International Conference on Data Mining, pp. 263–272. Cited by: 5th item.
-  (2018) Anonymous walk embeddings. ArXiv abs/1805.11921. Cited by: §2.
-  (2019) Robust graph learning from noisy data. IEEE transactions on cybernetics. Cited by: §2.
-  (2016) Adaptive edge weighting for graph-based learning algorithms. Machine Learning 106, pp. 307–335. Cited by: §2.
-  (2019) Hyperbolic image embeddings. arXiv preprint arXiv:1904.02239. Cited by: §2.
-  (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Cited by: §3.3.
-  (1964) Nonmetric multidimensional scaling: a numerical method. Psychometrika 29 (2), pp. 115–129. Cited by: 1st item, §4.1.
-  (2018) Learning sparse neural networks through l_0 regularization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §3.2.
-  (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §2.
-  (2011-06) Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Cited by: §4.3.
-  (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: Figure 1, §4.1.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, Cited by: §1, §2.
-  (2017) Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems, Cited by: §1, §2.
-  (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. International journal of computer vision 42 (3), pp. 145–175. Cited by: §2.
-  (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Cited by: §1, §2, 2nd item.
-  (2018) Poincar’e glove: hyperbolic word embeddings. arXiv preprint arXiv:1810.06546. Cited by: §2.
-  (2018) Hyperbolic recommender systems. arXiv preprint arXiv:1809.01703. Cited by: §2.
-  (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §2.
-  (2019) DAG-gnn: dag structure learning with graph neural networks. CoRR abs/1904.10098. Cited by: §2.
-  (2015) Improving deep neural networks using softplus units. In 2015 International Joint Conference on Neural Networks, IJCNN 2015, Killarney, Ireland, July 12-17, 2015, pp. 1–4. Cited by: §3.1.
-  (2018) DAGs with no tears: continuous optimization for structure learning. In NeurIPS, Cited by: §2.