On the Interpretability and Evaluation of Graph Representation Learning

On the Interpretability and Evaluation of Graph Representation Learning

Antonia Gogoglou
Capital One
McLean, VA 22102
&C. Bayan Bruss
Capital One
McLean, VA 22102
&Keegan E. Hines
Capital One
McLean, VA 22102

With rising interest in graph representation learning, a variety of approaches have been proposed to effectively capture a graph’s properties. While these approaches have improved performance in graph machine learning tasks compared to traditional graph mining techniques, they are still perceived as black-box techniques with limited insights into the information encoded in these representations. In this work, we explore methods to interpret node embeddings and propose the creation of a robust evaluation framework for comparing graph representation learning algorithms and hyperparameters. We test our methods on graphs with different properties and investigate the relationship between embedding training parameters and the ability of the produced embedding to recover the structure of the original graph in a downstream task.

1 Introduction

Graphs play a key role in many machine learning tasks providing the structured information needed to learn meaningful patterns and generate predictive models. However, it is challenging to represent complex structures like graphs in an expressive and efficient way that they can be fed into machine learning applications. Advances in the field of Graph Representation Learning hamilton2017representation ; goyal2018graph appear to provide a mapping that embeds nodes, or entire graphs, as dense low dimensional vectors. Recently proposed approaches such as DeepWalk perozzi2014deepwalk , LINE tang2015line , node2vec grover2016node2vec , GCNs kipf2016semi and GraphSage hamilton2017inductive treat this mapping as a machine learning task itself and aim to optimize it so that relationships in the embedding space accurately reflect the topology of the original graph.

A common categorization distinguishes between shallow and deep node embeddings. Shallow embeddings rely on first- or higher-order proximity derived from the original graph, often via random walks, to provide the context of a node and inform its representation. Deep learning approaches include Graph Convolutional Networks (GCNs) and Message Passing Neural Networks (MPNNs) which extend the concept of convolution to describe a node as a function of its neighborhood. Regarding the objective to optimize during training of the embeddings, unsupervised approaches optimize for link reconstruction, supervised approaches for an externally assigned node label and semi-supervised operate on a subset of labeled nodes.

Given that different embedding approaches optimize for different objectives and operate on different input, it is expected that there is not a single "one-fits-all" node embedding technique. Recent work has focused on evaluating graph representation learning techniques with regards to their ability to distinguish graph properties dalmia2018towards ; xu2018powerful . In this direction, we investigate the interpretability of node embeddings and propose an evaluation framework that answers the following questions:

  • What information do node embeddings express and can we derive metrics to quantify their properties?

  • How can we evaluate node embeddings with or without external labels and is there a single approach that maximizes performance across all tasks?

  • Can complicated structures of the original graph be captured in embeddings trained on the local context around a node?

The rest of the paper is organized as follows: Section 1.1 describes our proposed methodology, while Section 2 shows the results of our experiments and concludes the article.

1.1 Methods

1.1.1 Interpretability

In graph representation learning, nodes are typically embedded into a fixed D dimensional vector space (where D is a hyperparameter) Theoretically, the space is as condensed of a representation as we can get, without loss of information. This indicates that an interpretable embedding dimension would be highly associated with a particular feature of the original graph, a so-called disentangled representation higgins2017beta ; bouchacourt2018multi ; locatello2018challenging . In NLP these features are often expressed in the form of semantic categories of words csenel2018semantic ; park2017rotated . In the case of graphs such categories can be derived from extrinsic or intrinsic sources, with the former being categories or labels assigned externally to nodes while the latter refers to groups found in the decomposition of the original graph (e.g. communities or partitions).

We define an Interpretability Score adapted from csenel2018semantic for each dimension and each group of nodes:


where is the group of nodes and is the embedding dimension, while is a hyperparameter set equal to the cardinality of for our experiments. Interpretability scores are produced for both the top and bottom items of each embedding dimension and they can be aggregated by taking the maximum or average. Thereafter, scores are aggregated either across multiple groups to get the score for a single embedding dimension or across embedding dimensions to obtain per group scores.


If the top nodes in the positive or negative direction of an embedding dimension are highly associated with a particular node category and at the same time have lower overlap with the rest of the categories, then the interpretability of this dimension is strong.

1.1.2 Embedding approaches and Datasets

In random-walk based embedding models, there are two general components, a system for generating long random walks (with some variants depending on the model), and a shallow one layer neural network skip-gram model. Each of the components contains a set of hyperparameters out of which the most commonly reported one is embedding dimensionality.

To investigate the proposed evaluation methods we use three datasets: one coming from the financial sector (Brand Level Merchants - BLM) bruss2019deeptrax and two from the social networks sphere (BlogCatalog) and (Flickr) tang2009relational . The BlogCatalog dataset contains friendship connections between bloggers. Additionally, it contains labels for each node referring to 39 categories the bloggers could be affiliated with. Similarly, Flickr data contains links between users of the Flickr board and 195 categories users can be associated with. The Brand Level Merchant dataset is constructed from credit card transaction logs. By taking any two transactions that share an account within a specified time window, a set of merchant pairs, meaning walk lengths equal to 2, are generated. For all datasets we generate embeddings using the GENSIM implementation of word2vec rehurek_lrec with the same hyperparameters proposed in bruss2019deeptrax and perozzi2014deepwalk .

1.1.3 External and Internal Evaluation

In this work, we focus on evaluating embeddings both internally, meaning their ability to capture graph structure and externally, meaning their distinguishing power against node labels. In the embedding space, similar nodes are expected to be placed closer together, but the notion of similarity can be arbitrarily defined based on node features, neighborhoods or connectivity patterns. Communities are a broadly used way of graph partitioning and can capture complex structural similarity. Consequently, they make a good test case for evaluating how graph structural properties are represented in the embedding space. Two learning problems are generated from this: pairwise community detection, which is a binary classification task of whether a pair of nodes belong in the same community and node level community prediction, which we treat as a multi-class classification problem of predicting the community a node belongs in given its embedding representation. For graphs that contain node labels, like BlogCatalog and Flickr, we treat them the same way. The goal in both tasks is to test the embeddings’ efficiency to separate nodes. For community detection we use Louvain’s algorithm for optimizing modularity blondel2008fast .

Number of
Number of
Number of
Brand Level Merchants over 100,000 over
(80 for 95% of nodes)
BlogCatalog 10,312 333,983 6
Flickr 80,513 5,899,882 17
Table 1: Dataset Statistics

2 Results

For our experiments we produced embeddings for all datasets with embedding dimensionality of 10, 64 and 128. First, we examine Interpretability Scores (IS) aggregated over different axes to explore the association of the embedding space with both external and internal node categorization. Figure 1 shows the distribution of IS values over node communities for Brand Level Merchants and over node groups for the BlogCatalog and Flickr graphs. We observe that some node categories are highly associated with multiple dimensions of the embedding space (e.g. community 0 in Brand Level Merchants). These are the most highly populated categories and contain a larger variety of patterns expressed in multiple dimensions. Each embedding dimension appears to also be individually correlated with a particular subset of node categories. For instance the dimension for BlogCatalog is mostly correlated with groups 1 and 15, while the dimension carries information for groups 1, 14 and 3.

(a) IS for the and embedding dimensions across node groups for BlogCatalog data (D = 128)
(b) IS for the and embedding dimensions across communities for Brand Level Merchants (D = 10)
(c) IS for the and embedding dimensions across communities for Flickr (D = 128)
Figure 1: Interpretability scores over node categorizations for selected embedding dimensions
D Brand Level Merchants BlogCatalog
Community Group Community
Binary Multi-class
Binary Multi-label Binary Multi-class
10 0.78 0.84 0.98 0.55 0.35 0.71 0.86 0.87
64 0.71 0.86 0.95 0.75 0.42 0.68 0.80 0.90
128 0.71 0.85 0.94 0.78 0.40 0.72 0.83 0.93
D Flickr
Binary Multi-label Binary Multi-class
10 0.70 0.37 0.80 0.85 0.95
64 0.70 0.40 0.70 0.88 0.96
128 0.67 0.40 0.77 0.94 0.96
Table 2: Performance for different classification tasks with various embedding dimensionality values. In binary classification values are F1-scores, in multi-class micro-averaged F1 and LPAUC is Link Prediction AUC.

Next, we report in Table 2 the performance of different dimensionality embeddings on a set of prediction tasks described in Section 1.1.3. We observe that, by increasing the number of embedding dimensions, the ability to predict community membership does not improve, with an edge given to denser representations in BlogCatalog and Brand Level Merchants. Interestingly, performance in all node classification tasks we undertook is highly linked with Interpretability Scores distribution (see Figure 1), with the highest values being achieved for community prediction over node classification. Performance in external node classification increases with the number of dimensions for the BlogCatalog data, while for Flickr data medium sized embeddings outperform the rest in this task. We can conclude that hyperparameter tuning can be based on two axes: graph properties of the dataset and the structures of the original graph we need the embeddings to capture.

Link prediction accuracy, for which random-walk based approaches optimize, appears to be correlated with external node classification. This is not always the case with community prediction, which favors smaller sized embeddings in BlogCatalog and Brand Level Merchants while link prediction improves with higher number of dimensions in the same datasets. In the Flickr graph different dimensionality embeddings achieve almost identical link prediction AUC scores, but show big deviations in performance in community prediction. Our findings imply that optimizing for link occurrence or external labels alone is not always sufficient to evaluate the embedding space as a whole and graph structure based tasks can shed light into the quality of latent representations. This is only the first the step in an effort to design a generalizable evaluation framework for different graph representation approaches across graphs with varying properties.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description