Network Representation Learning: Consolidation and Renewed Bearing
Abstract
Graphs are a natural abstraction for many problems where nodes represent entities and edges represent a relationship across entities. The abstraction can be explicit (e.g., transportation networks, social networks, foreign key relationships) or implicit (e.g., nearest neighbor problems). An important area of research that has emerged over the last decade is the use of graphs as a vehicle for nonlinear dimensionality reduction in a manner akin to previous efforts based on manifold learning with uses for downstream database processing (e.g., entity resolution and link prediction, outlier analysis), machine learning and visualization. In this systematic yet comprehensive experimental survey, we benchmark several popular network representation learning methods operating on two key tasks: link prediction and node classification.
We examine the performance of 12 unsupervised embedding methods on 15 datasets. To the best of our knowledge, the scale of our study – both in terms of the number of methods and number of datasets – is the largest to date. Our benchmarking study in as far as possible uses the original codes provided by the original authors.
Our results reveal several key insights about worktodate in this space. First, we find that certain baseline methods (taskspecific heuristics, as well as classic manifold methods) that have often been dismissed or are not considered by previous efforts can compete on certain types of datasets if they are tuned appropriately. Second, we find that recent methods based on matrix factorization offer a small but relatively consistent advantage over alternative methods (e.g., randomwalk based methods) from a qualitative standpoint. Specifically, we find that MNMF, a community preserving embedding method, is the most competitive method for the link prediction task. While NetMF is the most competitive baseline for node classification. Third, no single method completely outperforms other embedding methods on both node classification and link prediction tasks. We also present several drilldown analysis that reveals settings under which certain algorithms perform well (e.g., the role of neighborhood context on performance; dataset characteristics that influence performance) – guiding the enduser.
Network Representation Learning: Consolidation and Renewed Bearing
\@float
copyrightbox[b]
\end@floatGraphs are effective in multiple disparate domains to model, query and mine relational data. Examples abound ranging from the use of nearest neighbor graphs in database systems [?, ?] and machine learning [?, ?] to the analysis of biological networks [?, ?] and from social network analysis [?, ?] to the analysis of transportation networks[?]. MLenhanced data structures and algorithms such as learned indexes [?] have recently shown promising results in database systems. An active area in ML research – network representation learning – has potential in multiple applications related to the downstream database processing tasks such as outlier analysis [?, ?], entity resolution [?, ?], link prediction [?, ?] and visualization [?, ?]. However, a plethora of new network representation learning methods has been proposed recently [?, ?]. Given the wide range of methods proposed, it is often tough for a practitioner to determine or understand which of these methods they should consider adopting for a particular task on a particular dataset. Part of the challenge is the lack of a standard evaluation benchmark and a thorough independent understanding of the strengths and weaknesses of each method for the particular task on hand. The challenges are daunting and can be summarized as follows:
Lack of Standard Assessment Protocol: First, there is no standard to evaluate the quality of generated embeddings. The efficacy among embedding methods is often evaluated based on downstream machine learning tasks. As a result, the superiority of one embedding method over another hinges on the performance in a downstream machine learning task. With the lack of a standard evaluation protocol for these downstream tasks, the results reported by different research articles are often inconsistent. As a specific example, Node2vec [?] report the node classification performance of Deepwalk on Blogcatalog dataset – for multilabel classification with a traintest split of 50:50 – as 21.1% Macrof1, whereas the Deepwalk paper [?] reports Deepwalks’ performance as 27.3% Macrof1.
Tuning Comparative Strawman: Second, a new method almost always compares its performance against a subset of other methods, and on a subset of tasks and datasets previously evaluated. In many cases, while great care is taken to tune the new method (via careful hyperparameter tuning) – the same care is often not taken when evaluating baselines. For example, in our experiments on Blogcatalog, we find that with a traintest split of 50:50 the Laplacian Eigenmaps method [?] without GridSearch achieves a Macrof1 score of 3.9% (similar to what was reported in [?, ?]). However, with tuning the hyperparameters of logistic regression, we find that the Laplacian Eigenmaps method achieves a MacroF1 of 29.2%. Importantly, while logistic regression is commonly used to evaluate the quality of node embeddings in such methods, GridSearch over logistic regression parameters is rarely conducted or reported. Additionally, reported results are rarely averaged over multiple shuffles to reduce any bias or patterns in the training data^{1}^{1}1This is our observation based on the evaluation scripts publically shared by multiple authors.. In short, a lack of consistency in evaluation inhibits our understanding of the scientific advances in this arena, discussed next.
Standard Benchmark: Third, there is no agreed list of datasets that are used consistently in the literature. A new embedding method evaluates their method on selected datasets with a suitable node classification/link prediction setup. For instance, few methods report node classification performance for the baselines with the traintest split of 10:90 while few methods report the same with the traintest split of 50:50. As a result, the comparison across embedding methods is often unclear. Additionally, there are no clear guidelines on whether the proposed embedding methodology favors a certain type of dataset characteristic (e.g., sparsity).
Task Specific Baselines: Fourth, for many tasks such as node classification and link prediction there is a rich preexisting literature [?, ?] focused on such tasks (that do not explicitly rely on node embedding methodology as a preprocessing step). Few, if any, of the prior art in network representation learning consider such baselines – often such methods compare performance on downstream ML tasks against other node embedding methods. In our experiments, we find that a curated feature vector based on heuristics can achieve a similar competitive AUROC score on many of the datasets for the link prediction task.
To summarize, there is a clear and pressing need for a comprehensive and careful benchmarking of such methods which is the focus of this study. To address the aforementioned issues in the network embedding literature, we perform an experimental study of 12 promising network embeddings methods on 15 diverse datasets. The selected embedding methods are unsupervised techniques to generate the node embeddings of a graph. Our goal is to perform a uniform, principled comparison of these methods on a variety of different datasets and across two key tasks – link prediction and node classification.
Specific findings of our experimental survey that we wish to highlight are that:

For the linkprediction task we find that MNMF [?], a community preserving embedding method, offers a compelling advantage when it scales to a particular dataset. Other more scalable alternatives such as Verse and LINE also perform well on most datasets. The heuristic approach we present for link prediction competes exceptionally well on all types of datasets surveyed for this task.

For the node classification task, NetMF [?] when it scales to the particular dataset, offers a small but consistent performance advantage. We find that for nodeclassification task, the taskspecific heuristic methodology we compare with works well when operating on datasets with fewer labels – in such scenarios, it competes well with a majority of the methods surveyed, whereas, some recent methods proposed fare much worse.

We also drill down to study the impact of context embeddings on the link prediction and node classification tasks (and find some methods impervious to the use of context – where for others it helps significantly). We also examine two common ways in which link prediction strategies are evaluated (explicitly through a classifier, or implicitly through vector dotproduct ranking). We find that there is a clear separation in performance when using these alternative strategies.
We denote the input graph as where and denote the set of nodes and edges of the graph, . The notations used in this work are listed in Table Network Representation Learning: Consolidation and Renewed Bearing. In this study, we consider both directed as well as undirected graphs along with weighted as well as unweighted graphs. We evaluate the embedding methods on nonattributed, homogeneous graphs.
Symbol  Meaning 

Input graph  
Nodes  
Edges  
Number of nodes,  
Adjacency matrix.  
Degree Matrix of Graph.  
and where  
Identity Matrix  
Node embedding of node  
Context embedding of node  
Node and context embedding matrix of size  
Sum of weights of all edges  
Graph Similarity matrix  
a nonlinear function such as Sigmoid function  
Number of negative samples  
Definition 2.1
Network Embedding: Given a Graph, and an embedding dimension, where , the goal of a network embedding method is to learn a dimensional representation of the graph, such that similarity in graph space approximates closeness in dimensional space.
In this section, we give a summary of the network embedding methods evaluated in our work. Herein, for each models along with their description, we also provide additional experimental details for reproducibility.

Laplacian Eigenmaps [?]: Laplacian Eigenmaps generates a dimensional embedding of the graph using the smallest eigenvectors of Laplacian matrix .
subject to is generated embedding matrix . The above equation can be reduced to simple minimization of L2 distance for adjacent nodes  . Laplacian Eigenmaps levers the first order information for generating the embeddings.
Reproducibility notes: We search for following hyperparameters: Embedding dimension = [64, 128, 256]. On the datasets with 1M nodes, Laplacian Eigenmaps did not scale for embedding dimension 128, 256.

DeepWalk [?]: DeepWalk is a random walk based network embedding method which uses truncated random walks and levers local information from these generated walks to learn similar latent representations. DeepWalk draws inspiration from Skipgram model in Word2vec [?] by treating random walks as sequences and optimizing following objective function:
(1) where is target node while are the context nodes. denotes the embedding of the node . Since the objective function is expensive to compute for large graphs, it is approximated using Hierarchical Softmax[?].
Reproducibility notes: We search for following hyperparameters: Walk length = [5, 20, 40], Number of walks = [20, 40, 80], Window size = [2, 4, 10], Embedding dimension = [64, 128, 256]. In the case of directed graphs, we observe lower performance in node classification and link prediction task. In order to have a fair comparison with other methods, we treat directed graphs as undirected for DeepWalk.

Node2Vec [?]: Node2Vec is a biased random walk based network embedding method which allows the random walk to be more flexible in exploring the graph neighborhoods. The flexibility of the random walk is achieved by interpolating between Breadthfirst traversal and Depthfirst traversal. The objective function is again based on Skipgram model based on Word2vec [?], and since the objective function is expensive to compute for large graphs, it is approximated by negative sampling [?].
Reproducibility notes: We search for following hyperparameters: Walk length = [10, 20, 40], Number of walks = 80. Window Size = 10, and = [0.25, 1, 2, 4], Embedding dimension = [64, 128, 256]. In case of directed graphs, we observe lower performance in node classification and link prediciton task. In order to have a fair comparison with other methods, we treat directed graphs as undirected for Node2Vec.

GraRep [?]: GraRep is a matrix factorization based network embedding method which captures the global structural information of the graph while learning node embeddings. The authors observe that the existing Skipgram based models project all the step relational information into a common subspace and then, argue the importance of preserving different step relational information in separate subspaces. The loss function to preserve the step relationship between node and is proposed as:
(2) where refers to the negative node at th step for node (see Table 1 for additional notation, e.g. ). The above loss function in closed form results in logtransformed, probabilistic adjacency matrix which is factorized with SVD for generating each step representation. The final node representation is generated by concatenation of all the step representations.
Reproducibility notes: We search for following hyperparameters: from 1 to 6, Embedding dimension = [64, 128, 256]. On the datasets with 2M edges, due to scalability issue, we searched for from 1 to 2 and Embedding dimension = [64, 128].

NetMF [?]: NetMF is a matrix factorization based network embedding method. NetMF presents theoretical proofs for their claim that Skipgram models with negative sampling are implicitly approximating and factorizing appropriate matrices constructed with the help of graph Laplacians. The objective matrix based for NetMF on small context window T is given by (see Table 1 for notation):
(3) where refers to sum of all edge weights and corresponds to number of negative samples in skipgram model. NetMF factorizes the above closed form DeepWalk matrix with SVD in order to generate node embedding and provides two algorithms for small context window and large context window.
Reproducibility notes: We search for following hyperparameters: = [1, 10], Negative samples = [1, 2, 3], Rank for large context window = [128, 256, 512], Embedding dimension = [64, 128, 256].

MNMF [?]: MNMF is a matrix factorization based network embedding method which generates node embeddings that preserves the microscopic information in form of firstorder and secondorder proximities among nodes and the generated embeddings also preserve mesoscopic information in form of community structure. The objective function for MNMF is given as
(4) where is the binary community membership matrix, is the latent representations of communities and is the modularity matrix obtained from the adjacency matrix, (see Table Network Representation Learning: Consolidation and Renewed Bearing for rest of the notations). Overall, MNMF discovers communities through modularity constraints. The node embeddings generated with the help of microscopic information and community embeddings are then, jointly optimized by assuming consensus relationship between both node and community embeddings.
Reproducibility notes: We search for following hyperparameters: = [0.1, 1.0, 10.0], = [0.1, 1.0, 10.0], Embedding dimension = [64, 128, 256].

HOPE [?]: HOPE is a matrix factorization based network embedding method which generates node embeddings that preserve asymmetric transitivity of nodes in directed graphs. If there exists a directed edge from node to and to , then – due to asymmetric transitivity property – an edge from to is more likely to form than edge from to . The objective function of HOPE is given as follows
(5) where and are the source and target embeddings. In order to preserve asymmetric transitivity of nodes, the proximity matrix is constructed using a similarity metric which respects the directionality of edges. The node embeddings are generated by factorizing the proximity matrix with generalized SVD [?].
Reproducibility notes: We search for following hyperparameters: The decay parameter = 0.5/, where is spectral radius of the graph. Embedding dimension = [64, 128, 256].

LINE [?]: LINE is an optimizationbased network embedding method which optimizes an objective function that preserves both first and second order proximity among nodes in the embedding space. The objective function for firstorder proximity is given as:
(6) The objective function to preserve the second order proximity is given as:
(7) where represents context embedding of node (see Table Network Representation Learning: Consolidation and Renewed Bearing for rest of the notations). The firstorder proximity corresponds to local proximity between nodes based on the presence of edges in the graph while the secondorder proximity corresponds to global proximity between nodes based on shared neighborhoods of those nodes in the graph. Since the objective function is expensive to compute for large graphs, it is approximated by negative sampling [?].
Reproducibility notes: We search for following hyperparameters: Number of samples = 10 billion, Embedding dimension [64, 128, 256]. In the case of directed graphs, as suggested by the authors of LINE, we evaluate only secondorder proximity.

Verse [?]: Verse is an optimizationbased network embedding method which optimizes an objective function that minimizes the KullbackLeibler (KL) divergence from the given similarity distribution in graph space to similarity distribution in embedding space (E). The objective function is given as follows:
(8) The similarity distribution in graph space could be constructed with help of Personalized PageRank[?], SimRank[?], or Adjacency matrix[?]. Since the objective function is expensive to compute for large graphs, it is approximated by Noise Constrastive Estimation [?].
Reproducibility notes: We search for following hyperparameters: PageRank damping factor = [0.7, 0.85, 0.9], Negative samples = [3, 10], Embedding dimension = [64, 128, 256].

SDNE [?]: SDNE is a deep autoencoder based network embedding method which optimizes an objective function that preserves both first and second order proximity among nodes in the embedding space. The objective function of SDNE is given below
(9) where and are loss functions to preserve firstorder and secondorder proximities respectively, while is the regularizer term. The authors propose a semisupervised deep model to minimize the mentioned objective function. The deep model consists of two components: supervised and unsupervised. The supervised component attempts to preserve the firstorder proximity while the unsupervised component attempts to preserve the secondorder proximity by minimizing reconstruction loss of nodes.
Reproducibility notes: We search for following hyperparameters: = [1e05, 0.2], Penalty coefficient = [5, 10], Embedding dimension = [64, 128, 256].

VAG [?]: VAG is a graph autoencoder based network embedding method which minimizes the reconstruction loss of the adjacency matrix. The reconstruction matrix is generated as where is node embeddings generated with Graph Convolutional Networks (GCN) [?] as with as node features (see Table 1 for additional notation). In the case of unattributed graphs, the node feature matrix is the identity matrix.
Reproducibility notes: We search for following hyperparameters: Epochs = [50, 100], Embedding dimension = [64, 128, 256].

Watch Your Step [?]: Watch Your Step (WYS) addresses the sensitivity issue of hyperparameters in the random walk based embedding methods. WYS solves the sensitivity issue with the attention mechanism on the expected random walk matrix. The attention mechanism guides the random walk to focus on short or long term dependencies pertinent to the input graph. The objective function of WYS is given as
(10) where is attention parameter vector, and are node embeddings, is expectation on the random walk matrix (see Table Network Representation Learning: Consolidation and Renewed Bearing for rest of the notations).
Reproducibility notes: We search for following hyperparameters: Learning rate = [0.05, 0.1, 0.2, 0.5, 1.0], Number of Hops = 5, Embedding dimension = [64, 128, 256].
We select datasets from multiple domains, Table Network Representation Learning: Consolidation and Renewed Bearing describes empirical properties of datasets. The selected datasets support both multilabel and multiclass classification. Directed as well as undirected datasets were selected in order to evaluate the embeddings methods on the linkprediction task efficiently. Further, datasets with and without edge weights are also included, thereby, providing us with a comprehensive set of possibilities to evaluate the methods. We summarize the datasets below:

Web: The WebKB datasets^{2}^{2}2http://linqs.cs.umd.edu/projects/projects/lbc/ [?] consist of classified webpages (nodes) and hyperlinks between them (edges). Here, labels are the categories of the webpages.

Medical: The PPI dataset [?] represents a subgraph of protein interactions in humans. Labels represent biological states corresponding to hallmark gene sets.

Natural Language: The Wikipedia dataset [?] is a dump of Wikipedia with nodes as words, edges corresponding to the cooccurrence matrix and labels corresponding to PartofSpeech (POS) tags.

Social: The Blogcatalog dataset and Flickr dataset [?] represent social networks. Blogcatalog and Flickr both represent bloggers and their friendships. YouTube dataset [?] represents users and their friendships. Labels for Blogcatalog, Flickr, and YouTube correspond to the groups to which each user belongs. The Epinions dataset [?] represents user annotated trust relationships, where users annotate which other users they trust. These are used to determine the reviews shown to a user.

Citation: The DBLP, CoCit, and Pubmed datasets represent citation networks. DBLP (CoAuthor) represents a subset of papers in DBLP^{3}^{3}3https://dblp.unitrier.de/ from closely related fields. CoCit (Microsoft) [?] corresponds to a cocitation subgraph of the Microsoft Academic Graph. Finally, Pubmed corresponds to a subset of diabetesrelated publications on Pubmed^{4}^{4}4https://www.ncbi.nlm.nih.gov/pubmed/. Labels in DBLP correspond to the subfield of the paper. In CoCit, they correspond to the conference of the paper, and in Pubmed correspond to the types of diabetes.

Digital: The p2pGnutella dataset [?] represents connections between hosts on a peertopeer file sharing network. This dataset has no node labels.

Voting: The WikiVote dataset [?] is constructed from voting data in multiple elections for Wikipedia administratorship. Users are nodes, and (directed) edge (, ) represents a vote from user to user . This dataset also has no node labels.
Dataset  #Nodes  #Edges  #Labels  (C/L)^{1}  D  W 

WebKB (Texas)  186  464  4  C  F  T 
WebKB (Cornell)  195  478  5  C  F  T 
WebKB (Washington)  230  596  5  C  F  T 
WebKB (Wisconsin)  265  724  5  C  F  T 
PPI  3,890  38,739  50  L  F  F 
Wikipedia  4,777  92,517  40  L  F  T 
Blogcatalog  10,312  333,983  39  L  F  F 
DBLP (CoAuthor)  18,721  122,245  3  C  F  T 
CoCit (Microsoft)  44,034  195,361  15  C  F  F 
WikiVote  7,115  103,689      T  F 
Pubmed  19,717  44,338  3  C  T  F 
p2pGnutella  62,586  147,892      T  F 
Flickr  80,513  5,899,882  195  L  F  F 
Epinions  75,879  508,837      T  F 
YouTube  1,134,890  2,987,624  47  C  F  F 
In this section, we elaborate on the experimental setup for link prediction and node classification tasks employed to evaluate the quality of embeddings generated by different methods. We present two heuristics baselines for both the tasks and define the metrics used for comparing the embedding methods.
Prediction of ties is an essential task in multiple domains where the relational information is costlier to obtain such as drugtarget interactions [?], proteinprotein interactions [?], or when the environment is partially observable. The problem of prediction of tie/link between two nodes and is often evaluated in one of two ways. The first is to treat the problem as a binary classification problem. The second is to use the dot product on the embedding space as a scoring function to evaluate the strength of the tie.
The edge features for binary classification consists of node embeddings of nodes and , where two node embeddings are aggregated with a binary function. In our study, we experimented with three binary functions on node embeddings: Concatenation, Hadamard, and L2 distance. We used logistic regression as our base classifier for the prediction of the link. The parameters of the logistic regression are tuned using GridSearchCV with 5fold cross validation with scoring metric as ‘roc_auc’. We evaluate the link prediction performance with metrics: Area Under the Receiver Operating Characteristics (AUROC) and Area Under PrecisionRecall curve (AUPR). An alternative evaluation strategy is to predict the presence of link based on dot product value of node embeddings of nodes and . We study the impact of both the evaluation strategies in Section \thefigure.
Construction of the train and test sets: The method of construction of train and test sets for link prediction task is crucial for comparison of embedding methods. The train and test split consists of 80% and 20% of the edges respectively and are constructed in the following order:

Selfloops are removed.

We randomly select 20% of all edges as positive test edges and add them in the test set.

Positive test edges are removed from the graph. We find the largest weakly connected component formed with the nonremoved edges. The edges of the connected component form positive train edges.

We sample negative edges from the largest weakly connected component and add the sampled negative edges to both the training set and test set. The number of negative edges is equal to the number of positive edges in both training and test sets.

For directed graphs, we form “directed negative test edges” which satisfy the following constraint: but where refers to edges in the largest weakly connected component. We add the directed negative test edges edges to our test set. The number of “directed negative test edges” is around 10% of negative test edges in the test set.

Nodes present in the test set, but not present in the training set, are deleted from the test set.
In case of large datasets (5M edges), we reduce our training set. We consider 10% of both randomly selected positive and negative train edges for learning the binary classifier. The learned model is evaluated on the test set. The above steps are repeated for 5 folds of a train:test splits of 80:20% and we report the average AUROC and AUPR scores across 5 folds.
In network embedding literature, node classification is the most popular way of comparing the quality of embeddings generated by different embedding methods. The generated node embeddings are treated as node features, and node labels are treated as ground truth. The classification task performed in our experiments is either Multilabel or Multiclass classification. The details on the classification task performed on each dataset are provided in Table Network Representation Learning: Consolidation and Renewed Bearing. We select Logistic Regression as our classifier. The hyperparameters of the logistic regression are tuned using GridSearchCV with 5fold cross validation with scoring metric as ‘f1_micro’. We split the dataset with 50:50 traintest splits. The learned model is evaluated on the test set, and we report the results averaged over 10 shuffles of traintest sets. The model does not have access to test instances while training.
We note that a majority of the efforts in the literature do not tune the hyperparameters of Logistic Regression. Default hyperparameters are not always the best hyperparameters for Logistic Regression. For instance, with default hyperparameters of LR classifier, the Macrof1 performance of Laplacian eigenmaps on Blogcatalog dataset is 3.9% for the traintest split of 50:50. However, tuning the hyperparameters results in significant improvement of Macrof1 score to 29.2%.
The choice of a “linear” classifier to evaluate the quality of embeddings is not a hard constraint in the node classification task. In this work, we also test the idea of leveraging a “nonlinear” classifier for the node classification task and use EigenPro [?] classifier for the same task. On large datasets, EigenPro provides a significant performance boost over the stateoftheart kernel methods with faster convergence rates [?]. In the experiments, we see a benefit to this approach, up to 15% improvement in Microf1 scores with nonlinear classifier compared to the linear classifier.
Next, we present heuristics baseline for both link prediction and node classification tasks. The purpose of defining heuristics baseline is to assess the difficulty of performing a particular task on a particular dataset and also to compare the performance of sophisticated network embedding methods over simple heuristics.
In the link prediction literature, there exist multiple similarity based metrics [?] which can predict a score for link formation between two nodes. Examples of such metrics include Jaccard Index [?, ?], Adamic Adar [?]. These similaritybased metrics often base their predictions on the neighborhood overlap between the nodes. We combine the similaritybased metrics to form a curated feature vector of an edge [?]. The binary classifier in the link prediction task is then trained on the generated edge embeddings. Our selected similaritybased metrics are Common Neighbors (CN), Adamic Adar (AA) [?], Jaccard Index (JA) [?], Resource Allocation Index (RA) [?] and Preferential Attachment Index (PA) [?]. The similaritybased metrics CN, JA, and PA captures firstorder proximity between nodes, while the metrics AA and RA capture secondorder proximity between nodes. We found this heuristic based model to be highly competitive as compared to the embedding methods on multiple datasets.
Nodes in the graph can be characterized/represented by their properties. We combine the node properties to form a feature vector/embedding of a node. The classifier in node classification task is then trained on the generated node embeddings. The node properties capture information such as nodes’ neighborhood, influence on other nodes, structural properties. We select following node properties : Degree, PageRank [?], Clustering Coefficient, Hub and Authority scores [?], Average Neighbor Degree, and Eccentricity [?]. We treat the graph as undirected while computing the node properties. As the magnitude of each node property varies with another, we perform columnwise normalization with RobustScaler available from Scikitlearn. We will show in the experiments Section Network Representation Learning: Consolidation and Renewed Bearing that the node classification heuristics baseline is competitive with most of the embedding methods on datasets with fewer labels.
In this section, we present two measures for comparing the performance of embedding methods in the downstream machine learning task.
Mean Rank: We compute the rank of all the embedding methods on each dataset based on selected performance metric and report the average rank of an embedding method across all datasets as the Mean Rank of the embedding method. Let be the rank of embedding method on dataset with being the set of datasets then mean rank of embedding method is given by
(11) 
Mean Penalty [?]: We define penalty of an embedding method on a dataset as difference between best score achieved by any embedding method on dataset and the score achieved by embedding method on same dataset . Score is the selected performance metric for a particular downstream ML task. Let be the set of embedding methods and be the score achieved by embedding method on same dataset , then the Mean Penalty is given by
(12) 
For a model, lower values for Mean Rank and Mean Penalty suggest better performance. We compare the embedding methods with Mean Rank, and Mean Penalty measures on the datasets where all the embedding methods complete execution. Though the measures do not consider the dataset size or missing values, the measures are simple and intuitive.
In this section, we report the performance of network embedding methods on link prediction task and node classification task. We tune both the parameters of embedding methods and the parameters of classifiers in link prediction and node classification task. Whenever possible, we rely on the authors’ code implementation of the embedding method. All the methods which do not complete execution on large datasets are executed on a modern machine with 500 GB RAM and 28 cores. All the evaluation scripts are executed in the same virtual python environment.^{5}^{5}5The evaluation scripts and datasets are available at https://github.com/PriyeshV/NRL_Benchmark.
The link prediction performance of 12 embedding methods measured in terms of AUROC and AUPR on 15 datasets is shown in Figure Network Representation Learning: Consolidation and Renewed Bearing and Figure Network Representation Learning: Consolidation and Renewed Bearing. The Overall (or aggregate) performance of an embedding method on all the datasets is also shown at the end of the horizontal bar of each method in Figure Network Representation Learning: Consolidation and Renewed Bearing and Figure Network Representation Learning: Consolidation and Renewed Bearing. We represent Overall score as the sum of scores (AUROC or AUPR) of the method on all the datasets. The Mean Rank and Mean Penalty of embedding methods – on datasets for which all methods run to completion on our system – is shown in Figure Network Representation Learning: Consolidation and Renewed Bearing. We also provide the tabulated results in Tables Network Representation Learning: Consolidation and Renewed Bearing and Network Representation Learning: Consolidation and Renewed Bearing. As mentioned in Section 2.1, we tune the hyperparameters of each embedding method and report the best average AUROC scores and average AUPR scores across 5 folds. In the case of WebKB datasets, we evaluate the methods on embedding dimensions 64, 128. We perform the link prediction task with both normalized and unnormalized embeddings and report the best performance.
We make the following observations:

Effectiveness of MNMF for Link Prediction: We observe that MNMF achieves the highest overall link prediction performance in terms of best average AUROC and AUPR scores as compared to other methods. The competitive performance of MNMF on link prediction could be credited to the community information imbibed into the node embeddings generation. The Mean Rank and Mean Penalty is lowest for MNMF which also suggest MNMF as a competitive baseline for Link Prediction. MNMF achieves the first rank for 7 out of 15 datasets. The small value of Mean Penalty suggests that even when MNMF is not the topranked method for a particular dataset, MNMF’s performance is closest to that of the topranked method on that dataset. However, MNMF does not completely outperform other methods on all the datasets. For instance, on the WikiVote and Pubmed dataset, WYS achieves the best average AUROC scores while on Microsoft dataset, GraRep achieves the best average AUROC score. In Figure (b) and Figure (b), we see that among the more scalable methods, LINE achieves the highest overall link prediction performance followed by DeepWalk in terms of both AUROC and AUPR scores. Note that MNMF did not scale for the datasets with 5M edges on a modern machine with 500 GB RAM and 28 cores. However, the scalability issue of nonnegative matrix factorization based methods can be addressed by adopting modern ideas [?, ?] (outside the scope of this study).

Performance of Heuristic Baseline: We observe that the Link Prediction Heuristics baseline – described in section Network Representation Learning: Consolidation and Renewed Bearing – is both efficient and effective. We see that Link Prediction Heuristics baselines’ overall performance is better than that of Laplacian Eigenmaps and SDNE and competitive to that of Node2vec, HOPE, Verse, LINE. The Mean Penalty of Link Prediction Heuristics is also close to other embedding methods. On the largest dataset YouTube, Link Prediction Heuristics achieve an AUROC of 96.2% which is close to the best performing Verse with AUROC of 97.6%. As compared to the most competitive baseline MNMF, the heuristics baseline outperforms MNMF on Wikipedia, Blogcatalog datasets. We also observe that the heuristics baseline performance is competitive against several methods on the directed datasets too even though the chosen similaritybased metrics in heuristics baseline treat the underlying graph as undirected.
Feature study on the Heuristic Baseline: We study the importance of the individual feature in the heuristics by analyzing the impact of the feature removal on link prediction. The results are reported in Figure Network Representation Learning: Consolidation and Renewed Bearing. The blue line on the top of the columns in Figure Network Representation Learning: Consolidation and Renewed Bearing corresponds to AUROC scores achieved with the proposed link prediction heuristic. In the feature study on the link prediction heuristics, we see that the removal of preferential attachment (PA) feature results in consistent drop in AUROC scores. We find that the removal of PA feature results in statistically significant drop at significance level of 0.05 with paired ttest. The removal of rest of the features both in link prediction heuristics did not result in significant drop in the downstream performance.

Impact of Evaluation Strategy: As described in section Network Representation Learning: Consolidation and Renewed Bearing, the presence of a link between two nodes can be predicted with either the Logistic Regression classifier (treating the embeddings as features) or the dot product between the node embeddings. We compare the performance of both evaluation strategies on each embedding method over all datasets using the differences in the average AUROC scores. A positive difference implies link prediction performance with classifier is better than that of the dot product. The results are presented as boxplot in Figure Network Representation Learning: Consolidation and Renewed Bearing. Paired ttest suggests the positive difference is statistically significant for all methods, except for Verse and WYS, with a significance level of 0.05. Hence, the use of classifier over dot product provides significant predictive performance gain on the task of link prediction. We also investigate the changes in the ranking of embedding methods based on overall average AUROC scores when predictions are performed with classifier rather than dot product. The methods were ranked based on overall average AUROC scores and we considered only those datasets on which all methods complete execution. We observed that the rank of NetMF in the ranking generated with dot product was 10 while its rank in the ranking generated with classifier improved to 3. Since the best link prediction performance for the majority of the embedding methods was achieved with classifier, we believe the superiority of the embedding methods based on link prediction task should be asserted by leveraging the classifier.
As mentioned in section Network Representation Learning: Consolidation and Renewed Bearing, we lever binary functions: Hadamard, Concatenation, and L2 to generate the edge embedding. In figure Network Representation Learning: Consolidation and Renewed Bearing, we present which binary function achieved the best average AUROC score for an embedding method on a particular dataset. We see that the binary function Hadamard resulted in achieving a maximum number of best average AUROC scores. However, there is no single winner in terms of choice of binary functions.

Impact of context embeddings: We study the impact of context embeddings on directed datasets for the link prediction task. We consider only those embedding methods which generate both node and context embeddings for this study. We compare the impact of using node + context embeddings over using only node embeddings with the help of differences in AUROC scores. The results are detailed in Figure Network Representation Learning: Consolidation and Renewed Bearing. A positive difference implies the use of context embeddings helps in link prediction. We see that levering node + context embeddings improve the link prediction performance of LINE, HOPE, and WYS. For MNMF, use of context embeddings does not improve the link prediction performance as in MNMF the community information – crucial for link prediction – is incorporated in the node embeddings. In the case of GraRep, we find that the node embeddings encapsulate highorder information and, hence, levering context does not help improve the performance. We find that the results of DeepWalk and Node2Vec on directed datasets are significantly lower, so in order to have a fair comparison with other embedding methods, we treated the directed datasets as undirected for DeepWalk and Node2Vec. The median of the box plot of DeepWalk and Node2Vec is close to zero due to this treatment.

Robustness of embedding methods: In link prediction, we compute the average AUROC score and average AUROC standard error of an embedding method over 5 folds of a selected dataset. The computed average AUROC standard error corresponds to the robustness of that embedding method on the selected dataset – as larger values of standard error corresponds to large variance in AUROC scores across 5 folds. In Figure Network Representation Learning: Consolidation and Renewed Bearing, we report the distribution of average AUROC standard error of each embedding method over all datasets. We observed large variance in average AUROC standard error scores over WebKB datasets and show the results for WebKB datasets in Figure (a) while we show the results for other datasets in Figure (b). Interestingly, even on WebKB datasets, the variance in average AUROC standard error scores is low for MNMF method. From Figure (b), we observe that the median of boxplots of majority of the methods is closer to zero.

Impact of embedding dimension: We study the impact of embedding size for all the embedding methods on the link prediction. Specifically, we compare the performances of 64 dimensional embedding with 128 dimensional embedding. The improvement – quantified in terms of performance difference – obtained with 128 dimensional embedding over 64 dimensional embedding is reported in Figure Network Representation Learning: Consolidation and Renewed Bearing. The boxplot represents the distribution of differences in AUROC scores between 128 dimensional embedding and 64 dimensional embedding for each method on all datasets. In link prediction, we observe a statistically significant improvement at significance level 0.05 with the 128 dimensional embedding for Laplacian Eigmaps, GraREP, HOPE, NetMF and MNMF methods.

Impact of embedding normalization: We study the impact of L2 normalization of the embeddings on the link prediction performance. The comparison results are shown in Figure Network Representation Learning: Consolidation and Renewed Bearing where Figure (a) and Figure (b) shows the comparison results when link prediction is performed through classifier and dotproduct, respectively. The box plot represents the distribution of differences of AUROC between normalized and unnormalized embeddings on link prediction task. The positive difference implies L2 normalization results in better downstream performance. When link prediction is performed through classifier, the negative difference is statistically significant for VERSE and GraREP at significance level of 0.05 with paired ttest. However, surprisingly the difference in performance with respect to normalization of embeddings is not statistically significantly for rest of the methods for link prediction. When link prediction is performed through dotproduct, the normalization of embedding results in statistically significant improvement for Node2vec and Verse, while not performing normalization of embedding results in statistically significant improvement HOPE, NetMF and WYS.
The node classification performance of 12 embedding methods measured in terms of Microf1 scores on 15 datasets with traintest split of 50:50 is reported in Figure Network Representation Learning: Consolidation and Renewed Bearing and Figure Network Representation Learning: Consolidation and Renewed Bearing. The overall performance of an embedding method on all the datasets is shown at the end of the horizontal bar of each method in Figure Network Representation Learning: Consolidation and Renewed Bearing and Figure Network Representation Learning: Consolidation and Renewed Bearing and represents the sum of scores (Microf1) of the method on the datasets. The Mean Rank and Mean Penalty of embedding methods – on the datasets for which all methods run to completion on our system – is shown in Figure Network Representation Learning: Consolidation and Renewed Bearing. We also report the Mean Rank and Mean Penalty of embedding methods – on datasets with few labels – in Figure Network Representation Learning: Consolidation and Renewed Bearing. We tune the hyperparameters of each embedding method – mentioned in section 2.1 – and report the best Microf1 score. In the case of WebKB datasets, we evaluate the methods on generated embedding with dimensions 64, 128. We perform the node classification with both normalized and unnormalized embeddings and report the best performance. We also provide tabulated results in Tables Network Representation Learning: Consolidation and Renewed Bearing and Network Representation Learning: Consolidation and Renewed Bearing. We make the following observations.

Effectiveness of NetMF for node classification: We observe that NetMF achieves the highest overall node classification performance in terms of best Microf1 scores using both linear and nonlinear classifiers. From Figure Network Representation Learning: Consolidation and Renewed Bearing, we see that the Mean Rank and Mean Penalty is lowest for NetMF which suggest NetMF as the strongest overall method for node classification. NetMF achieves low Mean Rank suggesting NetMF is among the topranked methods on the evaluated datasets. The smallest value of Mean Penalty suggests that even when NetMF is not the topranked method for a particular dataset, NetMF’s performance is closest to that of the topranked method for that dataset. However, it does not entirely outperform other methods on all the datasets. LINE, DeepWalk, and Node2Vec are also competitive baselines for the task of node classification as their overall performance is closest to that of NetMF. The performance of GraRep on datasets with more labels is comparable with other methods when we exclude the Flickr dataset. However, the reported results for GraRep on the Flickr dataset are with embeddings of dimension 64. Embedding dimensions 128 and 256 for GraRep resulted in memory errors on a modern machine with 500GB RAM and 28 cores. Note that NetMF did not scale for the YouTube dataset. While scalability is currently outside the scope of our study, the scalability of such methods is under active development (we refer the interested reader elsewhere [?, ?]).

Laplacian Eigenmaps Performance: We observe that the Laplacian Eigenmaps method achieves competitive Microf1 scores on several datasets. For instance, on Blogcatalog dataset with 39 labels, Laplacian Eigenmaps method achieves the best Microf1 score of 42.1% while on the Pubmed dataset, the Laplacian Eigenmaps methods outperform all other embedding methods with 81.7% Microf1. With a nonlinear classifier, Laplacian Eigenmaps achieves the secondbest performance on the PPI dataset with 23.8% Microf1. We observe from Figure Network Representation Learning: Consolidation and Renewed Bearing that the Mean Penalty of Laplacian Eigenmaps is also closest to other embedding methods, namely, Verse, MNMF, VAG. On the PPI and Flickr datasets, Laplacian Eigenmaps baselines’ Microf1 is close to the best Microf1. The observed results for Laplacian Eigenmaps on evaluated datasets are better than the reported results [?, ?] for both node classification and link prediction tasks. This improvement in the performance of Laplacian Eigenmaps is due to the hyperparameter tuning of parameters of logistic regression classifier.

Node Classification Heuristic: We observe from Figure (a) and Figure (a) that the node classification heuristic baseline is competitive against other embedding methods on datasets with fewer labels (up to 5 labels) as its overall score is better than many of the methods. This observation can also be verified from Figure Network Representation Learning: Consolidation and Renewed Bearing as both Mean Rank and Mean Penalty of node classification heuristics baseline is better than many of the methods. However, as the number of labels in the datasets increases (5 labels), we observe that the Microf1 scores of node heuristics baseline decrease drastically. The decrease in overall performance reflects that the node heuristics features lack the discriminative power to classify multiple labels.
Feature study on the Heuristics baseline: We study the importance of the individual feature in the node classification heuristics by analyzing the impact of the feature removal on the node classification performance. The results for node classification heuristic with logistic regression and EigenPro are reported in Figure (a) and Figure (b), respectively. The blue line on the top of the columns in the figures corresponds to Microf1 scores achieved with the proposed node classification heuristics. The removal of individual feature in node classification heuristics did not result in significant drop in the downstream performance. However, we see that the node classification heuristics’ classification performance with both Logistic Regression and EigenPro is better than the ones achieved through the removal of individual features on most of the datasets.

Context embeddings can improve performance: We see from Figure Network Representation Learning: Consolidation and Renewed Bearing that levering both node and context of Skipgram based models results in significant improvement (up to 25%) for most of the methods. On Pubmed dataset, we observe that the node classification performance of embedding methods like LINE (2^{nd} order), HOPE, and WYS was significantly lower than that of other methods. The Microf1 scores of the embedding methods are shown in Fig. Network Representation Learning: Consolidation and Renewed Bearing. We found that the Pubmed dataset consists of around 80% sink nodes. As a result, when the embedding methods based on Skipgram model generate the node embeddings, the sink nodes are always considered as “context” nodes and are never considered as “source” nodes. Hence, the quality of node embeddings of sink nodes is of lower quality. In order to have a fair comparison, we concatenate both the node and context embeddings of the methods (whenever possible) and evaluate the performance on the concatenated embeddings.

Impact of nonlinear classifier: We study the impact of the nonlinear classifier on the node classification performance. The comparison results are shown in box plot Network Representation Learning: Consolidation and Renewed Bearing. The box plot represents the distribution of differences of Microf1 scores computed with the nonlinear (EigenPro [?]) and linear classifiers (Logistic Regression). The positive difference implies that the results with nonlinear classifier are better than linear classifier. For Verse, we see a 15% absolute increase with the use of nonlinear classifier on the PubMed dataset. The positive difference is statistically significant (with paired ttest) for methods DeepWalk, Verse, SDNE, GraRep and MNMF with significance level 0.05. It is worth pointing out that on the smaller datasets this gain is less evident while on, the larger datasets (more training data) the benefits of using a nonlinear classifier are much clearer.

Impact of embedding dimension: We study the impact of embedding size for all the embedding methods on the node classification task. Specifically, we compare the performances of 64 dimensional embedding with 128 dimensional embedding. The improvement – quantified in terms of performance difference – obtained with 128 dimensional embedding over 64 dimensional embedding is reported in Figure Network Representation Learning: Consolidation and Renewed Bearing. The boxplot represents the distribution of differences in Microf1 scores between 128 dimensional embedding and 64 dimensional embedding for each method on all datasets. In node classification with linear classifier, none of the evaluated methods obtained a statistically significant difference at significance level 0.05. While in node classification with nonlinear classifier, the embedding method HOPE obtained a statistically significant positive difference – at significance level 0.05 – with 128 embedding dimension.

Impact of embedding normalization: We study the impact of L2 normalization the embeddings for the node classification task. The comparison results are shown in Figure Network Representation Learning: Consolidation and Renewed Bearing. The box plot represents the distribution of differences of Microf1 scores for embedding methods between normalized and unnormalized embeddings on node classification task. The positive difference implies L2 normalization results in better downstream performance. In node classification with linear classifier, the positive difference is statistically significantly for NetMF while the negative difference is statistically significantly for DeepWalk at significance level 0.05 with paired ttest. While in node classification with nonlinear classifier, the positive difference is statistically significantly for NetMF method at significance level 0.05 with paired ttest.

Node classification performance on 10:90 train: test split: We report the node classification performance of all methods on all the evaluated datasets with 10:90 train:test split in with logistic regression classifier in Figure Network Representation Learning: Consolidation and Renewed Bearing and nonlinear classifier EigenPro in Figure Network Representation Learning: Consolidation and Renewed Bearing. The Mean Rank and Mean Penalty of embedding methods â on the datasets for which all methods run to completion on our system â is shown in Figure (a). We also report the Mean Rank and Mean Penalty of embedding methods â on datasets with few labels â in Figure (b). The observations we reported with train:test 50:50 split also seem to hold with train:test 10:90 split. Specifically, we observe NetMF is the most competitive method for node classification while Laplacian Eigenmaps method outperforms multiple existing methods on multiple datasets (Blogcatalog, Coauthor datasets). Embedding methods such as DeepWalk and LINE also perform well on most datasets.
Network representational learning has attracted lot of attention in past few years. An interested reader can refer to the survey of network embedding methods [?, ?, ?]. The surveys focus on categorization of the embedding methods based either encoderdecoder framework [?] or novel taxonomy [?, ?] but does not provide experimental comparison of the embedding methods. There does exist one other experimental survey of network embedding methods [?]. However there are key differences. First, we present a systematic study on a larger set of embedding methods, including several more recent ideas, and on many more datasets (15 vs 7). Specifically, we evaluate 12 embedding methods + 2 efficient heuristics on 15 datasets. Second, there are several key differences in terms of results reported and reproducability. In our work we carefully tune all hyperparameters of each method as well as the logistic classifier (and include information in our reproducability notes). As a concrete example of where such careful tuning can make a difference consider that on Blogcatalog with a traintest split of 50:50, Goyal et al, achieve Macrof1 score of 3.9% while with tuning the hyperparameters of logistic regression we achieve a Macrof1 score of 29.2%. Third, our analysis reveals several important insights on the role of context, role of different link prediction evaluation strategies (dot product vs classifier), impact of nonlinear classifiers and many others. All of these provide useful insights for endusers as well as guidance for future research and evaluation in network representational learning and downstream applications. Fourth, we also provide a comparison against simple but effective taskspecific baseline heuristics which will serve as useful strawman methods for future work in these areas.
To conclude, we identify several issues in the current literature: lack of standard assessment protocol, use of default parameters for baselines, lack of standard benchmark, ignorance of taskspecific baselines. Additionally, we make the following observations:

MNMF and NetMF are the most effective baselines for the link prediction and node classification task respectively.

No one method completely outperform the other methods on both link prediction and node classification tasks.

If one considers Laplacian Eigenmaps as a baseline, the classifier parameters should be tuned appropriately.

The Link Prediction Heuristic we present is simple, efficient to compute and offers competitive performance. The Node Classificaton Heuristic is also simple and efficient to compute and is effective on datasets with fewer labels.

For both tasks, some methods are impervious to the use of context whereas for other methods context helps significantly.

While comparing embeddings methods through link prediction task, the superiority of the embedding methods should be asserted by leveraging the classifier.
We hope the insights put forward in this study are helpful to the community and encourage the comparison of novel embedding methods with the taskspecific competitive methods and proposed taskspecific heuristics.
Datasets  Heuristics  LapEig  DeepWalk  Node2Vec  Verse  LINE  GraRep  HOPE  SDNE  NetMF  MNMF  VAG  WYS 

WTexas  77.7  73.1  79.7  83  81.7  78.2  78.7  78.7  82.4  80.9  96.0  78.6  83.0 
WCornell  81.5  77.3  79.2  82  87  77.5  84.4  79.9  80.2  81.2  96.7  74.8  84.4 
WWashington  75.3  70.1  75.3  75  82.2  72.9  78.1  72.8  76.7  75.5  97.5  73.9  79.2 
WWisconsin  79.4  71.4  80.7  78  88.2  72.5  84.3  75.5  76.9  82.3  98.9  73.9  84.5 
PPI  90.9  78.2  89.1  88.3  89.6  87.8  90  88.4  89.3  87.3  96.9  87.4  91.5 
Wikipedia  91.6  77.9  90.9  90.9  91.3  91.2  92.3  90.4  50  91.4  88.4  89.5  92.3 
WikiVote  91.5  83.5  97.4  97.6  94.9  96.6  88.4  97.8  96.6  95.5  92.2  94.3  98.2 
BlogCatalog  95.2  77.4  94.3  95  97.3  95.2  96.2  95.3  95.6  95.1  94  94.8  96.0 
DBLP (CoAuthor)  95.6  93.3  96  95.4  97.9  94.3  97.1  89.6  50  95.9  99.4  94.1  96.8 
Pubmed  87.7  89.6  89.1  89.3  96.6  92.7  77.7  90.1  88.9  89.8  94.3  93.6  97.0 
CoCit (microsoft)  89.5  95.6  97.6  97.3  83.7  97.2  97.9  94.5  91.9  96.9  96.6  96.2   
P2P  83.8  69.9  88.2  88.3  77.6  91.2  71.8  88.6  83.9  87.5  92.3     
Flickr  92.4  93  95.8  94.7  72.6  95.2  95.5  96.5  93  97.2       
Epinions  92.2  90.9  93.3  93.4  91.9  91.6  93.7  92.7  92.7  92.8       
Youtube  96.2  96  93.6  91.4  97.6  96.5  91.4  92.4           
Datasets  Heuristics  LapEig  DeepWalk  Node2Vec  Verse  LINE  GraRep  HOPE  SDNE  NetMF  MNMF  VAG  WYS 

WTexas  81.8  78  81.9  85  85  82.2  82.1  82.9  85.5  83.6  96.0  80.6  85.2 
WCornell  81.9  78.9  79.8  81  87.8  76.7  86.2  79.8  79.5  82  96.7  75.6  86.6 
WWashington  80.3  75  76.5  78  86.5  75.5  82.9  77.2  81.5  80.4  97.7  78.7  83.4 
WWisconsin  82.3  74.8  81.6  79  90.7  76.4  87.4  78.8  80.7  85.2  98.5  78.1  86.7 
PPI  91.4  80.7  90.4  89.5  90.7  88.1  90.8  89.2  90.2  87.9  96.5  88.1  92.2 
Wikipedia  93  76  92.5  92.3  92.8  92.8  93.1  91.8  75  92.8  89.9  91.4  93.5 
WikiVote  87.9  82.1  96.9  97.2  94.8  95.3  84  96.8  96.3  93.8  87.6  94.3  97.4 
BlogCatalog  95.1  77.5  94.3  94.8  97.9  95.1  96  95  95.5  94.8  93.6  94.6  96.1 
DBLP (Coauthor)  96.7  93.8  96.8  96.1  98.2  95.6  97.4  90.9  75  96.7  99.2  95.2  97.3 
Pubmed  85  85  81.5  82.3  96.8  90.3  74.1  91.4  90.5  86.1  90.3  95.2  96.9 
Cocit (microsoft)  91.9  95.5  97.9  97.5  76.4  97.7  97.9  95.3  93.5  97.1  95.2  96.4   
P2P  79.3  68.1  84.3  84.5  68.3  88.6  71.3  85.8  80.8  84  89.6     
Flickr  92.5  95  96.1  95  70.9  95.5  95.7  96.7  94  97.6       
Epinions  89.2  89.5  91.7  91.9  88.6  88.8  93.0  91.5  91.6  91.8       
Youtube  96.7  96.7  95  93  98.2  97  92.2  94           
Datasets  Heuristics  LapEig  DeepWalk  Node2Vec  Verse  LINE  GraRep  HOPE  SDNE  NetMF  MNMF  VAG  WYS 

W  Texas  61.8  54.6  55.1  57.2  54.5  61.8  56.1  59.1  58.0  67.1  58.0  54.8  60.6 
W  Cornell  42.1  30.9  40.5  34.3  35.4  44.1  40.9  41.9  48.5  48.1  36.1  40.3  41.8 
W  Washington  65.0  43.3  56.0  58.4  51.5  65.3  46.7  62.8  60.5  61.1  60.4  59.0  65.3 
W  Wisconsin  51.7  41.5  52.3  45.6  41.7  52.9  50.8  51.5  51.3  56.7  53.9  48.1  53.2 
PPI  10.8  22.3  21.4  21.0  19.7  19.9  20.4  18.8  17.4  21.3  18.6  19.2  22.6 
Wikipedia  41.9  46.3  50.0  51.4  43.8  56.3  58.8  57.9  52.4  58.4  48.1  41.1  44.4 
Blogcatalog  17.1  42.1  41.5  41.7  35.5  38.6  41.3  34.4  29.5  41.7  21.6  17.1  38.9 
DBLP (CoAuthor)  37.3  37.1  35.9  35.6  37.2  37.0  35.7  36.0  37.4  36.6  36.2  36.2   
Pubmed  57.8  81.7  81.5  81.1  63.0  64.4  79.1  74.7  67.7  80.0  77.1  63.5  73.6 
CoCit (microsoft)  25.0  43.0  46.3  46.7  46.0  46.5  46.6  44.6  38.1  43.5  44.7  43.5   
Flickr  19.1  34.0  35.6  35.1  30.1  33.4  10.5  28.7  30.8  34.2       
Youtube  24.4    40.7  40.3  38.5  40.3  38.0  38.7           
Datasets  Heuristics  LapEig  DeepWalk  Node2Vec  Verse  LINE  GraRep  HOPE  SDNE  NetMF  MNMF  VAG  WYS 

W Texas  42.1  18.1  26.9  22.7  18.1  36.4  25.0  34.4  39.5  49.9  23.8  27.8  40.3 
W Cornell  22.2  21.3  28.1  23.2  22.4  25.0  27.7  28.7  13.1  32.8  23.5  20.6  26.5 
W Washington  32.2  22.2  24.3  27.3  23.1  30.6  28.7  30.2  29.1  29.7  28.6  26.3  31.1 
W Wisconsin  24.7  29.0  31.9  23.9  21.9  27.8  34.7  28.3  26.1  34.8  33.8  25.5  33.9 
PPI  6.0  17.9  18.1  18.0  16.5  16.9  17.4  15.9  15.2  17.5  15.9  13.1  17.9 
Wikipedia  5.5  10.4  11.9  12.9  8.2  18.2  18.3  20.1  14.1  18.4  11.0  3.8  10.1 
Blogcatalog  3.1  29.2  27.3  27.9  22.1  23.6  28.9  20.8  14.8  28.8  8.2  3.1  26.3 
DBLP (CoAuthor)  18.1  20.1  30.0  29.4  20.6  19.2  30.5  28.6  21.1  30.0  26.6  27.8   
Pubmed  48.9  80.2  80.1  79.8  58.0  61.3  77.6  73.0  63.3  78.4  75.4  60.6  71.6 
CoCit (microsoft)  12.6  27.3  34.3  34.2  33.3  33.8  34.8  32.8  27.8  34.0  30.4  29.2   
Flickr  1.7  20.4  21.2  20.7  17.6  18.2  0.9  11.4  14.9  20.2       
Youtube  9.3    34.7  34.0  32.1  33.1  30.0  30.8           
 1 Microsoft Academic Graph  KDD cup, 2016. https://kddcup2016.azurewebsites.net/Data.
 2 S. AbuElHaija, B. Perozzi, R. AlRfou, and A. Alemi. Watch your step: Learning node embeddings via graph attention. In Neural Information Processing Systems, 2018.
 3 L. A. Adamic and E. Adar. Friends and neighbors on the web. Social networks, 25(3):211–230, 2003.
 4 E. M. Airoldi, D. M. Blei, S. E. Fienberg, E. P. Xing, and T. Jaakkola. Mixed membership stochastic block models for relational data with application to proteinprotein interactions. In In Proceedings of the International Biometrics Society Annual Meeting, 2006.
 5 A.L. Barabási and R. Albert. Emergence of scaling in random networks. science, 286(5439):509–512, 1999.
 6 M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
 7 A. R. Benson, D. F. Gleich, and J. Leskovec. Higherorder organization of complex networks. Science, 353(6295):163–166, 2016.
 8 S. Bhagat, G. Cormode, and S. Muthukrishnan. Node classification in social networks. In Social network data analytics, pages 115–148. Springer, 2011.
 9 D. K. Bhattacharyya and J. K. Kalita. Network anomaly detection: A machine learning perspective. Chapman and Hall/CRC, 2013.
 10 B.J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M. Livstone, R. Oughtred, D. H. Lackner, J. Bähler, V. Wood, et al. The biogrid interaction database. Nucleic acids research, 36(suppl_1):D637–D640, 2007.
 11 H. Cai, V. W. Zheng, and K. C.C. Chang. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering, 30(9):1616–1637, 2018.
 12 S. Cao, W. Lu, and Q. Xu. Grarep: Learning graph representations with global structural information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 891–900. ACM, 2015.
 13 E. K. Cetinkaya, M. J. Alenazi, A. M. Peck, J. P. Rohrer, and J. P. Sterbenz. Multilevel resilience analysis of transportation and communication networks. Telecommunication Systems, 60(4):515–537, 2015.
 14 S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In ACM SIGMOD Record, pages 307–318. ACM, 1998.
 15 W. W. Cohen and J. Richman. Learning to match and cluster large highdimensional data sets for data integration. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 475–480. ACM, 2002.
 16 G. Crichton, Y. Guo, S. Pyysalo, and A. Korhonen. Neural networks for link prediction in realistic biomedical graphs: a multidimensional evaluation of graph embeddingbased approaches. BMC Bioinformatics, 19(1):176, May 2018.
 17 R. W. Eckardt III, R. G. Wolf Jr, A. Shapiro, K. G. Rivette, and M. F. Blaxill. Method and apparatus for selecting, analyzing, and visualizing related database records as a network, Mar. 2 2010. US Patent 7,672,950.
 18 D. Eppstein, M. S. Paterson, and F. F. Yao. On nearestneighbor graphs. Discrete & Computational Geometry, 17(3):263–282, 1997.
 19 L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 5(12):2018–2019, 2012.
 20 P. Goyal and E. Ferrara. Graph embedding techniques, applications, and performance: A survey. KnowledgeBased Systems, 151:78–94, 2018.
 21 A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.
 22 Y. Gu, Y. Sun, and J. Gao. The coevolution model for social network evolving and opinion migration. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 175–184. ACM, 2017.
 23 M. Gutmann and A. Hyvärinen. Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304, 2010.
 24 W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
 25 W. L. Hamilton, R. Ying, and J. Leskovec. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584, 2017.
 26 P. Jaccard. Étude comparative de la distribution florale dans une portion des alpes et des jura. Bull Soc Vaudoise Sci Nat, 37:547–579, 1901.
 27 G. Jeh and J. Widom. Simrank: a measure of structuralcontext similarity. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538–543. ACM, 2002.
 28 T. N. Kipf and M. Welling. Variational graph autoencoders. NIPS Workshop on Bayesian Deep Learning, 2016.
 29 T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
 30 J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604–632, 1999.
 31 T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, pages 489–504. ACM, 2018.
 32 J. Leskovec, D. Huttenlocher, and J. Kleinberg. Signed networks in social media. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 1361–1370. ACM, 2010.
 33 J. Liang, S. Gurukar, and S. Parthasarathy. Mile: A multilevel framework for scalable graph embedding. arXiv preprint arXiv:1802.09612, 2018.
 34 J. Liang, P. Jacobs, J. Sun, and S. Parthasarathy. Semisupervised embedding in attributed networks with outliers. In Proceedings of the 2018 SIAM International Conference on Data Mining, pages 153–161. SIAM, 2018.
 35 D. LibenNowell and J. Kleinberg. The linkprediction problem for social networks. Journal of the American society for information science and technology, 58(7):1019–1031, 2007.
 36 L. Lü and T. Zhou. Link prediction in complex networks: A survey. Physica A: statistical mechanics and its applications, 390(6):1150–1170, 2011.
 37 S. Ma and M. Belkin. Diving into the shallows: a computational perspective on largescale shallow learning. In Advances in Neural Information Processing Systems, pages 3778–3787, 2017.
 38 M. Mahoney. Large text compression benchmark. URL: http://www. mattmahoney. net/text/text. html, 2011.
 39 T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems  Volume 2, NIPS’13, pages 3111–3119, USA, 2013. Curran Associates Inc.
 40 A. Mnih and G. E. Hinton. A scalable hierarchical distributed language model. In Advances in neural information processing systems, pages 1081–1088, 2009.
 41 G. E. Moon, A. SukumaranRajam, S. Parthasarathy, and P. Sadayappan. Plnmf: Parallel localityoptimized nonnegative matrix factorization. arXiv preprint arXiv:1904.07935, 2019.
 42 M. Newman. Networks: an introduction. Oxford university press, 2010.
 43 M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2016.
 44 M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu. Asymmetric transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1105–1114. ACM, 2016.
 45 L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
 46 C. C. Paige and M. A. Saunders. Towards a generalized singular value decomposition. SIAM Journal on Numerical Analysis, 18(3):398–405, 1981.
 47 B. Perozzi, R. AlRfou, and S. Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.
 48 J. Qiu, Y. Dong, H. Ma, J. Li, C. Wang, and K. Wang. Netsmf: Largescale network embedding as sparse matrix factorization. Proceedings of the 2019 World Wide Web Conference on World Wide Web, 2019.
 49 J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 459–467. ACM, 2018.
 50 L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo. struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 385–394. ACM, 2017.
 51 M. Richardson, R. Agrawal, and P. Domingos. Trust management for the semantic web. In International semantic Web conference, pages 351–368. Springer, 2003.
 52 M. Ripeanu and I. Foster. Mapping the gnutella network: Macroscopic properties of largescale peertopeer systems. In international workshop on peertopeer systems, pages 85–93. Springer, 2002.
 53 A. Sinha, R. Cazabet, and R. Vaudaine. Systematic biases in link prediction: Comparing heuristic and graph embedding based methods. In L. M. Aiello, C. Cherifi, H. Cherifi, R. Lambiotte, P. Lió, and L. M. Rocha, editors, Complex Networks and Their Applications VII, pages 81–93, Cham, 2019. Springer International Publishing.
 54 J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pages 1067–1077. International World Wide Web Conferences Steering Committee, 2015.
 55 A. Tsitsulin, D. Mottin, P. Karras, and E. Müller. Verse: Versatile graph embeddings from similarity measures. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 539–548. International World Wide Web Conferences Steering Committee, 2018.
 56 P. Vijayan, Y. Chandak, M. M. Khapra, and B. Ravindran. Fusion graph convolutional networks. Mining and Learning with Graphs (MLG), KDD, 2018.
 57 C. Wang, V. Satuluri, and S. Parthasarathy. Local probabilistic models for link prediction. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, ICDM ’07, pages 322–331, Washington, DC, USA, 2007. IEEE Computer Society.
 58 D. Wang, P. Cui, and W. Zhu. Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1225–1234. ACM, 2016.
 59 X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang. Community preserving network embedding. In AAAI, pages 203–209, 2017.
 60 R. Zafarani and H. Liu. Social computing data repository at asu, 2009.
 61 D. Zhang, J. Yin, X. Zhu, and C. Zhang. Network representation learning: A survey. IEEE transactions on Big Data, 2018.
 62 F. Zhang, W. Zhang, Y. Zhang, L. Qin, and X. Lin. Olak: an efficient algorithm to prevent unraveling in social networks. Proceedings of the VLDB Endowment, 10(6):649–660, 2017.
 63 M. Zhao and V. Saligrama. Anomaly detection with score functions based on nearest neighbor graphs. In Advances in neural information processing systems, pages 2250–2258, 2009.
 64 T. Zhou, L. Lü, and Y.C. Zhang. Predicting missing links via local information. The European Physical Journal B, 71(4):623–630, 2009.