Network Representation Learning: Consolidation and Renewed Bearing
Abstract
Graphs are a natural abstraction for many problems where nodes represent entities and edges represent a relationship across entities. The abstraction can be explicit (e.g., transportation networks, social networks, foreign key relationships) or implicit (e.g., nearest neighbor problems). An important area of research that has emerged over the last decade is the use of graphs as a vehicle for nonlinear dimensionality reduction in a manner akin to previous efforts based on manifold learning with uses for downstream database processing (e.g., entity resolution and link prediction, outlier analysis), machine learning and visualization. In this systematic yet comprehensive experimental survey, we benchmark several popular network representation learning methods operating on two key tasks: link prediction and node classification.
We examine the performance of 12 unsupervised embedding methods on 15 datasets. To the best of our knowledge, the scale of our study – both in terms of the number of methods and number of datasets – is the largest to date. Our benchmarking study in as far as possible uses the original codes provided by the original authors.
Our results reveal several key insights about worktodate in this space. First, we find that certain baseline methods (taskspecific heuristics, as well as classic manifold methods) that have often been dismissed or are not considered by previous efforts can compete on certain types of datasets if they are tuned appropriately. Second, we find that recent methods based on matrix factorization offer a small but relatively consistent advantage over alternative methods (e.g., randomwalk based methods) from a qualitative standpoint. Specifically, we find that MNMF, a community preserving embedding method, is the most competitive method for the link prediction task. While NetMF is the most competitive baseline for node classification. Third, no single method completely outperforms other embedding methods on both node classification and link prediction tasks. We also present several drilldown analysis that reveals settings under which certain algorithms perform well (e.g., the role of neighborhood context on performance; dataset characteristics that influence performance) – guiding the enduser.
Network Representation Learning: Consolidation and Renewed Bearing
\@float
copyrightbox[b]
\end@floatGraphs are effective in multiple disparate domains to model, query and mine relational data. Examples abound ranging from the use of nearest neighbor graphs in database systems [?, ?] and machine learning [?, ?] to the analysis of biological networks [?, ?] and from social network analysis [?, ?] to the analysis of transportation networks[?]. MLenhanced data structures and algorithms such as learned indexes [?] have recently shown promising results in database systems. An active area in ML research – network representation learning – has potential in multiple applications related to the downstream database processing tasks such as outlier analysis [?, ?], entity resolution [?, ?], link prediction [?, ?] and visualization [?, ?]. However, a plethora of new network representation learning methods has been proposed recently [?, ?]. Given the wide range of methods proposed, it is often tough for a practitioner to determine or understand which of these methods they should consider adopting for a particular task on a particular dataset. Part of the challenge is the lack of a standard evaluation benchmark and a thorough independent understanding of the strengths and weaknesses of each method for the particular task on hand. The challenges are daunting and can be summarized as follows:
Lack of Standard Assessment Protocol: First, there is no standard to evaluate the quality of generated embeddings. The efficacy among embedding methods is often evaluated based on downstream machine learning tasks. As a result, the superiority of one embedding method over another hinges on the performance in a downstream machine learning task. With the lack of a standard evaluation protocol for these downstream tasks, the results reported by different research articles happen to be incomparable and also inconsistent in certain cases. As a specific example, Node2vec [?] report the node classification performance of Deepwalk on Blogcatalog dataset – for multilabel classification with a traintest split of 50:50 – as 21.1% Macrof1, whereas the Deepwalk paper [?] reports Deepwalks’ performance as 27.3% Macrof1.
Tuning Comparative Strawman: Second, a new method almost always compares its performance against a subset of other methods, and on a subset of tasks and datasets previously evaluated. In many cases, while great care is taken to tune the new method (via careful hyperparameter tuning) – the same care is often not taken when evaluating baselines. For example, in our experiments on Blogcatalog, we find that with a traintest split of 50:50 the Laplacian Eigenmaps method [?] without GridSearch achieves a Macrof1 score of 3.9% (similar to what was reported in [?]). However, with tuning the hyperparameters of logistic regression, we find that the Laplacian Eigenmaps method achieves a MacroF1 of 29.2%. Similarly, while logistic regression is commonly used to evaluate the quality of node embeddings in such methods, GridSearch over logistic regression parameters is rarely conducted or reported. Additionally, reported results are rarely averaged over multiple shuffles to reduce any bias or patterns in the training data^{1}^{1}1This is our observation based on the evaluation scripts publically shared by multiple authors.. In short, a lack of consistency in evaluation inhibits our understanding of the scientific advances in this arena, discussed next.
Standard Benchmark: Third, there is no agreed list of datasets that are used consistently in the literature. A new embedding method evaluates their method on selected datasets with a suitable node classification/link prediction setup. For instance, few methods report node classification performance for the baselines with the traintest split of 10:90 while few methods report the same with the traintest split of 50:50. As a result, the comparison across embedding methods is often unclear. Additionally, there are no clear guidelines on whether the proposed embedding methodology favors a certain type of dataset characteristic (e.g., sparsity).
Task Specific Baselines: Fourth, for many tasks such as node classification and link prediction there is a rich preexisting literature [?, ?] focused on such tasks (that do not explicitly rely on node embedding methodology as a preprocessing step). Few, if any, of the prior art in network representation learning consider such baselines – often such methods compare performance on downstream ML tasks against other node embedding methods. In our experiments, we find that a curated feature vector based on heuristics can achieve a similar competitive AUROC score on many of the datasets for the link prediction task.
To summarize, there is a clear and pressing need for a comprehensive and careful benchmarking of such methods which are the focus of this study. To address the aforementioned issues in the network embedding literature, we perform an experimental study of 12 promising network embeddings methods on 15 diverse datasets. The selected embedding methods are unsupervised techniques to generate the node embeddings of a graph. Our goal is to perform a uniform, principled comparison of these methods on a variety of different datasets and across two key tasks – link prediction and node classification.
Specific findings of our experimental survey that we wish to highlight are that:

For the linkprediction task we find that MNMF [?], a community preserving embedding method, offers a compelling advantage when it scales to a particular dataset. Other more scalable alternatives such as Verse and LINE also perform well on most datasets. The heuristic approach we present for link prediction competes exceptionally well on all types of datasets surveyed for this task.

For the node classification task, NetMF [?] when it scales to the particular dataset, offers a small but consistent performance advantage. We find that for nodeclassification task the taskspecific heuristic methodology we compare with works well when operating on datasets with fewer labels – in such scenarios, it competes well with a majority of the methods surveyed, whereas, some recent methods proposed fare much worse.

We also drill down to study the impact of context embeddings on the link prediction and node classification tasks (and find some methods impervious to the use of context – where for others it helps significantly). We also examine two common ways in which link prediction strategies are evaluated (explicitly through a classifier, or implicitly through vector dotproduct ranking). We find that there is a clear separation in performance when using these alternative strategies.
We denote the input graph as where and denote the set of nodes and edges of the graph, . The notations used in this work are listed in Table Network Representation Learning: Consolidation and Renewed Bearing. In this study, we consider both directed as well as undirected graphs along with weighted as well as unweighted graphs. We evaluate the embedding methods on nonattributed, homogeneous graphs.
Symbol  Meaning 

Input graph  
Nodes  
Edges  
Number of nodes,  
Adjacency matrix.  
Degree Matrix of Graph.  
and where  
Identity Matrix  
Node embedding of node  
Context embedding of node  
Node and context embedding matrix of size  
Number of edges  
Graph Similarity matrix  
Sigmoid function  
Number of negative samples  
Definition 2.1
Network Embedding: Given a Graph, and an embedding dimension, where , the goal of a network embedding method is to learn a dimensional representation of the graph, such that similarity in graph space approximates closeness in dimensional space.
In this section, we give a summary of the network embedding methods evaluated in our work. Herein, for each models along with their description, we also provide additional experimental details for reproducibility.

Laplacian Eigenmaps [?]: Laplacian Eigenmaps generates a dimensional embedding of the graph using the smallest eigenvectors of Laplacian matrix .
subject to is generated embedding matrix . The above equation can be reduced to simple minimization of L2 distance for adjacent nodes  . It has been shown that Laplacian Eigenmaps minimizes the graph cuts. Laplacian Eigenmaps levers the first order information for generating the embeddings.
Reproducibility notes: We search for following hyperparameters: Embedding dimension [64, 128, 256]. On the datasets with 1M nodes, Laplacian Eigenmaps did not scale for embedding dimension 128, 256.

DeepWalk [?]: DeepWalk is a random walk based network embedding method which uses truncated random walks and levers local information from these generated walks to learn similar latent representations. DeepWalk draws inspiration from Skipgram model in Word2vec [?] by treating random walks as sequences and optimizing following objective function:
(1) where denotes the embedding of the node . Since the objective function is expensive to compute for large graphs, it is approximated using Hierarchical Softmax.
Reproducibility notes: We search for following hyperparameters: Walk length [5, 20, 40], Number of walks [20, 40, 80], Window size [2, 4, 10], Embedding dimension [64, 128, 256]. In the case of directed graphs, we observe lower performance in node classification and link prediction task. In order to have a fair comparison with other methods, we treat directed graphs as undirected for DeepWalk.

Node2Vec [?]: Node2Vec is a biased random walk based network embedding method which allows the random walk to be more flexible in exploring the graph neighborhoods. The flexibility of the random walk is achieved by interpolating between Breadthfirst traversal and Depthfirst traversal. The objective function is based on Skipgram model in Word2vec [?], and since the objective function is expensive to compute for large graphs, it is approximated by negative sampling [?].
Reproducibility notes: We search for following hyperparameters: Walk length = [10, 20, 40], Number of walks = 80. Window Size = 10, and [0.25, 1, 2, 4], Embedding dimension = [64, 128, 256]. In case of directed graphs, we observe lower performance in node classification and link prediciton task. In order to have a fair comparison with other methods, we treat directed graphs as undirected for Node2Vec.

GraRep [?]: GraRep is a matrix factorization based network embedding method which captures the global structural information of the graph while learning node embeddings. The authors observe that the existing Skipgram based models project all the step relational information into a common subspace and then, argue the importance of preserving different step relational information in separate subspaces. The loss function to preserve the step relationship between node and is proposed as:
(2) where refers to the negative node at th step for node and refers to the number of negative samples, The above loss function in closed form results in logtransformed, probabilistic adjacency matrix which is factorized with SVD for generating each step representation. The final node representation or embeddings is generated by concatenation of all the step representations.
Reproducibility notes: We search for following hyperparameters: from 1 to 6, Embedding dimension = [64, 128, 256]. On the datasets with 2M edges, due to scalability issue, we searched for from 1 to 2 and Embedding dimension = [64, 128].

NetMF [?]: NetMF is a matrix factorization based network embedding method. NetMF presents theoretical proofs for their claim that Skipgram models with negative sampling are implicitly approximating and factorizing appropriate matrices constructed with the help of graph Laplacians. The objective matrix based for NetMF on small context window T is given by
(3) NetMF factorizes the above closed form DeepWalk matrix with SVD in order to generate node embedding and provides two algorithms for small context window and large context window.
Reproducibility notes: We search for following hyperparameters: = [1, 10], Negative samples = [1, 2, 3], Rank for large context window = [128, 256, 512], Embedding dimension = [64, 128, 256].

MNMF [?]: MNMF is a matrix factorization based network embedding method which generates node embeddings that preserves the microscopic information in form of firstorder and secondorder proximities among nodes and the generated embeddings also preserve mesoscopic information in form of community structure. The objective function for MNMF is given as
(4) where is the binary community membership matrix, is the latent representations of communities and is the modularity matrix obtained from the adjacency matrix, . Overall, MNMF discovers communities through modularity constraints. The node embeddings generated with the help of microscopic information and community embeddings are then, jointly optimized by assuming consensus relationship between both node and community embeddings.
Reproducibility notes: We search for following hyperparameters: = [0.1, 1.0, 10.0], = [0.1, 1.0, 10.0], Embedding dimension = [64, 128, 256].

HOPE [?]: HOPE is a matrix factorization based network embedding method which generates node embeddings that preserve asymmetric transitivity of nodes in directed graphs. If there exists a directed edge from node to and to , then – due to asymmetric transitivity property – an edge from to is more likely to form than edge from to . The objective function of HOPE is given as follows
(5) where and are the source and target embeddings. In order to preserve asymmetric transitivity of nodes, the proximity matrix is constructed using a similarity metric which respects the directionality of edges. The node embeddings are generated by factorizing the proximity matrix with generalized SVD [?].
Reproducibility notes: We search for following hyperparameters: The decay parameter = 0.5/, where is spectral radius of the graph. Embedding dimension = [64, 128, 256].

LINE [?]: LINE is an optimizationbased network embedding method which optimizes an objective function that preserves both first and second order proximity among nodes in the embedding space. The objective function for firstorder proximity is given as:
(6) The objective function to preserve the second order proximity is given as:
(7) where represents the node embedding of node and represents the context embedding of nodes and respectively. The firstorder proximity corresponds to local proximity between nodes based on the presence of edges in the graph while the secondorder proximity corresponds to global proximity between nodes based on shared neighborhoods of those nodes in the graph. Since the objective function is expensive to compute for large graphs, it is approximated by negative sampling [?].
Reproducibility notes: We search for following hyperparameters: Number of samples = 10 billion, Embedding dimension [64, 128, 256]. In the case of directed graphs, as suggested by the authors of LINE, we evaluate only secondorder proximity.

Verse [?]: Verse is an optimizationbased network embedding method which optimizes an objective function that minimizes the KullbackLeibler (KL) divergence from the given similarity distribution in graph space to similarity distribution in embedding space. The objective function is given as follows:
(8) The similarity distribution in graph space could be constructed with help of Personalized PageRank, SimRank, or Adjacency matrix. Since the objective function is expensive to compute for large graphs, it is approximated by Noise Constrastive Estimation [?].
Reproducibility notes: We search for following hyperparameters: PageRank damping factor = [0.7, 0.85, 0.9], Negative samples = [3, 10], Embedding dimension = [64, 128, 256].

SDNE [?]: SDNE is a deep autoencoder based network embedding method which optimizes an objective function that preserves both first and second order proximity among nodes in the embedding space. The objective function of SDNE is given below
(9) where and are loss functions to preserve firstorder and secondorder proximities respectively, while is the regularizer term. The authors propose a semisupervised deep model to minimize the mentioned objective function. The deep model consists of two components: supervised and unsupervised. The supervised component attempts to preserve the firstorder proximity while the unsupervised component attempts to preserve the secondorder proximity by minimizing reconstruction loss of nodes.
Reproducibility notes: We search for following hyperparameters: = [1e05, 0.2], Penalty coefficient = [5, 10], Embedding dimension = [64, 128, 256].

VAG [?]: VAG is a graph autoencoder based network embedding method which minimizes the reconstruction loss of the adjacency matrix. The reconstruction matrix is generated as where is node embeddings generated with Graph Convolutional Networks (GCN) [?] as with as node features. In the case of unattributed graphs, the node feature matrix is an identity matrix.
Reproducibility notes: We search for following hyperparameters: Epochs = [50, 100], Embedding dimension = [64, 128, 256].

Watch Your Step [?]: Watch Your Step (WYS) addresses the sensitivity issue of hyperparameters in the random walk based embedding methods. WYS solves the sensitivity issue with the attention mechanism on the expected random walk matrix. The attention mechanism guides the random walk to focus on short or long term dependencies pertinent to the input graph. The objective function of WYS is given as
(10) where is attention parameter vector, and are node embeddings, is expectation on the random walk matrix.
Reproducibility notes: We search for following hyperparameters: Learning rate = [0.05, 0.1, 0.2, 0.5, 1.0], Number of Hops = 5, Embedding dimension = [64, 128, 256].
We select datasets from multiple domains Table Network Representation Learning: Consolidation and Renewed Bearing describes empirical properties of datasets. The selected datasets support both multilabel and multiclass classification. Directed as well as undirected datasets were selected in order to evaluate the embeddings methods on the linkprediction task efficiently. Further, datasets with and without edge weights are also included, thereby, providing us with a comprehensive set of possibilities to evaluate the methods. We summarize the datasets below:

Web: The WebKB datasets^{2}^{2}2http://linqs.cs.umd.edu/projects/projects/lbc/ [?] consist of classified webpages (nodes) and hyperlinks between them (edges). Here, labels are the categories of the webpages.

Medical: The PPI dataset [?] represents a subgraph of protein interactions in humans. Labels represent biological states corresponding to hallmark gene sets.

Natural Language: The Wikipedia dataset [?] is a dump of Wikipedia with nodes as words, edges corresponding to the cooccurrence matrix and labels corresponding to PartofSpeech (POS) tags.

Social: The Blogcatalog dataset and Flickr dataset [?] represent social networks. Blogcatalog and Flickr both represent bloggers and their friendships. YouTube dataset [?] represents users and their friendships. Labels for Blogcatalog, Flickr, and YouTube correspond to the groups to which each user belongs. The Epinions dataset [?] represents user annotated trust relationships, where users annotate which other users they trust. These are used to determine the reviews shown to a user.

Citation: The DBLP, CoCit, and Pubmed datasets represent citation networks. DBLP (CoAuthor) represents a subset of papers in DBLP^{3}^{3}3https://dblp.unitrier.de/ from closely related fields. CoCit (Microsoft) [?] corresponds to a cocitation subgraph of the Microsoft Academic Graph. Finally, Pubmed corresponds to a subset of diabetesrelated publications on Pubmed^{4}^{4}4https://www.ncbi.nlm.nih.gov/pubmed/. Labels in DBLP correspond to the subfield of the paper. In CoCit, they correspond to the conference of the paper, and in Pubmed correspond to the types of diabetes.

Digital: The p2pGnutella dataset [?] represents connections between hosts on a peertopeer file sharing network. This dataset has no node labels.

Voting: The WikiVote dataset [?] is constructed from voting data in multiple elections for Wikipedia administratorship. Users are nodes, and (directed) edge (, ) represents a vote from user to user . This dataset also has no node labels.
Dataset  #Nodes  #Edges  #Labels  (C/L)  D  W 

WebKB (Cornell)  195  478  5  C  F  T 
WebKB (Washington)  230  596  5  C  F  T 
WebKB (Wisconsin)  265  724  5  C  F  T 
PPI  3,890  38,739  50  L  F  F 
Wikipedia  4,777  92,517  40  L  F  T 
Blogcatalog  10,312  333,983  39  L  F  F 
DBLP (CoAuthor)  18,721  122,245  3  C  F  T 
CoCit (Microsoft)  44,034  195,361  15  C  F  F 
WikiVote  7,115  103,689      T  F 
Pubmed  19,717  44,338  3  C  T  F 
p2pGnutella  62,586  147,892      T  F 
Flickr  80,513  5,899,882  195  L  F  F 
Epinions  75,879  508,837      T  F 
YouTube  1,134,890  2,987,624  47  C  F  F 
In this section, we elaborate on the experimental setup for link prediction and node classification tasks employed to evaluate the quality of embeddings generated by different methods. We present two heuristics baselines for both the tasks and define the metrics used for comparing the embedding methods.
Prediction of ties is an essential task in multiple domains where the relational information is costlier to obtain such as drugtarget interactions [?], proteinprotein interactions [?], or when the environment is partially observable. The problem of prediction of tie/link between two nodes and is often evaluated in one of two ways. The first is to treat the problem as a binary classification problem. The second is to use the dot product on the embedding space as a scoring function to evaluate the strength of the tie.
The edge features for binary classification consists of node embeddings of nodes and , where two node embeddings are aggregated with a binary function. In our study, we experimented with three binary functions on node embeddings: Concatenation, Hadamard, and L2 distance. We used logistic regression as our base classifier for the prediction of the link. The parameters of the logistic regression are tuned using GridSearchCV with 5fold cross validation with scoring metric as ‘roc_auc’. We evaluate the link prediction performance with metrics: Area Under the Receiver Operating Characteristics (AUROC) and Area Under PrecisionRecall curve (AUPR). An alternative evaluation strategy is to predict the presence of link based on dot product value of node embeddings of nodes and . We study the impact of both the evaluation strategies in section \thefigure.
Construction of the train and test sets: The method of construction of train and test sets for link prediction task is crucial for comparison of embedding methods. The train and test split consists of 80% and 20% of the edges respectively and are constructed in the following order:

Selfloops are removed.

We randomly select 20% of all edges as positive test edges and add them in the test set.

Positive test edges are removed from the graph. We find the largest weakly connected component formed with the nonremoved edges. The edges of the connected component form positive train edges.

We sample negative edges from the largest weakly connected component and add the sampled negative edges to both the training set and test set. The number of negative edges is equal to the number of positive edges in both training and test sets.

For directed graphs, we form “directed negative test edges” which satisfy the following constraint: but where refers to edges in the largest weakly connected component. We add the directed negative test edges edges to our test set. The number of “directed negative test edges” is around 10% of negative test edges in the test set.

Nodes present in the test set, but not present in the training set, are deleted from the test set.
In case of large datasets (500K edges), we reduce our training set. We consider 10% of both randomly selected positive and negative train edges for learning the binary classifier. The learned model is evaluated on the test set. The above steps are repeated for 5 folds of a train:test splits of 80:20% and we report the average AUROC and AUPR scores across 5 folds.
In network embedding literature, node classification is the most popular way of comparing the quality of embeddings generated by different embedding methods. The generated node embeddings are treated as node features, and node labels are treated as ground truth. The classification task performed in our experiments is either Multilabel or Multiclass classification. The details on the classification task performed on each dataset are provided in Table Network Representation Learning: Consolidation and Renewed Bearing. We select Logistic Regression as our classifier. The hyperparameters of the logistic regression are tuned using GridSearchCV with 5fold cross validation with scoring metric as ‘f1_micro’. We split the dataset with 50:50 traintest splits. The learned model is evaluated on the test set, and we report the results averaged over 10 shuffles of traintest sets. The model does not have access to test instances while training.
We note that a majority of the efforts in the literature do not tune the hyperparameters of Logistic Regression. Default hyperparameters are not always the best hyperparameters for Logistic Regression. For instance, with default hyperparameters, the Macrof1 performance of Laplacian eigenmaps on Blogcatalog dataset is 3.9% for the traintest split of 50:50. However, tuning the hyperparameters results in significant improvement of Macrof1 score to 29.2%.
The choice of a “linear” classifier to evaluate the quality of embeddings is not a hard constraint in the node classification task. In this work, we also test the idea of leveraging a “nonlinear” classifier for the node classification task and use EigenPro [?] classifier for the same task. On large datasets, EigenPro provides a significant performance boost over the stateoftheart kernel methods with faster convergence rates [?]. In the experiments, we see a benefit to this approach, up to 15% improvement in Microf1 scores with nonlinear classifier compared to the linear classifier.
Next, we present heuristics baseline for both link prediction and node classification tasks. The purpose of defining heuristics baseline is to assess the difficulty of performing a particular task on a particular dataset and also to compare the performance of sophisticated network embedding methods over simple heuristics.
In the link prediction literature, there exist multiple similarity based metrics [?] which can predict a score for link formation between two nodes. Examples of such metrics include Jaccard Index [?, ?], Adamic Adar [?]. These similaritybased metrics often base their predictions on the neighborhood overlap between the nodes. We combine the similaritybased metrics to form a curated feature vector of an edge [?]. The binary classifier in the link prediction task is then trained on the generated edge embeddings. Our selected similaritybased metrics are Common Neighbors (CN), Adamic Adar (AA) [?], Jaccard Index (JA) [?], Resource Allocation Index (RA) [?] and Preferential Attachment Index (PA) [?]. The similaritybased metrics CN, JA, and PA captures firstorder proximity between nodes, while the metrics AA and RA capture secondorder proximity between nodes. We found this heuristic based model to be highly competitive to embedding methods on multiple datasets.
Nodes in the graph can be characterized/represented by their properties. We combine the node properties to form a feature vector/embedding of a node. The classifier in node classification task is then trained on the generated node embeddings. The node properties capture information such as nodes’ neighborhood, influence on other nodes, structural properties. We select following node properties: Degree, PageRank, Clustering Coefficient, Hub and Authority scores, Eigenvector centrality, Average Neighbor Degree, Eccentricity, Betweenness centrality, Closeness centrality, and Fairness centrality[?, ?]. We treat the graph as undirected while computing the node properties. As the magnitude of each node property varies with another, we perform columnwise normalization with RobustScaler available from Scikitlearn. We will show in the experiments section Network Representation Learning: Consolidation and Renewed Bearing that the node classification heuristics baseline is competitive with most of the embedding methods on datasets with fewer labels.
In this section, we present two measures for comparing the performance of embedding methods in the downstream machine learning task.
Mean Rank: We compute the rank of all the embedding methods on each dataset based on selected performance metric and report the average rank of an embedding method across all datasets as the Mean Rank of the embedding method. Let be the rank of embedding method on dataset with being the set of datasets then mean rank of embedding method is given by
(11) 
Mean Penalty [?]: We define penalty of an embedding method on a dataset as difference between best score achieved by any embedding method on dataset and the score achieved by embedding method on same dataset . Score is the selected performance metric for a particular downstream ML task. Let be the set of embedding methods and be the score achieved by embedding method on same dataset , then the Mean Penalty is given by
(12) 
For a model, lower values for Mean Rank and Mean Penalty suggests better performance. We compare the embedding methods with Mean Rank, and Mean Penalty measures on the datasets where all the embedding methods are scalable. Though the measures do not consider the dataset size or missing values, the measures are simple and intuitive.
In this section, we report the performance of network embedding methods on link prediction task and node classification task. We tune both the parameters of embedding methods and the parameters of classifiers in link prediction and node classification task. Whenever possible, we rely on the authors’ code implementation of the embedding method. All the methods not scalable on large datasets are executed on a modern machine with 500 GB RAM and 28 cores. All the evaluation scripts are executed in the same virtual python environment.
The link prediction performance of 12 embedding methods measured in terms of AUROC and AUPR on 15 datasets is shown in Figure Network Representation Learning: Consolidation and Renewed Bearing and Figure Network Representation Learning: Consolidation and Renewed Bearing. The Overall (or aggregate) performance of an embedding method on all the datasets is also shown at the end of the horizontal bar of each method in Figure Network Representation Learning: Consolidation and Renewed Bearing and Figure Network Representation Learning: Consolidation and Renewed Bearing. We represent Overall score as the sum of scores (AUROC or AUPR) of the method on all the datasets. The Mean Rank and Mean Penalty of embedding methods – on datasets for which all methods run to completion on our system – is shown in Figure Network Representation Learning: Consolidation and Renewed Bearing. We also provide the tabulated results in Tables Network Representation Learning: Consolidation and Renewed Bearing and Network Representation Learning: Consolidation and Renewed Bearing. As mentioned in Section 2.1, we tune the hyperparameters of each embedding method and report the best average AUROC scores and average AUPR scores across 5 folds. In the case of WebKB datasets, we evaluate the methods on embedding dimensions 64, 128. We perform the link prediction task with both normalized and unnormalized embeddings and report the best performance.
We make the following observations:

Effectiveness of MNMF for Link Prediction: We observe that MNMF achieves the highest overall link prediction performance in terms of best average AUROC and AUPR scores as compared to other methods. The competitive performance of MNMF on link prediction could be credited to the community information imbibed into the node embeddings generation. The Mean Rank and Mean Penalty is lowest for MNMF which also suggest MNMF as a competitive baseline for Link Prediction. MNMF achieves the first rank for 7 out of 15 datasets. The small value of Mean Penalty suggests that even when MNMF is not the topranked method for a particular dataset, MNMF’s performance is closest to that of the topranked method on that dataset. However, MNMF does not completely outperform other methods on all the datasets. For instance, on the WikiVote and Pubmed dataset, WYS achieves the best average AUROC scores while on Microsoft dataset, GraRep achieves the best average AUROC score. In Figure (b) and Figure (b), we see that among the more scalable methods, LINE achieves the highest overall link prediction performance followed by DeepWalk in terms of both AUROC and AUPR scores. Note that MNMF did not scale for the datasets with 5M edges on a modern machine with 500 GB RAM and 28 cores. However, the scalability issue of nonnegative matrix factorization based methods can be addressed by adopting modern ideas [?, ?] (outside the scope of this study).

Performance of Heuristic Baseline: We observe that the Link Prediction Heuristics baseline – described in section Network Representation Learning: Consolidation and Renewed Bearing – is both efficient and effective. We see that Link Prediction Heuristics baselines’ overall performance is better than that of Laplacian Eigenmaps and SDNE and competitive to that of Node2vec, HOPE, Verse, LINE. The Mean Penalty of Link Prediction Heuristics is also close to other embedding methods. On the largest dataset YouTube, Link Prediction Heuristics achieve an AUROC of 96.2% which is close to the best performing Verse with AUROC of 97.6%. As compared to the most competitive baseline MNMF, the heuristics baseline outperforms MNMF on Wikipedia, Blogcatalog datasets. We also observe that the heuristics baseline performance is competitive with several methods on directed datasets even though the chosen similaritybased metrics in heuristics baseline treat the underlying graph as undirected.

Impact of Evaluation Strategy: As described in section Network Representation Learning: Consolidation and Renewed Bearing, the presence of a link between two nodes can be predicted with either the Logistic Regression classifier (treating the embeddings as features) or the dot product between the node embeddings. We compare the performance of both evaluation strategies on each embedding method over all datasets using the differences in the average AUROC scores. A positive difference implies link prediction performance with classifier is better than that of the dot product. The results are presented as boxplot in Figure Network Representation Learning: Consolidation and Renewed Bearing. Paired ttest suggests the positive difference is statistically significant for all methods, except for Verse and WYS, with a significance level of 0.05. Hence, the use of classifier over dot product provides significant predictive performance gain on the task of link prediction. We also investigate the changes in the ranking of embedding methods based on overall average AUROC scores when predictions are performed with classifier rather than dot product. The methods were ranked based on overall average AUROC scores and we considered only those datasets on which all methods were scalable. We observed that the rank of NetMF in the ranking generated with dot product was 10 while its rank in the ranking generated with classifier improved to 3. Since the best link prediction performance for the majority of the embedding methods was achieved with classifier, we believe the superiority of the embedding methods based on link prediction task should be asserted by leveraging the classifier.
As mentioned in section Network Representation Learning: Consolidation and Renewed Bearing, we lever binary functions: Hadamard, Concatenation, and L2 to generate the edge embedding. In figure Network Representation Learning: Consolidation and Renewed Bearing, we present which binary function achieved the best average AUROC score for an embedding method on a particular dataset. We see that the binary function Hadamard resulted in achieving a maximum number of best average AUROC scores. However, there is no single winner in terms of choice of binary functions.

Impact of context embeddings: We study the impact of context embeddings on directed datasets for the link prediction task. We consider only those embedding methods which generate both node and context embeddings for this study. We compare the impact of using node + context embeddings over using only node embeddings with the help of differences in AUROC scores. The results are detailed in Figure Network Representation Learning: Consolidation and Renewed Bearing. A positive difference implies the use of context embeddings helps in link prediction. We see that levering node + context embeddings improve the link prediction performance of LINE, HOPE, and WYS. For MNMF, use of context embeddings does not improve the link prediction performance as in MNMF the community information – crucial for link prediction – is incorporated in the node embeddings. In the case of GraRep, we find that the node embeddings encapsulate highorder information and, hence, levering context does not help improve the performance. We find that the results of DeepWalk and Node2Vec on directed datasets are significantly lower, so in order to have a fair comparison with other embedding methods, we treated the directed datasets as undirected for DeepWalk and Node2Vec. The median of the box plot of DeepWalk and Node2Vec is close to zero due to this treatment.
The node classification performance of 12 embedding methods measured in terms of Microf1 scores on 15 datasets with traintest split of 50:50 is reported in Figure Network Representation Learning: Consolidation and Renewed Bearing and Network Representation Learning: Consolidation and Renewed Bearing. The overall performance of an embedding method on all the datasets is shown at the end of the horizontal bar of each method in Figure Network Representation Learning: Consolidation and Renewed Bearing and Network Representation Learning: Consolidation and Renewed Bearing and represents the sum of scores (Microf1) of the method on the datasets. The Mean Rank and Mean Penalty of embedding methods – on datasets where all methods are scalable – is shown in Figure Network Representation Learning: Consolidation and Renewed Bearing. We tune the hyperparameters of each embedding method – mentioned in section 2.1 – and report the best Microf1 score. In the case of WebKB datasets, we evaluate the methods on generated embedding with dimensions 64, 128. We perform the node classification with both normalized and unnormalized embeddings and report the best performance. We also provide tabulated results in Tables Network Representation Learning: Consolidation and Renewed Bearing and Network Representation Learning: Consolidation and Renewed Bearing.
We make the following observations.

Effectiveness of NetMF for node classification: We observe that NetMF achieves the highest overall node classification performance in terms of best Microf1 scores using both linear and nonlinear classifiers. The Mean Rank and Mean Penalty is lowest for NetMF which suggest NetMF as the strongest overall method for node classification. NetMF achieves low Mean Rank suggesting NetMF is among the topranked methods on the evaluated datasets. The smallest value of Mean Penalty suggests that even when NetMF is not the topranked method for a particular dataset, NetMF’s performance is closest to that of the topranked method for that dataset. However, it does not entirely outperform other methods on all the datasets. LINE, DeepWalk, and Node2Cec are also competitive baselines for the task of node classification as their overall performance is closest to that of NetMF. The performance of GraRep on datasets with more labels is comparable with other methods when we exclude the Flickr dataset. However, the reported results for GraRep on the Flickr dataset are with embeddings of dimension 64. Embedding dimensions 128 and 256 for GraRep resulted in memory errors on a modern machine with 500GB RAM and 28 cores. Note that NetMF did not scale for the YouTube dataset. While scalability is currently outside the scope of our study, the scalability of such methods is under active development (we refer the interested reader elsewhere [?, ?]).

Laplacian Eigenmaps Performance: We observe that the Laplacian Eigenmaps method achieves competitive Microf1 scores on several datasets. For instance, on Blogcatalog dataset with 39 labels, Laplacian Eigenmaps method achieves the best Microf1 score of 42.1% while on the Pubmed dataset, the Laplacian Eigenmaps methods outperform all other embedding methods with 81.7% Microf1. With a nonlinear classifier, Laplacian Eigenmaps achieves the secondbest performance on the PPI dataset with 23.8% Microf1. The Mean Penalty of Laplacian Eigenmaps is also closest to other embedding methods, namely, Verse, MNMF, VAG. On the PPI and Flickr datasets, Laplacian Eigenmaps baselines’ Microf1 is close to the best Microf1. The observed results for Laplacian Eigenmaps on evaluated datasets are better than the reported results [?, ?] for both node classification and link prediction tasks. This improvement in the performance of Laplacian Eigenmaps can be directly traced to the hyperparameter tuning of parameters of logistic regression classifier.

Node Classification Heuristic: We observe from Figure (a) and Figure (a) that the node classification heuristic baseline is competitive against other embedding methods on datasets with fewer labels (up to 5 labels) as its overall score is better than many of the methods. Except on the Pubmed dataset, node heuristics baseline performance is comparable to the overall performance of NetMF. However, as the number of labels in the datasets increases (5 labels), we observe that the Microf1 scores of node heuristics baseline decrease drastically. The decrease in overall performance reflects that the node heuristics features lack the discriminative power to classify multiple labels.

Context embeddings can improve performance: As we can see from Figure Network Representation Learning: Consolidation and Renewed Bearing, levering both node and context of Skipgram based models results in significant improvement (up to 25%) for most of the methods. On the directed Pubmed dataset, we observed that the node classification performance of embedding methods like LINE (2^{nd} order), HOPE, and WYS was significantly lower than that of other methods. The Microf1 scores of the embedding methods are shown in Fig. Network Representation Learning: Consolidation and Renewed Bearing. We found that the Pubmed dataset consists of around 80% sink nodes. As a result, when the embedding methods based on Skipgram model generate the node embeddings, the sink nodes are always considered as “context” nodes and are never considered as “source” nodes. Hence, the quality of node embeddings of sink nodes is of lower quality. In order to have a fair comparison, we concatenate both the node and context embeddings of the methods (whenever possible) and evaluate the performance on the concatenated embeddings.

Impact of nonlinear classifier: We study the impact of the nonlinear classifier on the node classification performance. The comparison results are shown in box plot Network Representation Learning: Consolidation and Renewed Bearing. The box plot represents the distribution of differences of Microf1 scores computed with the nonlinear (EigenPro [?]) and linear classifiers (Logistic Regression). The positive difference implies that the results with nonlinear classifier are better than linear classifier. For Verse, we see a 15% absolute increase with the use of nonlinear classifier on the PubMed dataset. The positive difference is statistically significant (with paired ttest) for methods DeepWalk, Verse, SDNE, GraRep and MNMF with significance level 0.05. It is worth pointing out that on the smaller datasets this gain is less evident while on, the larger datasets (more training data) the benefits of using a nonlinear classifier are much clearer.
Datasets  Heuristics  LapEig  DeepWalk  Node2Vec  Verse  LINE  SDNE  NetMF  GraRep  HOPE  MNMF  VAG  WYS 

WTexas  77.7  73.1  79.7  83.0  81.7  78.2  82.4  80.9  78.7  78.7  96.0  78.6  83.0 
WCornell  81.5  77.3  79.2  82.0  87.0  77.5  80.2  81.2  84.4  79.9  96.7  74.8  84.4 
WWashington  75.3  70.1  75.3  75.0  82.2  72.9  76.7  75.5  78.1  72.8  97.5  73.9  79.2 
WWisconsin  79.4  71.4  80.7  78.0  88.2  72.5  76.9  82.3  84.3  75.5  98.9  73.9  84.5 
PPI  90.9  78.2  89.1  88.3  89.6  87.8  89.3  87.3  90.0  88.4  96.9  87.4  91.5 
Wikipedia  91.6  77.9  90.9  90.9  91.3  91.2  50.0  91.4  92.3  90.4  88.4  89.5  92.3 
WikiVote  91.5  83.5  97.4  94.6  94.9  96.6  96.6  95.5  88.4  97.8  92.2  94.3  98.2 
BlogCatalog  95.2  77.4  94.3  95.0  97.3  95.2  95.6  95.1  96.2  95.3  94.0  94.8  96.0 
DBLP (CoAuthor)  95.6  93.3  96.0  95.4  97.9  94.3  50.0  95.9  97.1  89.6  99.4  94.1  96.8 
Pubmed  87.7  89.6  89.1  87.6  96.6  92.7  88.9  89.8  77.7  90.1  94.3  93.6  97.0 
CoCit (microsoft)  89.5  95.6  97.6  93.6  83.7  97.2  91.9  96.9  97.9  94.5  96.6  96.2   
P2P  83.8  69.9  88.2  83.3  77.6  91.2  83.9  87.5  71.8  88.6  92.3     
Flickr  92.4  93.0  95.8  90.6  72.6  95.2  93.0  97.2  95.5  96.5       
Epinions  92.2  90.9  93.3  90.2  91.9  91.6  92.7  92.8  93.7  92.7       
Youtube  96.2  96.0  93.6  90.8  97.6  96.5      91.4  92.4       
Datasets  Heuristics  LapEig  DeepWalk  Node2Vec  Verse  LINE  SDNE  NetMF  GraRep  HOPE  MNMF  VAG  WYS 

WTexas  81.8  78.0  81.9  85.0  85.0  82.2  85.5  83.6  82.1  82.9  96.0  80.6  85.2 
WCornell  81.9  78.9  79.8  81.0  87.8  76.7  79.5  82.0  86.2  79.8  96.7  75.6  86.6 
WWashington  80.3  75.0  76.5  78.0  86.5  75.5  81.5  80.4  82.9  77.2  97.7  78.7  83.4 
WWisconsin  82.3  74.8  81.6  79.0  90.7  76.4  80.7  85.2  87.4  78.8  98.5  78.1  86.7 
PPI  91.4  80.7  90.4  89.5  90.7  88.1  90.2  87.9  90.8  89.2  96.5  88.1  92.2 
Wikipedia  93.0  76.0  92.5  92.3  92.8  92.8  75.0  92.8  93.1  91.8  89.9  91.4  93.5 
WikiVote  87.9  82.1  96.9  93.4  94.8  95.3  96.3  93.8  84.0  96.8  87.6  94.3  97.4 
BlogCatalog  95.1  77.5  94.3  94.8  97.9  95.1  95.5  94.8  96.0  95.0  93.6  94.6  96.1 
DBLP (Coauthor)  96.7  93.8  96.8  96.1  98.2  95.6  75.0  96.7  97.4  90.9  99.2  95.2  97.3 
Pubmed  85.0  85.0  81.5  75.9  96.8  90.3  90.5  86.1  74.1  91.4  90.3  95.2  96.9 
Cocit (microsoft)  91.9  95.5  97.9  95.0  76.4  97.7  93.5  97.1  97.9  95.3  95.2  96.4   
P2P  79.3  68.1  84.3  78.9  68.3  88.6  80.8  84.0  71.3  85.8  89.6     
Flickr  92.5  95.0  96.1  91.4  70.9  95.5  94.0  97.6  95.7  96.7       
Epinions  89.2  89.5  91.7  87.1  88.6  88.8  91.6  91.8  93.0  91.5       
Youtube  96.7  96.7  95.0  90.8  98.2  97.0      92.2  94.0       
Datasets  Heuristics  LapEig  DeepWalk  Node2Vec  Verse  LINE  SDNE  NetMF  GraRep  HOPE  MNMF  VAG  WYS 

WTexas  63.8  54.6  55.1  57.2  54.5  61.8  58.0  67.1  56.1  59.1  58.0  54.8  60.6 
WCornell  43.4  30.9  40.5  34.3  35.4  44.1  48.5  48.1  40.9  41.9  36.1  40.3  41.8 
WWashington  63.3  43.3  56.0  58.4  51.5  65.3  60.5  61.1  46.7  62.8  60.4  59.0  65.3 
WWisconsin  53.3  41.5  52.3  45.6  41.7  52.9  51.3  56.7  50.8  51.5  53.9  48.1  53.2 
PPI  5.3  22.3  21.4  21.0  19.7  19.9  17.4  21.3  20.4  18.8  18.6  19.2  22.6 
Wikipedia  41.1  46.3  50.0  51.4  43.8  56.3  52.4  58.4  58.8  57.9  48.1  41.1  44.4 
Blogcatalog  16.9  42.1  41.5  41.7  35.5  38.6  29.5  41.7  41.3  34.4  21.6  17.1  38.9 
DBLP (CoAuthor)  37.4  37.1  35.9  35.6  37.2  37.0  37.4  36.6  35.7  36.0  36.2  36.2   
Pubmed  59.8  81.7  81.5  81.1  63.0  64.4  67.7  80.0  79.1  74.7  77.1  63.5  73.6 
CoCit (microsoft)  28.9  43.0  46.3  46.7  46.0  46.5  38.1  43.5  46.6  44.6  44.7  43.5   
Flickr  19.1  34.0  35.6  35.1  30.1  33.4  30.8  34.2  10.5  28.7       
Youtube  24.4    40.7  40.3  38.5  40.3      38.0  38.7       
Datasets  Heuristics  LapEig  DeepWalk  Node2Vec  Verse  LINE  SDNE  NetMF  GraRep  HOPE  MNMF  VAG  WYS 

WTexas  43.8  18.1  26.9  22.7  18.1  36.4  39.5  49.9  25.0  34.4  23.8  27.8  40.3 
WCornell  23.1  21.3  28.1  23.2  22.4  25.0  13.1  32.8  27.7  28.7  23.5  20.6  26.5 
WWashington  32.7  22.2  24.3  27.3  23.1  30.6  29.1  29.7  28.7  30.2  28.6  26.3  31.1 
WWisconsin  26.6  29.0  31.9  23.9  21.9  27.8  26.1  34.8  34.7  28.3  33.8  25.5  33.9 
PPI  2.2  17.9  18.1  18.0  16.5  16.9  15.2  17.5  17.4  15.9  15.9  13.1  17.9 
Wikipedia  5.3  10.4  11.9  12.9  8.2  18.2  14.1  18.4  18.3  20.1  11.0  3.8  10.1 
Blogcatalog  3.4  29.2  27.3  27.9  22.1  23.6  14.8  28.8  28.9  20.8  8.2  3.1  26.3 
DBLP (CoAuthor)  19.6  20.1  30.0  29.4  20.6  19.2  21.1  30.0  30.5  28.6  26.6  27.8   
Pubmed  52.5  80.2  80.1  79.8  58.0  61.3  63.3  78.4  77.6  73.0  75.4  60.6  71.6 
CoCit (microsoft)  16.7  27.3  34.3  34.2  33.3  33.8  27.8  34.0  34.8  32.8  30.4  29.2   
Flickr  1.8  20.4  21.2  20.7  17.6  18.2  14.9  20.2  0.9  11.4       
Youtube  9.3    34.7  34.0  32.1  33.1      30.0  30.8       
Network representational learning has attracted lot of attention in past few years. An interested reader can refer to the survey of network embedding methods [?, ?, ?]. The surveys focus on categorization of the embedding methods based either encoderdecoder framework [?] or novel taxonomy [?, ?] but does not provide experimental comparison of the embedding methods. There does exist one other experimental survey of network embedding methods [?]. However there are key differences. First, we present a systematic study on a larger set of embedding methods, including several more recent ideas, and on many more datasets (15 vs 7). Specifically, we evaluate 12 embedding methods + 2 efficient heuristics on 15 datasets. Second, there are several key differences in terms of results reported and reproducability. In our work we carefully tune all hyperparameters of each method as well as the logistic classifier (and include information in our reproducability notes). As a concrete example of where such careful tuning can make a difference consider that on Blogcatalog with a traintest split of 50:50, Goyal et al, achieve Macrof1 score of 3.9% while with tuning the hyperparameters of logistic regression we achieve a Macrof1 score of 29.2%. Third, our analysis reveals several important insights on the role of context, role of different link prediction evaluation strategies (dot product vs classifier), impact of nonlinear classifiers and many others. All of these provide useful insights for endusers as well as guidance for future research and evaluation in network representational learning and downstream applications. Fourth, we also provide a comparison against simple but effective taskspecific baseline heuristics which will serve as useful strawman methods for future work in these areas.
To conclude, we identify several issues in the current literature: lack of standard assessment protocol, use of default parameters for baselines, lack of standard benchmark, ignorance of taskspecific baselines. We, then, addressed the presented issues and made the following observations:

MNMF and NetMF are the most competitive baseline for the link prediction and node classification task respectively.

No one method completely outperform the other methods on both link prediction and node classification tasks.

If one considers Laplacian Eigenmaps as a baseline, the classifier parameters should be tuned appropriately.

Link Prediction Heuristics baseline is efficient in general, while Node Heuristics baseline is efficient on datasets with fewer labels.

For both tasks, some methods are impervious to the use of contextâ where for others context helps significantly.

In link prediction task, evaluation strategy with classifier provide statistically significant predictive gain over that of dot product for most of the embedding methods.
In addition to the reported experiments, we conducted experiments varying train:test splits and studied the impact of embedding dimension and the impact of embedding normalization on the downstream machine learning tasks. A detailed analysis of the additional experiments along with a link to our evaluation scripts would soon be made available on Arxiv. We hope the insights put forward in this study are helpful to the community and encourage the comparison of novel embedding methods with the taskspecific competitive methods and proposed taskspecific heuristics.
 1 Microsoft academic graph  kdd cup 2016. https://kddcup2016.azurewebsites.net/Data, 2016.
 2 S. AbuElHaija, B. Perozzi, R. AlRfou, and A. Alemi. Watch your step: Learning node embeddings via graph attention. In Neural Information Processing Systems, 2018.
 3 L. A. Adamic and E. Adar. Friends and neighbors on the web. Social networks, 25(3):211–230, 2003.
 4 E. M. Airoldi, D. M. Blei, S. E. Fienberg, E. P. Xing, and T. Jaakkola. Mixed membership stochastic block models for relational data with application to proteinprotein interactions. In In Proceedings of the International Biometrics Society Annual Meeting, 2006.
 5 A.L. Barabási and R. Albert. Emergence of scaling in random networks. science, 286(5439):509–512, 1999.
 6 M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
 7 A. R. Benson, D. F. Gleich, and J. Leskovec. Higherorder organization of complex networks. Science, 353(6295):163–166, 2016.
 8 S. Bhagat, G. Cormode, and S. Muthukrishnan. Node classification in social networks. In Social network data analytics, pages 115–148. Springer, 2011.
 9 D. K. Bhattacharyya and J. K. Kalita. Network anomaly detection: A machine learning perspective. Chapman and Hall/CRC, 2013.
 10 B.J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M. Livstone, R. Oughtred, D. H. Lackner, J. Bähler, V. Wood, et al. The biogrid interaction database. Nucleic acids research, 36(suppl_1):D637–D640, 2007.
 11 H. Cai, V. W. Zheng, and K. C.C. Chang. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering, 30(9):1616–1637, 2018.
 12 S. Cao, W. Lu, and Q. Xu. Grarep: Learning graph representations with global structural information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 891–900. ACM, 2015.
 13 E. K. Cetinkaya, M. J. Alenazi, A. M. Peck, J. P. Rohrer, and J. P. Sterbenz. Multilevel resilience analysis of transportation and communication networks. Telecommunication Systems, 60(4):515–537, 2015.
 14 S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In ACM SIGMOD Record, volume 27, pages 307–318. ACM, 1998.
 15 W. W. Cohen and J. Richman. Learning to match and cluster large highdimensional data sets for data integration. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 475–480. ACM, 2002.
 16 G. Crichton, Y. Guo, S. Pyysalo, and A. Korhonen. Neural networks for link prediction in realistic biomedical graphs: a multidimensional evaluation of graph embeddingbased approaches. BMC Bioinformatics, 19(1):176, May 2018.
 17 R. W. Eckardt III, R. G. Wolf Jr, A. Shapiro, K. G. Rivette, and M. F. Blaxill. Method and apparatus for selecting, analyzing, and visualizing related database records as a network, Mar. 2 2010. US Patent 7,672,950.
 18 D. Eppstein, M. S. Paterson, and F. F. Yao. On nearestneighbor graphs. Discrete & Computational Geometry, 17(3):263–282, 1997.
 19 L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 5(12):2018–2019, 2012.
 20 P. Goyal and E. Ferrara. Graph embedding techniques, applications, and performance: A survey. KnowledgeBased Systems, 151:78–94, 2018.
 21 A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.
 22 Y. Gu, Y. Sun, and J. Gao. The coevolution model for social network evolving and opinion migration. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 175–184. ACM, 2017.
 23 M. Gutmann and A. Hyvärinen. Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304, 2010.
 24 W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
 25 W. L. Hamilton, R. Ying, and J. Leskovec. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584, 2017.
 26 P. Jaccard. Étude comparative de la distribution florale dans une portion des alpes et des jura. Bull Soc Vaudoise Sci Nat, 37:547–579, 1901.
 27 T. N. Kipf and M. Welling. Variational graph autoencoders. NIPS Workshop on Bayesian Deep Learning, 2016.
 28 T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
 29 T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, pages 489–504. ACM, 2018.
 30 J. Leskovec, D. Huttenlocher, and J. Kleinberg. Signed networks in social media. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 1361–1370. ACM, 2010.
 31 J. Liang, S. Gurukar, and S. Parthasarathy. Mile: A multilevel framework for scalable graph embedding. arXiv preprint arXiv:1802.09612, 2018.
 32 J. Liang, P. Jacobs, J. Sun, and S. Parthasarathy. Semisupervised embedding in attributed networks with outliers. In Proceedings of the 2018 SIAM International Conference on Data Mining, pages 153–161. SIAM, 2018.
 33 D. LibenNowell and J. Kleinberg. The linkprediction problem for social networks. Journal of the American society for information science and technology, 58(7):1019–1031, 2007.
 34 L. Lü and T. Zhou. Link prediction in complex networks: A survey. Physica A: statistical mechanics and its applications, 390(6):1150–1170, 2011.
 35 S. Ma and M. Belkin. Diving into the shallows: a computational perspective on largescale shallow learning. In Advances in Neural Information Processing Systems, pages 3778–3787, 2017.
 36 M. Mahoney. Large text compression benchmark. URL: http://www. mattmahoney. net/text/text. html, 2011.
 37 T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems  Volume 2, NIPS’13, pages 3111–3119, USA, 2013. Curran Associates Inc.
 38 G. E. Moon, A. SukumaranRajam, S. Parthasarathy, and P. Sadayappan. Plnmf: Parallel localityoptimized nonnegative matrix factorization. arXiv preprint arXiv:1904.07935, 2019.
 39 M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2016.
 40 M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu. Asymmetric transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1105–1114. ACM, 2016.
 41 C. C. Paige and M. A. Saunders. Towards a generalized singular value decomposition. SIAM Journal on Numerical Analysis, 18(3):398–405, 1981.
 42 B. Perozzi, R. AlRfou, and S. Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.
 43 J. Qiu, Y. Dong, H. Ma, J. Li, C. Wang, and K. Wang. Netsmf: Largescale network embedding as sparse matrix factorization. Proceedings of the 2019 World Wide Web Conference on World Wide Web, 2019.
 44 J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 459–467. ACM, 2018.
 45 L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo. struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 385–394. ACM, 2017.
 46 M. Richardson, R. Agrawal, and P. Domingos. Trust management for the semantic web. In International semantic Web conference, pages 351–368. Springer, 2003.
 47 M. Ripeanu and I. Foster. Mapping the gnutella network: Macroscopic properties of largescale peertopeer systems. In international workshop on peertopeer systems, pages 85–93. Springer, 2002.
 48 A. Sinha, R. Cazabet, and R. Vaudaine. Systematic biases in link prediction: Comparing heuristic and graph embedding based methods. In L. M. Aiello, C. Cherifi, H. Cherifi, R. Lambiotte, P. Lió, and L. M. Rocha, editors, Complex Networks and Their Applications VII, pages 81–93, Cham, 2019. Springer International Publishing.
 49 J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pages 1067–1077. International World Wide Web Conferences Steering Committee, 2015.
 50 A. Tsitsulin, D. Mottin, P. Karras, and E. Müller. Verse: Versatile graph embeddings from similarity measures. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 539–548. International World Wide Web Conferences Steering Committee, 2018.
 51 P. Vijayan, Y. Chandak, M. M. Khapra, and B. Ravindran. Fusion graph convolutional networks. Mining and Learning with Graphs (MLG), KDD, 2018.
 52 D. Wang, P. Cui, and W. Zhu. Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1225–1234. ACM, 2016.
 53 X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang. Community preserving network embedding. In AAAI, pages 203–209, 2017.
 54 R. Zafarani and H. Liu. Social computing data repository at asu, 2009.
 55 D. Zhang, J. Yin, X. Zhu, and C. Zhang. Network representation learning: A survey. IEEE transactions on Big Data, 2018.
 56 F. Zhang, W. Zhang, Y. Zhang, L. Qin, and X. Lin. Olak: an efficient algorithm to prevent unraveling in social networks. Proceedings of the VLDB Endowment, 10(6):649–660, 2017.
 57 M. Zhao and V. Saligrama. Anomaly detection with score functions based on nearest neighbor graphs. In Advances in neural information processing systems, pages 2250–2258, 2009.
 58 T. Zhou, L. Lü, and Y.C. Zhang. Predicting missing links via local information. The European Physical Journal B, 71(4):623–630, 2009.