Node Representation Learning for Directed Graphs
Abstract
We propose a novel approach for learning node representations in directed graphs, which maintains separate views or embedding spaces for the two distinct node roles induced by the directionality of the edges. In order to achieve this, we propose a novel alternating random walk strategy to generate training samples from the directed graph while preserving the role information. These samples are then trained using SkipGram with Negative Sampling (SGNS) with nodes retaining their source/target semantics. We conduct extensive experimental evaluation to showcase our effectiveness on several realworld datasets on link prediction, multilabel classification and graph reconstruction tasks. We show that the embeddings from our approach are indeed robust, generalizable and well performing across multiple kinds of tasks and networks. We show that we consistently outperform all randomwalk based neural embedding methods for link prediction and graph reconstruction tasks. In addition to providing a theoretical interpretation of our method we also show that we are more considerably robust than the other directed graph approaches.
algorithmsection
1 Introduction
Valuable knowledge is encoded in directed graph representations of realworld phenomena where an edge not only suggests relationships between entities, but the directionality is often representative of important asymmetric semantic information. Prime examples are follower networks, interaction networks, web graphs, and citation networks among others. Using unsupervised learning techniques for learning graph representations is fundamentally challenging because graphs are discrete and structured, while much of machine learning works on continuous and unstructured data. Recently, there has been a large amount of work [24, 8, 19] on embedding nodes for graphs with the goal of preserving neighborhood structure of nodes when embedding one space into another, thereby fully exploring the homophily assumption in node classification tasks.
Even though node embeddings have experienced a spurge of interest and research papers in general, the works are still limited in the following ways. First, most of these node embeddings techniques operate on a single embedding space and distances in this space are considered to be symmetric. Consequently, even though some of the approaches claim to be applicable for directed graphs, they do not respect the asymmetric roles of the vertices in the directed graph. For example, in predicting links in an incomplete web graph or an evolving social network graph, it is more likely that a directed link exists from a less popular node, say Max Smith, to a more popular node, say an authoritative node Elon Musk, than the other way around. Algorithms employing single representation for node might be able to predict a link between Elon Musk and Max Smith but cannot predict the direction.
Secondly, approaches like APP [27] (see Section 4 for more details) overcome the first limitation by using two embedding spaces but lack in identifying the difference between neighborhoods in undirected and directed graphs. APP uses a directed walk like [8, 19] to sample node neighborhoods as training data. Now consider a directed graph with prominent hub and authority structure, for example a large graph with only a few vertices with positive outdegree. In such a case any directed random walk from a source node will halt after a few number of steps, irrespective of the stopping criteria. Consequently, the global structure of the directed graph remains unexplored using such walks.
Thirdly, works like HOPE [17] (see also Section 4) rely on stricter definitions of neighborhoods dictated by proximity measures like Katz [9], Rooted PageRank etc. and cannot be generalized to a variety of tasks. In addition, their reliance on matrix decomposition techniques to extract the node embeddings renders them nonscalable for very large graphs. Moreover, the accuracy guarantees of HOPE only hold for low rank data.
Present Work. We propose a novel scalable approach for learning Node Embeddings Respecting Directionality (NERD) for directed and (un)weighted graphs. NERD aims at learning representations that maximize the likelihood of preserving node neighborhoods. But unlike the previous methods, it identifies the existence of two different types of node neighborhoods; one in its source role and the other in its target role. We propose a novel alternating random walk strategy to sample such node neighborhoods while preserving their respective role information. Our alternating walk strategy is inspired from the classical HITS [10] algorithm which also identifies two types of important nodes in a directed network: hubs and authorities. Roughly speaking, the paths generated with our alternating random walks alternate between hubs (source nodes) and authorities (target nodes), thereby sampling both neighboring hubs and authorities with respect to an input node.
From a theoretical perspective we derive an equivalence for NERD’s optimization in a matrix factorization framework. In particular, we map the directed network to an undirected bipartite network with a symmetric adjacency matrix and show that NERD’s optimization is equivalent to factorizing the following matrix into source and target representations.
Theorem 1.
Let denote the adjacency matrix of the bipartite network obtained by mapping the given directed network to . Let be the degree (diagonal) matrix of and be the number of training pairs sampled in one alternating walk and is the number of negative samples. Let denote the total weight of all edges in . NERD implicitly factorizes the following matrix
In addition to proposing a new framework, we also unearth the limitations of earlier works in the evaluation of models on directed graphs and propose new evaluation strategies for Link Prediction and Graph Reconstruction tasks in directed graphs.
Our Contributions. To summarize, we make the following contributions:

We propose a novel yet simple network embedding model for directed graphs by maintaining separate embeddings for separate roles.

We propose an alternating random walk model for sampling node neighborhoods in their source as well as the target roles. Moreover, the role information of neighborhood nodes is also preserved.

We highlight some of the limitations of previous experimental designs in evaluating asymmetric properties in directed graphs and conduct extensive experimental evaluation to show the effectiveness of our approach.

We provide theoretical interpretation and analysis of NERD’s optimization.
In the next section we describe our new approach for unsupervised learning in directed graphs in detail.
2 The Nerd Model
Our approach consists of the following two main components: (i) a novel random walk strategy to generate sentencelike structures from the directed graph while preserving the noderole information followed by (ii) appropriate sampling of input nodeneighbor pairs while preserving their role semantics to be trained using negative sampling. Before we describe our learning framework using NERD, we first introduce the alternating walk model on directed graphs.
The Alternating Walk Model. We propose two alternating walks which alternate between source and target vertices and are referred to as source and target walks respectively. To understand the intuition behind these walks, consider a directed graph with nodes. Now construct a copy of each of these nodes and call this set . Construct an undirected bipartite graph such that for vertices and , where is a copy of vertex , there is an edge if and only if . In the directed graphs the adjacency matrix is generally asymmetric, however, with our construction we obtain a symmetric adjacency matrix for bipartite graph .
(1) 
A walk on this undirected bipartite starting from a vertex in will now encounter source nodes in the odd time step and target nodes in the even time step. We call such a walk an alternating walk. In the following we formally define the corresponding source and target alternating walks which will help us to appropriately sample node neighbourhoods while preserving the role semantics .
Definition 2.
The Source Walk. Given a directed graph, we define sourcewalk of length as a list of nodes such that there exists edge if is odd and edge if is even:
Definition 3.
The Target Walk. A target walk of length , starting with an inedge, from node to node in a directed network is a list of nodes such that there exists edge if is odd and edge if is even:
As already mentioned the alternating walk model is inspired from the HITS algorithm. More specifically, good hubs and authorities of the HITS algorithm are in fact the nodes which appear many times in the above described alternating walks. Due to space limitations, we discuss the similarities and dissimilarities of our alternating walks as compared to HITS algorithm in the supplementary material. In the next section, we describe the learning framework which uses the above described alternating walks to generate node neighborhood pairs which are then used as training examples for optimizing an SGNS (SkipGram with Negative Sampling) based objective function.
2.1 The Learning Framework
We first introduce the notations that would also be followed in the rest of the paper unless stated otherwise. Let be a directed weighted graph with nodes and edges. Let denote the weight of edge and . For any vertex let denote the total outdegree of , i.e. the sum of weights of the outgoing edges from . Similarly denotes the total indegree of . For unweighted graphs, we assume the weight of each edge to be . Let and represent the respective embedding vectors for any node in its role as source and target respectively. Let and denote the input and output degree distributions of respectively. Specifically and . We use alternating random walks as explained below to sample training examples.
Alternating Random Walks. To generate an alternating random walk we first sample the input node for the source/target walks from the indegree/outdegree distribution of . We then simulate source/target random walks of length . Let denote the th node in the alternating random walk starting with node . Then
All nodes in a source/target walk in their respective roles constitute a neighborhood set for the input (the walk starts at this node) node.
To model and describe our objective function we assume that the input node is a source node and its neighborhood is sampled via source walk. We seek to optimize the following objective function, which maximizes the logprobability of observing a network neighborhood for a node conditioned on its feature representation in its source role, .
where denotes the nodes in the source walk starting from . We further assume (for tractability) that the likelihood of observing a neighborhood node is independent of observing any other neighborhood node given the feature representation of the input node , i.e.
(2) 
The corresponding objective corresponding to the target input node (in this case the neighborhood consists of nodes in the target walk starting from ) is given as
We model the conditional probability of observing an inputneighbor pair in a source or target walk using a softmax over the dot product of their feature representations in their respective roles, i.e.
where and are the roles of the input and the neighbor node depending on whether the sampled walk is a source or target walk. Using the neighborhood independence condition as in (2) and substituting , we obtain
(3) 
Optimizing (3) directly would require to compute the partition function which is computationally expensive for large graphs. Similar to other works we rather use negative sampling as proposed in [15] which samples some negative samples according to some noisy distribution compensate for the normalization factor. In particular for an inputneighbor pair in roles and respectively, we specify the objective functions as
(4) 
where and is the indegree or outdegree noise distribution. We set , where is the indegree (if is the target role) or outdegree (if is the source role) of vertex .
We optimize Equation (2.1) using Asynchronous Stochastic Gradient Descent [22]. In our experiments, we observed that it suffices to consider only neighbors of differing roles, i.e. it suffices to sample sourcetarget and targetsource pairs as positive training examples in source and target walks respectively. Moreover, for the datasets in question, very long walks did not give adequate gains on various tasks. The input vertex is always chosen to be the starting vertex of the walk. The pseudocode for NERD as used in this work is stated in Algorithm . In the pseudo–code, denotes a source walk and a target walk.
2.2 Theoretical Analysis
We further support our new approach by (i) observing its intuitive motivation from the classical HITS [10] algorithm (details in supplementary material) and (ii) deriving a closed form expression for NERD’s optimization in the matrix framework.
Following the work in [13], for a given training sample of a word and a context , SGNS implicitly factorizes
(5) 
where is the joint probability distribution of pairs (occurring in a contextual window) and and are the probability distributions of sampled words and contexts respectively and is the number of negative samples. As we also employ the SGNS objective, we use the main result from [13] which implies that for a training pair NERD implicitly factorizes Equation (5) with word and context replaced by the node pairs. In order to prove Theorem 1 we need to compute the distributions for sampling a training pair and the marginalized node distributions, which we accomplish in the following proof.
Proof of Theorem 1.
We recall that is the degree matrix of . Set First we note that the initial vertex is chosen with probability for a source walk or for a target walk, starting at . The probability that a given sourcetarget pair will be sampled in a walk of length , where is the number of sampled pairs, is given by
(6) 
The first term corresponds the sampling of in a source walk starting from the source node and the second term corresponds to the sampling of in a target walk starting from the target node . Note that corresponds to the outdegree of in the original graph , corresponds to the indegree of in the original graph , and . Also note that for NERD the input vertex is always the first vertex in the walk. Further marginalization of Equation (6) gives us and . Substituting the above terms in , we obtain
(7) 
In matrix form Equation (2.2) is equivalent to
hence completing the proof. ∎
The reader might have noticed that unlike the equivalence proofs for DeepWalk and other methods in [21], we do not need the assumptions about infinite long walks and stationary distributions over undirected and nonbipartite graphs because of the following facts. Firstly, the initial distribution for the first vertex is for a source walk and for a target walk, unlike the uniform distribution in DeepWalk. Secondly, we use the first vertex of the walk as the input vertex and we know the exact distribution from which it is drawn. As a result, the distribution for training pairs can be computed analytically.
2.3 Complexity Analysis of Nerd
Sampling a vertex based on indegree or outdegree distribution requires constant amortized time by building an alias sampling table upfront. At any time only neighbors are stored which is typically a small number as we observed in our experiments. In our experiments we set the total number of walks to be . In each optimization step we use negative samples, therefore, complexity of each optimization step is . As it requires time to read the graph initially, the run time complexity for NERD can be given as . As we suggest to increase with the number of nodes, the run time complexity of NERD is given as . The space complexity of NERD is . As our method is linear (with respect to space and time complexity) in the input size, it is scalable for large graphs.
3 Experiments
In this section we present how we evaluate NERD against several stateoftheart graph embedding algorithms.
The materials used in the experiments including code and datasets would be released for reproducibility purposes at the time of publication.
We use the original implementations of the authors for all the baselines (if available) and parameter tuning (whenever possible). Otherwise we use the best parameters reported by the authors in their respective papers. We perform comparisons corresponding to three tasks, namely Link Prediction, Graph Reconstruction and Multilabel classification. As our datasets, we consider (i) a citation network in which directed edges represent the citation relationship, (ii) a social network where directed edges represent the follower relationship, and (iii) a hyperlink network. In the next section we explain the datasets used in our evaluations.
Size  Statistics  

dataset  Diameter  
CORA  23,166  91,500  79  20  21,201 
465,017  834,797    8  2,502  
Google Web  875,713  5,105,039    24  739,453 
3.1 Datasets
We now describe the datasets used for our evaluation. A brief summary of the characteristics of these graphs is presented in Table 1. We will also describe in our experimental results how certain network statistics, like the number nodes with a positive outdegree, influence the performance of various algorithms in a given task.

Cora [23]: This is a citation network of academic papers. The nodes are academic papers and the directed edges are the citation relationship between papers. An edge between two nodes indicates that the left node cites the right node. The nodes are also labeled with the paper categories. Each paper has one or more labels. The labels are extracted from the paper categories. For example, if the paper category is /A/B/, we consider that the paper has two labels, A and B.

Twitter Social Network (Twitter (ICWSM)) [7]: This is the directed network containing information about who follows whom on Twitter. Nodes represent users and an edge shows that the left user follows the right one.

WebGoogle [12]: This is a network of web pages connected by hyperlinks. The data was released in 2002 by Google as a part of the Google Programming Contest.
All the above datasets have been collected from [11].
3.2 Baselines
We compare the NERD model with several existing graph embedding models for link prediction, graph reconstruction and multilabel classification tasks. The parameter settings for NERD and all baselines are described in the supplementary material.

APP [27]: It uses an approximate version of Rooted PageRank wherein several paths are sampled from the starting vertex using a restart probability. The first and the last vertex of such paths form the training pair.

DeepWalk [19]: DeepWalk trains the SkipGram model using hierarchical softmax akin to word2vecbased training procedure. The training set is prepared by sampling vertexcontext pairs over a sliding window in a given random walk.

LINE [24]: This has two variants: LINE1 and LINE2. LINE1 optimizes for the first order proximity, i.e. it aims to embed nodes together which are connected by an edge. LINE2 on the other hand optimizes for second order proximity by embedding nodes, which share neighborhood, closer.

HOPE [17]: It learns two embeddings corresponding to the two roles of the nodes and is based on an SVD operation on sparse similarity matrices constructed using Katz, (rooted) PageRank similarity etc. We used Katz similarity as provided in the authors’ implementation. The embeddings based on Katz measure were also emphasized in the paper as the one with the best performance across various tasks.

Node2vec [8]: Node2vec is a variant of DeepWalk. It uses biased random walks performing breadth first or depth first search or a mixture of both to control the random walk sampling of DeepWalk. Note that while Node2vec employs negative sampling, DeepWalk uses hierarchical softmax. As we used the author’s original implementation of DeepWalk using hierarchical softmax as against other papers using the SGNS version, we find that DeepWalk is sometimes performing better than Node2vec, especially for multilabel classification even for the parameters () for which Node2vec is claimed to be equivalent to DeepWalk.

VERSE [25]: This is a very recent embedding method which optimizes for three similarity measures, namely Personalized PageRank (PPR), direct neighborhood and SimRank. As the paper emphasizes the general applicability of embeddings created using PPR similarity, we report the results corresponding to the same for the best parameter.
3.3 Link Prediction
The aim of the link prediction task is to predict missing edges given a network with a fraction of removed edges. In the literature there have been slightly different yet similar experimental settings. A fraction of edges is removed randomly to serve as the test split while the residual network can be utilized for training. The test split is balanced with negative edges sampled from random vertex pairs that have no edges between them. We refer to this setting as the undirected setting. While removing edges randomly, we make sure that no node is isolated, otherwise the representations corresponding to these nodes can not be learned. Tables 2 and 3 present the ROCAUC (Area Under the Receiver Operating Characteristic Curve) scores for this classical undirected setting. More specifically, given an embedding, the inner product of two node representations normalized by the sigmoid function is employed as the similarity/linkprobability measurement for all the algorithms.
For the best parameter search we use the training data with edges for Cora and with edges for Twitter (because of the sparsity of Twitter dataset).
Directed link prediction. Since we are interested in not only the existence of the edges between nodes but also the directions of these edges, we consider a slight modification in the test split setting. In a directed network the algorithm should also be able to decide the direction of the predicted edge. To achieve this, we allow for negative edges that are complements of the true edges which exist already in the test split.
We experiment by varying the number of such complement edges created by inverting a fraction of the true edges in the test split. A value of corresponds to the classical undirected graph setting while a value in determines what fraction of positive edges from the test split are inverted at most to create negative examples. It can also happen that an inverted edge is actually an edge in the network, in which case we discard it and pick up some random pair which corresponds to a negative edge. Such a construction of test data is essential to check if the algorithm is also predicting the correct direction of the edge along with the existence of the edge. Please note that we always make sure that in the test set the number of negative examples is equal to the number of positive examples. Embedding dimensions are set to for all models for both settings.
Results on Cora and Twitter. APP mostly outperforms others for the undirected setting in the Cora dataset. The scores for HOPE, NERD and DeepWalk are also very close to the best scores. Figure 4 plot (a) shows the degrading performance of DeepWalk and other random walk based methods in the directed setting where the algorithm is forced to assign a direction to the edge. The plots correspond to trainingtest split. NERD stays stable with AUC scores in . HOPE and APP perform better than DeepWalk, Node2vec, VERSE and LINE but it is still not as stable as NERD in the directed setting and show a decrease in performance with an increasing fraction of inverted edges.
For the Twitter dataset NERD outperforms all other algorithms except for the split.
Using nonalternating directed walks on Twitter which has a prominent hubauthority structure hinders APP and other similar random walk methods to fully explore the network structure as much as they could do for Cora. For Twitter the walks become much shorter because with more than vertices with zero outdegree, it is highly likely that the walk reaches such a vertex in a small number of steps. Figure 4 plot (b) shows the performance of all methods on a traintest split in the directed setting. NERD and HOPE showcase a stable performance. We witness an increase in APP’s score as more and more of the positive edges are inverted to create negative test edges. This implies that APP is able to distinguish the correct direction of a given edge (unlike other random walk methods) but is unable to correctly distinguish positive edges from random negative edges.
Training Data, % Edges  

method  20%  50%  70%  90% 
APP  0.473  0.893  0.939  0.961 
DeepWalk  0.559  0.879  0.919  0.938 
LINE2  0.503  0.502  0.502  0.502 
HOPE  0.594  0.875  0.941  0.893 
Node2vec  0.719  0.761  0.788  0.808 
VERSE  0.536  0.613  0.613  0.612 
NERD  0.699  0.797  0.895  0.923 
Training Data, % Edges  
method  60%  70%  80%  90% 
APP  0.467  0.446  0.433  0.425 
DeepWalk  0.536  0.531  0.521  0.532 
LINE2  0.512  0.512  0.510  0.510 
HOPE  0.984  0.898  0.835  0.785 
Node2vec  0.5003  0.4996  0.499  0.5005 
VERSE  0.520  0.607  0.663  0.702 
NERD  0.956  0.954  0.957  0.958 

Results on WebGoogle We already showed that models using single representation for nodes cannot perform better than just a random guess (on link prediction task) as they cannot correctly predict the direction. We therefore compare only the asymmetry preserving methods on the link prediction task for the WebGoogle dataset. In particular, we investigate the stability of the models’ scores in the directed setting where we invert positive edges in the test set to create negative examples. The results are shown in Figure 5. We observe that, even though HOPE scores a bit higher at , i.e. when no positive edges are inverted, its performance decreases by about when (for training data). On the other hand, the corresponding drop in NERD’s performance is close to . APP performs the worst of all the three approaches.
3.4 Graph Reconstruction
In the graph reconstruction task we evaluate how well the embeddings preserve neighborhood information. There are two separate evaluation regimes for graph reconstruction in previous works. One line of work [17], that we refer to as edgecentric evaluation, relies on sampling random pairs of nodes from the original graphs into their test set. These candidate edges are then ordered according to their similarity in the embedding space. Precision is then computed at different rank depths where the relevant edges are the ones present in the original graph. On the other hand, [25] perform a nodecentric evaluation where precision is computed on a pernode basis. For a given node with an outdegree , embeddings are used to perform a nearest neighbor search for and precision is computed based on how many actual neighbors the NN procedure is able to extract.
Directed Graph Reconstruction. We believe that the edgecentric evaluation suffers from sparsity issues typical in realworld networks and even if a large number of node pairs are sampled, the fraction of relevant edges retrieved tends to remain low. More acutely, such an approach does not model the neighborhood reconstruction aspect of graph construction and is rather close to predicting links. We adopt the nodecentric evaluation approach where we intend to also compute precision on directed networks with a slight modification. In particular, we propose to compute precision for both outgoing and incoming edges for a given node.
As in Link Prediction, the similarity or the probability of an edge is computed as the sigmoid over the dot product of their respective embedding vectors. For HOPE and NERD we use the corresponding source and target vectors respectively. We do not assume the prior knowledge of the indegree or outdegree, rather we compute the precision for . For a given we obtain the nearest neighbors ranked by sigmoid similarity for each embedding approach. If a node has an outdegree or indegree of zero, we set the precision to be if the sigmoid corresponding to the nearest neighbor is less than (recall that for ), otherwise we set it to . In other cases, for a given node and a specific we compute and corresponding to the outgoing and incoming edges as
where and are the nearest outgoing (to whom has outgoing edges) and incoming (neighbors point to ) neighbors retrieved from the embeddings and and are the actual outgoing and incoming neighbors of . We then compute the MicroF1 score as the harmonic mean of and . To avoid any zeros in the denominator, we add a very small to each precision value before computing the harmonic mean. We finally report the final precision as the average of these harmonic means over the nodes in the test set.
Results. We perform the graph reconstruction task on the Cora and Twitter datasets. In order to create the test set we randomly sample and of the nodes for Cora and Twitter respectively. We plot the final averaged precision corresponding to different values of in Figure 6. For the Cora dataset, NERD clearly outperforms all the other models including HOPE. In particular for , NERD shows an improvement of over HOPE which in some sense is fine tuned for this task. The trend between NERD and HOPE is reversed for Twitter dataset, where HOPE behaves like an almost exact algorithm. This can be attributed to the low rank of the associated Katz similarity matrix. Note that only out of more than nodes have nonzero outdegree which causes a tremendous drop in the rank of the associated Katz matrix. We recall that HOPE’s approximation guarantee relies on the low rank assumption of the associated similarity matrix which seems to be fulfilled quite well in this dataset. The performance of other models in our directed setting clearly shows their inadequacy to reconstruct neighborhoods in directed graphs.
3.5 Multilabel classification
We run experiments for predicting labels in the Cora dataset. There are a total of labels. Each node has one or more labels. We report the MicroF1 and MacroF1 scores after a fold multilabel classification using onevsrest logistic regression. The main aim of this experiment is to show that NERD is generalizable across tasks. Unlike HOPE, NERD performs comparably to the best approach in this task.
We recall that for HOPE the similarity between nodes and is determined by the effective distance between them which is computed using the Katz measure, penalizing longer distances by an attenuation factor . The advantage of such a degrading distance measure is that it conserves the adjacency similarity of the graph, which reflects in our experiments on Graph Reconstruction. NERD on the other hand also takes into account how likely can influence by taking into account the likelihood of the traversal of various alternating paths between and . In other words, NERD constructs the local neighborhood based on how likely this neighborhood can influence the node, which helps the classifier to learn better labels on NERD trained embeddings. For HOPE, LINE and APP, the two embedding vectors are concatenated. For NERD, the concatenation of embedding vectors or averaging the vectors give essentially the same results. Note that DeepWalk performs the best but it is inefficient because of an extra factor in the running time of its hierarchical softmax optimization. We would also like to point to the reader that the argument, that Node2vec is equivalent to DeepWalk for certain parameters, only holds for the SGNS implementation of DeepWalk. For fair comparisons, we used the actual hierarchical softmax implementation by the authors. Nevertheless, the extra factor in the run time of DeepWalk renders it practically inefficient for larger datasets.

Micro and Macro F1 scores (in %)  

method  MicroF1  MacroF1 
APP 
67.65  53.85 
DeepWalk  73.90  61.72 
LINE1 + LINE2  65.62  53.68 
HOPE  26.42  1.25 
Node2vec  48.71  21.08 
VERSE  61.81  46.95 
NERD  68.48  53.74 
4 Related Work
Directed Graph Embeddings. Traditionally, undirected graphs have been the main use case for graph embedding methods. Manifold learning techniques [2], for instance, embed nodes of the graph while preserving the local affinity reflected by the edges. Chen et al. [6] explore the directed links of the graph using random walks, and propose an embedding while preserving the local affinity defined by directed edges. PerraultJoncas et al. [20] and Mousazadeh et al. [16] learn the embedding vectors based on Laplacian type operators and preserve the asymmetry property of edges in a vector field.
Neural Node Embeddings. Recent advances in language modeling and unsupervised feature learning in text inspired their adaptations [8, 19, 4, 24] to learn graph embeddings. Though the text based learning methods [15, 18] inherently model neighborhood relations, a conceptual adaptation to graphs was required. DeepWalk [19], for instance, samples truncated random walks from the graph, thus treating walks as equivalent of sentences, and then samples nodecontext pairs from a sliding window to train a SkipGram model [14, 15]. The main idea is to relate nodes which can reach other similar nodes via random walks. Node2vec [8], on the other hand, uses a biased random walk procedure to explore diverse neighborhoods and can be thought of as preserving structural similarity among nodes in the low dimensional representation.
LINE [24] proposes objectives preserving the local pairwise proximity among the nodes (firstorder proximity) and preserving the secondorder proximity among nodes which share many neighbors. VERSE [25] attempts to incorporate all the above ideas in one framework and proposes methods to preserve three similarity measures among nodes, namely Personalized PageRank, adjacency similarity and SimRank.
Deep Learning based Methods. Works such as [5, 26] investigate deep learning approaches for learning node representations. [5] first constructs a probabilistic cooccurence matrix for node pairs using a random surfing model. Instead of using SVD to obtain lowdimensional projections of this matrix, a stacked denoising autoencoder is introduced in the model to extract complex features and model nonlinearities. SDNE [26] uses a multilayer autoencoder model to capture nonlinear structures based on direct first and secondorder proximities. [1] employs a deep neural network to learn asymmetric edge representations from trainable node embeddings. Like most of the other methods, it learns a single representation for a node, hence ignoring the asymmetric node roles. The downsides of these deep learning approaches, as compared to other approaches, are the computationally expensive optimization and elaborate parameter tuning resulting in very complex models. However, none of the above methods can preserve the asymmetry property in the embedding vector space of nodes, as by assigning a single representation to a node, they fail to preserve the role information of nodes and hence suffer from the limitations described earlier.
Asymmetry (for nodes) preserving approaches. To the best of our knowledge, there are only two recent works [17, 27] which learn two embedding spaces for nodes, one representing its embedding in the source role and the other in the target role. Note that [1] does not preserve asymmetry for the nodes, which is the main theme of this work (more comparisons and discussions on this method can be found in the supplementary material). HOPE [17] preserves the asymmetric role information of the nodes by approximating highorder proximity measures like Katz measure, Rooted PageRank etc. Basically they propose to decompose the similarity matrices given by these measures and use the two decompositions as representations of the nodes. To avoid the computation of the similarity matrices and high computation cost of SVD, they exploit an interesting general formulation of these similarity measures and compute a low rank factorization without actually creating the full similarity matrix. HOPE cannot be easily generalized as it is tied to a particular measure. Asymmetric proximity preserving (APP) [27] proposes a random walk based method to encode rooted PageRank proximity. It uses directed random walks with restarts to generate training pairs. Unlike other DeepWalk type random walk based methods, APP does not discard the learnt context matrix, on the other hand it uses it as a second (target) representation of the node. In principle all other random based methods can also be modified by using the second learned context matrix as the second representation of the node. In essence, the neighborhood structure of a node explored by APP is similar to other random walk based methods.
5 Conclusion and Future Work
We presented a novel approach, NERD, for embedding directed graphs while preserving the role semantics of the nodes. We propose an alternating random walk strategy to sample node neighborhoods from a directed graph. The runtime and space complexities of NERD are both linear in the input size, which makes it suitable for large scale directed graphs.
In addition to providing advantages of using two embedding representations of nodes in a directed graph, we revisit the evaluation strategies that have been used in the previous works while evaluating directed networks. To this extent, we chart out a clear evaluation strategy for link prediction and graph reconstruction tasks. We show that the embeddings from NERD are indeed robust, generalizable and well performing across multiple types of tasks and networks. As future work, we will investigate NERD on multilabel classification tasks on graphs where the labels have a stricter correlation with the direction of the edge. We would also like to work towards scalable training approaches for NERD in a distributed setting for truly massive networks.
Appendix A Supplementary Material
a.1 Parameter Settings.
For NERD, the minibatch size of stochastic gradient descent is set to walk sample, i.e. inputneighbor pairs. The learning rate is set with the starting value and where is the total number of walk samples. The number of walk samples is fixed to Million. Other parameters, i.e. number of neighborhood nodes to be sampled and number of negative samples , vary over datasets and across tasks and were fine tuned using a small amount of training data. For the Cora dataset, the reported results are according to the following parameter settings: for the link prediction task, for the multilabel classification and graph reconstruction tasks. For Twitter, we set for link prediction and for Graph Reconstruction tasks. For WebGoogle, we set for link prediction task. For fair comparisons, the embedding dimensions were set to for all approaches.
For HOPE, the attenuation factor was set to across all datasets and tasks. The possible values for are large and the only rough guiding principle is that it should be less than the spectral radius of the corresponding adjacency matrix which is clearly insufficient. In the original paper, the best results for the Cora dataset were reported at . We investigated several values close to , in particular using a small amount of training data and found to be the best performing.
For Node2vec, we run experiments with walk length , number of walks per node , and window size , as described in the paper. The results are reported for the best inout and return hyperparameters selected from the range . In particular, the reported results correspond to the following inout and return parameters: for link prediction, for multilabel classification and for graph reconstruction tasks in the Cora dataset; for linkprediction and graph reconstruction tasks in Twitter.
For VERSE, the best results were obtained at for all tasks across all datasets.
For DeepWalk,the parameters described in the paper are used for the experiments, i.e. walk length , number of walks per node , and window size .
For LINE, we run experiments with total billion samples and negative samples, as described in the paper. For the multilabel classification task, the two halves (the embeddings from LINE 1 and LINE 2) are normalized and concatenated.
For APP we used the restart probability as and adjusted the total number of sampled paths to match that of NERD, i.e., Million as no exact numbers were provided in the original paper.
a.2 Parameter Sensitivity
We next investigate the performance of NERD with respect to the embedding dimensions and its converging performance with respect to the number of walk samples on the multilabel classification task in the Cora dataset. It is interesting to note that the performance of NERD is mostly stable for dimensions greater than with a slight fall at . The performance of NERD converges quite fast with the number of walk samples.
a.3 HITS and Nerd
The classical HITS algorithm is based on the idea that in all types of directed networks, there are two types of important nodes: hubs and authorities. Good hubs are those which point to many good authorities and good authorities are those pointed to by many good hubs. Technically, given a directed graph with adjacency matrix , HITS is an iterative power method to compute the dominant eigenvector for and for . The authority scores are determined by the entries of the dominant eigenvector of the matrix , which is called the authority matrix and the hub scores are determined by the entries of the dominant eigenvector of , called the hub matrix.
Consider the undirected bipartite graph with adjacency matrix constructed from in Section 2. Applying HITS to the adjacency matrix from (1), , so and introducing the vector for we obtain
(8) 
followed by normalization of the two vector components of so that each has a norm equal to . Now, if is an matrix, is and vector is in . The first entries of (after convergence under suitable conditions as ) correspond to the hub rankings of the nodes, while the last entries give the authority rankings. We point the interested reader to [3] and the references therein for a detailed discussion on HITS related matrix functions. Now (8) can be written as
(9) 
where (where there are matrices being multiplied) counts the number of alternating walks of length , starting with an outedge, from node to node , and (where there are matrices being multiplied) counts the number of alternating walks of length , starting with an inedge, from node to node . A good hub or a good authority will show up many times in these two sets of walks. We also base our NERD model on a similar intuition, in which we aim to embed nodes cooccurring in such alternative walks closer in their respective source and target vector spaces. Moreover, each iterative step of HITS requires hub/authority scores to be updated based on authority/hub scores of neighboring nodes. NERD attempts to extend this by exploring slightly bigger neighborhoods by embedding source target pairs closer if the cooccur in walks of some small size .
a.4 Learning Asymmetrical Edge Representations
The approach presented in [1] uses a deep neural network (DNN) to obtain edge representations from trainable node embeddings as inputs. This method also uses a simple embedding space for representing nodes. Specifically, the DNN learns to output representations that maximize the Graph Likelihood, which is defined as the overall probability of correctly estimating the presence or absence of edges in the original graph, using (trainable) node embeddings as inputs.
Training Data, % Edges  

method  20%  50%  70%  90% 
EdgeDNN  0.719  0.722  0.716  0.753 
We run the authors’ implementation of the approach on the CORA dataset using multiple traintestsplits created by their method ( to provide them the advantage). The AUC scores resulting from link prediction evaluation are presented in Table 5. The training for Twitter dataset did not finish after running for day. The results show that the method is performing worse than NERD. Moreover, it uses a much more complex architecture than NERD which is difficult to finetune for a variety of tasks. For example no leverage over other undirected methods can be achieved in node classification task as this method encode asymmetry information of edges and not the nodes.
References
 S. AbuElHaija, B. Perozzi, and R. AlRfou. Learning edge representations via lowrank asymmetric projections. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, pages 1787–1796, 2017.
 Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems 14, pages 585–591. 2002.
 M. Benzi, E. Estrada, and C. Klymko. Ranking hubs and authorities using matrix functions. Linear Algebra and its Applications, 438(5):2447–2474, 2013.
 S. Cao, W. Lu, and Q. Xu. Grarep: Learning graph representations with global structural information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, pages 891–900, 2015.
 S. Cao, W. Lu, and Q. Xu. Deep neural networks for learning graph representations. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 1145–1152, 2016.
 Mo Chen, Qiong Yang, and Xiaoou Tang. Directed graph embedding. In IJCAI, pages 2707–2712, 2007.
 M.D. Choudhury, Y.Ru Lin, H. Sundaram, K. S. Candan, L. Xie, and A. Kelliher. How does the data sampling strategy impact the discovery of information diffusion in social media? In ICWSM, pages 34–41, 2010.
 A. Grover and J. Leskovec. Node2vec: Scalable feature learning for networks. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 855–864, 2016.
 L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43, 1953.
 Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604–632, 1999.
 J Kunegis. Konect datasets: Koblenz network collection. http://konect.unikoblenz.de, 2015.
 Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. Statistical properties of community structure in large social and information networks. In Proc. Int. World Wide Web Conf., pages 695–704, 2008.
 O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Proceedings of the 27th International Conference on Neural Information Processing Systems  Volume 2, NIPS’14, pages 2177–2185, 2014.
 T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013.
 T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems 2013., pages 3111–3119, 2013.
 S. Mousazadeh and I. Cohen. Embedding and function extension on directed graph. Signal Process., 111(C):137–149, 2015.
 Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. Asymmetric transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1105–1114. ACM, 2016.
 Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
 B. Perozzi, R. AlRfou, and S. Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 701–710, 2014.
 Dominique C PerraultJoncas and Marina Meila. Directed graph embedding: an algorithm based on continuous limits of laplaciantype operators. In Advances in Neural Information Processing Systems, pages 990–998, 2011.
 J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM ’18, pages 459–467, 2018.
 B. Recht, C. Re, S. Wright, and Feng N. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24, pages 693–701. 2011.
 Lovro Šubelj and Marko Bajec. Model of complex networks based on citation dynamics. In Proceedings of the WWW Workshop on Large Scale Network Analysis, pages 527–530, 2013.
 J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15, pages 1067–1077, 2015.
 A. Tsitsulin, D. Mottin, P. Karras, and E. Müller. Verse: Versatile graph embeddings from similarity measures. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 539–548, 2018.
 D. Wang, P. Cui, and W. Zhu. Structural deep network embedding. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 1225–1234, 2016.
 C. Zhou, Y. Liu, X. Liu, Z. Liu, and J. Gao. Scalable graph embedding for asymmetric proximity. In AAAI Conference on Artificial Intelligence (AAAI’17), 2017.