The Hidden Flow Structure and Metric Space of Network Embedding Algorithms Based on Random Walks

The Hidden Flow Structure and Metric Space of Network Embedding Algorithms Based on Random Walks

Weiwei Gu School of Systems Science, Beijing Normal University, Beijing 100875, P.R.China Li Gong School of Systems Science, Beijing Normal University, Beijing 100875, P.R.China Xiaodan Lou School of Systems Science, Beijing Normal University, Beijing 100875, P.R.China Jiang Zhang
Abstract

Network embedding which encodes all vertices in a network as a set of numerical vectors in accordance with it’s local and global structures, has drawn widespread attention. Network embedding not only learns significant features of a network, such as the clustering and linking prediction but also learns the latent vector representation of the nodes which provides theoretical support for a variety of applications, such as visualization, node classification, and recommendation. As the latest progress of the research, several algorithms based on random walks have been devised. Although their high scores for learning efficiency and accuracy have drawn much attention, there is still a lack of theoretical explanation, and the transparency of the algorithms has been doubted. Here, we propose an approach based on the open-flow network model to reveal the underlying flow structure and its hidden metric space of different random walk strategies on networks. We show that the essence of embedding based on random walks is the latent metric structure defined on the open-flow network. This not only deepens our understanding of random walk based embedding algorithms but also helps in finding new potential applications in embedding.

Introduction

Complex networks, as high-level abstractions of complex system, have been widely applied in different areas, such as biology, sociology, economics and technology[RevModPhys.74, Barab2004Network, Deville2016Scaling, Marc2010Spatial, Wang2016The, Lv2009Effective]. Recent progress has revealed a hidden geometric structure in networks[Brockmann2013The, Garc2016The] that not only deepens our understanding of the multiscale nature and intrinsic heterogeneity of networks but also provides a useful tool to unravel the regularity of some dynamic processes on networks [Brockmann2013The, Kleinberg2000Navigation, Higham2008Fitting, Kleinberg2007Geographic, Shi2015A, Lou2016The, Serrano2012Uncovering]. At the same time, researchers in the machine learning community have developed several techniques to embed a whole network in a high-dimensional space[Perozzi2014DeepWalk, Grover2016node2vec, Tang2015LINE, Cao2015GraRep, Arora2015RAND, Wang2016Structural] such that the vectors of each node can be used as abstract features feeding on neural networks to perform tasks. It has been demonstrated that such a form of network embedding has wide applications, such as community detection, node classification and link prediction[Grover2016node2vec, Leskovec2010Empirical]. Various methods have been proposed in network embedding field such as Principal Component Analysis, Multi-Dimensional Scaling, IsoMap and their extensions[Belkin2002Laplacian, Roweis2000Nonlinear, Tenenbaum2001A, Yan2005Graph, Shavitt2002Big]. Those embedding methods give good performance when the network is small. But most of them cannot be effectively applied on networks containing millions of nodes and billions of edges.

Recently, there has been a surge of works proposing alternative ways to embed networks by training neural networks[Perozzi2014DeepWalk, Grover2016node2vec] in various approaches inspired by natural language processing techniques[Pennington2014Glove, Sridhar2015Unsupervised, Mikolov2013Efficient, Mikolov2013Distributed]. To build a connection between a language and a network, a random walk needs to be implemented on the network such that the node sequences generated by random walks are treated as sentences in which nodes resemble words. After the sequences are generated, skip-gram in word2vec[Mikolov2013Distributed], which is one of the most famous algorithms for word embedding developed in the deep learning community, can be efficiently applied on the sequences. Among these random-walk-based approaches, deepwalk[Perozzi2014DeepWalk] and node2vec[Grover2016node2vec] have drawn wide attention for their high training speed and high classification accuracy. Both algorithms regard random walks as a paradigmatic dynamic process on a network that can reveal both the local and global network structures. Several extended works that unravel the fundamental co-occurrence matrix between the context and words in skip-gram-based embedding algorithms and the multiple-step transition matrix. Levy et al. [Levy2014Neura]proves that skip-gram models are implicitly factorizing a word-contex matrix. Tang et al.[Tang2015LINE] takes 1-step and 2-step local relational co-occurrence into consideration and Cao et al.[Cao2015GraRep] believes that the skip-gram is an equally weighted linear combination of k-step relational information. Those works were proposed soon after word2vec and deepwalk were presented.

Although word2vec and network embedding are successfully applied in some real problems, several drawbacks still exist. First, explicit and fundamental explanations are nedded to explain why neural-based algorithms work so well since these algorithms are fundamentally black boxes. Second, how we should set the values of the hyper-parameters is still poorly understood. Third, explicit and intuitive explanations of the embedding vectors of each node and the inner structures of the embedding space are needed. We should find an explanation to provide a general framework to unify deepwalk, node2vec and other new algorithms based on random walks.

In this paper, we put forward a novel framework based on an open-flow network to deepen the understanding of network embedding algorithms based on random walks. We first use the so-called open-flow network model to characterize the different random walk strategies on a single background network. Then, we note that there is a natural metric called the flow distance that is defined on these flow networks. Finally, the hidden metric space framed by the flow distances can be derived and, interestingly, this metric space is similar to the embedding space from the deepwalk and node2vec algorithms. We uncover that the embedding algorithms based on neural networks are only attempting to realize the hidden metric based on flow networks, and the correlation between flow distance and node2vec is up to 0.91. With this understanding, we propose a new method called Flow-based Geometric Embedding(FGE), which has no free parameters and performs excellently in some applications, such as clustering and node centrality ranking.

Methods

Both deepwalk and node2vec are aim to learn the continuous feature representations of nodes by sampling truncated random walk sequences from the graph as mimic sentences to feed on the skip-gram algorithm in word2vec. The difference lies in the random walk strategy, where the deepwalk algorithm implements a common unbiased random walk on a graph such that all the edges are visited in accordance with the relative weights on the local node, while node2vec employs a biased random walk in which the probability of visiting is adjusted by two parameters and . Node2vec can uncover much richer structures of a network because it resembles deepwalk when and . Thus, we discuss only node2vec in the rest of this paper. Please reffer to algorithm 3 for more concrete details about node2vec.

Constructing Open-flow Networks

To reveal the flow structure behind a random walk strategy (for a given and ), we constructed an open-flow network model[Guo2015Flow] in accordance with the random walk strategy. An open-flow network is a special directed weighted network in which the nodes are identical to those of the original network, and the weighted edges represent the actual fluxes realized by a large number of random walks. There are two special nodes, the source and the sink, representing the environment, that is why the network is called an open network. When a random walker is generated at a given node, a unit of flux is injected into the flow network from the source to the given node, and this particle contributes one unit of flux to all the edges visited. When the random walk is truncated, a unit of flux is added from the last node to the sink. A large number of random walkers jumping on the network according to the specific strategy form a flow structure that can be characterized by the open-flow network model, in which the weight on the edge is the number of particles visited. Figure 1 illustrates how the different open-flow networks are constructed for a single background binary network with deepwalk in the upper panel and node2vec in the lower panel.

Calculating Flow Distance

For a given flow network F with entries, where the value at the -th row and the -th column represents the flux from to , node 0 represents the source, and the sink is represented as the last node, the flow distance between any pair of nodes and is defined as the average number of steps needed for a random walker to jump from to and finally return back to along the network. It can be expressed as:

(1)

Where, m is the transition probability from to , which is defined as: where is the total flow from node to node . The pseudo probability matrix is defined as[Guo2015Flow]:

(2)

Where is the identity matrix with nodes. is the pseudo probability that a random walker jumps from to along all possible paths. Figure 2 is a sample flow network constructed under condition 1 in Figure 1. Algorithm 1 shows the concrete details about how to calculate flow distance based on F matrix.

Embedding Networks

To display the hidden information in an open-flow network and visualize the node relationships, we embed the flow distance () into a high-dimensional Euclidean space. We use the SMACOF algorithm[Williams2002On, Borg2009Modern, Deville2016Scaling] to perform the embedding. This algorithm takes the distance matrix and the number of dimensions as the input and tries to place each node in -dimensional space such that the between-node distance is preserved as well as possible. Through network embedding, we find the proper vector representation of nodes in the network. Please refer to algorithms 2 for more concrete details about this embedding method.

Based on algorithms 1 and 2, we proposed a new network embedding algorithm named Flow-based Geometric Embedding (FGE). We then discovered the hidden metric space of the random-walk-based network embedding algorithms, such as node2vec, word2vec GraRep and so on. The nodes’ training vectors obtained from the node2vec algorithm is highly correlated with the Euclidean distance embedding vectors derived from the flow network. The strong correlation is shown in the “Results” section.

Results

In this section, we present our results applied on several empirical networks. First, we applied FGE algorithm on the Karate network and plotted the open-flow network models behind two different random walk strategies (with different and ). Next, we compared FGE algorithm with node2vec by embedding networks into two-dimensional planes. After that, we correlated the two distances, the flow distances and the Euclidean distances which is obtained from any given node pair in node2vec algorithm to show that the node2vec embedding algorithm is attempting to realize the metric of the flow distances. Then, we compared FGE and node2vec on clustering and centrality measuring tasks. Finally, we studied how the parameters of embedding algorithms based on random walks affected the flow structure and the correlations between the two distances. An overview of the networks we consider in our experiments is given in Table 1.

Flow Structure and Representation

This section describes experiments on Karate Graph. Figure shows different flow structures of node2vec with different p and q, where the thickness of the line indicates the amount of the flow between nodes. To capture the hidden metric on the flow structures, we fed random walk sequences into node2vec and FGE with number of walks per node , walk length and embedding size . After this training process, each node could acquire two vector representations, denoted by in FGE and in node2vec. We then visualized the vector representations using t-SNE[Laurens2008Visualizing], which provided both qualitative and quantitative results for the learned vector representations. Figure 4 shows the flow structure generated by unbaised random walk strategies(, ). Figure 4 represents the visualization of and under and . Intuitively, we observed that the nodes embedded by this two methods almost overlapped each othjavascript:void(0);er. This indicates that the flow distances of the embedding captured the essence of node2vec. Additionally, the latent relationship between nodes was well expressed. For example, we found that nodes 4,5,10 and 16 were all close to each other and belong to the same community in both algorithms. By analysing the network structure, we also discovered that nodes 14,15,20 and 22 were much closer to each other in node2vec embedding than in the FGE embedding. That is because node2vec only considers n-step connection between nodes. However, the relationship changed when we consider infinity step connection with other nodes. This change can be captured by FGE algorithm since it considers all pathways.

Correlations between Distances

To confirm our conclusion that the skip-gram algorithm only tries to realize the hidden metric of the flow distance defined by random walks, we plotted the flow distance of the flow network generated by random walks from FGE algorithm and the Euclidean distance in the embedding space given by the node2vec algorithm on the same node sequences for any given node pair on the same background network. The results showed strong correlations between the two distances. Figure 5 is a heat map, where the X-axis represents the flow distance. between nodes and , and the Y-axis is the node2vec distance . The Pearson correlation between the two distances was 0.90 with a p-value = 0.001 in Figure 5 and 0.83 with a p-value = 0 in 5. The correlation indicates the highly linear relationship between the paired data.

To show the generality of our results, we performed the same experiments over different datasets, and the accuracy of the experiments was enhanced by averaging the correlation value of each dataset. The results in table 2 shows that there is a strong connection between the flow distance and node2vec’s distance. We also found that the correlation is not sensitive to the different walking strategies. This is because different walking strategies generate different neighbor nodes, leading to different metric distances. All those walking strategies can be captured by the flow distances, and so the flow distance can reveal the latent space in random-walk-based network embedding algorithms such as node2vec,deepwalk and so on. The FGE algorithm can reveal the latent relationship between nodes in graph embedding.

Node Clustering

To further show the similarity between our method and node2vec, we compared the two approaches in performing node clustering. In complex network studies, node clustering is merely community structure detection, which is of importance in various backgrounds[Newman2003A, Freeman1980Centrality, Rka1999Hawoong]. We then performed the k-means clustering method on the node vectors and with , and . The number of clusters can be determined using the average silhouette coefficient as the criterion. According to the silhouette value, we aggregated the graph into 4 clusters, each of them is regard as a community. Here, our method was applied to the karate club graph, as shown in Figure 4. In this graph different vertex colors represented different communities of the input graph. The clusters obtained using the two methods overlapped in a degree of 100%. We also performed a clustering experiment on other datasets, such as China Click Websites and Airline Network the clustering results were identical.

Centrality Measurement

We showed that the understanding of the network embedding from the angle of the flow structure could not only provide us new insights but also new applications. Such as centrality measuring. The centrality measure of nodes is a key issue in network analysis [Freeman1978Centrality, Bonacich1987Power, Borgatti2005Centrality] and a variety of centrality measures have been developed. Here, we showed that the average distance from the focus node to all other nodes can be treated as a new type of node centrality measure. Formally, we defined a new metric to measure the centrality of nodes based on FGE as:

(3)

The reason for the usefulness of this definition is that the nodes close to other nodes always have tight connections and high traffic. Because the flow distances are highly correlated with the Euclidean distances in node2vec embedding, this definition also works for the node2vec algorithm. That is, we can measure each node’s centrality through its distances to all other nodes in the embedding Euclidean space. Furthermore, we can read the centrality information directly from the embedding graph because the nodes with high centrality (small average distances) are always concentrated on the central area of the embedding graph.

We tested the node centrality on the dataset of China Click Websites, which contained approximately 5 years of browsing data from more than 30000 online volunteers. We calculated each website’s centrality based on its flow distance matrix and node2vec distance. We found that the popular websites have always had a small distance because they usually have had more travelling paths to other websites. Therefore, the smaller the average flow distance, the more central the website position. We ranked the websites in accordance with their centrality and then compared those two methods with other methods such as PageRank and total traffic (the number of clicks for each website). The ranking results for the top 10 websites are listed in Table 3. We found that the ranking orders of the flow distance and node2vec were nearly the same. We also discovered that high-traffic websites, such as Tmall (a popular shopping website) and 163.com, have lower ranks, but baidu.com and qq.com have high ranking orders even though their total traffic was not heavy. That is because baidu.com and qq.com are bridges between the real and virtual worlds.

Parameter Sensitivity

Random-walk-based embedding algorithms involve a number of sensitive parameters. To evaluate how the parameters affect the correlation between the two distances, we conducted several experiments on the dataset of the Karate club network. We examined how the embedding size , the number of walks started per node , the window size , and the walk length influenced the correlation between the two distances. As shown in Fig 6, the correlation grew with the number of walks increased, and the correlation tended to saturate when the number reached 512. This indicated that the node2vec embedding algorithm merely tried to realize the hidden metric of the flow structure of the random walk, and the performance increased as more samples were drawn. The neural network of the skip-gram algorithm behind node2vec is over-fitted when the number of walks is small because a higher embedding size leads to more parameters in the neural network that needed to be trained and the correlation decreased with the embedding size (Figure 6). However, there was a slight trend of the decreasing correlation coefficient with the number of walks when this number is larger than 512. We speculated that the decrease in the correlation is due to errors in the substitution of the large sample of random walks using the open-flow network. The FGE algorithm assumes that the random walks can be represented as a Markovian process on the network, which means that each step jump is exclusively determined by the previous-step position. However, the random walk of node2vec does not satisfy this condition. Even though the difference exists as seen in Figure 6 we believe that the hidden metric of flows is more essential to reflect the structural properties of the network. We also evaluated how changes to the window size and walk length affected the correlation. We have fixed the embedding size and the number of walks to sensible values , and varied the window size and walk length for each node. The performance differences were not that large as changed. When the walk length reached , the correlation declined rapidly with further increases in the walk length.

Conclusions and Discussions

In this paper, we reveal the hidden flow structure and metric space of random-walk-based network embedding algorithms by introducing FGE algorithm. This algorithm takes the flow from node to node as an input. After calculating the flow distance, node2vec learns nodes representation that encodes both structural and local regularities. The high Pearson correlation value between the node2vec representations and FGE vectors indicates that there is a hidden metric of random-walk-based network-embedding algorithms. The FGE algorithm not only helps in finding the hidden metric space but also works as a novel approach to learn the latent relations between vertices. Experiments on a variety of different networks illustrate the effectiveness of this method in revealing the hidden metric space of random-walk-based network-embedding algorithms. This finding is of great importance because it not only provides a novel perspective to understand the essence of network embedding based on random walks but also reveals the skip-gram (the main algorithm in node2vec) is trying to find the proper node representation to match this metric between nodes. With this finding, we first applied node2vec to a centrality measuring task we use the Euclidean distance instead of cosine distance between nodes to measure the importance of nodes. We then validate the Euclidean distance of the nodes’ vectors in FGE and node2vec in clustering task. The outcome shows that the two algorithms give similar clustering and centrality measuring results. The FGE algorithm has no free parameters, so it can work as a criterion for parameter setting for node2vec. PPMI [Levy2014Neura] shows that the skip-gram in word2vec is implicitly factorizes a word-context matrix. In future, we would like to explore the hidden relationship between the flow distance and point wise mutual information. Both node2vec and FGE regard random walk as a paradigmatic dynamic process to reveal network structures. This sampling strategy consumes a large amount of computer resources to reach a stationary state for each node. Further extensions of FGE could involve calculating the nodes’ flow distances without sampling.

References

Acknowledgements

We acknowledge the financial support for this work from the National Science Foundation of China with grant number 61673070, ”the Fundamental Research Funds for the Central Universities”, grant number 310421103, and Beijing Normal University Interdisciplinary Project.

Author contributions statement

JZ and WG conceived the experiments and write the manuscript. WG collects and analyses the empirical data. WG, LG and XL plot the graphs. All authors reviewed the manuscript.

Additional information

Competing financial interests: The authors declare no competing financial interests.

Figure 1: Illustration of the construction of different open-flow networks from the same background network with different random walk strategies. (A) represents the adjacency matrix of a network; (B) shows the random walks implemented by the deepwalk algorithm with , (C1), and node2vec algorithm with , (C2) from (A); (C) shows several sequences of nodes generated by the corresponding random walk algorithms; (D) shows the open-flow networks constructed by the sequences. Algorithm 1 shows how to build an open-flow network matrix based on total flow from node to node.
(a) matrix F built from random walk sequences
(b) matrix C calculated based on F
Figure 2: An example flow network including 7 nodes. (a) is the flux matrix of the sampled network under condition C1 (, ) in Figure 1. (b) shows the flow distances among all nodes, where infinity means that there is no connected path from to . Algorithm 1 shows how to compute the flow distance based on the matrix.
NAME NODES EDGES DIRECTED
Les Misérables network 77 254 directed
Airline Network 3,425 67,333 directed
Karate Graph 34 156 undirected
Blog Catalog 10,312 333,983 undirected
Wikipedia 9488 832,408 directed weighted
China Click Websites 20,746 135,770 directed weighted
Table 1: An overview of the basic information of the datasets. Les Misérables is a co-occurrence network with 77 characters and 254 relationships in the novel Les Misérables. Airline Network contains 59036 routes between 3209 airports on 531 airlines spanning the globe, the routes are directional. Karate Graph is a social network of friendships between 34 members of a karate club at a US university. Blog Catalog is a network of 333,983 social interactions of 10,312 bloggers listed on the Blog Catalogue website. Wikipedia datasets comes from the latest web page texts from the Chinese Wikipedia with 9488 nodes and 832,408 edges. China Click Websites contains 120 million records of all the clicking behaviours of 1000 users with 20,746 websites and 135,770 click streams within one month.
Figure 3: Visualization of flow network structure. (a) is an undirected, unweighted network of Karate Graph. (b) is the result of unbaised random walks on this graph. (c) indicates the random walks mainly explored within a community to uncover the local structures, while in (d) the vast flows between different communities showed the random walks were trying to find the global structures in this network.
Figure 4: The embedding of Karate Graph. The visualization results were generated by node2vec and FGE algorithms with label colors reflecting clustering results and node shapes indicating different embedding methods.
Figure 5: Heat maps of flow distance and node2vec distance under p=1, q=1.(a) shows the correlation coefficient between flow distance and node2vec distance of Karate Graph. (b) indicates the correlation coefficient of Airline Network dataset.
Parameter Karate Graph Blog Catalog Les Misérables Network Airline Network China Click Websites Wikipedia
p=1.0, q=1.0 0.91 0.78 0.81 0.82 0.86 0.58
p=0.5, q=2.0 0.90 0.86 0.74 0.85 0.60 0.62
p=2.0, q=0.5 0.87 0.79 0.86 0.77 0.75 0.65
Table 2: The correlation coefficients of different datasets. This table shows the coefficients between flow distance and node2vec’s Euclidean distance in embedding space, in which node2vec is chosen with different “inward” and “outward” parameters ( and ).
rank web name flow distance node2vec distacne PageRank total flow
1 baidu (1)26.332 (1)26.437 (1)0.0221 (1)105560
2 qq (2)30.087 (2)30.208 (2)0.0189 (2)57209
3 sogou (3)33.035 (4)33.168 (3)0.0131 (4)25979
4 taobao (4)33.272 (3)33.106 (4)0.0120 (3)35311
5 hao123 (5)33.626 (5)34.190 (6)0.0122 (5)23295
6 sina (6)33.818 (7)33.954 (5)0.0098 (7)21711
7 weibo (7)34.054 (6)33.953 (9)0.0070 (6)21815
8 163 (8)34.949 (12)36.41 (7)0.0062 (12)13890
9 sohu (9)35.706 (8)33.239 (8)0.0071 (8)15512
10 360 (10)35.01 (9)35.155 (10)0.006 (9)14711
Table 3: Centrality ranking of top 10 websites. Ranking top 10 websites according to flow distance, node2vec distance and comparisons with other ranking methods.
Figure 6: Parameter Sensitivity Study. (a) The correlation coefficient over embedding size and number of walks per node . (b) The correlation coefficient over walk length and number of walks per node .
0:   total total flow from node to node matrix   number of nodes  
0:  flow distance matrix  
1:  Build the transition matrix based on:
2:  for  to  do
3:     for  to  do
4:        
5:     end for
6:  end for
7:  Calculate fundamental matrix U based on M and it’s identity matrix:
8:  
9:  for  to  do
10:     for  to  do
11:        
12:        Symmetrize flow distance matrix:
13:        
14:     end for
15:  end for
Algorithm 1 Flow Distance(, )
0:   flow distance matrix   embedding size   maximum number of iterations   tolerance error to declare converge   number of nodes for embedding  
0:  matrix of nodes representation  
1:  Initialization:
2:  Set , , where is random started configuration, is the counter for iteration.
3:  compute (X) , where dis the Euclidean distance between node and in .
4:  while   (X) -((X)  do
5:     
6:     Compute the Guttman transform: X
7:     , where:
if ,
if ,
if
8:     Update stress function:
9:     (X)
10:     
11:  end while
Algorithm 2 Network Embedding(, , , , )
0:   graph (, , )   window size   embedding size   walks per vertex   walk length   in and out parameter  ,
0:  matrix of nodes representation  
1:  Modify Graph Weight:
2:   = PreprocessModifiedWeights(, , )
3:   = (, , )
4:  Generate Walking Sequences:
5:  for  to  do
6:      = Shuffle()
7:     for each  do
8:         = RandomWalk(, , )
9:     end for
10:  end for
11:  Learn Features by SkipGram:
12:  for each v  do
13:     for each u  do
14:        
15:        
16:     end for
17:  end for
Algorithm 3 Node2vec Embedding(, , , , , , )
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
15129
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description