The Hidden Flow Structure and Metric Space of Network Embedding Algorithms Based on Random Walks
Network embedding which encodes all vertices in a network as a set of numerical vectors in accordance with it’s local and global structures, has drawn widespread attention. Network embedding not only learns significant features of a network, such as the clustering and linking prediction but also learns the latent vector representation of the nodes which provides theoretical support for a variety of applications, such as visualization, node classification, and recommendation. As the latest progress of the research, several algorithms based on random walks have been devised. Although their high scores for learning efficiency and accuracy have drawn much attention, there is still a lack of theoretical explanation, and the transparency of the algorithms has been doubted. Here, we propose an approach based on the open-flow network model to reveal the underlying flow structure and its hidden metric space of different random walk strategies on networks. We show that the essence of embedding based on random walks is the latent metric structure defined on the open-flow network. This not only deepens our understanding of random walk based embedding algorithms but also helps in finding new potential applications in embedding.
Complex networks, as high-level abstractions of complex system, have been widely applied in different areas, such as biology, sociology, economics and technology[RevModPhys.74, Barab2004Network, Deville2016Scaling, Marc2010Spatial, Wang2016The, Lv2009Effective]. Recent progress has revealed a hidden geometric structure in networks[Brockmann2013The, Garc2016The] that not only deepens our understanding of the multiscale nature and intrinsic heterogeneity of networks but also provides a useful tool to unravel the regularity of some dynamic processes on networks [Brockmann2013The, Kleinberg2000Navigation, Higham2008Fitting, Kleinberg2007Geographic, Shi2015A, Lou2016The, Serrano2012Uncovering]. At the same time, researchers in the machine learning community have developed several techniques to embed a whole network in a high-dimensional space[Perozzi2014DeepWalk, Grover2016node2vec, Tang2015LINE, Cao2015GraRep, Arora2015RAND, Wang2016Structural] such that the vectors of each node can be used as abstract features feeding on neural networks to perform tasks. It has been demonstrated that such a form of network embedding has wide applications, such as community detection, node classification and link prediction[Grover2016node2vec, Leskovec2010Empirical]. Various methods have been proposed in network embedding field such as Principal Component Analysis, Multi-Dimensional Scaling, IsoMap and their extensions[Belkin2002Laplacian, Roweis2000Nonlinear, Tenenbaum2001A, Yan2005Graph, Shavitt2002Big]. Those embedding methods give good performance when the network is small. But most of them cannot be effectively applied on networks containing millions of nodes and billions of edges.
Recently, there has been a surge of works proposing alternative ways to embed networks by training neural networks[Perozzi2014DeepWalk, Grover2016node2vec] in various approaches inspired by natural language processing techniques[Pennington2014Glove, Sridhar2015Unsupervised, Mikolov2013Efficient, Mikolov2013Distributed]. To build a connection between a language and a network, a random walk needs to be implemented on the network such that the node sequences generated by random walks are treated as sentences in which nodes resemble words. After the sequences are generated, skip-gram in word2vec[Mikolov2013Distributed], which is one of the most famous algorithms for word embedding developed in the deep learning community, can be efficiently applied on the sequences. Among these random-walk-based approaches, deepwalk[Perozzi2014DeepWalk] and node2vec[Grover2016node2vec] have drawn wide attention for their high training speed and high classification accuracy. Both algorithms regard random walks as a paradigmatic dynamic process on a network that can reveal both the local and global network structures. Several extended works that unravel the fundamental co-occurrence matrix between the context and words in skip-gram-based embedding algorithms and the multiple-step transition matrix. Levy et al. [Levy2014Neura]proves that skip-gram models are implicitly factorizing a word-contex matrix. Tang et al.[Tang2015LINE] takes 1-step and 2-step local relational co-occurrence into consideration and Cao et al.[Cao2015GraRep] believes that the skip-gram is an equally weighted linear combination of k-step relational information. Those works were proposed soon after word2vec and deepwalk were presented.
Although word2vec and network embedding are successfully applied in some real problems, several drawbacks still exist. First, explicit and fundamental explanations are nedded to explain why neural-based algorithms work so well since these algorithms are fundamentally black boxes. Second, how we should set the values of the hyper-parameters is still poorly understood. Third, explicit and intuitive explanations of the embedding vectors of each node and the inner structures of the embedding space are needed. We should find an explanation to provide a general framework to unify deepwalk, node2vec and other new algorithms based on random walks.
In this paper, we put forward a novel framework based on an open-flow network to deepen the understanding of network embedding algorithms based on random walks. We first use the so-called open-flow network model to characterize the different random walk strategies on a single background network. Then, we note that there is a natural metric called the flow distance that is defined on these flow networks. Finally, the hidden metric space framed by the flow distances can be derived and, interestingly, this metric space is similar to the embedding space from the deepwalk and node2vec algorithms. We uncover that the embedding algorithms based on neural networks are only attempting to realize the hidden metric based on flow networks, and the correlation between flow distance and node2vec is up to 0.91. With this understanding, we propose a new method called Flow-based Geometric Embedding(FGE), which has no free parameters and performs excellently in some applications, such as clustering and node centrality ranking.
Both deepwalk and node2vec are aim to learn the continuous feature representations of nodes by sampling truncated random walk sequences from the graph as mimic sentences to feed on the skip-gram algorithm in word2vec. The difference lies in the random walk strategy, where the deepwalk algorithm implements a common unbiased random walk on a graph such that all the edges are visited in accordance with the relative weights on the local node, while node2vec employs a biased random walk in which the probability of visiting is adjusted by two parameters and . Node2vec can uncover much richer structures of a network because it resembles deepwalk when and . Thus, we discuss only node2vec in the rest of this paper. Please reffer to algorithm 3 for more concrete details about node2vec.
Constructing Open-flow Networks
To reveal the flow structure behind a random walk strategy (for a given and ), we constructed an open-flow network model[Guo2015Flow] in accordance with the random walk strategy. An open-flow network is a special directed weighted network in which the nodes are identical to those of the original network, and the weighted edges represent the actual fluxes realized by a large number of random walks. There are two special nodes, the source and the sink, representing the environment, that is why the network is called an open network. When a random walker is generated at a given node, a unit of flux is injected into the flow network from the source to the given node, and this particle contributes one unit of flux to all the edges visited. When the random walk is truncated, a unit of flux is added from the last node to the sink. A large number of random walkers jumping on the network according to the specific strategy form a flow structure that can be characterized by the open-flow network model, in which the weight on the edge is the number of particles visited. Figure 1 illustrates how the different open-flow networks are constructed for a single background binary network with deepwalk in the upper panel and node2vec in the lower panel.
Calculating Flow Distance
For a given flow network F with entries, where the value at the -th row and the -th column represents the flux from to , node 0 represents the source, and the sink is represented as the last node, the flow distance between any pair of nodes and is defined as the average number of steps needed for a random walker to jump from to and finally return back to along the network. It can be expressed as:
Where, m is the transition probability from to , which is defined as: where is the total flow from node to node . The pseudo probability matrix is defined as[Guo2015Flow]:
Where is the identity matrix with nodes. is the pseudo probability that a random walker jumps from to along all possible paths. Figure 2 is a sample flow network constructed under condition 1 in Figure 1. Algorithm 1 shows the concrete details about how to calculate flow distance based on F matrix.
To display the hidden information in an open-flow network and visualize the node relationships, we embed the flow distance () into a high-dimensional Euclidean space. We use the SMACOF algorithm[Williams2002On, Borg2009Modern, Deville2016Scaling] to perform the embedding. This algorithm takes the distance matrix and the number of dimensions as the input and tries to place each node in -dimensional space such that the between-node distance is preserved as well as possible. Through network embedding, we find the proper vector representation of nodes in the network. Please refer to algorithms 2 for more concrete details about this embedding method.
Based on algorithms 1 and 2, we proposed a new network embedding algorithm named Flow-based Geometric Embedding (FGE). We then discovered the hidden metric space of the random-walk-based network embedding algorithms, such as node2vec, word2vec GraRep and so on. The nodes’ training vectors obtained from the node2vec algorithm is highly correlated with the Euclidean distance embedding vectors derived from the flow network. The strong correlation is shown in the “Results” section.
In this section, we present our results applied on several empirical networks. First, we applied FGE algorithm on the Karate network and plotted the open-flow network models behind two different random walk strategies (with different and ). Next, we compared FGE algorithm with node2vec by embedding networks into two-dimensional planes. After that, we correlated the two distances, the flow distances and the Euclidean distances which is obtained from any given node pair in node2vec algorithm to show that the node2vec embedding algorithm is attempting to realize the metric of the flow distances. Then, we compared FGE and node2vec on clustering and centrality measuring tasks. Finally, we studied how the parameters of embedding algorithms based on random walks affected the flow structure and the correlations between the two distances. An overview of the networks we consider in our experiments is given in Table 1.
Flow Structure and Representation
Correlations between Distances
To confirm our conclusion that the skip-gram algorithm only tries to realize the hidden metric of the flow distance defined by random walks, we plotted the flow distance of the flow network generated by random walks from FGE algorithm and the Euclidean distance in the embedding space given by the node2vec algorithm on the same node sequences for any given node pair on the same background network. The results showed strong correlations between the two distances. Figure 5 is a heat map, where the X-axis represents the flow distance. between nodes and , and the Y-axis is the node2vec distance . The Pearson correlation between the two distances was 0.90 with a p-value = 0.001 in Figure 5 and 0.83 with a p-value = 0 in 5. The correlation indicates the highly linear relationship between the paired data.
To show the generality of our results, we performed the same experiments over different datasets, and the accuracy of the experiments was enhanced by averaging the correlation value of each dataset. The results in table 2 shows that there is a strong connection between the flow distance and node2vec’s distance. We also found that the correlation is not sensitive to the different walking strategies. This is because different walking strategies generate different neighbor nodes, leading to different metric distances. All those walking strategies can be captured by the flow distances, and so the flow distance can reveal the latent space in random-walk-based network embedding algorithms such as node2vec,deepwalk and so on. The FGE algorithm can reveal the latent relationship between nodes in graph embedding.
To further show the similarity between our method and node2vec, we compared the two approaches in performing node clustering. In complex network studies, node clustering is merely community structure detection, which is of importance in various backgrounds[Newman2003A, Freeman1980Centrality, Rka1999Hawoong]. We then performed the k-means clustering method on the node vectors and with , and . The number of clusters can be determined using the average silhouette coefficient as the criterion. According to the silhouette value, we aggregated the graph into 4 clusters, each of them is regard as a community. Here, our method was applied to the karate club graph, as shown in Figure 4. In this graph different vertex colors represented different communities of the input graph. The clusters obtained using the two methods overlapped in a degree of 100%. We also performed a clustering experiment on other datasets, such as China Click Websites and Airline Network the clustering results were identical.
We showed that the understanding of the network embedding from the angle of the flow structure could not only provide us new insights but also new applications. Such as centrality measuring. The centrality measure of nodes is a key issue in network analysis [Freeman1978Centrality, Bonacich1987Power, Borgatti2005Centrality] and a variety of centrality measures have been developed. Here, we showed that the average distance from the focus node to all other nodes can be treated as a new type of node centrality measure. Formally, we defined a new metric to measure the centrality of nodes based on FGE as:
The reason for the usefulness of this definition is that the nodes close to other nodes always have tight connections and high traffic. Because the flow distances are highly correlated with the Euclidean distances in node2vec embedding, this definition also works for the node2vec algorithm. That is, we can measure each node’s centrality through its distances to all other nodes in the embedding Euclidean space. Furthermore, we can read the centrality information directly from the embedding graph because the nodes with high centrality (small average distances) are always concentrated on the central area of the embedding graph.
We tested the node centrality on the dataset of China Click Websites, which contained approximately 5 years of browsing data from more than 30000 online volunteers. We calculated each website’s centrality based on its flow distance matrix and node2vec distance. We found that the popular websites have always had a small distance because they usually have had more travelling paths to other websites. Therefore, the smaller the average flow distance, the more central the website position. We ranked the websites in accordance with their centrality and then compared those two methods with other methods such as PageRank and total traffic (the number of clicks for each website). The ranking results for the top 10 websites are listed in Table 3. We found that the ranking orders of the flow distance and node2vec were nearly the same. We also discovered that high-traffic websites, such as Tmall (a popular shopping website) and 163.com, have lower ranks, but baidu.com and qq.com have high ranking orders even though their total traffic was not heavy. That is because baidu.com and qq.com are bridges between the real and virtual worlds.
Random-walk-based embedding algorithms involve a number of sensitive parameters. To evaluate how the parameters affect the correlation between the two distances, we conducted several experiments on the dataset of the Karate club network. We examined how the embedding size , the number of walks started per node , the window size , and the walk length influenced the correlation between the two distances. As shown in Fig 6, the correlation grew with the number of walks increased, and the correlation tended to saturate when the number reached 512. This indicated that the node2vec embedding algorithm merely tried to realize the hidden metric of the flow structure of the random walk, and the performance increased as more samples were drawn. The neural network of the skip-gram algorithm behind node2vec is over-fitted when the number of walks is small because a higher embedding size leads to more parameters in the neural network that needed to be trained and the correlation decreased with the embedding size (Figure 6). However, there was a slight trend of the decreasing correlation coefficient with the number of walks when this number is larger than 512. We speculated that the decrease in the correlation is due to errors in the substitution of the large sample of random walks using the open-flow network. The FGE algorithm assumes that the random walks can be represented as a Markovian process on the network, which means that each step jump is exclusively determined by the previous-step position. However, the random walk of node2vec does not satisfy this condition. Even though the difference exists as seen in Figure 6 we believe that the hidden metric of flows is more essential to reflect the structural properties of the network. We also evaluated how changes to the window size and walk length affected the correlation. We have fixed the embedding size and the number of walks to sensible values , and varied the window size and walk length for each node. The performance differences were not that large as changed. When the walk length reached , the correlation declined rapidly with further increases in the walk length.
Conclusions and Discussions
In this paper, we reveal the hidden flow structure and metric space of random-walk-based network embedding algorithms by introducing FGE algorithm. This algorithm takes the flow from node to node as an input. After calculating the flow distance, node2vec learns nodes representation that encodes both structural and local regularities. The high Pearson correlation value between the node2vec representations and FGE vectors indicates that there is a hidden metric of random-walk-based network-embedding algorithms. The FGE algorithm not only helps in finding the hidden metric space but also works as a novel approach to learn the latent relations between vertices. Experiments on a variety of different networks illustrate the effectiveness of this method in revealing the hidden metric space of random-walk-based network-embedding algorithms. This finding is of great importance because it not only provides a novel perspective to understand the essence of network embedding based on random walks but also reveals the skip-gram (the main algorithm in node2vec) is trying to find the proper node representation to match this metric between nodes. With this finding, we first applied node2vec to a centrality measuring task we use the Euclidean distance instead of cosine distance between nodes to measure the importance of nodes. We then validate the Euclidean distance of the nodes’ vectors in FGE and node2vec in clustering task. The outcome shows that the two algorithms give similar clustering and centrality measuring results. The FGE algorithm has no free parameters, so it can work as a criterion for parameter setting for node2vec. PPMI [Levy2014Neura] shows that the skip-gram in word2vec is implicitly factorizes a word-context matrix. In future, we would like to explore the hidden relationship between the flow distance and point wise mutual information. Both node2vec and FGE regard random walk as a paradigmatic dynamic process to reveal network structures. This sampling strategy consumes a large amount of computer resources to reach a stationary state for each node. Further extensions of FGE could involve calculating the nodes’ flow distances without sampling.
We acknowledge the financial support for this work from the National Science Foundation of China with grant number 61673070, ”the Fundamental Research Funds for the Central Universities”, grant number 310421103, and Beijing Normal University Interdisciplinary Project.
Author contributions statement
JZ and WG conceived the experiments and write the manuscript. WG collects and analyses the empirical data. WG, LG and XL plot the graphs. All authors reviewed the manuscript.
Competing financial interests: The authors declare no competing financial interests.
|Les Misérables network||77||254||directed|
|China Click Websites||20,746||135,770||directed weighted|
|Parameter||Karate Graph||Blog Catalog||Les Misérables Network||Airline Network||China Click Websites||Wikipedia|
|rank||web name||flow distance||node2vec distacne||PageRank||total flow|