Link prediction based on path entropy
Information theory has been taken as a prospective tool for quantifying the complexity of complex networks. In this paper, we first study the information entropy or uncertainty of a path using the information theory. Then we apply the path entropy to the link prediction problem in real-world networks. Specifically, we propose a new similarity index, namely Path Entropy (PE) index, which considers the information entropies of shortest paths between node pairs with penalization to long paths. Empirical experiments demonstrate that PE index outperforms the mainstream link predictors.
keywords:Link prediction, complex networks, information entropy
Fundamental principles underlying various complex systems, such as social, biological, technological systems, have attracted lots of attention from the network science community over the past two decades Dorog4 (); Barrat3 (); Newman1 (); Baraba2 (). It has been demonstrated that plenty of real-world networks have scale-free degree distributions Baraba5 (); Baraba8 (); Cohen6 (); Albert7 (), small-world effects Travers9 (); Watts10 (); Newman11 (); Hawick12 (), and high clustering properties Klemm16 (); Bocca15 (). Generally, social and collaboration networks are assortative mixing Newman17 (); Catan18 (), while biological and technological networks are disassortative mixing Newman20 (); Zhou19 (). Scale-free networks are very robust to random attacks, but are fragile to target attacks Albert21 (); Holme22 (); Pu23 (); Pu24 (). Also, scale-free networks facilitate epidemic spreading since the epidemic threshold for scale-free networks approximates to zero Pastor25 (); Gang26 (); Pastor27 (); Yang28 (); Pu29 (). The deep understanding of network structures and dynamics helps us to make critical predictions of complex networks wenxu14 (). Link prediction lv30 (); Wang31 (); Al34 (); Barzel32 () is to estimate the existence possibility of links between unconnected nodes based on the network structures, node¡¯s attributes, and many others. Generally, there are two kinds of desired links: one is the missing links in the current network, and the other is the future links emerging in the evolution of the network. Link prediction has both scientific meaning and broad applications. On the one hand, prediction methods usually echo the fundamental organization rules of complex networks, and prediction performance in some sense indicates the predictability of complex networks Lv35 (). For example, common neighborhood (CN) based indices M37 (); G36 (); Liben38 () are based on the high clustering property of complex networks. High prediction accuracy of CN-based indices indicates that the network has a strong clustering property and a large predictability. Preferential attachment (PA) index Baraba5 () reflects the rich-get-richer mechanism of social networks. In addition, link prediction provides us a natural standard for the comparison of various network models lv30 (). On the other hand, link prediction is widely used in various applications, for example discovering potential interactions in protein-protein interaction networks LeiC39 (), recommending goods and friends in social networks Al34 (); Sher41 (), exploring coauthor relationships in collaboration networks NEWMAN01 (), and even revealing hidden relations in terrorist networks Knoke43 ().
Link prediction has long been discussed in computer science, but is booming recently in network science lv30 (). The reason is that structural similarity indices are generally simpler with lower computational cost than machine learning based prediction methods. Specifically, structural similarity based methods can be classified into three groups: local indices M37 (); G36 (); Liben38 (); Baraba5 (); L45 (); T46 (), global indices Katz48 (); Leicht49 () and quasi-local indices L50 (); Liu51 (). Local indices are usually defined by using the knowledge of common neighbors and node degree, which include CN, PA, Adamic-Adar (AA) L45 (), resource allocation (RA) T46 (), etc. Global indices are defined based on the whole network topological information, such as Katz Index Katz48 (), Leicht-Holme-Newman (LHN) Index Leicht49 (), and so on. Quasi-local indices are between local indices and global indices since the network topological information used in quasi-local indices is more than local indices, but less than global indices. Quasi-local indices contain local path (LP) index L50 (), local random walk (LRW) index Liu51 (), Superposed Random Walk (SRW) Liu51 (), etc. Generally, the prediction accuracy of local indices is the lowest among the three groups of indices. However, the computational cost of local indices is the smallest among the three. Global indices are the opposite of local indices, while quasi-local indices fall in between. In addition, information of hierarchical and community structures Claus52 (); Soun53 () has been referred to link prediction which further improves the prediction accuracy with additional computational cost.
Recently, information theory has been employed to quantify the complexity of complex networks structures with various scales Anand54 (); so55 (). The Von Neumann entropy Passe56 () and Shannon entropy Anand54 () of a network are defined respectively. Bauer et al Baue57 () used the maximum entropy principle in their construction of random graphs with arbitrary degree distribution. Bianconi Bianco58 () studied the entropy of randomized network ensembles and found that network ensembles with fixed scale-free degree distribution have smaller entropy than that with homogeneous degree distribution. She Bianco59 () further provided the expression of the entropy of multiplex networks ensembles. Halu et al Halu60 () further studied the maximal entropy ensembles of spatial multiplex and spatial interacting networks. Entropy of network dynamics such as diffusion process Go61 () and random walks Sina62 () are also discussed. Network entropy measures have been applied to community detection Ros63 (), aging and cancer progression characterization Meni64 (), and very recently link prediction Tan65 ().
So far, the information entropy or uncertainty embodied in a path has not yet been explored specifically. In complex networks, heterogeneity of paths can be further quantified by the path entropy or uncertainty. With path entropy, we can study how the path heterogeneity affects network properties and dynamics. In this paper, we firstly study the path entropy and obtain an approximate expression of path entropy which is based on the entropies of links in the path. Then we apply path entropy to the link prediction problems and propose a new similarity index based on path entropy. The outline of the article is as follows. Section 2 provides a detailed derivation of the entropy of a path. Section 3 gives the new similarity index. Section 4 presents the experiment results, and section 5 provides the conclusion. There is also an appendix which introduces the basic link prediction framework and traditional similarity indices which are used in our experiments for comparison purpose.
2 Information entropy of a path
In information theory, the uncertainty of an event depends on the probability of its occurrence. Given an event with occurrence possibility , its information entropy or uncertainty is defined as Woodw66 ():
where the base of the logarithm is 2, the same in the following. Apparently, the larger the occurrence possibility of event , the smaller the entropy of event is. For a node pair (a, b) in a network, let’s denote (), which means that there is (not) a link between a and b. Assuming there is no degree correlation among nodes in the network, the probability of is calculated as follows:
where and are the degrees of and . is the number of edges in the network. Combing Eq. 1 and Eq. 2, we get the entropy of as:
Through the above derivation, we infer that . Assuming the network is sparse, we have , where is the maximum node degree. Then, let’s consider a simple path of length . The occurrence probability of path is calculated as follows:
Eq. 4 means that the occurrence probability of a simple path approximates to the product of its links’ occurrence probabilities. Then, the entropy of path is calculated as follows:
Eq. 5 indicates that the entropy of a path approximates to the sum of its links’ entropies.
3 Similarity index based on path entropy
In link prediction, the link probability of an unconnected node pair is positively correlated with their topological similarities. Various similarity indices are proposed to quantify the topological similarity of node pairs lv30 (). However, it is hard to find an universally applied similarity index, since the essential organization rule or rules of various complex networks are usually unknown.
Information theory is a promising tool to measure the complexity of complex networks and has been applied in link prediction Tan65 (). From the perspective of information theory, the link likelihood of a node pair is indicated by the link entropy. Large link entropy means that the node pair has a small probability to be connected with a link. In link prediction problems, we are interested in the conditional entropy Tan65 ():
where is the part of the whole topological structure used in link prediction. Generally, the conditional entropy decreases with the increase of amount of topological information used in the link prediction. Here, we consider all the simple paths, and thus , where is the set of all simple paths of length between and . is the maximum length of simple paths we consider in the network. Most of the path-based indices such as Katz and LP ignore the heterogeneity of paths. However, different paths may make different contributions to the link existence between the two end nodes. Paths with large entropies are critical substructures for the network from the perspective of information theory, and the existence of these large-entropy paths greatly reduces the link entropy of the end nodes as indicated by Eq. 6. The contributions of all the simple paths are represented by in Eq. 6, which is approximately calculated as follows:
where is the weight of simple paths with length , with which the contributions of long paths are penalized. This is based on the common assumption that the longer the path is, the less important the path is in link prediction. Note that we have also checked other weight forms such as , , etc., and find that corresponds to the best prediction performance. For two unconnected nodes, large link entropy means small link probability, or in some sense small similarity. Our path-entropy (PE) based similarity index is defined as the negative of the conditional entropy as follows:
We illustrate the calculation of PE with a simple network as shown in Figure 1. We set as 3. Considering node pair , we have and . Based on Eq. 3, we have , and . Then, we get and using Eq. 5. Finally, we obtain based on Eq. 8. Similarly, we get , , , , and . Thus, we have . Note that node pair have no common neighbors, but based on PE index their possibility of being connected is larger than node pair which have one common neighbor. This indicates that the contribution of path is greater than that of path , given that node pairs and have the same link entropy().
We compare our similarity index with six mainstream indices (see Appendix A) on eleven real-world networks Watts10 (); FWFW68 (); SmaGri69 (); Poliblogs70 (); Router71 (); Yeast72 (); email73 (), the statistics of which are summarized in Table 1. In those networks, directed links are taken as undirected links by ignoring the directions, and self-connections are removed. For unconnected networks, we choose their largest connected components. The data of those networks comes from disparate fields. C.elegans Watts10 (): The neural network of the nematode Caenorhabditis elegans. FWFW FWFW68 (): The food web of Florida ecosystem. SmaGri SmaGri69 (): The network composed of citations to Small Griffith and Descendants. Poliblogs Poliblogs70 (): A network of the US political blogs. Power SmaGri69 (): The electrical power grid of the western US. Router Router71 (): The router-level topology of the internet. Yeast Yeast72 (): The protein-protein interaction network of yeast. Email email73 (): The network of e-mail interchanges between members of the Univeristy Rovira i Virgili (Tarragona). Kohonen SmaGri69 (): A network of articles with topic self-organizing maps or references to Kohonen T. EPA SmaGri69 (): A network of web pages linking to the website www.epa.gov. SciMet SmaGri69 (): A network of articles from or citing Scientometrics.
Introductions of AUC and Precision are shown in Appendix A. The division of the training and test sets is also provided in Appendix A. The results of AUC and Precision for various similarity indices are provided in Table 2 and 3 respectively. For our PE index, and are considered in the experiment. Note that all the results are the average over independent runs.
Table 2 shows that for AUC, PE with already achieves better performance than the other mainstream similarity indices except FWFW. When , PE gets better performance than , and AUC for FWFW is greatly improved. This is generally because for PE index the more topological information used in link prediction, the better the prediction performance is. However, considering the contributions of long simple paths, which is relatively small, and the large computational cost they cause, it is reasonable to just consider short simple paths in link prediction. Note that for Poliblogs and Yeast, AUC reach more than , so it is difficult to improve their prediction accuracy. Also, the small average node degree and large average distance for Power limit the prediction accuracy.
Table 3 shows that for precision for PE is not enough, since the corresponding precision values are not better than the mainstream similarity indices. When , the precision values of PE are generally larger than the other indices. The exceptions are Power and Email(bold in Table 3) for which the Precision values of PE with are even worse than PE with .
In summary, we quantitatively study the influence of paths in link prediction by using the information theory. We obtain that the information entropy of a path is approximately equal to the sum of its links¡¯ information entropies. Path entropy is a natural metric for quantifying the structure importance of a path in the network. We apply path entropy in link prediction problems, and propose a new similarity index. Our similarity index considers the contributions of all simple paths in link prediction measured by path entropies with penalty to long paths. Simulation results on real-world networks demonstrate that our index generally outperforms the other mainstream similarity indices with higher prediction accuracy measured by AUC and Precision. The reason is that most of the other similarity indices consider the number of common neighbors, node degrees, path lengths, etc. However, these metrics are relatively coarse compared to those metrics in the information theory framework. With path entropy, we better quantify the role of paths in link prediction, and thus can design more efficient link predictors. We also believe that path entropy can be applied to other network problems such as epidemic spreading, network attacks and so on.
Appendix A Problem description and standard metrics
Assuming an undirected and unweighted network , where and are the sets of nodes and links respectively. Clearly, has node pairs totally, which constitute the universal set . To measure the performance of similarity indices in link prediction, is randomly divided into two parts: a training set and a test set . In our experiment, and are generated with the 90/10 rule lv30 (). Obviously, and .
Two standard metrics AUC and Precision are often used in link prediction. AUC is the area under the receiver operating characteristic (ROC) curve. When calculating AUC, each node pair in is given a similarity score based on a given similarity index. Then, each time we randomly pick a link from and a nonexistence link from and compare their scores. If among times of independent comparisons, there are times that the score of the link from is higher than the link from , and times that they have the same scores, then AUC is calculated as:
Apparently, AUC should be close to if the scores are assigned from an independent and identical distribution. Therefore, an AUC larger than 0.5 means the link prediction method is better than pure chance, and similarity index with the larger AUC is always preferable. Precision cares about the prediction accuracy of top ranked links. If among top links ranked by similarity scores, there are links belonging to , then Precision is calculated as:
We here introduce the six mainstream similarity indices which are used for comparison purpose in our experiment.
(1)Common neighbors (CN) M37 (). This index defines the similarity score of two nodes as the number of their common neighbors, which is:
where is the set of neighbors of , and is the set of common neighbors of and .
(2)Resource Allocation (RA) T46 (). This index considers the degree of common neighbors with penalty to large degree nodes, which is:
(3)Adamic-Adar Index (AA) L45 (). This index is similar to RA, but considers the logarithm of node degree, which is:
(4)Local Naïve Bayes form of CN (LNB-CN) Liu74 (). This index weights the contributions of common neighbors by using the Naïve Bayes model, which is defined as follows:
where , and . and are respectively the numbers of connected and disconnected node pairs whose common neighbors include .
(5)Local Naïve Bayes form of RA (LNB-RA) Liu74 (). Similar to LNB-CN, this index combines the RA index with the Naïve Bayes model, defined as:
(6)Mutual Information index (MI) Tan65 (). This index quantifies the contributions of common neighbors with the mutual information theory, defined as:
where is calculated with Eq. 3. is calculated as follows:
- (1) S. N. Dorogovtsev, A. V. Goltsev, J. F. F. Mendes, Rev. Mod. Phys., 80 (2008) 1275.
- (2) A. Barrat, M. Barthelemy, A. Vespignani, Dynamical processes on complex networks, Cambridge University Press, 2008.
- (3) M. Newman, Networks: an introduction, Oxford University Press, 2010.
- (4) A. L. Barabási, J. Frangos, Linked: the new science of networks science of networks, Basic Books, 2014.
- (5) A. L. Barabási, R. Albert, it Science 286 (1999) 509.
- (6) A. L. Barabási, E. Ravasz, T. Vicsek, Physica A 299 (2001) 559.
- (7) R. Cohen, S. Havlin, Phys. Rev. Lett. 90 (2003) 058701.
- (8) R. Albert, J. Cell Sci. 118 (2005) 4947.
- (9) J. Travers, S. Milgram, Sociometry (1969) 425.
- (10) D. J. Watts, S. H. Strogatz, nature 393 (1998) 440.
- (11) M. E. J. Newman, J. Stat. Phys. 101 (2000) 819.
- (12) K. A. Hawick, H. A. James, Int. J. Wireless Mobile Comput. 4 (2010) 155.
- (13) K. Klemm, V. M. Eguiluz, Phys. Rev. E 65 (2002) 036123.
- (14) S. Boccaletti, V. Latora, Y. Moreno, et al, Phys. Rep. 424 (2006) 175.
- (15) M. E. J. Newman, Phys. Rev. Lett. 89 (2002) 208701.
- (16) M. Catanzaro, G. Caldarelli, L. Pietronero, Physica A 338 (2004) 119.
- (17) M. E. J. Newman, Phys. Rev. E 67 (2003) 026126.
- (18) S. Zhou, Phys. Rev. E 74 (2006) 016124.
- (19) R. Albert, H. Jeong, A. L. Barabási, Nature 406 (2000) 378.
- (20) P. Holme, B. J. Kim, C. N. Yoon, et al, Phys. Rev. E 65 (2002) 056109.
- (21) C. L. Pu, W. Cui, Physica A 419 (2015) 622.
- (22) C. Pu, S. Li, A. Michaelson, et al, Phys. Lett. A 379 (2015) 1633.
- (23) R. Pastor-Satorras, A. Vespignani, Phys. Rev. Lett. 86 (2001) 3200.
- (24) Y. Gang, Z. Tao, W. Jie, et al, Chinese Phys. Lett. 22 (2005) 510.
- (25) R. Pastor-Satorras, C. Castellano, P. Van Mieghem, et al, arXiv preprint arXiv:1408.2701, 2014.
- (26) H. X. Yang, M. Tang, Y. C. Lai, Phys. Rev. E 91 (2015) 062817.
- (27) C. Pu, S. Li, J. Yang, Physica A 432 (2015) 230.
- (28) Z. S. Shen, W. X. Wang, Y. Fan, et al, Nat. Commu. 5 (2014) 4323.
- (29) L. Lü, T. Zhou, Physica A 390 (2011) 1150.
- (30) D. Wang, D. Pedreschi, C. Song, et al Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2011, 1100.
- (31) M. Al Hasan, M. J. Zaki, A survey of link prediction in social networks, Social network data analytics, Springer US, 2011, 243.
- (32) B. Barzel, A. L. Barabási, Nat. Biotechnol. 31 (2013) 720.
- (33) L. Lü, L. Pan, T. Zhou, et al, PNAS 112 (2015) 2325.
- (34) M. E. J. Newman, Phys. Rev. E 64 (2001) 025102.
- (35) G. Kossinets,Soc. Networks 28 (2006) 247.
- (36) D. Liben-Nowell, J. Kleinberg, J. Am. Soc. Inf. Sci. Tec. 58 (2007) 1019.
- (37) C. Lei, J. Ruan, Bioinformatics 29 (2013) 355.
- (38) E. Sherkat, M. Rahgozar, M. Asadpour, Physica A 419 (2015) 80.
- (39) M. E. J. Newman, PNAS 98 (2001) 404.
- (40) D. Knoke, Emerging Trends in Social Network Analysis of Terrorism and Counterterrorism, Emerging Trends in the Social and Behavioral Sciences: An Interdisciplinary, Searchable, and Linkable Resource, 2015.
- (41) L. A. Adamic, E. Adar, Soc. Networks 25 (2003) 211.
- (42) T. Zhou, L. Lü, Y. C. Zhang, Eur. Phys. J. B 71 (2009) 623.
- (43) L. Katz, Psychmetrika 18 (1953) 39.
- (44) E. A. Leicht, P. Holme, M. E. J. Newman, Phys. Rev. E 73 (2006) 026120.
- (45) L. Lü, C. H. Jin, T. Zhou, Phys. Rev. E 80 (2009) 046122.
- (46) W. Liu, L. Lü, Europhys. Lett. 89 (2010) 58007.
- (47) A. Clauset, C. Moore, M. E. J. Newman, Nature 453 (2008) 98.
- (48) S. Soundarajan, J. Hopcroft, Proceedings of the 21st international conference companion on World Wide Web, ACM, 2012, 607.
- (49) K. Anand, G. Bianconi, Phys. Rev. E 80 (2009) 045102.
- (50) R. V. Solé, S. Valverde, Information theory of complex networks: On evolution and architectural constraints, Complex networks, Springer Berlin Heidelberg, 2004, 189.
- (51) F. Passerini, S. Severini, International Journal of Agent Technologies and Systems 1 (2009) 58.
- (52) M. Bauer, D. Bernard, arXiv preprint cond-mat/0206150, 2002.
- (53) G. Bianconi, EPL 81 (2008) 28005.
- (54) G. Bianconi, Phys. Rev. E 87 (2013) 062806.
- (55) A. Halu, S. Mukherjee, G. Bianconi, Phys. Rev. E 89 (2014) 012806.
- (56) J. Gómez-Gardeñes, V. Latora, Phys. Rev. E 78 (2008) 065102.
- (57) R. Sinatra, J. G¨®mez-Gardenes, R. Lambiotte, et al, Phys. Rev. E 83 (2011) 030103.
- (58) M. Rosvall, C. T. Bergstrom, PNAS 104 (2007) 7327.
- (59) G. Menichetti, G. Bianconi, G. Castellani, et al, Multiscale characterization of ageing and cancer progression by a novel network entropy measure, Molecular BioSystems, 2015.
- (60) F. Tan, Y. Xia, B. Zhu, Plos One 9 (2014) e107056.
- (61) P. M. Woodward, Probability and Information Theory, with Applications to Radar: International Series of Monographs on Electronics and Instrumentation, Elsevier, 2014.
- (62) R. E. Ulanowicz, C. Bondavalli, and M. S. Egnotovich, (1998) Network Analysis of Trophic Dynamics in South Florida Ecosystem, FY 97: The Florida Bay Ecosystem. Ref. No. [UMCES]CBL 98-123. Chesapeake Biological Laboratory, Solomons, MD 20688-0038 USA.
- (63) V. Batagelj, A. Mrvar, Pajek datasets website.
- (64) L. A. Adamic, N. Glance, in Proceedings of the WWW-2005 Workshop on the Weblogging Ecosystem, 2005.
- (65) N. Spring, R. Mahajan, D. Wetherall, ACM SIGCOMM Computer Communication Review 32 (2002) 133.
- (66) C. Von Mering, R. Krause, B. Snel, et al, Nature 417 (2002) 399.
- (67) R. Guimera, L. Danon, A. Diaz-Guilera, et al. Phys. Rev. E 68 (2003) 065103(R).
- (68) Z. Liu, Q. M. Zhang, L. Lü, et al, EPL 96 (2011) 48007.