Effective spreading from multiple leaders identified by percolation in social networks
Social networks constitute a new platform for information propagation, but its success is crucially dependent on the choice of spreaders who initiate the spreading of information. In this paper, we remove edges in a network at random and the network segments into isolated clusters. The most important nodes in each cluster then form a group of influential spreaders, such that news propagating from them would lead to an extensive coverage and minimal redundancy. The method well utilizes the similarities between the pre-percolated state and the coverage of information propagation in each social cluster to obtain a set of distributed and coordinated spreaders. Our tests on the Facebook networks show that this method outperforms conventional methods based on centrality. The suggested way of identifying influential spreaders thus sheds light on a new paradigm of information propagation on social networks.
The development of social network has a great impact on our lifestyle, from making friends to dating, from working to shopping. They become more essential as we are increasing our dependence on them to gather information. Compared with search engines which are based on isolated queries, collecting information through leveraging the individual specialties in social networks lead us to useful websites from experts in disparate fields, and thus increase both the quality and the diversity of the acquired information. Thus, by the same token, influential individuals can also be used to spread information. The key to the success is to identify the most influential spreaders in the network. Nevertheless, it is difficult as there are usually just a few users capable of propagating a news to a large number of users Albert (). For example, while socially significant users are rare in the tweeter network, their messages and blogs can spread quickly throughout the whole network Wen2010 (); hu2014conditions ().
Although social networks are powerful for propagating information, their application for this purpose is limited, partially because a way to identify the optimal spreaders is absent. Nevertheless, simple methods have been proposed. For instance, “degree centrality” suggests that nodes with higher degree are more influential than the others Wasserman (). On the other hand, the location of a node in a network and the influence of its neighbors are also considered important. For instance, a node with a small number of highly influential neighbors located at the center of the network may be more influential than a node having a larger number of less influential neighbors. Kitsak et al. Kitsak () thus proposed a coarse-grained method to use the -core decomposition to quantify the influence of a node, based on the assumption that news initiated at nodes in higher shells are likely to spread more extensively. Some distance-based global metrics such as betweenness Freeman1977 () and closeness Sabidussi () are suggested which can lead to extensive propagation, but due to the high computational complexity, they are not practical for large-scale social networks. Other centralities such as LocalRank were also suggested ChenDB ().
The above simple but sub-optimal protocols have been applied to social media such as QQ , BBS and Blog to find the key spreaders who can trigger the “tipping point” in social marketing to promote commercial products. Specifically, if one can convince a set of influential users to adopt a new product, one may induce a large cascade of purchases as these initial buyers propagate their compliment of the product along the network. Unlike the forementioned methods which identify a set of independent spreaders according to their centralities, our goal is to find a set of coordinated individuals such that their combined impact is greatest, leading to much more extensive propagation of information. Nevertheless, identifying the optimal group of spreaders is indeed a computationally difficult task Kleinberg ().
In this paper, we utilize the similarities PastorSatorras (); Newman2002 () between percolation and information propagation to identify a group of influential spreaders. By removing edges at random until percolation ceases, individual isolated clusters are formed. Due to the correspondence between percolation and information transmission, the emergence of such clusters imply that news can be effectively propagated within the clusters but not across the clusters. Initiating a news on the most influential user in each cluster is thus an effective way to distribute the news within the cluster. Since such process is static and requires much less computation power than the dynamics spreading of news, a lot of percolated states can be generated to give a more accurate result on the segmentation of social clusters as well as their corresponding influential spreaders.
By testing our protocol on Facebook and Enron email network, we show that in addition to a lower computational efficiency, our protocol outperforms other simple heuristics based on local and global centrality in terms of propagation coverage and coverage redundancy of the selected spreaders. This is consistent with the old saying that the power of a typical group exceeds that of a single most competent individual. Moreover, we find that the average degree of the users selected by our method is lower, which implies a lower cost in identifying the spreaders when compared to the other methods. We also identify the different characteristics of spreaders who are most effective to promote niche or popular items in order to maximize the coverage. All these results lead to insights into the design of viral marketing strategies and a new paradigm for information propagation.
Spreading dynamics with the involvement of human can be mainly classified into two classes: one is the spreading of infectious diseases which requires physical contacts, and the other is the spreading of information, including opinions and rumors where physical contacts are not required LuNJP2011 (). Due to the similarity between epidemic and information spreading, well-established models of epidemic models are widely used to describe the propagation of information Sudbury1985 (); Zanette2001 (); Zanette2002 (); LiuZ2003 (); Moreno2004 ().
In particular, the susceptible-infected-recovered (SIR) model is one of the representatives. Specifically, a susceptible person (S) in the model is analogous to an individual who is not aware of the information. An infected person (I) is analogous to an individual who is aware of the information and will pass it to his/her neighbors. A recovered person (R) is analogous to an individual who loses his/her interest and will never pass the information again. Newman Newman2002 () studied in detail the relation between the static properties of the SIR model and bond percolation phenomenon on networks and remarked that the SIR model with transmissibility is equivalent to a bond percolation model with bond occupation probability on the network (see Method and Materials 4.5).
Our method is then devised in relation to the bond percolation model as follows. Given an undirected network where represents the set of nodes (i.e., users in social networks) and represents the set of edges (i.e., connection in terms of communication, friendship or other kinds of interactions), all edges are first removed and each individual edge is then recovered with a probability , i.e. all links are removed when . As increases from , more links are recovered and clusters start to form and merge with each other. We will call this state the pre-percolated state. For a network containing node, a giant component of size emerges only when is larger than a critical threshold , which is called percolation. In the context of information propagation, since an edge between two nodes appears with a probability , the value can be considered as the transmissibility of an information from one node to another.
To find the influential group, we have to find the most influential spreaders with a given value of . Assume that there are percolation clusters after one realization of link recovery, and denote by the size of cluster , . We introduce a tunable parameter , which is usually equal to or larger than . If , we choose the top- largest clusters and assign one score to the largest degree node in each cluster. If there are many nodes with the largest degree, we assign the score to a random one among them. If , we first choose the top- node in each cluster, and the rest nodes are chosen to be those with the second largest degree respectively from the top- largest clusters. If , we will choose the next largest degree nodes in each cluster following the same selection rules. After times of different trials of link recovery, all nodes are ranked according to their scores in a descending order and those nodes with the highest scores are suggested to be the set of initial spreaders. For the sake of simplicity, we set and have tested and found that the results are not sensitive to . The dependence of are shown in the Supplementary Information (SI) Sec. SV Fig. S6.
In other words, our suggested method draws analogy with percolation to identify individual social clusters in the network where news can be effectively propagated within the clusters but not across the clusters. These isolated clusters in the pre-percolated state thus have a direct correspondence to the propagation coverage when one spreads the news from an initial spreader in each of the clusters. Our rationale is different from most other methods which usually identify a group of influential spreaders for the network as a whole. In addition, such a set of well distributed spreaders also enjoy a reduced redundancy when compared to a set of un-coordinated spreaders. These differences make our method unique compared to the other methods.
ii.1 Spreadability and coverage redundancy
To quantify the performance of our method, we examine the spreadability, i.e. the propagation coverage of a news from a set of selected spreaders, by our method as well as other methods. We will use the SIR model to mimic the spreading of news, and the spreadability is defined as the ratio of recovered nodes to the total number of nodes (i.e., the size of outbreak to ). We remark that the transmissibility adopted in the SIR model is the same as the probability used to recover edges to identify the clusters in the percolated states. As a result, for a single spreader, the ultimate size of the SIR outbreak triggered by this spreader is precisely the size of the percolation cluster that it belongs to. Likewise, the ultimate size of the SIR outbreak triggered by a group of spreaders in distinct percolation clusters is the sum of the size of the clusters that these nodes belong to. For example, if we measure the coverage of three selected nodes on the network with nodes, and if the first two nodes belong to the cluster , and the third one is in the cluster . For each of the single nodes, the coverage of node 1,2 and 3 are respectively , and , while for the whole group, the spreadability of the three nodes is .
We first apply our method on the Facebook network with 59691 nodes. Figure 1a shows the coverage obtained from 4000 initial spreaders chosen by our percolation method, compared with a set of 4000 spreaders identified by three other methods, namely the degree centrality, the -shell decomposition and the betweeness centrality (see Methods and Materials for the definition of each of these methods; comparisons with other centrality measures can be found in Sec. SIII Fig. S2 of the SI). Percolation method yields the highest spreadability for an arbitrary transmissibility . Figure 1b shows the degree distribution of the 4000 spreaders identified by the percolation method.
When ,the percolation method yields isolated clusters Newman2001 () (see Fig. 1d) of similar size, and since the set of the selected spreaders come from different clusters, and a wide range of degree is found among the spreaders (see Fig. 1b). In this case, the percolation method is more likely to choose high-degree nodes (see Fig. S5b in the SI, where the red stars represent the degree distribution of the 4000 selected nodes when ). When , the distribution will become narrower as increases (see the blue squares in Fig. S5b of the SI). In this case, the percolation method prefers low-degree spreaders. The average original degree (i.e. degree in the original network before edge removal) of the 4000 spreaders selected by the percolation method when is higher than that of the nodes selected when . This implies that if we want to promote and advertise a new niche product which is difficult to get accepted, one can draw analogy with the case of small transmissibility where high-degree initial spreaders are preferred. On the other hand, for popular items which are easy to be accepted, one can draw analogy with the case of large and low-degree initial spreaders are preferred.
We then examine the cost of identifying the initial spreaders. By assuming that the direct influence of a user is equal to the number of its nearest neighbors (i.e., its degree), while the difficulties of finding a user with degree is proportional to , the cost to find a spreader is assumed to be . Figure 1c shows the dependence of the average cost to find the 4000 spreaders under the parameter , i.e. . The cost decreases abruptly at the critical point , indicating a phenomenon resembling phase transition. It means that when increases just beyond , the cost can be reduced substantially.
Besides the spreadability and the cost, we also examine the redundancy in coverage which quantifies the efficiency of the propagation. Specifically, the redundancy of a node is defined as the number of initial spreaders who has the potential to infect node . A method is inefficient if the initial chosen spreaders pass the same information to the same group more than once. Averaging the redundancy over all the infected nodes, we obtain the redundancy of the set of initial spreaders. Figure 1e compares the spreading redundancy of our method with the three other methods (comparisons with other centrality measures can be found in Sec. SIII Fig. S3 of the SI). Highest redundancy is found in the methods of -shell and degree centrality, followed by betweenness centrality. Our percolation method has the lowest redundancy among the four methods, since the spreaders identified by this method are usually located in different regions of the original network. We also checked the Enron e-mail network and similar results with Facebook network are obtained (see Supplementary Sec.II figure S1).
To further examine the spreadability, we applied the four methods to identify four initial spreaders on a generated network with four clear communities. As shown in Fig. 2, the four spreaders identified by the percolation method are very likely to be found in different communities, with one spreader in each community. For the other methods, there are high probabilities that all or some of the initial spreaders are in the same communities. These results are easy to understand as our method relies on the segmentation of the network into isolated clusters to identify the spreaders. In the present case, the network is likely to separate into the four communities and thus one spreader is found in each community.
Most of the other methods always lead to the same set of spreaders. In our percolation method, different set of spreaders may be generated from different realizations, especially for large . Figure 2 (a) shows the number of initial spreaders which are common among different realizations of the percolation method applied on the Facebook network. It is clear that when increases, the number of common spreaders decreases, indicating that the solutions become more and more diverse. This result has practical significance, especially when some of the initial spreaders are offline, we can use the next best candidate as a back-up spreader without losing spreadability. On the other hand, Fig. 2 (b) shows the entropy of the obtained solution, i.e. the logarithm of the number of different identified spreaders. Compared with the other three methods, percolation method provides a higher flexibility in the choice of spreaders.
In order to further examine the difference between our method and the other methods, Fig. 3 shows the number of identified spreaders which are common between the percolation method and the other methods (comparisons with some other methods are found in Sec. SIII Fig S4 of the SI). The overlap between the percolation method and the degree centrality method reaches the highest value () at the critical point and then sharply decreases to less than when . It is because when increases, most of the high-degree nodes are replaced by nodes with lower degree, and there are a lot of sets of identified spreaders generated from the different realizations as we have discussed in Fig. 2. What is more impressive is that by increasing the value of the cost can sharply decrease without losing spreadability and substantially increasing coverage redundancy (see Fig 1(c) and 1(e)).
We show in Fig. 5 the relation between the spreadability and the cost. The percolation method is the most cost effective method in terms of spreadability. Four cases are presented, namely , , and . Clearly, with the same cost, the percolation method lead to a higher spreadability than the methods of -shell and degree centrality. Although the cost for using betweenness is low, its spreadability is very limited and become saturated at small cost. The percolation method has the highest saturated value of spreadability.
As we can see, social networks constitute a new platform to propagate information. Unlike the usual practice where the networks are used by uncoordinated individuals to share their own message, intended spreading of information can indeed be implemented via the networks. To measure its performance, one can measure the coverage, the redundancy in propagation, and the cost in identifying appropriate initial spreaders. Yet these measures of performance are largely dependent on the choice of users who start the propagation, and there is not a single protocol which achieves optimality in all these dimensions. These difficulties of identifying influential spreaders makes information propagation via social networks remain in its immature stage.
To tackle the challenge, we draw an analogy between the percolation process and information propagation to develop a protocol which gives rise to a low-cost, minimally redundant set of initial spreaders leading to a large coverage. Our protocol was tested on the Facebook network, where favorable results over all the tested centrality-based methods were obtained. When compared to these conventional methods which identify a set of un-coordinated spreaders, the spreaders identified by our protocol are evenly distributed within the network which greatly increases the propagation coverage and reduces its redundancy. Such coordination of spreaders is essential and can only be obtained using the suggested percolation procedures.
The success of this method is not just a coincidence, but it makes the best use of the similarities between percolation and the process of information propagation. By removing edges at random until percolation ceases, we identify individual isolated clusters where news can be effectively propagated within the clusters but not across the clusters. Specific spreaders at the center of these clusters are then identified to be the influential initial spreaders in the original network. By initiating news propagating from this set of spreaders, coverage is increased and redundancy is reduced compared to the conventional centrality methods. Percolation is thus at the center of our propagation protocol instead of a mere analogy.
The remaining question is practicality. As we have discussed, the computational complexity of our protocol is , which is a favorable characteristics for applications on practical systems as its complexity scales linearly with the system size. Once the set of important initial spreaders is identified, an coordinator just has to connect to these users and pass the news to them, and information will then propagate quickly throughout the network. Of course, a lot of details and practical difficulties are omitted in this simple description, but our results have lead to insights into a completely new paradigm of information propagation. Further research along this line may revolutionize our way of spreading and gathering information in the near future.
Iv Methods and Materials
iv.1 Baseline methods
To identify the most influential spreaders, various centrality measures have been proposed. The first method by which we compare our result with is degree centrality. Degree centrality is a straightforward and efficient metric. It assumes that a node with more nearest neighbors has a higher influence. However, node degree can only reflect its direct influence but not the indirect influence triggered by its nearest neighbors. For example, a node of small degrees, but with a few highly influential neighbors may be more influential than a node having a larger number of less influential neighbors.
The second method we used for comparison is the -shell decomposition. Recent research shows that the location of a node in a network may play a more important role than its degree. A node located in the center of the network is more influential than a node having a larger number of less influential neighbors. Similar to this rationale, Kitsak et al. Kitsak () proposed a coarse-grained method by using the method of k-core decomposition to quantify the influence of a node based on the assumption that nodes in the same shell have similar influence, and nodes in higher-level shells are likely to infect more nodes.
In the last method, we employ global information to identify the influential spreaders. Specifically, betweenness is one of the most popular geodesic-path-based ranking measures. It is defined as the fraction of shortest paths between all node pairs that pass through the node of interest. Betweenness is, in some sense, a measure of the influence of a node in terms of its role in spreading information Guimera2002 (); Yan2006 (). For a network with nodes and edges, the betweenness centrality of node , denoted by is Freeman1977 (); Freeman1979 ()
where is the number of shortest paths between nodes and , and denotes the number of shortest paths between nodes and which pass through node .
iv.2 Computational complexity
Given a network , there are four steps to find the influential spreaders by the percolation method. Firstly, all the edges are first removed and then recovered with a probability ; we then obtain a new network . The required computational complexity is . Secondly, we find the strongly connected components of using Tarjan’s algorithm Tarjan () which has a complexity of . Thirdly, we select one node with the highest degree in each of the largest components and assign one score to the selected nodes. This complexity for the procedures is . Repeating the above three steps for different realizations, we rank the nodes according to their scores in descending order, and the top- nodes are chosen to be the most influential spreaders. The different realizations of the percolation process can be computed in parallel and the complexity of each implementation is . Considering , then the complexity is . Since in real networks, then we have , i.e. the complexity of our method grows linearly with system size.
iv.3 Model networks with community structures
There are three steps to generate a network with community structures. In our experiment, we consider a network with nodes which has four communities each of which contains 500 nodes. First, we generate a random network of size 500 and with node degrees distributed in power-law with exponent 2.2 using the configuration model catanzaro05 (). The minimum degree is 1 and the maximum degree is Dorogovtsev2101 (). Second, we repeat the above procedures to generate independently the other three networks. Finally, for each pair of sub-networks we randomly selected a fraction of node pairs to connect them.
Datasets we used are described in Sec. I of the SI and the statistical features of the real networks are summarised in Table S1.
iv.5 SIR model and bond percolation
Susceptible-Infected-Recovered (SIR) model Anderson1992 () is usually used to mimic the spreading processes of disease. Individuals in this model are classified in three states: susceptible (, does not carry the disease and will not infect others but can be infected), infected (, carry the disease and can infect others), recovered (, either dead or recovered from the disease and immune to further infection). The simulation runs in discrete time steps. At each time step, infective individuals transmit the disease to his or her neighbors with probability and will recover with probability . Then the SIR transmissibility is . The process stops when there is no infected node anymore.
The SIR model can be mapped to a bond percolation model where each link exists with a probability equals to the SIR transmissibility Newman2002 (). After removing the other edges, a number of clusters are formed. It is clear that the ultimate size of the SIR epidemic outbreak is triggered by a single initially infected node, which is precisely the size of the percolation cluster that the initial node belongs to. Apparently, the nodes in the same cluster are expected to have the same coverage. A review article on epidemic processes in complex networks can be found in Ref. PastorSatorras ().
This work is partially supported by the NSFC Grant Nos. 61203156, 11205042, and the Fundamental Research Funds for the Central Universities Gran No. 2682014RC17. LL acknowledges research start-up fund of Hangzhou Normal University under Grant PE13002004039 and the EU FP7 Grant 611272 (project GROWTHCOM). CHY acknowledges the Internal Research Grant RG 71/2013-2014R of the Hong Kong Institute of Education.
-  Réka Albert, Hawoong Jeong, and Albert-László Barabási. Error and attack tolerance of complex networks. Nature, 406(6794):378–382, 2000.
-  Jianshu Weng, Ee-Peng Lim, Jing Jiang, and Qi He. Twitterrank: finding topic-sensitive in uential twitterers. In Proceedings of the third ACM international conference on Web search and data mining, pages 261–270. ACM, 2010.
-  Yanqing Hu, Shlomo Havlin, and Hernán A Makse. Conditions for viral in uence spreading through multiplex correlated social networks. Physical Review X, 4(2):021031, 2014.
-  Stanley Wasserman. Social network analysis: Methods and applications, volume 8. Cambridge university press, 1994.
-  Maksim Kitsak, Lazaros K. Gallos, Shlomo Havlin, Fredrik Liljeros, Lev Muchnik, H. Eugene Stanley, and Hernan A. Makse. Identification of in uential spreaders in complex networks. Nat Phys, 6(11):888–893, 11 2010.
-  Linton C Freeman. A set of measures of centrality based on betweenness. Sociometry, pages 35–41, 1977.
-  Gert Sabidussi. The centrality index of a graph. Psychometrika, 31(4):581–603, 1966.
-  Duanbing Chen, Linyuan Lü, Ming-Sheng Shang, Yi-Cheng Zhang, and Tao Zhou. Identifying influential nodes in complex networks. Physica A: Statistical Mechanics and its Applications, 391(4):1777–1787, 2012.
-  David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of in uence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137–146. ACM, 2003.
-  Romualdo Pastor-Satorras, Claudio Castellano, Piet Van Mieghem, and Alessandro Vespignani. Epidemic processes in complex networks. arXiv preprint arXiv:1408.2701, 2014.
-  Mark EJ Newman. Spread of epidemic disease on networks. Physical review E, 66(1):016128, 2002.
-  Linyuan Lü, Duan-Bing Chen, and Tao Zhou. The small world yields the most effective information spreading. New Journal of Physics, 13(12):123005, 2011.
-  Aidan Sudbury. The proportion of the population never hearing a rumour. Journal of applied probability, pages 443–446, 1985.
-  Damian H Zanette. Critical behavior of propagation on small-world networks. Physical Review E, 64(5):050901, 2001.
-  Damian H Zanette. Dynamics of rumor propagation on small-world networks. Physical review E, 65(4):041908, 2002.
-  Zonghua Liu, Ying-Cheng Lai, and Nong Ye. Propagation and immunization of infection on general networks with both homogeneous and heterogeneous components. Physical Review E, 67(3):031911, 2003.
-  Yamir Moreno, Maziar Nekovee, and Amalio F Pacheco. Dynamics of rumor spreading in complex networks. Physical Review E, 69(6):066130, 2004.
-  Mark EJ Newman, Steven H Strogatz, and Duncan J Watts. Random graphs with arbitrary degree distributions and their applications. Physical review E, 64(2):026118, 2001.
-  Roni Parshani, Sergey V Buldyrev, and Shlomo Havlin. Critical effect of dependency groups on the function of networks. Proceedings of the National Academy of Sciences, 108(3):1007-1010,2011.
-  Roger Guimerà, Albert Diaz-Guilera, Fernando Vega-Redondo, Antonio Cabrales, and Alex Arenas. Optimal network topologies for local search with congestion. Physical Review Letters,89(24):248701, 2002.
-  Gang Yan, Tao Zhou, Bo Hu, Zhong-Qian Fu, and Bing-Hong Wang. Effcient routing on complex networks. Physical Review E, 73(4):046108, 2006.
-  Linton C Freeman. Centrality in social networks conceptual clarification. Social networks, 1(3):215–239, 1979.
-  Robert Tarjan. Depth-first search and linear graph algorithms. SIAM journal on computing, 1(2):146–160, 1972.
-  Michele Catanzaro, Marián Boguñá, and Romualdo Pastor-Satorras. Generation of uncorrelated random scale-free networks. Physical Review E, 71(2):027103, 2005.
-  S. N. Dorogovtsev, J. F. F. Mendes, and A. N. Samukhin. Size-dependent degree distribution of a scale-free growing network. Phys. Rev. E, 63:062101, May 2001.
-  Roy M Anderson, RM May, and B Anderson. Infectious diseases of humans: dynamics and control. Australian Journal of Public Health, 16:208–212, 1992.