An Efficient Randomized Algorithm for Rumor Blocking in Online Social Networks
Social networks allow rapid spread of ideas and innovations while negative information can also propagate widely. When a user receives two opposing opinions, they tend to believe the one arrives first. Therefore, once misinformation or rumor is detected, one containment method is to introduce a positive cascade competing against the rumor. Given a budget , the rumor blocking problem asks for seed users to trigger the spread of a positive cascade such that the number of the users who are not influenced by rumor can be maximized. The prior works have shown that the rumor blocking problem can be approximated within a factor of by a classic greedy algorithm combined with Monte Carlo simulation. Unfortunately, the Monte Carlo simulation based methods are time consuming and the existing algorithms either trade performance guarantees for practical efficiency or vice versa. In this paper, we present a randomized approximation algorithm which is provably superior to the state-of-the art methods with respect to running time. The superiority of the proposed algorithm is demonstrated by experiments done on both the real-world and synthetic social networks.
The tremendous advance of the Internet of things (loT) is making online social networks be the most common platform for communication. There have been totally 44.5 million users on Twitter and 1.4 million monthly active users on Facebook. Admittedly online social networks are greatly beneficial, they also lead the widespread of negative information. Such negative influence, namely misinformation and rumor, has been a cause of concern as it renders the network unreliable and may cause further panic in population. For example, the misinformation of swine flu on Twitter threw the people in Texas and Kansas into panic in 2009 , and the endless report of Ebola in 2014 has caused unnecessary worldwide terror. Therefore, effective strategies for rumor containment are crucial for social networks and it has been a hot topic in the last decades.
In a social network, information and innovations diffuse from user to user via influence cascades where each cascade starts to spread with a set of seed users. When two cascades holding opposing views reach a certain user, the user is likely to trust the one arriving first. As an example, if the international institutions like WHO would have posted clarification for swine flu, the users who have read such posts will not be misled by the misinformation. Therefore, a common method for rumor blocking is to generate a corresponding positive cascade that competes against the rumor. Due to the expense of deploying seed nodes, there is a budget for the positive cascade, and naturally one selects the positive seed nodes which are able to limit the spread of rumor in maximum, which is referred as the least cost rumor blocking problem.
The recent study of influence diffusion in social networks can be tracked back to D. Kempe  where the well-known influence maximization problem is formulated. In that seminal work, two fundamental probabilistic operational models, independent cascade (IC) model and linear threshold (LT) model are developed. Based on such models, many influence related problems are then proposed and studied. The problem of rumor blocking is also considered in such models or in their variants. Most existing approaches utilize the submodularity of the objective function. A set function over a ground is set to be submodular if
for any and . It turns out that the number of non-rumor-activated users is a monotone increasing submodular function and consequently the classic hill-climbing algorithm provides a -approximation . For example, X. He et al.  formulate the influence blocking maximization problem and present a -approximation algorithm for the competitive linear threshold model, Budak et al.  propose several competitive models and show a greedy algorithm with the same approximation ratio under the campaign-oblivious independent cascade model, and, Fan et al.  provide a -approximation algorithm for the rumor blocking problem under the opportunistic one-active-one model.
Assuming that the objective function can be efficiently111Usually this is referred to the polynomial-time computability. calculated for any input, the greedy algorithm is simple and effective for most of the submodular maximization problems. Unfortunately, for the influence related optimization problems, the objective functions are often very complicated to compute due to the randomness of the probabilistic diffusion model. Such a scenario is first observed by W. Chen  where it is shown that computing the exact value of the expected influence is #P-hard. In order to circumvent such difficulty, the prior works employ the Monte Carlo simulation to estimate the objective value for any input. However, such a method is computationally expensive. It turns out that the greedy algorithm with Monte Carlo simulation has the time complexity to achieve a approximation ratio, and it takes several hours even on very small networks. With the recently analysis of influence diffusion [8, 9, 10], the difficulty in solving such problems has shifted from the nodes selection strategy to the calculation of the objective function. Fundamentally, it asks for a better sampling method to estimate the expected influence. To the best of our knowledge, there is no rumor blocking algorithm that can meet practical efficiency without sacrificing performance guarantee.
In this paper, we present an efficient randomized algorithm for rumor blocking, which is termed as the reverse-tuple (R-tuple) based randomized (RBR) algorithm. The RBR algorithm runs in and returns a approximation with a provable probability. The proposed algorithm utilizes the R-tuple based sampling method which is more effective than the Monte Carlo simulation used in the prior works. The reverse sampling technique is first designed by C. Borgs  for the influence maximization problem. In this paper, we develop a new type of sampling based on the concept of R-tuple, and show how such sampling method can be applied to the rumor blocking problem. Although both the sampling methods give the unbiased estimate, one set of Monte Carlo simulations can only provide an estimation for a specified seed set, while the samples produced by the R-tuple based sampling can be applied to any seed sets. The RBR algorithm can be implemented with tunable parameters and it is flexible for balancing the running time and the error probability. We experimentally evaluate the proposed algorithm on both the real-world social network and synthetic power-law networks. The experimental results show that the RBR algorithm not only produces high quality positive seed set but also takes much less time than the greedy algorithm with the Monte Carlo simulation does. In particular, when and the error probability is set as less than where is the number of users, RBR algorithm is at least 1,000 times faster than that of the sate-of-the-art approach. The contributions of this paper are summarized as follows:
We develop the reverse-tuple based sampling method which can be used to obtain an unbiased estimate for the objective function of rumor blocking problem.
Based on the new sampling technique, we design the RBR approximation algorithm which is effective and efficient for blocking rumors under IC model.
We evaluate the proposed algorithm via experiments and show that the RBR algorithm outperforms the existing methods by a significant magnitude in terms of the running time.
2 Related Work
In this section, we survey the prior works regarding rumor controlling.
C. Budak et al.  are among the first who study the misinformation containment problem. In particular, they consider the multi-campaign independent cascade model and investigate the problem of identifying a subset of individuals that need to be convinced to adopt the “good” campaign so as to minimize the number of people that adopt the rumor. X. He et al.  and L. Fan et la.  further study this problem under the competitive linear threshold model and the OPOAO model, respectively. S. Li et al.  later formulate the rumor restriction problem and show a -approximation. As mentioned earlier, the existing approaches are time consuming and thus cannot handle large social networks. Recently, several heuristic methods have been proposed by different works, such as [12, 13], but they cannot provide performance guarantee. In this paper, we aim to design the rumor blocking algorithm which is provably effective and also efficient.
Rumor source detection is another important problem for rumor controlling. The prior works primarily focus on the susceptible-infected-recovered (SIR) model where the nodes can be infected by rumor and may recover later. Shah et al.  provide a systematic study and design a rumor source estimator based upon the concept of rumor centrality. Z. Wang et al.  propose a unified inference framework based on the union rumor centrality.
Rumor detection aims to distinguish rumor from genuine news. Leskovec et al.  develop a framework for tracking the spread of misinformation and observe a set of persistent temporal patterns in the news cycle. Ratkiewicz et al.  build a machine learning framework to detect the early stages of viral spreading of political misinformation. In , Qazvinian et al. address the rumor detection problem by exploring the effectiveness of three categories of features: content-based, network-based, and microblog-specific memes. Takahashi et al.  study the characteristics of rumor and design a system to detect rumor on Twitter.
3 System Model
In what follows we provide the preliminaries to the rest of this paper. The important notations are listed in Table I.
3.1 Influence Model
A social network is represented by a directed graph where denotes the user set and user is a friend of iff . Let and be the number of nodes and edges, respectively. We denote by the set of in-neighbors of . For a network , let and be the edge-set and node-set of , respectively. In order to spread an idea or to advertise a new product in a social network, some seed nodes are chosen to be activated to trigger the spread of influence. The diffusion process terminates when there is no user can be further activated. In this paper, we adopt the following model.
Independent Cascade (IC) Model. Associated with each edge there is a propagation probability . When node becomes active at time , it attempts to activate each inactive neighbor at time step with a success probability of . For each pair of nodes and , has only one chance to activate .
|instance of IC network.|
|and||number of nodes and edges.|
|the set of in-neighbors of|
|propagation probability of edge .|
|the probability that can be generated.|
|the seed set of rumor.|
|the seed set of positive cascade.|
|objective function of rumor blocking.|
|the budget of the seed set of positive cascade.|
|a random R-tuple of .|
|the set of all possible random R-tuples of .|
|a concrete R-tuple of in .|
|the probability that can be generated by Alg. 1.|
|a random R-tuple.|
3.2 Rumor and Competing Cascade
Note that the IC model is originally designed for single cascade diffusion. Suppose there are multiple cascades each of which is generated by its own seed set. In the network, each node is initially inactive and never changes its state once activated by one cascade. Therefore, the cascade arriving first will dominate the node. In order to limit the spread of rumor, we introduce a competing cascade denoted as the positive cascade. At each time step, if a node is successfully activated by two or more neighbors belonging to different cascades, it will select the one with the highest priority. We assume that rumor has the higher priority, because rumor always polishes itself to be convincing. We denote by and the seed sets of rumor and the competing positive cascade, respectively. The diffusion process unfolds in discrete, as follows.
Initially all the nodes are inactive.
At time 0, nodes in and are activated by rumor and positive cascade, respectively222Since our goal is to limit the spread of rumor and rumor has the higher priority, we can assume that without loss of generality..
At time , each node which is activated at attempts to activate each of its inactive neighbors with a success probability of . If node is successfully activated by the two cascades simultaneously at time , will be activated by rumor.
The diffusion process terminates when there is no node can be further activated.
An example is shown in Fig. 1, where there are five nodes and the propagation probability of each edge is 1. In this example, and are selected as the seed node of rumor and positive cascade, respectively. At time step 1, and activate simultaneously and is finally activated by rumor as rumor has the higher priority.
3.3 The problem
Given an IC network and the seed set of rumor, let be the expected number of nodes that are not activated by rumor when is selected as the seed set of the positive cascade. Given a budget , the rumor blocking problem considered in this paper is given as follows.
Find a seed set with at most nodes such that is maximized.
It is well-known that this problem is NP-hard.
For , let and . Since polynomial exact algorithm does not exist unless NP=P, we aim to design approximation algorithms.
In this section, we introduce the concept of realization which provides a fundamental understanding of IC model.
Given an IC network , a realization of is an IC network where and is a subset of where each edge in has the propagation probability of . The edge set is constructed in random. For each edge in , we generate a random number from 0 to 1 in uniform. Edge appears in if and only if . Let be the set of all possible realizations of . One can see that there are realizations in .
Let be the probability that realization can be generated. By Def. 1,
Intuitively, a realization is a deterministic IC network. Given the seed sets and , the following two diffusion processes are equivalent with respect to .
Execute the stochastic diffusion process on with and .
Randomly generate a realization of and execute the deterministic diffusion process on with and .
Therefore, can be expressed as
In order to maximize , one naturally asks that in which realization that is equal to 1. For a realization , let be length of the shortest path from node to node in , and, for a node set , define that . A key lemma is shown as follows.
For a realization and , a node will be activated by rumor in under (i.e. ) if and only and .
See Appendix A.1. ∎
As a corollary of Lemma 1, our objective function is monotone submodular.
is monotone submodular with respect to .
According to Eq. (2), it suffices to prove that is monotone submodular for each and . It is clear monotone as does not increase when more nodes are added into . Now we prove that
holds for any and . Since, is either 0 or 1, we only need to prove that
when is equal to 1. When , and , which means, by Lemma 1, , and . Therefore, and consequently . Furthermore, because is a subset of and . As a result, is also equal to 1. ∎
According to Corollary 1, it seems that we can use the greedy algorithm to maximize according to Eq. (2). Unfortunately, there are exponential number of realizations in to consider and therefore the greedy algorithm does not run in polynomial. Alternatively, we utilize the reverse sampling technique to obtain an estimate of and then maximize the estimate.
3.5 A Sampling Method
Our sampling method is designed based on the following objects.
(Random R-tuple of ) Generated by Algorithm 1, a random reverse tuple of node is a four tuple where is a node-set, and are edge-sets, and is a boolean variable. As shown in Alg. 1, we start from and successively test whether the current in-neighbor of the nodes in can be added to in a breadth first manner until one of the rumor seed is reached or no node can be furthered reached. and are generated in line 18 and line 20, respectively. is set of nodes that are reachable to . is set as if and only if some rumor seeds are encountered. We denote by , , and the four attributes of .
(Random R-tuple) Generated by Algorithm 2, a random R-tuple is a random R-tuple of generated by Alg. 1 where is selected from uniformly in random. For a node set , let be a random variable over 0 and 1, where
The following lemma is critical for the rest of the analysis in this section.
for any .
See Appendix A.2. ∎
Suppose there is a set of random R-tuples each of which is obtained by Alg. 2 . For a set and , let . Now let us consider the following problem
Finding a node set with at most nodes such that is maximized.
Because is always 1 when , we can take as the ground set such that is equal to 1 for any non-empty set . Now Problem 2 becomes the classic set cover problem and therefore the greedy algorithm shown in Algorithm 3 produces a -approximation . For a given , let be the set produced by Alg. 3. Then
for any .
3.6 Chernoff Bound
In this paper, we use the Chernoff bound to analyze the error of estimating. Let be i.i.d random variables where . The Chernoff bound  states that
4 The algorithm
In this section, we first discuss how to estimate the optimal value and then present the algorithm together with its analysis.
Estimating the optimal value of is an important part of our algorithm. Suppose we have a set of random R-tuples. Intuitively, should be a good choice because, by Eq. (3), it is close to with a guaranteed factor, and, according to Lemma 2, is an unbiased estimate of . Because 333 is always no less than 1 because and ., we design a statistic test which compares with and terminates when they are sufficiently close to each other. The estimation process is shown in Algorithm 4 with tunable parameters and . Let be the estimation produced by Alg. 4.
First, we need to guarantee that is smaller than . The following result shows that the terminate condition (i.e., line 9) leads that is smaller than with a high probability.
With probability at least , Algorithm 4 produces an that is less .
See Appendix A.3 ∎
Second, it can be shown that is not too much less than .
With a probability at least , Algorithm 4 produces an such that .
See Appendix A.4 ∎
The above results are summarized as follows.
With a probability at least , Algorithm 4 returns an , such that
4.2 The Algorithm
Now we are ready to show the algorithm of rumor blocking. Let be a set of random R-tuples. According to Lemma 2, is an unbiased estimate of and therefore the that can maximize should be able to maximize as long as is sufficiently large. The whole algorithm is given in Alg. 5. Let , , and be some adjustable parameters. We first obtain an estimate of by Alg. 4 with input and set that
Next, we generate random R-tuples by Alg. 2. Finally, Alg. 5 returns the set obtained by running Alg. 3 with input . Let be the node set produced by Alg. 5. As mentioned early, should be a -approximation to Problem 1 if the estimate is sufficiently accurate. In particular, we require the following accuracy of and .
Let . By Eq. (11),
See Appendix A.5. ∎
See Appendix A.6. ∎
The above analysis is summarized as follows.
With probability at least , .
4.3 Running Time
See Appendix A.7. ∎
Alg. 5 runs in .
Alg. 3 can be implemented to run in time linear to the total size of its input . Alg. 4 invokes Alg. 3 with input size from to where is the index of the last iteration and the input size is doubled in each iteration. By Lemma 3, the total number R-tuples generated by Alg. 4 is and, by Lemma 8, line 2 of Alg. 5 runs in . The running time of lines 6 and 7 is dominated by that of line 2. Therefore, the running time of Alg. 5 is by taking , , as constants and assuming . ∎
As shown in Theorem. 3, and controls the success probability and the approximation ratio, respectively. When is fixed, we select the such that can be minimized to reduce the running time. As shown in Alg. 4, decides the value of . When is getting larger, Alg. 4 takes less time while Alg. 5 takes more time because, by Theorem 2, becomes smaller and consequently becomes larger. In experiments, we simply set that and .
In this section, we evaluate the performance of RBR algorithm with respect to the state-of-the-art method and other heuristics. Besides, we also discuss the running time of the considered algorithms.
In our experiments, we employ four datasets, Power2500, Wiki, Epinion and Youtube, scaling from small to large. Power2500 is a synthetic power-law network generated by DIGG . It has been shown that the power-law distribution is one of the most important characteristics of social networks . Wiki is a who-votes-on-whom network extracted from the vote history data of Wikipedia444https://www.wikipedia.org/. Epinions is a who-trust-whom online social network extracted from the consumer review site Epinions.com. Youtube is a social network of a video-sharing website. Wiki, Epinions and Youtube are provided by the SNAP. The basic statistics of the above datasets are shown in Table II. The probability on the edges is either uniformly set as 0.1 or is set as . These two settings are denoted as constant probability (CP) model and weighted cascade (WC) model. The above datasets together with the probability settings are widely used in the prior works.
|time||# R||time||# R||time||# R||time||# R||time||# R||time||# R||time||# R||time||# R|
We consider four rumor blocking algorithms shown as follows:
RBR algorithm. This is the algorithm proposed in this paper. We set and by default.
Greedy. This is the state-of-the-art rumor blocking algorithm using the Monte Carlo simulation. 2,000 simulations are used for each estimation. Greedy is only tested on small graphs, Power2500 and Wiki.
Proximity. This is a popular heuristic algorithm which selects the out-neighbors of the rumor seed nodes as the positive seed nodes. In particular, we give an index to each node and select the neighbors with the highest index.
Random. This is a baseline method where the positive seed nodes are randomly selected.
Unblocking. This is the base case when there is no positive cascade.
In our experiments, the rumor seed nodes are selected from the nodes with the highest degree. The number of rumor seed nodes is set as 20 and the budget of positive seed set is selected from to . The function value of the seed set produced by each algorithm is finally evaluated by with where the R-tuples are separately generated.
We conduct two series of experiments. In the first experiment, we evaluate the performance of RBR algorithm. In the second experiment, we investigate how many R-tuples that RBR algorithm needs to produce a high quality seed set.
The analysis of the experimental results of the two series of experiments are shown in the following two subsections, respectively.
5.1.1 Experiment I
The results on graph Power2500 are shown in Figs. (a)a and (e)e. One can see that RBR algorithm and Greedy have the same performance with respect to . Nevertheless, RBR algorithm is more efficient than Greedy with respect to running time, as shown in Table III. For example, under the CP model on Power2500 with , RBR takes 0.31 second while Greedy takes about 1.3 hour.
The results on the Wiki dataset are shown in Figs. (b)b and (f)f. Under the CP model, RBR algorithm is able to reach at least 97.98% blocking effect of the Greedy algorithm with respect to . Under the WC model, Greedy performs better than RBR does until is larger than 16. Recall that 2,000 simulations are used by Greedy for each estimation. Such a phenomenon suggests that, when is larger than 16, more simulations are required to maintain the accuracy of the estimates so that Greedy is able to achieve the (1-1/e)-approximation. However, as shown in Table III, Greedy has already been very time consuming on Wiki with 2,000 simulations, and therefore using more simulations is not a good choice even though it may increase the quality of the produced seed set. Despite that Wiki is larger than Power2500, comparing Figs. (e)e and (f)f, one can see that when there is no positive cascade, 20 rumor seed nodes result 410 and 650 rumor-activated nodes on Wiki and Power2500, respectively, which indicates that the dense of the network has more impact on the influence diffusion than the network scale does.
The results on the Epinions dataset are shown in Figs. (c)c and (g)g. One can see that on the large network RBR algorithm is superior to other heuristics by a significant margin. Under the CP model when , the RBR algorithm can protect about 2,000 users while Proximity protects 500 nodes. On Youtube, as shown in Figs. (d)d and (h)h, RBR is still effective but other heuristics can hardly protect any node.
5.1.2 Experiment II
As shown in Sec. 4, the main part of the analysis focuses on determining an threshold of the number of R-tuples used in Alg. 5. In this section, we experimentally test how the quality of the seed sets varies with the increase of used R-tuples. In particular, we are interested in that whether or not the number of R-tuples used by RBR algorithm is proper. To this end, instead of calculating as shown from line 2 to line 5 in Alg. 5, we explicitly set and then run the rest of the Alg. 5 from line 6 to 8. For each dataset, we increase until the quality of the produced seed set tends to converge. The results are given by Fig. 3.
According to Fig. 3 and Table III, RBR generates sufficient number of R-tuples in practice for most of the considered datasets. For example, on graph Power2500 under CP model, RBR totally generates 220K R-tuples as shown in the second column of Table III, and, as shown in Fig. (a)a, the quality of the seed set does not markedly increases when more than 200K R-tuples have been used. For this case, 220K R-tuples are sufficient as spending more R-tuples cannot help improve the quality. One has the same conclusion on the other three datasets. The only exception occurs on Youtube graph under WC model. For this case, RBR utilizes totally 336K R-tuples, while as shown in Fig. (h)h the standard deviation of is about 400 when X axis is equal to 336K, and, it completely converges after 1,000K R-tuples are used. Such a case may suggest that, on large graphs, and should be set as smaller than 0.1 to raise the number of R-tuples used by Alg. 5 such that the quality of the produced seed set can be more stable. In fact, learning the best perimeter setting is an interesting problem and we leave this part as future work.
6 Conclusion and Future Work
In this paper, we have studied the rumor blocking problem for online social networks. We first design the R-tuple based sampling method and then present a randomized rumor blocking algorithm. The proposed RBR algorithm theoretically dominates the existing rumor blocking algorithms, and as shown in the experiments it is very efficient without sacrificing the blocking effect.
One promising future work is to investigate the rumor blocking problem under other models, namely LT model. It is worthy to note that the rumor blocking problem under LT model is significantly different from that under the IC model. Another direction of future work, as mentioned in Sec. 5, is to study the parameter setting of the RBR algorithm. Finally, exact algorithm designed based on R-tuple sampling method is possibly obtainable for special graph structures like trees and regular graphs.
Appendix A Proofs
a.1 Proof of Lemma 1
We first prove a useful property.
Suppose that . Let and
be a shortest path from to in . If , then all the nodes in will be activated by rumor in under and in particular will be activated at time step .
We prove this claim by induction from to along the path. First, is obviously activated by rumor at time step as it is a seed node of rumor. Now we prove that if is activated by rumor at time step then node will be activated by rumor at time step . There are two cases to consider.
Case 1. If is activated by then clearly is activated by rumor at time step .
Case 2. Otherwise, is activated by a neighbor other than , which implies that is activated no later than . However, because and is the shortest path from to , for any seed node , , which means is activated no early than . Therefore, is activated at time step and and will attempt to activate simultaneously at time step . Since rumor has the higher priority, will be activated by rumor regardless of whether or not belongs to rumor. ∎
Lemma 1 follows from the following two claims.
If and , then will be activated by rumor.
This claim follows directly from Claim 1. ∎
If or , then will not be activated by rumor.
If , is clearly not activated by rumor. Otherwise, let . Similar to the proof of claim 1, we can prove that the nodes on the shortest path from to will activated by the positive cascade by induction. The only difference is that positive cascade always reaches those nodes earlier that rumor does by at least one time step. ∎
a.2 Proof of Lemma 2
To prove Lemma 2, we introduce the following definitions.
We use to denote a concrete R-tuple of . Let be the set of all possible and be the probability that can be generated by Alg. 1. Given a node set , let be a variable over 0 and 1, where
Different from defined in Def. 4, is not a random variable.
A pair of ordered edge-sets (, ) is valid, if , and . Note that there is a bijection between the realizations and all the valid pairs (, ) such that . We say is compatible to (, ) if and . Let be the set of the realizations compatible to . For a valid pair (, ) and a realization compatible to , define that
One can easily check that
Intuitively, if a realization is compatible to a valid pair , can be taken as an intermediate state while generating . For a R-tuple of , it follows that
Note that is always valid for each . Now lets consider the realizations compatible to for different R-tuples of .
For each node , the sets for form a partition of .
First, it is obvious that for each there exists a such that is compatible to . Thus, it suffices to show that
for . Since and are different, there must be an edge such that , , and . By Def. 5, a realization cannot be compatible to both and