Cumulative Activation in Social Networks

Cumulative Activation in Social Networks

Abstract

Most studies on influence maximization focus on one-shot propagation, i.e. the influence is propagated from seed users only once following a probabilistic diffusion model and users’ activation are determined via single cascade. In reality it is often the case that a user needs to be cumulatively impacted by receiving enough pieces of information propagated to her before she makes the final purchase decision. In this paper we model such cumulative activation as the following process: first multiple pieces of information are propagated independently in the social network following the classical independent cascade model, then the user will be activated (and adopt the product) if the cumulative pieces of information she received reaches her cumulative activation threshold. Two optimization problems are investigated under this framework: seed minimization with cumulative activation (SM-CA), which asks how to select a seed set with minimum size such that the number of cumulatively active nodes reaches a given requirement ; influence maximization with cumulative activation (IM-CA), which asks how to choose a seed set with fixed budget to maximize the number of cumulatively active nodes. For SM-CA problem, we design a greedy algorithm that yields a bicriteria -approximation when , where is the number of nodes in the network. For both SM-CA problem with and IM-CA problem, we prove strong inapproximability results. Despite the hardness results, we propose two efficient heuristic algorithms for SM-CA and IM-CA respectively based on the reverse reachable set approach. Experimental results on different real-world social networks show that our algorithms significantly outperform baseline algorithms.

social networks; independent cascade model; cumulative activation; influence maximization; seed minimization
\numberofauthors

5

1 Introduction

With the wide popularity of social media and social network sites such as Facebook, Twitter, WeChat, etc., social networks have become a powerful platform for spreading information, ideas and products among individuals. In particular, product marketing through social networks can attract large number of customers.

Motivated by this background, influence diffusion in social networks has been extensively studied (cf. [11, 20, 7]). However, most of previous works only consider the influence after one-shot propagation — influence propagates from the seed users only once and user activation or adoption is fully determined after this single cascade. In contrast, in the real world, people often make decisions after they have cumulated many pieces of information about a new technology, new product, etc., and these different pieces of information are propagated in the network independently as different information cascades.

Consider the following scenario: A company is going to launch a new version (named as V11 for convenience) of their product with many new features, but most people are not familiar with these new features. Thus, it is often beneficial that the company conduct a series of advertisement and marketing campaigns covering different features of the product. An effective way of marketing in a social network is to select influential users as seeds to initiate the information cascades of these campaigns. From potential customers’ perspective, when they receive the first piece of information about V11 from their friends, they may find it interesting and forward it to their friends, but this may not necessarily lead to their purchase actions. Later they may receive and be impacted by further information about V11, and when they are impacted by enough pieces of information cascades, they may finally decide to buy the new product.

We model the above behavior by an integrated process consisting of two phases: (a) repeated information cascades, and (b) threshold-based user adoptions. First, there are multiple information cascades about multiple pieces of production information in the network. We model information cascades by the classical independent cascade (IC) model [20]: A social network is modeled as a weighted directed graph, with an influence probability as the weight on every edge. Initially, some nodes are selected as seeds and become active, and all other nodes are inactive. At each step, newly activated nodes have one chance to influence each of their inactive out-neighbors with the success probability given on the edge. Independent cascade model is suitable to model simple contagions [5, 7] such as virus and information propagation, and thus we adopt it to model information cascades in the first phase. We consider multiple pieces of production information propagates independently following the IC model. For the second phase, we assume that there is a threshold for each user, who will adopt the product if the amount of information that she receives in the first phase exceeds her threshold. We measure the amount of information a user received as the fraction of information cascades that reaches the user, which is equivalent to the probability of the user being activated in an information cascade. A node is cumulatively activated if this probability exceeds the threshold. We refer to this model as the cumulative activation (CA) model.

Given the above cumulative activation model, the company may face one of the following two objectives: either the company has a fixed budget to activate the seed nodes, and wants to maximize the number of cumulative active nodes, or the company needs to reach a predetermined number of cumulative active nodes, and wants to minimize the number of seeds.

We formulate the above scenarios as the following two optimization problems: Seed minimization with cumulative activation (SM-CA) and influence maximization with cumulative activation (IM-CA). Given a directed graph with a probability on each edge and a threshold for each node, an activation requirement and a budget , the SM-CA problem is to find a seed set with minimum size such that the number of cumulatively activated nodes is at least . The IM-CA problem is to find a seed set with nodes such that the number of cumulatively activated nodes is maximum.

Let denote the number of cumulative activated nodes given seed set . We first show that set function is not submodular, which means unlike most of the current studies, we cannot guarantee the approximation ratio by using the greedy algorithm directly.

For SM-CA problem, we consider the case and separately, where is the number of nodes in the network and is the activation requirement. The complexity results of these two cases are quite different. When , we show while it is NP-hard to approximate SM-CA problem within factor for any , we can achieve a bicriteria -approximation. Our technique is to replace the nonsubmodular with a submodular surrogate function , and show that the set of feasible solutions to the original SM-CA problem with constraint is exactly the same as the set of feasible solutions for to assume its maximum value, and then we can apply the greedy algorithm to the surrogate instead of to provide the theoretical guarantee. When , we construct a reduction from the densest -subgraph problem to SM-CA problem and show that SM-CA problem cannot be approximated within factor if the densest -subgraph problem cannot be approximated within , for any , which is commonly acknowledged as a hard problem for some small .

For IM-CA problem, we construct a reduction from the Set Cover problem and prove that it is NP-hard to approximate IM-CA problem within a factor of for any .

Despite the approximation hardness on the SM-CA problem with and the IM-CA problem, we may still need practical solutions for them. For this purpose, we propose some heuristic algorithms, which utilize the state-of-the-art approach in influence maximization, namely the Reverse Reachable Set (RR set) approach [3, 26, 30, 29], to improve the efficiency of the algorithms comparing to the old greedy algorithms based on naive Monte Carlo simulations.

Finally, we conduct experiments on three real-world social networks to test the performance of our algorithms. Our results demonstrate that one heuristic algorithm proposed consistently out-performs all other algorithms under comparison in all test cases and clearly stands out as the winning choice for both the SM-CA and IM-CA problems.

To summarize, our contributions include: (a) we propose the seed minimization and influence maximization problem under cumulative activation (SM-CA problem and IM-CA problem respectively), which is a reasonable model for purchasing behavior of customers exposed to repeated information cascades; (b)we design an approximate algorithm for SM-CA problem when ; (c) we show strong hardness results for SM-CA problem with and IM-CA problem; (d) we propose efficient heuristic algorithms and validate them through extensive experiments on real-world datasets and conclude that one heuristic is the best choice for both SM-CA and IM-CA problems.

1.1 Related Work

The classical influence maximization problem is to find a seed set of at most nodes to maximize the expected number of active nodes. It is first studied as an algorithmic problem by Domingos and Richardson [11] and Richardson and Domingos [27]. Kempe et al. [20] first formulate the problem as a discrete optimization problem. They summarize the independent cascade model and the linear threshold model, and obtain approximation algorithms for influence maximization by applying submodular function maximization. Extensive studies follow their approach and provide more efficient algorithms [10, 9, 23]. Leskovec et al. [23] present a “lazy-forward” optimization method in selecting new seeds, which greatly reduce the number of influence spread evaluations. Chen et al. [10, 9] propose scalable algorithms which are faster than the greedy algorithms proposed in [21]. Recently, Borgs et al. [3] and Tang et al. [29, 30] and Nguyen et al. [26] propose a series of more effective algorithms for influence maximization in large social networks that both has theoretical guarantee and practical efficiency. The approach is based on the “Reverse Reachable Set” idea first proposed in [3].

Another aspect of influence problem is seed set minimization, Chen [6] studies the seed minimization problem under the fix threshold model and shows some strong negative results for this model. Long et al. [24] also study independent cascade model and linear threshold model from a minimization perspective. In [15], Goyal et al. study the problem of finding the minimum size of seed set such that the expected number of active nodes reaches a given threshold, they provide a bicriteria approximation algorithm for this problem. Zhang et al. [32] study the seed set minimization problem with probabilistic coverage guarantee, and design an approximation algorithm for this problem. He et al. [19] study positive influence model under single-step activation and propose an approximation algorithm. Note that, the work in [19] is a special case of our work.

Beyond influence maximization and seed minimization, another interesting direction is the learning of social influence over real online social network data set, e.g. influence learning in blogspace [17] and academic collaboration network [28].

Most early studies on influence maximization and influence learning are summarized in the monograph [7]. However, almost all the existing studies consider only node activation after a single information or influence cascade. Our work differentiate with all these studies on this important aspect, as discussed in the introduction.

Paper organization. We formally define the diffusion model and the optimization problems SM-CA and IM-CA in Section 2. The approximation algorithms and hardness results of these two problems are proposed in Section 3, including a greedy algorithm for SM-CA problem with in section 3.1.1, the hardness result of SM-CA problem with in Section 3.1.2 and the inapproximate result of IM-CA problem in Section 3.2. In Section 4, we present two heuristic algorithms for SM-CA problem and two heuristic algorithms for IM-CA problem. Section 5 shows our experimental results on real-world datasets. We summarize the paper with some further directions in Section 6.

2 Model and Problem Definitions

Our social network is defined on a directed graph , where is the set of nodes representing individuals and is the set of directed edges representing social ties between pairs of individuals. Each edge is associated with an influence probability , which represents the probability that influences .

The entire activation process consists of information diffusion process and node activation. The information diffusion process follows the independent cascade (IC) model proposed by Kempe et al. [20]. In the IC model, discrete time steps are used to model the diffusion process. Each node in has two states: inactive or active, At step 0, a subset is selected as seed set and nodes in are active directly, while nodes not in are inactive. For any step , if a node is newly active at step , then has a single chance to influence each of its inactive out-neighbor with independent probability to make active. Once a node becomes active, it will never return to the inactive state. The diffusion process stops when there is no new active nodes at a time step.

The above basic IC model describe the diffusion of one piece of information, but actually there could be many pieces of information about a product being propagated in the network, all following the same IC model. Users’ final production adoption is based on cumulative information collected, which we refer to as cumulative activation (CA) and is described below, and it is different from the user becoming active for one piece of information specified above in the IC model. Let be the probability that becomes active after an information cascade starting from the seed set . Since also represents the fraction of information accepted by in multiple cascades, we use to define cumulative activation: Suppose that each node has an activation threshold , then becomes cumulatively active if . Given a target set and a seed set , let be the number of cumulatively active nodes in from seed set . When , we omit the subscript and use directly.

We consider two optimization problems under cumulative activation, seed minimization with cumulative activation (SM-CA) and influence maximization with cumulative activation (IM-CA). SM-CA aims at finding a seed set with minimum size such that there are at least nodes in the target set become cumulatively active. IM-CA is the problem of finding a seed set of size to maximize the number of cumulatively active nodes in the target set. The formal definitions are as follows.

Definition 1 (Seed minimization with cumulative activation)

In the seed minimization with cumulative activation (SM-CA) problem, the input includes a directed graph with , an influence probability vector , a target set , an activation threshold for each node and a coverage requirement . Our goal is to find the minimum size seed set such that at least nodes in U can be cumulatively activated, that is,

.

Definition 2 (Influence maximization with cumulative activation)

In the influence maximization with cumulative activation (IM-CA) problem, the input includes a directed graph with , an influence probability vector , a target set , an activation threshold for each node and a size budget . Our goal is to find a seed set of size such that the number of cumulatively active nodes in is maximized, that is,

.

2.1 Equivalence to Frequency-based Definition

Suppose there are diffusions, which lead to final cumulative activation. Intuitively, a node becomes cumulative activated when the number of times that becomes influenced in these diffusions is larger than a threshold. Formally, given a seed set and a node , let be a random variable defined as follows: if is influenced in the -th diffusion and otherwise. Thus, denotes the number of times that becomes influenced after diffusions, and is the influence frequency of . By Hoeffding’s inequality, we show the relationship between and in Lemma 1.

Lemma 1

Given a seed set , a node and a large enough , (a) if , then ; and (b) if , then , where is asymptotic to the number of diffusions .

{proof}

It is obvious that the expectation of is . When , by Hoeffding’s inequality, we have:

Thus, when is large enough, is a high probability event if . Similarly, is a high probability event if .

Based on Lemma 1, the formal definition of cumulative activation is consistent with our motivation.

2.2 Comparison with IC and LT Models

We first explain the differences between the CA model the IC model. CA model uses IC model as information cascades in its first stage, and thus the main difference is at the determination of which nodes are finally activated, or simply at the objective function. This is clearly illustrated by the simple example in Figure a, which shows a five-node graph with edge probabilities shown next to edges. In the IC model, it is clear that the influence spread of and are the same: . For the CA model, if every node has the activation threshold as , then and , because can only activate itself and , while can activate itself and . If the activation threshold of every node is increased to , then and . Therefore, if we want to select one seed in the influence maximization task, for IC model either or is fine, but for the CA model or is selected based on different threshold values. This means the influence maximization task under CA model is different from the task under the IC model. The above example also provides the intuition that the influence maximization task under the IC model focuses on the average effect of the influence, while the task under the CA model may need to select either nodes that has wide but average influence (e.g. ) or nodes with concentrated influence (e.g. ) based on the threshold setting.


(a) CA & IC

(b) nonsubmoduarity

(c)
Figure 1: Figures for understanding the model

We next distinguish our CA model with the popular linear threshold (LT) model proposed in [20]. In the LT model, each edge has a weight with ( if is not an edge). Each node has a threshold , which is drawn from uniformly at random before the propagation starts. Then, starting from the seed set , an inactive node becomes active at time if any only if the total weights from its active in-neighbors exceeds ’s threshold: , where is the set of active nodes at time with .

Despite the superficial similarity on using thresholding to model user adoption behavior, the two models are quite different. One key difference is that in the LT model what is being propagated are the user adoption behavior, while in the CA model, what is being propagated are multiple pieces of information about a product, and user’s adoption in the end is based on the information received. This is actually the difference between CA and most other models on influence diffusion, as discussed in the introduction. This further leads to a specific difference between LT and CA: the threshold in the LT model is on the number of friends who already adopt a product, while the threshold in the CA model is on the fraction of information cascades that reach a user. Finally, in LT the threshold is a random number in , making the influence spread objective function submodular, while in CA the threshold is fixed as an input, causing the objective function not submodular, as discussed in the next section.

3 Algorithms and hardness results

In this section, we provide algorithmic as well as hardness results for SM-CA problem and IM-CA problem.

A set function is monotone if for all , and submodular if for all and . It is well known that monotone submodular functions leads to a good approximation ratio by using the greedy algorithm [25], and indeed most of the existing work on social influence takes advantage of this nature (e.g. [3, 8, 15, 20]).

Unfortunately, our objective function is monotone but not submodular in general as shown in the example below, which makes our problems much harder.

Example 1

(Figure b) Suppose is a bipartite graph and the influence probability on each edge is , the activation thresholds are: , . Let . Let and , then, , , , . Thus, , implying that is not submodular. We further remark that in this example, if we set , the function is still not submodular.

In the rest part of this section, we consider how to design approximation algorithms for SM-CA problem and IM-CA problem as well as the computational complexity of them.

3.1 Seed minimization with cumulative activation (SM-CA) problem

In this section, we study SM-CA problem. We first show the hardness result of SM-CA problem in Theorem 1.

Theorem 1

SM-CA problem is NP-hard. Moreover, SM-CA problem cannot be approximated within in polynomial time unless , .

{proof}

We construct a reduction from the Partial Set Cover (PSC) problem. An instance of PSC problem consists of a ground set and a family of subsets and a coverage requirement . The objective is to find a subcollection such that and is minimized.

Given any instance of PSC problem , we construct an instance of SM-CA problem as follows. The PSC instance is reduced to a bipartite graph , in which each node corresponds to a subset one to one, each node corresponds to an element one to one, there is an edge if and only if . The target set is , the influence probability on each edge is 1, the activation thresholdof each node in is 1. The activation requirement is , which is the same with the coverage requirement in the PSC instance.

Based on the above construction, it is easy to check that the objective function values for any given S in the instances and are always the same. Which means the two problems should have the same approximation ratio. For the PSC problem, Feige showed that it cannot be approximated within a factor of in polynomial time unless [13]. Therefore, SM-CA problem has the same computational complexity. Based on the hardness result of SM-CA problem, our next goal is to design an algorithm with approximation ratio close to . Surprisely, it turns out that the results are quite different between “activating all nodes” () and “partial activation” (), as we discuss separately below.

SM-CA problem with

When , we can design an algorithm with a bicriteria -approximation, even though the objective function is not submodular. The key idea to our solution is to find a submodular function as the surrogate for the original nonsubmodular , as the following lemma specifies.

Lemma 2

When , a seed set is a feasible solution to the SM-CA problem if and only if , where is a surrogate function defined as:

.

{proof}

If is a feasible solution to SM-CA, that is, , then every node satisfies , and thus . The only-if part is also straightforward.

The above lemma implies that minimizing seed set size for the constraint of is the same as minimizing the seed set size for the constraint of . The reason we want to switch the minimization problem on the surrogate function is because it is submodular, as pointed out by Lemma 3. We remark that in [12], Farajtabar et al. study an objective function in the similar form as in the continuous-time influence model, but the interpretation of is the cap on user activity intensity in [12] rather than the activation threshold.

Lemma 3

The surrogate function is monotone and submodular.

{proof}

[sketch] It is obvious that is monotone. For submodularity, following [20] we know that is submodular. Then it is easy to check that the minium of a submodular function and a constant is still submodular, and the simple summation of submodular functions is also submodular. Having the submodularity, we can design a greedy algorithm guided by . But like most work in the IC model, we cannot avoid the problem of computing . It has been shown that exactly computing in the IC model is #P-hard [9], where is the expected number of active nodes given the seed set . Thus, computing is also #P-hard since if we set for all . In this section, we use Monte Carlo simulation to estimate . A more efficient method will be discussed in Section 4.2.

0:    
0:     : the estimation of
1:  ;
2:  ; for all
3:  for  to  do
4:     simulate IC diffusion from seed set
5:     if  is activated then
6:        
7:     end if
8:  end for
9:  for  do
10:     
11:     if  then
12:        
13:     else
14:        
15:     end if
16:  end for
17:  return  
Algorithm 1 Estimate by Monte Carlo
0:     ,
0:     Seed set
1:  ,
2:  while  do
3:     choose
4:     
5:  end while
6:  return  
Algorithm 2 Greedy algorithm for SM-CA with

Algorithm 1 shows the procedure of the Monte Carlo method. Given a seed set and a node , Algorithm 1 simulates the diffusion process from for runs, and uses the frequency that has been influenced as the estimation of . Then we can obtain the estimation of directly by a truncation operation. The estimations of and are denoted by and respectively.

The accuracy of the estimate depends on the number of simulation runs , as rigorously specified by the following lemma.

Lemma 4

For any seed set , suppose is the estimate of output by Algorithm 1, then , if .

{proof}

For each node , Let , where is a random variable defined as if is influenced in the -th simulation and otherwise. Then is the number of times that is active after simulations. Thus, and . By Hoeffding’s inequality and the condition , for any constant and ,

We next show that always holds.

Then, we have

Having the estimation algorithm of , we show our greedy algorithm for SM-CA problem with in Algorithm 2.

Algorithm 2 starts from an empty seed set . At each iteration, it adds one node providing the largest marginal increment to into , i.e., . The algorithm ends when and outputs as the selected seed set. Goyal et al. proved the performance guarantee for the greedy algorithm when is monotone and submodular [15].

Theorem 2 ([15])

Let be a social graph and be a nonnegative, monotone and submodular function defined on . Given a threshold , let be a subset with minimum size such that , and be the greedy solution using a -approximate function with the stopping criteria . Then, there exists a such that for any and , with high probability.

Now we can conclude the approximation ratio of Algorithm 2 based on Lemmas 24 and Theorem 2.

Theorem 3

When , for any , Algorithm 2 ends when and approximates SM-CA problem within a factor of with high probability.

SM-CA problem with

When , the surrogate function does not enjoy the property in Lemma 2 any more, and thus the problem becomes more difficult. We use the following example to explain this phenomenon.

Example 2

(Figure c). Suppose the influence probability on each edge from is 0.5 and each edge from is 1. The activation threshold of each node is 1, and . Then , , but , . Thus, is not a feasible solution even though is large enough. This simple example shows that too many “small active probability” nodes may mislead causing it to diverge significantly from .

Now we show the hardness result of SM-CA problem with . Our analysis is based on the hardness of the densest -subgraph (DS) problem [14]. An instance of DS problem consists of an undirected graph and a parameter , where . The objective is to find a subset of cardinality such that the number of edges with both endpoints in is maximized.

The first polynomial time approximation algorithm for DS problem is given by Feige et al. in 2001 [14] with the approximation ratio . This result was improved to (for any ) by Bhaskara et al. [2] in 2010 and this is the currently the best known guarantee. For the hardness of DS problem, Khot [22] proved that the DS problem does not admit PTAS under the assumption that NP problems does not have sub-exponential time randomized algorithms. The exact complexity of approximating DS problem is still open, but it is widely believed that DS problem can only be approximated within polynomial ratio.

Partially borrowing the idea in [18], we can prove a hardness result for SM-CA problem with based on the hardness of DS problem.

Theorem 4

When , SM-CA problem cannot be approximated within if DS problem cannot be approximated within , for any .

{proof}

Suppose there is a polynomial time approximation algorithm with performance ratio for SM-CA with , we design an algorithm for DS problem based on , which has approximation ratio , hence the theorem follows.

Given any instance of DS problem on graph , construct an instance (denoted by SM-CA-I) of SM-CA problem as follows. It is defined on a one-way bipartite graph , where , the directed edge set , and is one of the endpoints of in . The probability on each edge is . The target set , for each node , and for each node , . For any , let be the maximum threshold requirement for which outputs a solution for SM-CA with nodes. That is to say, outputs a seed set with nodes if the threshold is and at least nodes if the threshold is larger than . 1

It is clearly that, in SM-CA-I, nodes in are no better than nodes in as candidates of seed since the target set is the set of all nodes, select a node in can only activate itself, but a node in may help to activate nodes in . So here we assume that all seeds selected by algorithm are from . Since for each edge , and for each node , , an easy probability calculation implies that a node can be cumulatively activated if and only if both endpoints of are selected as seeds.

Suppose the seed set of SM-CA-I with parameter computed by algorithm is , then we can use the corresponding node set in graph as an approximate solution of the DS problem. Indeed, we have . Since in SM-CA-I cumulatively activates at least nodes, only of them are in , so at least nodes are cumulatively activated in . Therefore, in graph the number of edges induced by is at least .

Without loss of generality, we can assume , this is because we can easily choose nodes from to cumulatively active nodes in . It is easy to check that .

Suppose the optimal solution of DS problem contains edges, then it is sufficient to show . Indeed, if we can prove , then we have , which means there is a -approximate algorithm for the DS problem.

In SM-CA-I, based on the choice of and the fact that is a -approximate algorithm, any seed set with size can cumulatively activate at most nodes. Thus, at most nodes in can be cumulatively activated by any seeds in . This is equivalent to the fact that there are at most edges induced by any vertexes in . Thus, for any with , all possible subset of vertexes in can induce at most edges and each edge is counted exactly times. So, if , the total number of edges induced by is at most

.

if , then . By the arbitrary chosen of , we have and this completes the proof.

We remark that when , it corresponds to the case of in the DkS problem, which has a trivial solution and makes the theorem statement vacuously true. Thus we add just to emphasize that the theorem is only useful when .

3.2 Influence maximization with cumulative activation (IM-CA) problem

In IM-CA problem, we prove a strong inapproximability result even when the base graph is a bipartite graph.

Theorem 5

For any , it is NP-hard to approximate IM-CA problem within a factor of , where is the input size.

{proof}

Similar to the proof of inapproximability result in [20], We construct a reduction from SET COVER problem. The input of the SET COVER problem includes a ground set , a collection of subsets , and a positive integer . The question is whether there exists subsets whose union is .

Given an instance of the set cover problem, we construct an instance of IM-CA problem as follows: There are three types of nodes, set nodes, element nodes, and dummy nodes. There is a set node corresponding to each set, an element node corresponding to each element, and a directed edge with activation probability if the element represented by is belong to the set represented by and otherwise. There are dummy nodes (where ), and there is a directed edge for each and . The activation probability on is . The activation thresholds of set nodes, element nodes and dummy nodes are , respectively. The budget of the size of a seed set is and the target set is all nodes. Notice that the input size of our IM-CA problem is , so .

Under our construction, if there exists a collection of sets covering all elements in for SET COVER problem, then in IM-CA problem, the node set corresponding to the collection denoted by will cumulatively activate all element nodes and all dummy nodes. In total, there will be nodes become cumulatively active. On the other hand, let’s consider the case if there is no set cover of size . Again we can assume all the seeds are selected from set nodes, since as a candidate for seeds, set nodes are more efficient than element nodes and dummy nodes. Thus, if there is no set cover of size , then we cannot find seeds which activate all the element nodes, hence none of the dummy notes are activated. Therefore, the total number of nodes cumulatively activated are no more than . It follows that if a polynomial algorithm can approximate IM-CA problem within , then we can answer the decision problem of the SET COVER problem in polynomial time, this is impossible under the assumption P NP.

4 Efficient Heuristic Algorithms

In Section 3, we prove that both SM-CA with and IM-CA are hard to approximate. Despite this difficulty, in this section we present efficient heuristic algorithms based on the greedy strategy, in order to tackle the problem in practice. We first show the outline of our greedy strategies in Section 4.1. In Section 4.2, we adopt an efficient method to design scalability greedy algorithms.

4.1 Greedy Strategies

In this section, we introduce two possible greedy strategies for SM-CA problem and IM-CA problem.

From Section 3.1.1, we know that greedy by the surrogate function can guarantee good approximation ratio for SM-CA problem with . Thus, intuitively we could adopt as our surrogate objective even when and apply the greedy strategy based on . However, our initial experiments demonstrate that directly adopting is less effective, especially when seed set size is relative small. We believe that this is because greedy on would prefer larger increment of far below over smaller increment of close to , but the latter actually provides new cumulative activations. To guide seed selection towards the latter case, we generalize to by introducing an additional parameter .

A large reduces the difficulty of lifting over the threshold when it is getting close to , but it continuously rewards above , while a close to has the reverse effect. Essentially balances between the truncated surrogate (when ) and the expected influence function (when is large). Thus, our first greedy strategy is to use with a proper tuned as the greedy objective, and we call it the balanced truncation greedy (BTG) strategy.

The second strategy is to apply greedy on the objective function directly. That is, we select the node with the largest increment to in each step. However, since is a discrete rounding function, there could be many nodes having the same effect (or no effect at all) in any step. For tie-breaking, we select nodes according to , which is equal to under this situation. In summary, the second strategy preferentially selects nodes promoting most, then chooses the node contributing to most among nodes having the same promotion to . In this strategy, the objective function (i.e. ) plays a dominant role in selecting seeds. We call it the activation dominance greedy (ADG) strategy.

During the process of greedy algorithms, we need to estimate for each node . It will be very expensive if we do this estimation by Monte Carlo simulations. Specially, by Lemma 4, we need to simulate times to guarantee the accuracy, each simulation takes time in the worst case. Thus, it takes for each node to estimate . To improve the efficiency, we adopt a new approach named reverse reachable ret (RR set), as we describe in the next section.

4.2 Greedy Algorithms Based on RR Set

In this section, we present our efficient algorithms based on RR set. We first introduce the background of RR set. RR set was first proposed by Borgs et al. in 2014 [3] to provide the first near-linear-time algorithm for the classical influence maximization problem in [20]. The approach is further optimized later in a series of follow-up work [29, 30, 26]. The definition of RR set is as follows:

Definition 3 (Reverse Reachable Set)

Let be a node in , and be a random graph obtained by independently removing each edge in with probability . The reverse reachable set (RR set) for is the set of nodes in that can reach .

Borgs et al. established a crucial connection between RR set and the influence propagation process on . We restatement it in Lemma 5.

Lemma 5 ([3])

Let be a seed set and be a fixed node. Suppose is an RR set for generated from , then equals the probability that overlaps with , that is,

.

Now we introduce our new method to estimate for each node . We first generate RR sets for independently. Let be the collection of all generated RR sets for . For any node set , let be the fraction of RR sets in overlapping with . That is, Then for any , we use as the estimation of . We can prove that we can bound the estimation error if is large enough.

Lemma 6

For any , if satisfies , then for each node ,

.

{proof}

Let , then is the number of RR sets in overlapping with . Moreover, can be regarded as the sum of Bernoulli variables. Specifically, let where if overlaps with the -th RR set in and otherwise. Based on Lemma 5, we have . By the Hoeffding’s inequality and the condition , we have

We now present our greedy algorithms. Recall that we use two greedy functions: and , where is the indicator function.

In order to make it easier to understand, we describe the processes of selecting seeds in subprograms. We first present the framework of the whole greedy algorithm for IM-CA problem in Algorithm 3.

0:    
0:     Seed set
1:  set
2:  generate RR sets for each node :
3:  set for each node
4:  for  to  do
5:      SS()
6:     /*SS is a general term of SSBT and SSAD*/
7:     
8:     remove all Sets containing
9:     for each in  do
10:        : the number of Sets removed from
11:        
12:     end for
13:  end for
14:  return  
Algorithm 3 Framework of greedy algorithm for IM-CA problem

In Algorithm 3, we first initialize for the seed set (line 1). Then we generate RR sets for each node in . Let be the collection of RR sets for . In line 3, is the requirement of node , which is the number of RR sets in that needs to be hit by a seed set so that can become cumulatively active. We say a set hits an RR set if . Based on Lemma 6, is cumulatively active only if there are at least RR sets in hit by the seed set. Thus, we set for each node in .

At each step, we add a new node into the current seed set (line 5). After is selected, we need to remove all RR sets containing and update the requirements for all nodes. The algorithm ends when .

Note that Algorithm 3 needs to call the seed selecting procedures (line 5). Here, SS() is a general term for our two subprograms SSBT (Selecting Seeds via Balanced Truncation strategy) and SSAD (Selecting Seeds via Activation Dominance strategy). Specifically, SSBT (Procedure 4) is the subprogram that selects one node with the largest marginal increment to into the current seed set . SSAD (Procedure 5) is the subprogram selecting the node with the largest marginal increment to , with tie-breaking on . The algorithm calling SSBT is named as BTG-IM-CA (Balanced Truncation Greedy algorithm for IM-CA problem) and the algorithm calling SSAD is named as ADG-IM-CA (Activation Dominance Greedy algorithm for IM-CA problem ).

0:  
0:  a new seed
1:  set for all
2:  for each node and  do
3:     for each node  do
4:        /*compute the marginal increment of */
5:        
6:     end for
7:  end for
8:  select
9:  return  
Procedure 4 SSBT: Selecting Seeds via Balanced Truncation strategy

Now we describe our two subprograms SSBT and SSAD. We first introduce SSBT in Procedure 4. Let be the value of the marginal increment generated by any node , be the number of RR sets in overlapping with node . In the main loop of SSBT, we select the node providing the largest marginal increment to . To this end, for each node , we compute the marginal increment of to all nodes which are not cumulatively active yet. Based on Lemma 5, the marginal increment of a node to node can be measured by . Summing up the increments of on all not-yet cumulatively active nodes, we can obtain (see details in lines 2 - 7). Then, we choose the node with the maximum