A N-FAMILY approach for IM Algorithms in IC model

Fast Influence Maximization in Dynamic Graphs: A Local Updating Approach

Abstract

We propose a generalized framework for influence maximization in large-scale, time evolving networks. Many real-life influence graphs such as social networks, telephone networks, and IP traffic data exhibit dynamic characteristics, e.g., the underlying structure and communication patterns evolve with time. Correspondingly, we develop a dynamic framework for the influence maximization problem, where we perform effective local updates to quickly adjust the top- influencers, as the structure and communication patterns in the network change. We design a novel N-Family approach (N=1, 2, 3, ) based on the maximum influence arborescence (MIA) propagation model with approximation guarantee of . We then develop heuristic algorithms by extending the N-Family approach to other information propagation models (e.g., independent cascade, linear threshold) and influence maximization algorithms (e.g., CELF, reverse reachable sketch). Based on a detailed empirical analysis over several real-world, dynamic, and large-scale networks, we find that our proposed solution, N-Family improves the updating time of the top- influencers by orders of magnitude, compared to state-of-the-art algorithms, while ensuring similar memory usage and influence spreads.

\DeclareCaptionType

copyrightbox

\setcopyright

acmcopyright

\acmPrice

$15.00

\numberofauthors

2

1 Introduction

The problem of influence analysis [7, 11] has been widely studied in the context of social networks, because of the tremendous number of applications of this problem in viral marketing and targeted recommendations. Influence analysis is also closely related to information diffusion and outbreak detection. The general assumption in bulk of the literature on this problem is that a static network has already been provided, and the objective is to identify the top- seed users in the network such that the expected number of influenced users, starting from those seed users and following an influence diffusion model, is maximized.

In recent years, however, it has been recognized that there is an inherent usefulness in studying the dynamic network setting [2, 18, 25], and influence analysis is no exception to this general trend [19, 17], because many real-world social networks evolve over time. In a time evolving graph, new edges (interactions) and nodes (users) are continuously added, while old edges and nodes get dormant, or even deleted. In addition, the communication pattern and frequency may also change, e.g., certain regions of the network can suddenly become more active. Some real-world examples are as follows:

  • In a social network (e.g., Twitter), users may continuously message one another, or they may post public messages referencing each other. Such interactions can be viewed as active links between the users.

  • In a message board (e.g., Youtube), users may repeatedly exchange messages with one another on a thread. This represents an interaction between a pair of users.

  • In a telecommunication network, calls between participants represent signs of activity between them.

  • In an academic collaboration network (e.g., DBLP), users may co-author papers with one another, which represents a pattern of influence between them. New collaborations are formed, and past collaborations become stale.

Figure 1: Running example: an influence graph
Figure 2: Influence graph after update operation: edge deletion

From an influence analysis perspective, even modest changes in the underlying network structure (e.g., addition/ deletion of nodes and edges) and communication patterns (e.g., update in influence probabilities over time) may lead to changes in the top- influential nodes. As an example, let us consider the influence graph in Figure 1 with 12 nodes, out of which the top- seed nodes are and (marked in bold), following the Maximum Influence Arborescence (MIA) model and [11] (we shall introduce the details of the MIA model later). The influence spread obtained from this seed set, according to the MIA model is: . Now, assume an update operation in the form of an edge removal (marked in red). The new influence spread obtained from the old seed nodes would be: , whereas if we recompute the top- seed nodes, they are and , as shown in Figure 2. The influence spread from these new seed nodes is: . It can be observed that there is a significant difference in the influence spread obtained with the old seed set vs. the new ones (even for such a small example graph), which motivates us to efficiently update the seed nodes when the influence graph evolves.

However, computing the seed set from ground, after every update, is prohibitively expensive [19, 17] — this inspires us to develop dynamic influence maximization algorithms. By carefully observing, we realize that among the initial two seed nodes, only one seed node, namely is replaced by , whereas still continues to be a seed node. It is because is in the affected region of the update operation, whereas is not affected by it. Therefore, if we can identify that can no longer continue as a seed node, then we can remove it from the seed set; and next, the aim would be to find one new seed node instead of two. Hence, we save almost of the computation in updating the seed set.

To this end, the two following questions are critical for identifying the top- seed nodes in a dynamic environment.

  • What are the regions affected when the network evolves?

  • How to efficiently update the seed nodes with respect to such affected regions?

Affected region. The foremost query that we address is identifying the affected region, i.e., the set of nodes potentially affected due to the update. They could be: (1) the nodes (including some old seed nodes) whose influence spreads are significantly changed due to the update operation, and also (2) those nodes whose marginal gains might change due to an affected seed node, discovered in the previous step(s). Given a seed set , the marginal gain of a node is computed as the additional influence that can introduce when it is added to the seed set.

Given the influence graph and dynamic updates, we design an iterative algorithm to quickly identify the nodes in the affected region. We call our method the N-Family approach, (until a base condition is satisfied), which we shall discuss in Section 3.

Updating the seed nodes. Once the affected region is identified, updating the top- seed set with respect to that affected region is also a challenging problem. In this work, we develop an approximate algorithm under the MIA model of information diffusion, with theoretical performance guarantee of .

Moreover, it should be understood that our primary aim is to maximize the influence spread as much as possible with the new seed nodes, instead of searching for the exact seed nodes (in fact, finding the exact top- seed nodes is -hard [11]). To illustrate this fact, assume that the ideal seed set generates an influence spread , and it takes units of time to identify them. Another set of nodes , whose influence spread is (of course, ), requires only units of time to find them, and . Then, one might prefer finding over , especially in the presence of dynamic graph updates. Therefore, we also show how to design more efficient heuristic algorithms, by carefully tuning the parameters (e.g., by limiting ) of our N-Family approach.

Our proposed framework for updating the top- seed nodes is a generic one, and we develop heuristics by using it on top of a series of information diffusion models (e.g., independent cascade (IC) [11], linear threshold (LT) [11]) and many influence maximization algorithms (e.g., Greedy [11], CELF [12], reverse reachable (RR) sketch [4, 22, 23]). In particular, we first find the affected region, and then update the seed nodes only by adding a few sub-routines to the existing static influence maximization algorithms, so that they can easily adapt to dynamic changes.

Our contributions. The contributions of our work can be summarized as follows.

  • We propose an iterative technique, N-Family that systematically identifies affected nodes (including old seed nodes) due to dynamic updates, and develop an incremental method that replaces the affected seed nodes with new ones, so to maximize the influence spread in the updated graph. We derive time complexities and theoretical performance guarantees of our algorithm under the MIA model.

  • We show how to develop efficient heuristics by extending proposed algorithm to other information propagation models and influence maximization algorithms for updating the seed nodes in an evolving network.

  • We conduct a thorough experimental evaluation using several real-world, dynamic, and large graph datasets. The empirical results with our heuristics attest orders of efficiency improvement, compared to state-of-the-art approaches [19, 17]. A snippet of our empirical results is presented in Table 1.

Datasets UBI+ Family-CELF DIA Family-RRS
(#nodes, #edges) [19] [our method] [17] [our method]
Digg (30K, 85K) 3.36 sec 0.008 sec 5.60 sec 0.20 sec
Slashdot (51K, 130K) 11.3 sec 0.05 sec 35.16 sec 2.96 sec
Epinions (0.1M, 0.8M) 1111.21 sec 24.58 sec 134.68 sec 5.31 sec
Flickr (2.3M, 33M) 45108.09 sec 1939.40 sec 770.41 sec 273.50 sec
Table 1: Average seed-set-updating time (sec) per node addition in the influence graph; the seed set consists of top-30 seed nodes; IC Model for influence cascade. For more details, we refer to Section 4.

2 Preliminaries

An influence network can be modeled as an uncertain graph , where and denote the sets of nodes (users) and directed edges (links between users) in the network, respectively. is a function that assigns a probability to every edge , such that is the strength at which an active user influences her neighbor . The edge probabilities can be learnt (from past propagation traces), or inferred (following various models), as in [9, 3]. In this work, we shall assume that is given as an input to our problem.

2.1 Influence Maximization in Static Graphs

Whenever a social network user buys a product, or endorses an action (e.g., re-tweets a post, or re-shares a picture), she is viewed as being influenced or activated. When is active, she automatically becomes eligible to influence her neighbors who are not active yet. While our designed framework can be applied on top of a varieties of influence diffusion models; due to brevity, we shall introduce maximum influence arborescence (MIA) [6] and independent cascade (IC) [11] models, because we develop the approximate algorithm with theoretical guarantee on the former one, and an efficient heuristic on the latter. We shall, however, elaborate later how our techniques can be employed over other influence propagation models, such as the linear threshold (LT) [11] model.

MIA model. We start with an already active set of nodes , called the seed set, and the influence from the seed nodes propagates only via the maximum influence paths. A path from a source to a destination node is called the maximum influence path if this has the highest probability compared to all other paths between the same pair of nodes. Ties are broken in a predetermined and consistent way, such that the maximum influence path between a pair of nodes is always unique. Formally,

0:  Graph , seed set (initially empty), positive integer
0:  Seed set having the top- seed nodes
1:  while  do
2:     
3:     
4:  end while
5:  Output
Algorithm 1 : for IM in static networks
(1)

Here, denotes the set of all paths from to . In addition, an influence threshold (which is an input parameter to trade off between efficiency and accuracy [6]) is used to eliminate maximum influence paths that have smaller propagation probabilities than .

IC model. This model assumes that diffusion process from the seed nodes continue in discrete time steps. When some node first becomes active at step , it gets a single chance to activate each of its currently inactive out-neighbors ; it succeeds with probability . If succeeds, then will become active at step . Whether or not succeeds at step , it cannot make any further attempts in the subsequent rounds. If a node has incoming edges from multiple newly activated nodes, their attempts are sequenced in an arbitrary order. Also, each node can be activated only once and it stays active until the end. The campaigning process runs until no more activations are possible.

Influence estimation problem. All active nodes at the end, due to a certain diffusion process, are considered as the nodes influenced by . Because of the stochastic nature of diffusion, the expected influence spread (i.e., expected number of influenced users at the end of the diffusion process) by is denoted as .

In an uncertain graph , influence estimation is the problem of identifying the expected influence spread of . Let us denote by the probability that gets activated by , during the diffusion process. Then, the expected influence spread of is given below.

(2)
Example 1

In Figure 1, and are the nodes that are influenced by . We have, , whereas for , is . Next, ; . Similarly, . This holds for both MIA and IC model, as there is only one path from to each of , , , and . Additionally, the MIA model uses a threshold , so to eliminate maximum influence paths that have smaller propagation probabilities than . For example, with , we shall not consider the influence of on according to the MIA model. Hence, by Equation 2, (IC model), and (MIA model, ).

Clearly, the computation under the IC model gets more complex with larger graphs and multiple seed nodes. In fact, it has been proved that the exact estimation of influence spread is a -hard problem, under the IC model [6]. However, influence spread can be computed in polynomial time for the MIA model.

Marginal influence gain. Given a seed set , the marginal gain of a node is computed as the additional influence that can introduce when it is added to the seed set.

(3)

Influence maximization (IM) problem. Influence maximization is the problem of identifying the seed set of cardinality that has the maximum expected influence spread in the network. IM can be formally defined as an optimization problem as follows.

(4)

The influence maximization is an -hard problem, under both MIA and IC models [6, 11].

Symbol Meaning
uncertain graph
probability of edge
a path
set of all paths from to
the highest probability path from to
seed set
seed set formed after iterations of Greedy algorithm
seed node added at the -th iteration of Greedy algorithm
expected influence spread from
probability that gets activated by
marginal influence gain of w.r.t. seed set
priority queue that sorts non-seed nodes in descending order
of marginal gains (w.r.t. seed set)
Table 2: Notations used and their meanings

2.2 Greedy Algorithm for IM in Static Graphs

In spite of the aforementioned computational challenges of influence estimation and maximization (stated in Section 2.1), the following properties of the influence function, assist us in developing a Greedy Algorithm (presented in Algorithm 1) with approximation guarantee of [16].

Lemma 1 (Influence function is sub-modular [11, 6])

A function is sub-modular if for any , when .

Lemma 2 (Influence function is monotone [11, 6])

A function is monotone if for any x.

The Greedy algorithm repeatedly selects the node with the maximum marginal influence gain (line 2), and adds it to the current seed set (line 3) until nodes are identified.

As given in Table 2, we denote by the seed set formed at the end of the -th iteration of Greedy, whereas is the seed node added at the -th iteration. Clearly, . One can verify that the following inequality holds for all , .

(5)

2.3 IM in Dynamic Graphs

Classical influence maximization techniques are developed for static graphs. The real-time influence graphs, however, are seldom static and evolves over time, where multiple nodes (or, edges) are added or deleted, or the propagation probabilities may change.

Graph update categories. We recognize six update operations among which four are edge operations and two are node operations in dynamic graphs: 1. increase in edge probability, 2. adding a new edge, 3. adding a new node, 4. decrease in edge probability, 5. deleting an existing edge, and 6. deleting an existing node. We refer to the first three update operations as additive updates, because the size of the graph and its parameters increase with these operations; and the remaining as reductive updates. Hereafter, we use a general term update for any of the above operations, until and unless specified, and we denote an update operation with .

Dynamic influence maximization problem.

Problem 1

Given an initial uncertain graph , old set of top- seed nodes, and a series of consecutive graph updates , find the new set of top- seed nodes for this updated graph.

The baseline method to solve the dynamic influence maximization problem will be to find the updated graph at every time, and then execute an IM algorithm on the updated graph, which returns the new top- seed nodes. However, computing all seed nodes from ground at every snapshot is prohibitively expensive, even for moderate size graphs [19, 17]. Hence, our work aims at incrementally updating the seed set, without explicitly running the complete IM algorithm at every snapshot of the evolving graph.

3 Proposed Solution

We propose a novel N-Family framework for dynamic influence maximization, and this can be adapted to many influence maximization algorithms and several influence diffusion models. We first introduce our framework that illustrates how an update affects the nodes in the graph (Section 3.1), and how to re-adjust the top- seed nodes with a theoretical performance guarantee under the MIA model (Section 3.2). Initially, we explain our technique for a single dynamic update, and later we show how it can be extended to batch updates (Section 3.2.3). In Section 3.3 and 3.4, we show how to extend our algorithm to IC model and LT model, respectively, for developing efficient heuristics.

3.1 Finding Affected Regions

Given an update, the influence spread of several nodes in the graph could be affected. However, the nearby nodes would be impacted heavily, compared to a distant node. We, therefore, design a threshold-based approach to find the affected regions, and our method is consistent with the notion of the MIA model.

Problem 2

Given an update operation in an uncertain graph , find all nodes for which the expected influence spread is changed by at least .

In MIA model, the affected nodes could be computed exactly in polynomial time (e.g., by exactly finding the expected influence spread of each node before and after the update, with the MIA model). In this work, we, however, consider a more efficient upper bounding technique as discussed next.

Definitions

We start with a few definitions.

Definition 1 (Maximum Influence In-Arborescence)

Maximum Influence In-Arborescence (MIIA) [6] of a node is the union of all the maximum influence paths to where every node in that path reaches with a minimum propagation probability of , and it is denoted as . Formally,

(6)
Definition 2 (Maximum Influence Out-Arborescence)

Maximum Influence Out-Arborescence (MIOA) [6] of a node is the union of all the maximum influence paths from where can reach every node in that path with a minimum propagation probability of , and it is denoted as .

(7)
Definition 3 (1-Family)

For every node , 1-Family of , denoted as , is the set of nodes that influence , or get influenced by with minimum probability through the maximum influence paths, i.e.,

(8)
Definition 4 (2-Family)

For every node , 2-Family of , denoted as , is the union of the set of nodes present in 1-Family of every node in , i.e.,

(9)

Note that 2-Family is always a superset of 1-Family of a node.

Example 2

In Figure 1, let us consider . Then, , and . For any other node in the graph, its influence on is 0. Hence, . Similarly, . will contain . Analogously will contain . Since the context is clear, for brevity we omit from the notation of family.

We note that Dijkstra’s shortest path algorithm, with time complexity [8, 6], can be used to identify the , , and 1-Family of a node. The time complexity for computing 2-Family is . For simplicity, we refer to 1-Family of a node as its family.

The 2-Family of a seed node satisfies an interesting property (given in Lemma 3) in terms of marginal gains. It follows from the fact that a node influences, or gets influenced by the nodes that are present only in its family, based on the MIA model.

Lemma 3

Consider , then removing from the seed set does not change the marginal gain of any node that is not in . Formally, , for all , according to the MIA model.

{proof}

According to Eq. 3, the marginal gain of with respect to is given as:

(10)

As , the influence of on any node in is . Hence, Equation 10 can be written as:

(11)

Now, the removed seed node cannot influence any node outside . Hence, Equation 11 can be written as:

(12)

As influence of on any node in is , Equation 12 can be written as:

(13)

Hence, the lemma.

Change in family after an update. During the additive update, e.g., an edge addition, the size of the family of a node nearby the update may increase. A new edge would help in more influence spread, as demonstrated below.

Example 3

Consider Figure 2 as the initial graph. When , . Let us assume that a new edge with probability is added, that is, the updated graph is now Figure 1. If we recompute in Figure 1, then we get .

Analogously, during the reductive update, e.g., an edge deletion, the size of family of a node surrounding the update may decrease. Deleting the edge eliminates paths for influence spread, as follows.

Example 4

Consider Figure 1 as the initial graph. . Now, the edge with probability is deleted. If we recompute after modifying the graph (i.e., Figure 2), we get .

Thus, for soundness, in case of an additive update, we compute , , and family on the updated graph. On the contrary, for a reductive update, we compute them on the old graph, i.e., before the update. Next, we show in Lemma 4 that provides a safe bound on affected region for any update originating at node , according to the MIA model.

Lemma 4

In an influence graph , adding a new edge does not change the influence spread of any node outside by more than , according to the MIA model.

{proof}

Consider a node outside in the original graph , which means cannot activate with a minimum strength of through . Then, the strength at which activates through in the updated graph is: . Since, , we have: . Thus, adding the edge does not change the expected influence spread of , based on the MIA model. Hence, the lemma follows. From the above lemma, in fact, it can be understood that adding an edge does not change the influence spread (at all) of any node outside for the MIA model, and this phenomenon can be extended to edge deletion, edge probability increase, and for edge probability decrease. Moreover, for a node update (both addition and deletion) , gives a safe upper bound of the affected region. We omit the proof due to brevity. Therefore, is an efficient (computing time ) and a safe upper bound for the affected region.

Infected Regions

Due to an update in the graph, we find that a node may get affected in two ways: (1) the nodes (including a few old seed nodes) whose influence spreads are significantly affected due to the update operation, and also (2) those nodes whose marginal gains might change due to an affected seed node, discovered in the previous step(s). This gives rise to a recursive definition, and multiple levels of infected regions, as introduced next.

First infected region (1-IR). Whenever an update operation takes place, the influence spread of the nodes surrounding it, will change. Hence, we consider the first infected region as the set of nodes, whose influence spreads change at least by . Similar to the threshold in the MIA model, our threshold can be decided empirically to avoid negligible change in the influence spread of a node, due to an update operation.

Definition 5 (First infected region (1-Ir))

In an influence graph and given a probability threshold , for an update operation , 1-IR is the set of nodes whose influence spread changes greater than or equal to . Formally,

(14)

In the above equation, denotes the expected influence spread of in , whereas is the expected influence spread of in the updated graph. Following our earlier discussion, we consider as a proxy for 1-IR, where is the starting node for the update operation .

Example 5

In Figure 1, consider the removal of edge . Assuming , 1-IR=.

Second infected region (2-IR). We next demonstrate how infection propagates from the first infected region to other parts of the graph through the family of affected seed nodes.

First, consider a seed node , a non-seed node , and . If the influence spread of has increased due to an update, then to ensure that continues as a seed node, we have to remove from the seed set, and recompute the marginal gain of every node in . The node, which has the maximum gain, will be the new seed node. Second, if a seed node gets removed from the seed set in this process, the marginal gains of all nodes present in will change. We are now ready to define the second infected region.

Definition 6 (Second infected region (2-Ir))

For an additive update , the influence spread of every node present in 1-IR increases which gives the possibility for any node in 1-IR to become a seed node. Hence, the union of 2-Family of all the nodes present in 1-IR is called the second infected region 2-IR. On the contrary, in a reductive update operation , there is no increase in influence spread of any node in 1-IR. Hence, the union of 2-Family of old seed nodes present in 1-IR is considered as the second infected region 2-IR.

(15)
(16)

The time complexity to identify 2-IR is , where is the number of nodes in 1-IR.

Example 6

In Figure 1, consider the removal of edge . Assuming , 2-IR=. This is because is an old seed node present in 1-IR for this reductive update. Furthermore, because this is a reductive update, the family of needs to be computed before the update. Therefore, 2-IR=.

Iterative infection propagation. Whenever there is an update, the infection propagates through the 2-Family of the nodes whose marginal gain changes as discussed above. For , the infection propagates from the infected region to the infected region through old seed nodes that are present in the 2-Family of nodes in (N-1)-IR.

Definition 7 ( infected region (N-Ir))

The 2-Family of seed nodes, that are in the 2-Family of infected nodes in (N-1)-IR, constitute the infected region.

(17)
Figure 3: Iterative infection propagation: is an additive update operation originating at node . and are two old seed nodes. , , , are nodes, not necessarily old seed nodes.

We demonstrate the iterative computation of infected regions, up to 4-IR for an additive update, in Figure 3. We begin with node which is the starting node of the update, and is the 1-IR. The update being an additive one, union of 2-Family of all the nodes is considered as the 2-IR. For all nodes , we compute . Now, union of 2-Family of all seed nodes is considered as 3-IR. Similarly, 4-IR can be deduced, and as there is no seed node present in the 2-Family of all nodes , we terminate the infection propagation.

Termination of infection propagation. The infection propagation stops when no further old seed node is identified in the 2-Family of any node in the infected region. Due to this, there shall be no infected node present in 2-Family of any uninfected seed node. Assume we have a budget on the number of seed nodes. Then, it can be verified that for a reductive update, the maximum value of can be between and . For an additive update, the maximum value of is between to .

Total infected region (TIR). The union of all infected regions is referred to as the total infected region (TIR).

(18)

Our recursive definition of TIR ensures the following properties.

Lemma 5

The marginal gain of every node outside TIR does not change, according to the MIA model. Formally, let be the old seed set, and we denote by the remaining old seed nodes outside TIR, i.e., . Then, the following holds: , for all nodes .

Lemma 6

Any old seed node outside TIR has no influence on the nodes inside TIR, following the MIA model. Formally, , for all nodes .

{proof}

A seed node can influence only the nodes present in its family according to the MIA model. There is no node present in TIR which belongs to the family of any seed node outside TIR. This is because any uninfected seed node is more than 2-Family away from any node present in TIR (This is how we terminate infection propagation). Hence, the lemma.

The old seed nodes inside TIR may no longer continue as seeds, therefore we need to discard them from the seed set, and the same number of new seed nodes have to be identified. We discuss the updating procedure of seed nodes in the following section.

3.2 Updating the Seed Nodes

We now describe our seed updating method over the Greedy IM algorithm, and following the MIA model of influence cascade. Later we prove that the new seed nodes reported by our technique (Algorithm 2) will be the same as the top- seed nodes found by Greedy on the updated graph and with the MIA model, thereby maintaining approximation guarantee to the optimal solution [6].

Approximation Algorithm

We present our proposed algorithm for updating the seed set in Algorithm 2. Consider Greedy (Algorithm 1) over the MIA model on the initial graph, and assume that we obtained the seed set , having cardinality . Since Greedy works in an iterative manner, let us denote by the seed set formed at the end of the -th iteration, whereas is the seed node added at the -th iteration. Clearly, , , and . Additionally, as given in Table 2, we use a priority queue , where its top node has the maximum marginal gain among all the non seed nodes.

After the update , we first compute the total infected region, TIR using Equation 18. Consider , of size , as the set of old seed nodes outside TIR, i.e., . Then, we remove old seed nodes inside TIR, and our next objective is to identify new seed nodes from the updated graph.

Note that inside , the seed nodes are still sorted in descending order of their marginal gains, computed at the time of insertion in the old seed set following the Greedy algorithm. In particular, we denote by the -th seed node in descending order inside , where . Due to Lemma 5, , for all nodes . Thus, for all , the following inequalities hold.

(19)
(20)

Now, after removing the old seed nodes present in TIR from the seed set, we compute the influence spread of every node and, we update these nodes in the priority queue , based on their new marginal gains . It can be verified that , for all , due to Lemma 6.

0:  Graph , total infected region TIR, old seed set , , old priority queue
0:  Compute the new seed set of size
1:  
2:  for all  do
3:     
4:  end for
5:  while TRUE do
6:      Greedy /* Starting with seed set , add new seed nodes via Greedy */ 
7:      Sort nodes in in Greedy inclusion order
8:     [top]
9:     if  then
10:        for all  do
11:           
12:           
13:        end for
14:     else
15:        Output
16:     end if
17:  end while
Algorithm 2    N-Family seeds updating method on top of Greedy

After updating the marginal gains of all the nodes in the priority queue (lines 1-4) as explained above, we proceed with greedy algorithm and find the new seed nodes, where . Let us denote by the new seed set (of size ) found in this manner (line 6). Now, we sort the seed nodes in in their appropriate inclusion order according to the Greedy algorithm over the updated graph (line 7). This can be efficiently achieved by running Greedy only over the seed nodes in , while computing their influence spreads and marginal gains in the updated graph. The sorted seed set is denoted by . Let us denote by the last (i.e., -th) seed node in , whereas represents the set of top- seed nodes in . We denote by the top-most seed node in the priority queue . If , we terminate our updating algorithm (line 15).

Iterative seed replacement. On the other hand, if , we remove the last seed node from . For every node in the 2-Family , we compute marginal gain and update the priority queue (lines 10-11). Next, we compute one new seed node using Greedy and add it to , thereby updating the seed set . We also keep the nodes in sorted after every update in it. Now, we again verify the condition: if , where being the new top-most node in the priority queue , then we repeat the above steps, each time replacing the last seed node from , with the top-most node from the updated priority queue . This iterative seed replacement phase terminates when . Clearly, this seed replacement can run for at most rounds; because in the worst scenario, all old seed nodes in could get replaced by new seed nodes from TIR. Finally, we report as the new seed set.

Theoretical Performance Guarantee

We show in the Appendix that the top- seed nodes reported by our N-Family method are the same as the top- seed nodes obtained by running the Greedy on the updated graph under the MIA model. Since, the Greedy algorithm provides the approximation guarantee of under the MIA model [6], our N-Family also provides the same approximation guarantee. All the proofs are given in the Appendix.

Extending to Batch Updates

We now describe how our proposed N-FAMILY algorithm can be extended to batch updates. We consider the difference of nodes and edges present in two snapshots at different time intervals of the evolving network as a set of batch updates. Clearly, we consider only the final updates present in the second snapshot, avoiding the intermediate ones. For example, in between two snapshot graphs, if an edge is added and then gets deleted, we will not consider it as an update because there is no change in the graph with respect to after the final update.

One straightforward approach would be to apply our algorithm for every update sequentially. However, we develop a more efficient technique as follows. For a batch update consisting of individual updates, every update has its own TIR. The TIR of the batch update is the union of TIR, for all .

(21)

Once the TIR is computed corresponding to a batch update, we update the seed set using Algorithm 2. Processing all the updates in one batch is more efficient than the sequential updates. For example, if a seed node is affected multiple times during sequential updates, we have to check if it remains the seed node every time. Whereas in batch update, we need to verify it only once.

3.3 Implementation with the IC model

Here we will show how we can develop efficient heuristics by extending the proposed N-Family approach to the IC model.

Computing TIR. For the IC model, one generally does not use any probability threshold to discard smaller influences; and perhaps more importantly, finding the nodes whose influence spread changes by at least (due to an update operation) is a -hard problem. Hence, computing TIR under the IC model is hard as well, and one can no longer ensure a theoretical performance guarantee of as earlier. Instead, we estimate TIR analogous to the MIA model (discussed in Section 3.1.2), which generates high-quality results as verified in our detailed empirical evaluation, since the maximum influence paths considered by the MIA model play a crucial role in influence cascade over real-world networks [6]. We would like to note that, though we find TIR using MIA model, we compute the influence spreads of the nodes using IC model only.

Updating Seed set. Our method In IC model follows the same outline as given in Algorithm 2 for updating the seed set with two major differences. In lines 3 and 11 of Algorithm 2, we compute the marginal gains and update the priority queue, but now we employ more efficient techniques based on the IM algorithm used for the purpose. Moreover, in Appendix, we derive two efficient heuristic algorithms, namely, Family-CELF (or, F-CELF) and Family-RRS (or F-RRS) by employing our N-Family approach on top of two efficient IM algorithms CELF [12] and RR sketch [4], respectively.

3.4 Implementation with the LT Model

The N-Family algorithm can be implemented on top of both Greedy and CELF which work with the linear threshold (LT) model of influence cascade [11, 12] also Hence, our algorithm can also be used with the LT model. We omit details due to brevity.

3.5 Heuristics to Improve Efficiency

We propose a more efficient heuristic method, by carefully tuning the parameters (e.g., by limiting in TIR computation) of our N-Family algorithm. Based on our experimental analysis with several evolving networks, we find that the influence spread changes significantly only for those nodes which are close to the update operation. Another seed node, which is far away from the update operation, even though its influence spread (and its marginal gain) may change slightly, it almost always remains as a seed node in the updated graph. Hence, we further improve the efficiency of our N-Family algorithm by limiting in TIR computation. Indeed, the major difference in influence spreads between the new seed set and the old one comes from those seed nodes in the first two infected regions (i.e., 1-IR and 2-IR), which can also be verified from our experimental results (Section 4.4).

4 Experimental Results

4.1 Experimental Setup

Datasets

We download four real-world, time evolving graphs (Table 3) from the Koblenz Network Collection (http://konect. uni-koblenz.de/networks/). (1) Digg. Digg is a communication network (http://digg.com/), and the dataset is downloaded between 10-05-2002, 12:19 to 11-23-2015, 23.30. Every node is a user, and each directed edge represents a reply from the source user to the target user. (2) Slashdot. Slashdot is also a communication network on the technology website http://slashdot.org/, and the dataset is collected between 11-30-2005, 19:11 to 08-15-2006, 14:06. Each node is a user, and every directed edge is a response given by the source node to the target node. (3) Epinions. This is a trust network of the online product rating site http://www.epinions.com/. Every node is a user, and a directed edge represents that one user indicates trust on the other. The dataset consists of responses from users between 01-09-2001, 23:00 to 08-11-2003, 22:00. (4) Flickr. The Flickr dataset belongs to a social network on the website https://www.flickr.com/, occurred between 11-01-2006, 11.00 to 05-07-2007, 10.00. Every node denotes a user, and an edge between and represents that they are friends.

All these graphs have directed edges, together with time-stamps; and hence, we consider them as evolving networks. If some edge appears for multiple times, we only consider the first appearance of that edge as its insertion time in the graph. The edge counts in Table 3 are given considering distinct edges only.

Influence strength models

In the downloaded datasets from the Koblenz Network Collection, edges are not provided with their corresponding influence strengths. By following bulk of the literature on influence maximization [11, 19, 17, 3], we adopt two widely-used edge probability models for our experiments. These are exactly the same settings used by our two competitors: UBI+ [19] and DIA [17].

Degree Weighted Activation (DWA) Model. In this model [11, 3, 17], the influence strength of the edge is equal to , where is the in-degree of the target node . This is also known as the weighted cascade model.

Trivalency (TV) Model. In this model [11, 17] (also known as the uniform activation model [3], or a slight variation of it), each edge is assigned with a probability, chosen uniformly at random, from .

Competing Algorithms

We compare the efficiency, memory usage, and the influence spread of the following methods.

FAMILY-CELF (F-CELF). This is an implementation of our proposed N-FAMILY framework, on top of the CELF influence maximization algorithm.

FAMILY-RR-Sketch (F-RRS). This is an implementation of our proposed N-FAMILY framework, on top of the RR-Sketch influence maximization algorithm.

DIA. The DIA algorithm was proposed in [17], on top of the RR-Sketch. The method generates all RR-sketches only once; and after every update, quickly modifies those existing sketches. After that, all seed nodes are identified from ground using the modified sketches. This is the key difference with our algorithm F-RRS, since we generally need to identify only a limited number of new seed nodes, based on the affected region due to the update.

UBI+. The UBI+ algorithm [19] performs greedy exchange for multiple times — every time an old seed node is replaced with the best possible non-seed node. If one continues such exchanges until there is no improvement, the method will guarantee 0.5-approximation. However, due to efficiency reasons, [19] limits the number of exchanges to rounds, where is the cardinality of the seed set. An upper bounding method [26] is used to find such best possible non-seed nodes at every round.

Dataset #Nodes #Edges Edge Prob: Mean, SD, Quartiles
prob. model prob. model
TV DWA
Digg 30 398 85 247 0.0370.045 0.1970.252
{0.001, 0.010, 0.100} {0.041, 0.100, 0.250 }
Slashdot 51,083 130 370 0.0370.045 0.1720.276
{0.001, 0.010, 0.100} {0.015, 0.048, 0.167}
Epinions 131 828 840 799 0.0370.045 0.1000.225
{0.001, 0.010, 0.100} {0.004, 0.013, 0.063}
Flickr 2 302 925 33 140 017 0.0370.045 0.0670.200
{0.001, 0.010, 0.100} {0.001, 0.003, 0.02}
Table 3: Properties of datasets

Among these methods, F-CELF and UBI+ are competitors, because they are developed on MC-simulation based techniques. On the contrary, F-RRS and DIA are competitors, because they are designed with sketches.

(a) Edge add., Digg (DWA)
(b) Edge del., Slashdot (TV)
(c) Node add., Epinions (TV)
(d) Node del., Flickr (DWA)
Figure 4: Run time to adjust seed set, IC model, seed sets are adjusted after every update
(a) Node add., Slashdot (DWA)
(b) Node del., Epinions (DWA)
Figure 5: Run time to adjust seed set, MIA model, seed sets are adjusted after every update

Parameters setup

We vary the following parameters.

#Seed nodes. We vary the number of seed nodes from 10 to 100, while most of the experiments are performed considering 30 seeds.

#RR-Sketches. To vary the number of sketches, a parameter was introduced in [17]. The number of sketches is decided as , where |V| and |E| are the number of nodes and edges, respectively, in the influence graph. We vary from to ; while in most experiments, we consider . The influence spread saturates around , which was also observed in [17].

Size of family. The family size of a node is decided by the parameter , that is, the set of nodes that influence , or get influenced by with minimum probability through the maximum influence paths. We vary from to ; while in most experiments, we set , since it provides a good trade-off between accuracy and efficiency.

#IR to compute TIR. We vary the number of infected regions from 1-IR to 3-IR (in order to compute the total infected region, TIR). However, in most experiments, we consider up to 2-IR, since it provides a good trade-off between accuracy and efficiency.

Influence diffusion models. We employ IC [11] and MIA [6] models for influence cascade. Bulk of our empirical results are provided with the IC model, since this is widely-used in the literature.

#MC samples. We use MC simulation 10 000 times to compute the influence spread of a seed set in the IC model [11].

(a) Inf. spread, edge add.,
Digg (
DWA) in IC model
(b) Inf. spread, node del.,
Epinions (
DWA) in MIA model
Figure 6: Influence spread, seed sets are adjusted after every update

The code is implemented in Python, and the experiments are performed on a single core of a 256GB, 2.40GHz Xeon server. All results are averaged over 10 runs.

4.2 Single Update Results

First, we show results for single update queries related to edge addition, edge deletion, node addition, and node deletion. We note that adding an edge can also be considered as an increase in the edge probability from to . Analogously, deleting an edge can be regarded as a decrease in edge probability. Moreover, for the DWA edge influence model, when an edge is added or deleted, the probabilities of multiple adjacent edges are updated (since, inversely proportional to node degree). Nevertheless, we shall show separate results for change in edge probabilities in Section 4.3.

Our experiment settings are as follows.

Edge addition. We start with initial 40% of the edges in the graph data, and then add all the remaining edges as dynamic updates. We demonstrate our results with the Digg dataset and the DWA edge influence model (Figure a).

Edge deletion. We delete the last 60% of edges from the graph as update operations. We use the Slashdot dataset, with TV model, for showing our results (Figure b).

Node addition. We start with the first % of nodes and all their edges in the dataset. We next added the remaining nodes sequentially, along with their associated edges. We present our results over Epinions, along with the TV model (Figure c).

Node deletion. We delete the last % of nodes, with all their edges from the graph. We use our largest dataset Flickr and the DWA model for demonstration (Figure d).

For the aforementioned update operations, we adjust the seed set after every update, since one does not know apriori when the seed set actually changes, and hence, it can be learnt only after updating the seed set.

Efficiency. In Figure 4, we present the running time to dynamically adjust the top- seed nodes, under the IC influence cascade model. We find that F-CELF and F-RRS are always faster than UBI+ and DIA, respectively, by 12 orders of magnitude. As an example, for node addition over Epinions in Figure c, the time taken by F-CELF is only sec for about K node additions (i.e., 24.58 sec/node add). In comparison, UBI+ takes around sec (i.e., 1111.21 sec/ node add). Our F-RRS algorithm requires about secs (i.e., 5.31 sec/ node add), and DIA takes sec (i.e., 134.68 sec/node add). These results clearly demonstrate the efficiency improvements by our methods.

We also note that sketch-based methods are relatively slower (i.e., F-RRS vs. F-CELF, and DIA vs. UBI+) in smaller graphs (e.g., Digg and Slashdot). This is due to the overhead of updating sketches after graph updates. On the contrary, in our larger datasets, Epinions and Flickr, the benefit of sketches is more evident as opposed to MC-simulation based techniques. In fact, both F-CELF and UBI+ are very slow for our largest Flickr dataset (see Table 1); hence, we only show F-RRS and DIA for Flickr in Figure d.

Additionally, in Figure 5, we show the efficiency of our method under the MIA model of influence spread. Since it is non-trivial to adapt UBI+ and DIA for the MIA model, we compare our algorithm F-CELF with CELF [12] in these experiments. For demonstration, we consider Slashdot and Epinions, together with node addition and deletion, respectively. It can be observed from Figure 5 that F-CELF is about 2 orders of magnitude faster than CELF. These results illustrate the generality and effectiveness of our approach under difference influence cascading models.

Influence spread. We report the influence spread with the updated seed set for both IC (Figure a) and MIA models (Figure b). It can be observed that the competing algorithms, i.e., F-CELF, F-RRS, UBI+, and DIA achieve similar influence spreads with their updated seed nodes. Furthermore, we also show by INITIAL the influence spread obtained by the old seed set in the modified graph. We find that INITIAL achieves significantly less influence spread, especially with more graph updates. These results demonstrate the usefulness of dynamic IM techniques in general, and also the effectiveness of our algorithm in terms of influence coverage.

Memory usage. We show the memory used by all algorithms in Table 4. We find that MC-sampling based algorithms (i.e., F-CELF and UBI+) require similar amount of memory, whereas both sketch-based techniques (i.e., F-RRS and DIA) also have comparable memory usage. Our results illustrate that the proposed methods, F-CELF and F-RRS improve the updating time of the top- influencers by 12 orders of magnitude, compared to state-of-the-art algorithms, while ensuring similar memory usage and influence spreads.

Algorithms Digg Slashdot Epinions Flickr
F-CELF 0.223 GB 0.316 GB 1.032 GB 31.548 GB
UBI+ 0.241 GB 0.353 GB 1.212 GB 35.202 GB
F-RRS 3.829 GB 5.885 GB 25.873 GB 142.893 GB
DIA 3.822 GB 5.872 GB 25.839 GB 142.327 GB
Table 4: Memory consumed by different algorithms

4.3 Batch Update Results

Real batch updates. We demonstrate real batch updates with a sliding window model. In this model, initially we consider the edges present in between to units of time (length of window). We compute the seed set with the edges present in that window. Next, we slide the window to units of time. The edges present in between and are considered as the updated data, and our goal to find the seed set based on the updated data. We delete the edges from to and add the edges from to . We continue sliding the window until we complete the whole data.

We conducted this experiment using the Twitter dataset downloaded from https://snap.stanford.edu/data/. The dataset is extracted from the tweets posted between 01-JUL-2012 to 07-JUL-2012, which is during the announcement of the Higgs-Boson particle. This dataset contains nodes and edges. Probability of an edge is given by the formula , where is the total number of edges appeared in the window, and is the constant. We present our experimental results by varying from 30 mins to 6 hrs and from 1 sec to 2 mins. We set the value of as . On an average, updates appear per second. Since the number of edges in a window is small, we avoid showing results with F-RRS. This is because F-CELF performs much better on smaller datasets. From the experimental results in Figure 7, we find that F-CELF is faster than both UBI+ and DIA upto three orders of magnitude.

(a) Run time to adjust seed set, varying , = 1 hour
(b) Run time to adjust seed set, varying , = 60 secs
Figure 7: Impacts of varying batch sizes, sliding window model, Twitter, IC model, seed sets are adjusted after every slide
(a) Run time to adjust seed set
w/ increasing edge prob.
(b) Run time to adjust seed set
w/ decreasing edge prob.
Figure 8: Impacts of batch edge prob. updates, Flickr (DWA), IC model, seed sets are adjusted after batch updates
(a) Inf. Spread
(b) Run time to adjust seed set
Figure 9: Impacts of , node del., Epinions (DWA), IC model
(a) Inf. spread
(b) Run time to adjust seed set
Figure 10: Impacts of #IRs, node del., Epinions (DWA), IC model
(a) Run time to adjust seed
set, node add., Epinions
(b) Inf. spread w/ varying ,
Digg
Figure 11: Impacts of varying #seeds and , IC model

Synthetic batch updates. We present the efficiency of our algorithms for varying batch sizes with change in edge probabilities in Figure 8, by making synthetic local updates. We use our largest Flickr dataset for demonstration. We select nodes uniformly at random; and for every node, probabilities of % of the edges within its -hops are changed. We vary from to (batch size = 15, 1.8K, 20K, 454K edges for 1, 2, 3, 4 hops, resp.). For increase in edge probability (Figure a), we increase the probability of selected edges by %; and for decrease in edge probability (Figure b), we reduce them by %. Due to the larger size of Flickr, as earlier we compare the run times of F-RRS and DIA in these experiments. It can be observed that F-RRS is about times faster than DIA for batch updates within -hop, and about times faster for batch updates within -hops. For batch updates beyond -hops, F-RRS is still faster, but as more seed nodes get affected, more time is necessary for re-adjusting the seed set. These results show that our algorithms are very efficient in handling localized batch updates.

4.4 Sensitivity Analysis

In these experiments, we vary the parameters of our algorithms. For demonstration, we update the last 40 nodes in a dataset, and report the average time taken to re-adjust the seed set per update operation, with the F-RRS algorithm.

Varying . Since the family size increases with smaller , we vary it from to . The average family sizes for = , , and are around , , and nodes, respectively (Epinions). We observe that by selecting , influence spread increases by around % compared to that of , and there is no significant increase in influence spread for even smaller . However, the efficiency of the algorithm decreases almost linearly with decrease in (Figure b), because more seed nodes fall in TIR with increase in the family size. Hence, we select as a good trade-off between quality and efficiency.

Varying IRs. We vary the number of IRs from to to compute TIR, and analyze the performance in terms of efficiency and influence spread in Figure 10. We find that the run time to adjust the seed set increases with increase in IRs; although the influence spread almost saturates at IR=. With increase in IR, number of seed nodes fall in TIR increases. However, as reasoned earlier in Section 3.5, a seed node, which is far away from the update operation, even though its influence spread (and its marginal gain) may change slightly, it almost always remains as a seed node in the updated graph. Hence, by considering a trade off between efficiency and influence coverage, we select 2-IR to compute TIR.

Varying seed set size. In Figure a, we show the efficiency with varying seed sets size from to . It can be observed that even for the seed set of size , F-RRS is faster than DIA by more than an order of magnitude. This demonstrates that our technique is scalable for large seed set sizes.

Varying . For sketch-based methods, choosing the optimal is very important. In Figure b, we show the influence coverage of the F-RRS with varying from to . We compare the influence spread with that of CELF. We find that with increase in , influence coverage initially increases, and gets saturated at . Hence, we set in our experiments, which is also the same value observed in DIA [17].

5 Related Work

Influence Maximization in Static Networks. Kempe et al. [11] addressed the problem of influence maximization in a social network as a discrete optimization problem, which is to identify the set of seed nodes, having cardinality , that maximizes the expected influence spread in the graph. They proved that the problem is -Hard, and proposed a hill climbing greedy algorithm, with an accuracy guarantee of , due to sub-modularity. They used the Monte Carlo (MC) simulation for computing the expected influence spread from a seed set. However, later it was proved in [6] that the exact computation of influence spread is -Hard.

Since the introduction of the influence maximization problem, many algorithms (see [5] for details) have been developed, both heuristic and approximated, to improve the efficiency of the original greedy method. Below, we survey the methods that provide theoretical performance guarantees. Leskovec et al. [12] and Goyal et al. [10] exploited the sub-modularity property of the greedy algorithm, and proposed more efficient CELF and CELF++ algorithms, respectively. Chen et al. [6] avoided MC simulations, and developed the maximum influence arborescence (MIA) model using maximum probable paths for the influence spread computation. Addressing the inefficiency of MC simulations, Borgs et al. [4] introduced a reverse reachable sketching technique (RRS) without sacrificing the accuracy guarantee. Tang et al. [22, 23] proposed the TIM/TIM+ and IMM algorithms, and Li et al. [13] designed indexing methods, all based on the RRS technique, to further improve its efficiency. However, the aforementioned algorithms aim at identifying the top- seed nodes in a static network. With every update in the social influence graph, it is costly to apply these methods and find the new top- seed nodes from ground [19, 17].

Influence Maximization in Dynamic Networks. In recent years, there has been interest in performing influence analysis in dynamic graphs [1, 19, 27, 14, 15, 17, 24]. The work in [1] was the first to propose methods that maximize the influence over a specific interval in time; however, it was not designed for the online setting. The work in [27] probed a subset of the nodes for detecting the underlying changes. Liu et al. [14] considered an evolving network model (e.g., preferential attachment) for influence maximization. A probabilistic edge decay model was considered for analyzing various graph properties in [25]. Subbian et al. [21, 20] also discussed the problem of finding influencers in social streams, although these works employed frequent pattern mining techniques over the underlying social stream of content. This is a different modeling assumption than the dynamic graph setting considered in this work. Recently, Wang et al. [24] considered a sliding window model to find influencers based on the most recent interactions. Once again, their framework is philosophically different from the classical influence maximization setting [11], as they do not consider any edge probabilities; and hence, not directly comparable to ours.

In regards to problem formulation, recent works in [19, 17] are the closest to ours. UBI+ [19] updates the influencers over different snapshots of graphs, whereas DIA [17] adjusts the structure of the reverse reachable sketch (RRS) index with every node and edge modification. However, unlike our effective local updates, these methods could be inefficient in an online setting. In particular, UBI+ performs greedy exchange for multiple times; every time an old seed node is replaced with the best possible non-seed node. The method is generally two orders of magnitude slower than ours. DIA is more efficient than UBI+ in larger graphs, but that is due to the usage of faster RR sketches. The RR sketches can be updated incrementally with graph changes; however, DIA still needs to find all new top- seeds from ground, after RR sketches are modified. In contrast, we may not need to compute all seed nodes, even with updated RR sketches, thereby improving the efficiency by an order of magnitude compared to DIA.

Moreover, UBI+, along with its upper bounding method, was designed for MC-simulation based algorithms and IC model. DIA works only with RR sketches and IC model. It is non-trivial to adapt them for other influence models and algorithms. A drawback of this is as follows. Sketch based methods (e.g., DIA) consume higher memory for storing multiple sketches. In contrast, MC-simulation based methods (e.g., UBI+) are slower over large graphs. On the other hand, our proposed N-Family approach can be employed over many IM models and algorithms, and due to the local updating principle, it significantly improves the efficiency under all scenarios. Therefore, one can select the underlying IM models and algorithms for the N-Family approach based on system specifications and application requirements. This demonstrates the generality of our solution.

6 Conclusions

We developed a generalized, local updating framework for efficiently adjusting the top- influencers in an evolving network. Our method iteratively identifies only the affected seed nodes due to dynamic updates in the influence graph, and then replaces them with more suitable ones. Our solution can be applied to a variety of information propagation models and influence maximization techniques. Our algorithm, N-Family ensures approximation guarantee with the MIA influence cascade model, and works well for localized batch updates. Based on a detailed empirical analysis over several real-world, dynamic, and large-scale networks, N-Family improves the updating time of the top- influencers by 12 orders of magnitude, compared to state-of-the-art algorithms, while ensuring similar memory usage and influence spreads.

Appendix A N-FAMILY approach for IM Algorithms in IC model

We discuss how to adapt the N-Family approach to efficient static IM algorithms in the IC model, e.g., CELF and Reverse Reachable Sketch. First, we explain static IM algorithms briefly, and then we introduce the methods to adapt them to a dynamic setting.

a.1 Celf

In the Greedy algorithm discussed in Section 2.2, marginal influence gains of all remaining nodes need to be repeatedly calculated at every round, which makes it very inefficient (see Line 3, Algorithm 1). Utilizing the lazy forward optimization technique, Leskovec et al. [12] proposed the CELF algorithm. Due to the sub-modularity property of the influence function, the marginal gain of a node in the present iteration cannot be more than that of the previous iteration. Therefore, CELF maintains a priority queue containing the nodes and their marginal gains in descending order. It associates a flag variable with every node, which stores the iteration number in which the marginal gain for that node was last computed. In the beginning, (individual) influence spreads of all nodes are calculated and added to the priority queue, and flag values of all nodes are initiated to zero. In the first iteration, the top node in the priority queue is removed, since it has the maximum influence spread, and is added to the seed set. In each subsequent iteration, the algorithm takes the first element from the priority queue, and verifies the status of its flag. If the marginal gain of the node was calculated in the current iteration, then it is considered as the next seed node; else, it computes the marginal gain of the node, updates its flag, and re-inserts the node in the priority queue. This process repeats until seed nodes are identified.

FAMILY-CELF We refer to the N-Family algorithm over CELF as FAMILY-CELF (or, F-CELF). In particular, we employ MC-sampling to compute marginal gains in lines 3 and 11 of Algorithm 2, and then update the priority queue. Given a node and the current seed set , the corresponding marginal gain can be derived with two influence spread computations, i.e., . However, thanks to the lazy forward optimization technique in CELF, one may insert any upper bound of the marginal gain in the priority queue. The actual marginal gain needs to be computed only when that node is in the top of the priority queue at a later time. Therefore, we only compute the influence spread of , i.e., , which is an upper bound to its marginal gain, and insert this upper bound in the priority queue.

a.2 Reverse Reachable (RR) Sketch

In this method, first proposed by Borgs et al. [4] and later improved by Tang et al. [22, 23], subgraphs are repeatedly constructed and stored as sketches in index . For each subgraph , an arbitrary node , selected uniformly at random, is considered as the target node. Using a reverse Breadth First Search (BFS) traversal, it finds all nodes that influence through active edges. An activation function is selected uniformly at random, and for each edge , if , then it is considered active. The subgraph consists of all nodes that can influence via these active edges. Each sketch is a tuple containing . This process halts when the total number of edges examined exceeds a pre-defined threshold , where is an error function associated with the desired quality guarantee . The intuition is that if a node appears in a large number of subgraphs, then it should have a high probability to activate many nodes, and therefore, it would be a good candidate for a seed node. Once the sufficient number of sketches are created as above, a greedy algorithm repeatedly identifies the node present in the majority of sketches, adds it to the seed set, and the sketches containing it are removed. This process continues until seed nodes are found.

FAMILY-RRS We denote the N-FAMILY algorithm over RR-Sketch as FAMILY-RRS (or, F-RRS). RRS technique greedily identifies the node present in the majority of sketches, adds it to the seed set, and the sketches containing it are deleted. This process continues until seed nodes are identified. In our F-RRS algorithm, instead of deleting sketches as above, we remove them from , and store them in another index , since these removed sketches could be used later in our seeds updating procedure.

Let be the set of sketches with . Similarly, represents the set of all sketches with . Furthermore, (similarly ) denotes (similarly ) after the seed set is identified. Clearly, the sketches in will not have any seed node in their subgraphs. Also note that is proportional to , by following the RRS technique.

After an update operation, we need to modify the sketches (both in and ), and also to possibly swap some sketches between these two indexes, as discussed next.

Modifying sketches after dynamic updates. In the following, we only discuss sketch updating techniques corresponding to an edge addition. Sketch updating methods due other updates (e.g., node addition, edge deletion, etc.) are similar [17], and we omit them due to brevity. To this end, we present three operations:

Expanding sketches: Assume that we added a new edge . We examine every sketch both in and , and add every new node that can reach through active edges in . We compute these new nodes using a reverse breadth first search from . In this process, the initial subgraph is extended to .

Next, we need to update and in such a way that sketches in do not have a seed node in their (extended) subgraphs. For every sketch , if , we then remove from , and add it to .

Deleting sketches: If the combined weight of indexes except the last sketch exceeds the threshold (), we delete the last sketch from the index where it belongs to (i.e., either from or ).

Adding sketches: If the combined weight of indexes is less than the threshold , we select a target node uniformly at random, and construct a new sketch . If , we add the new sketch to , otherwise to .

Sketch swapping for computing marginal gains. Assume that we computed TIR, , and . For every infected old seed node , we identify all sketches with , that are present in . Then, we perform the following sketch swapping to ensure that all infected seed nodes are removed from the old seed set.

  • If there is no uninfected seed node in (i.e, ), where , we move from to .

  • If there is an uninfected seed node in , where , we keep in .

Finally, we identify new seed nodes using updated . Marginal gain computation at line 11 (Algorithm 2) follows a similar sketch replacement method, and we omit the details for brevity.

Appendix B Proof of Performance Guarantee

We show that the top- seed nodes reported by our N-Family method (Algorithm 2) are the same as the top- seed nodes obtained by running the Greedy on the updated graph under the MIA model. Since, the Greedy algorithm provides the approximation guarantee of , our N-Family also provides the same approximation guarantee. The proof is as follows.

As described in Section 3.2.1, after identifying the TIR using Equation 18, we compute (=), influence spreads of all nodes , and update the priority queue.

Now, we continue with computing the new seed nodes over the updated graph, and is new seed set (of size ) found in this manner. Note that before we begin computing new seed nodes, contains the seed nodes present in , and then new nodes are added in an iterative manner. Clearly, is same as . We consider as the seed node computed by Greedy in the iteration, where . Due to Greedy algorithm,

(22)

Next, we sort all seeds in according to the greedy inclusion order, and the sorted seed set is denoted as . Note that seed nodes present in and are same, but their order could be different. At this stage, the important observations are as follows.

After computing , and assuming the top-most node in the priority queue, we will have two mutually exclusive cases:

Case 1:
Case 2:

If we end up with Case-1, we terminate our algorithm and report as the set of new seed nodes, which would be same as the ones computed by the Greedy algorithm on the updated graph (we shall prove this soon). However, if we arrive at Case-2, we do iterative seed replacements until we achieve Case-1 (we prove that by iterative seed replacements for at most times