IMRank: Influence Maximization via Finding Self-Consistent Ranking

IMRank: Influence Maximization via Finding Self-Consistent Ranking

Abstract

Influence maximization, fundamental for word-of-mouth marketing and viral marketing, aims to find a set of seed nodes maximizing influence spread on social network. Early methods mainly fall into two paradigms with certain benefits and drawbacks: (1)Greedy algorithms, selecting seed nodes one by one, give a guaranteed accuracy relying on the accurate approximation of influence spread with high computational cost; (2)Heuristic algorithms, estimating influence spread using efficient heuristics, have low computational cost but unstable accuracy.

We first point out that greedy algorithms are essentially finding a self-consistent ranking, where nodes’ ranks are consistent with their ranking-based marginal influence spread. This insight motivates us to develop an iterative ranking framework, i.e., IMRank, to efficiently solve influence maximization problem under independent cascade model. Starting from an initial ranking, e.g., one obtained from efficient heuristic algorithm, IMRank finds a self-consistent ranking by reordering nodes iteratively in terms of their ranking-based marginal influence spread computed according to current ranking. We also prove that IMRank definitely converges to a self-consistent ranking starting from any initial ranking. Furthermore, within this framework, a last-to-first allocating strategy and a generalization of this strategy are proposed to improve the efficiency of estimating ranking-based marginal influence spread for a given ranking. In this way, IMRank achieves both remarkable efficiency and high accuracy by leveraging simultaneously the benefits of greedy algorithms and heuristic algorithms. As demonstrated by extensive experiments on large scale real-world social networks, IMRank always achieves high accuracy comparable to greedy algorithms, with computational cost reduced dramatically, even about times faster than other scalable heuristics.

influence maximization, social network analysis, viral marketing, iterative method
\numberofauthors

1

\category

F.2.2Analysis of Algorithms and Problem ComplexityNon-numerical Algorithms and Problems \categoryD.2.8Software EngineeringMetrics[complexity measures, performance measures] \termsAlgorithms, Experiments, Performance

1 Introduction

The prosperity of online social networks and social media invokes a new wave of research on social influence analysis [20, 9]. Finding influential individuals is important for many applications such as expert finding, online advertising and marketing. Therefore, influence maximization is identified as a fundamental problem for word-of-mouth marketing and viral marketing in the area of online marketing. It aims to find a fixed-size set of seed nodes in social network to maximize their influence spread, i.e., the expected number of activated nodes triggered by the seed nodes. Ever since being formalized by Kempe et al. [12], influence maximization problem has attracted much research attention from various fields, including social network analysis, data mining and marketing.

Early methods for influence maximization mainly use greedy framework, selecting one by one the node with the largest marginal influence spread. With calculating influence spread accurately, the greedy framework is proved to provide a approximation to the optimal solution of influence maximization [12], guaranteed by the submodularity and monotonicity properties of influence spread as a function of seed node set. These methods roughly fall into two paradigms: greedy algorithms [12, 14, 4, 8, 6] and heuristic algorithms [13, 3, 22, 11]. Greedy algorithms provide a approximation by approximating influence spread through Monte Carlo simulation. However, they have high computation cost because the calculation of marginal influence spread invokes estimating the influence spread of nodes from scratch, using time-consuming Monte Carlo simulation. The latter, in contrast, resorts to estimate the influence spread via efficient heuristic methods. The scalability of these heuristics generally outperforms the greedy algorithms by several orders of magnitude. Yet, their high scalability is gained with the pain of unguaranteed accuracy and unreliable performance on various scenarios. To the best of our knowledge, we lack an efficient and accurate algorithm of influence maximization for applications to large scale social networks in real world.

In this paper, we propose an efficient and accurate algorithm to solve influence maximization problem under independent cascade model. This algorithm is motivated by the key insight that greedy algorithms are essentially finding a self-consistent ranking, where nodes’ ranks are consistent with their ranking-based marginal influence spread. We prove that such self-consistent ranking can be obtained directly using an iterative ranking framework, i.e., IMRank, proposed in this paper. Starting from an initial ranking, e.g., one obtained from efficient heuristic algorithm, IMRank efficiently finds a self-consistent ranking by reordering nodes iteratively in terms of their ranking-based marginal influence spread computed according to current ranking. Different from greedy algorithms computing ranking-based marginal influence spread from scratch, IMRank conducts the computation of ranking-based marginal influence spread via an efficient last-to-first allocating strategy. As a result, IMRank achieves both high efficiency and high accuracy by leveraging simultaneously the benefits of greedy algorithms and heuristic algorithms.

To evaluate the performance of IMRank, we conduct extensive experiments on large-scale social networks with hundreds of thousands of edges to millions of edges. Experimental results demonstrate that IMRank achieves high accuracy comparable to greedy algorithms with computational cost reduced dramatically.

Our main contributions are summarized as follows:

  • We propose a novel framework IMRank, which unifies the estimation of marginal influence spread and the selection of seed nodes. IMRank achieves both remarkable efficiency and high accuracy by exploiting the interplay between the calculation of ranking-based marginal influence spread and the ranking of nodes.

  • We prove that IMRank, starting from any initial ranking, definitely converges to a self-consistent ranking in a finite number of steps. This indicates that IMRank is efficient at solving the influence maximization problem via finding the final self-consistent ranking.

  • We design an efficient last-to-first allocating strategy to approximately estimate the ranking-based marginal influence spread of nodes for a given ranking, further improving the efficiency of IMRank.

  • We conduct extensive experiments on several real-world networks under different types of the independent cascade model. Through comparing two instances of IMRank with both greedy algorithm and existing state-of-the-art heuristics, we show that IMRank always achieves comparable accuracy to the greedy algorithm while runs times faster than other heuristics with better accuracy.

2 Related Work

Notation Description
a node with index
the index of node with rank with respect to a given ranking
a set of nodes
expected number of nodes eventually activated by set
marginal influence spread by adding node into a seed set
short for , where is the set of nodes ranked higher than in a given ranking
probability that is activated given that a collection of nodes are already activated
influence score that node sends to node with respect to a given ranking
a simple path starting from and ending at , i.e.,
influence path, which is a simple path where is the only node ranked higher than on the path
probability that is activated by through all influence paths, with respect to a given ranking
maximal length of all influence paths to account into
Table 1: Notations.

Influence maximization problem was first studied by Domingos and Richardson from algorithmic perspective [7, 18]. Kempe et al. then formulated it as a combinatorial optimization problem of finding a set of seed nodes with maximum influence spread [12]. They proved that this problem is NP-hard and proposed a greedy algorithm which can guarantee a approximation ratio. Here, is caused by the inaccurate estimation of influence spread [3] [5]. The biggest problem suffered by Kempe’s greedy algorithm is its low scalability, limiting it to social networks with small or moderate size.

Many efforts have been made to improve the scalability of Kempe’s greedy algorithm for influence maximization. “cost-effective lazy forward” (CELF) optimization strategy [14] and CELF++ [8] are proposed to reduce the times of influence spread estimation in Kempe’s greedy algorithm by exploiting the submodularity property of influence spread function. To reduce the number of Monte Carlo simulations, Chen et al. proposed NewGreedy algorithm and MixedGreedy algorithm in [4]. The NewGreedy algorithm reusing the results of Monte Carlo simulations in the same iteration to calculate marginal influence spread for all candidate nodes. Yet, it increases the computational cost for a single Monte Carlo simulation because the simulation is now conducted globally rather than locally as done in Kempe’s greedy algorithm. As a remedy, the MixedGreedy algorithm was developed, integrating the CELF strategy into the NewGreedy algorithm. Sheldon et al. [19] proposed a sample average approximation approach from stochastic optimization for maximizing the spread of cascades under budget restriction. Cheng et al. proposed a static greedy algorithm [6], reducing the number of Monte-Carlo simulations through strictly guaranteeing the submodularity and monotonicity properties of influence spread function. Although these improvements can speedup the original greedy algorithm in several orders of magnitude, scalability is still a big challenge for greedy algorithms because the guaranteed accuracy of these algorithms relies on a huge number of Monte Carlo simulations.

Heuristic algorithms, in contrast, mainly reduce the complexity of Kempe’s greedy algorithm through computing influence spread heuristically. DegreeDiscount, designed for uniform independent cascade model, only computes direct influence [4]. Community-based greedy algorithm conducted Monte Carlo simulation within each community rather than on the whole network [22]. SPM/SP1M algorithms [13] estimated influence spread according to shortest paths, while PMIA algorithm [3] used maximum influence paths. SP1N algorithm employed the concept of Shapley value from the cooperative game theory [17]. IRIE algorithm efficiently estimated marginal influence spread through an iterative method. Besides the above heuristics using greedy approach, Jiang et al. proposed a simulated annealing approach with several heuristics [10], and Mathioudakis et al. suggested to speed up influence maximization using a simplified influence network [15]. However, these heuristics cannot give rise to guaranteed accuracy and their performance is unstable on different networks and diffusion models.

Taken together, in existing algorithms for influence maximization, the estimation of influence spread and the ranking of nodes are studied separately. On one hand, without leveraging the ranking of nodes, greedy algorithms estimate the influence spread of nodes from scratch, causing high computational cost. On the other hand, lacking a reliable estimation of influence spread, heuristic algorithms have no guaranteed accuracy. Hence, in this paper, we improve the state-of-the-art solution of influence maximization problem by exploiting the interplay between marginal influence spread and the ranking of nodes.

3 Self-consistent ranking

For influence maximization on a social network , influence spread function of a node set is defined as the expected number of nodes in eventually activated by under certain diffusion model. The function is nonnegative, monotone, and submodular, satisfying

  • Nonnegative: ;

  • Monotone: , if ;

  • Submodular: , for all and .

These properties guarantee that a fair approximation to the optimal solution of influence maximization can be obtained by greedy algorithms, iteratively selecting the node with maximum marginal influence spread as seed node.

Definition 1

Marginal influence spread: Given a node set and a node , the marginal influence spread of upon is defined as .

However, the influence spread function is not extensive, i.e., if , since the nodes activated by may overlap with the nodes activated by . Therefore, one has to compute the marginal influence spread by computing both and from scratch, resulting in huge computation cost. To remedy this problem, we further analyze the property of the set of seed nodes obtained by greedy algorithms. Indeed, greedy algorithms implicitly give a ranking of nodes, where nodes are ranked in decreasing order of their marginal influence spread. Meanwhile, their marginal influence spread are computed based on their ranks in the implicit ranking. Hence, greedy algorithms obtain a self-consistent ranking of nodes.

Before formally defining self-consistent ranking, we first introduce several related notations for clarity. Without loss of generality, we index all the nodes into where . A ranking of nodes, determined by a permutation with denoting the index of node with rank , is denoted as . With these notations, for convenience, we now define the ranking-based marginal influence spread of node with respect to a ranking as . In addition, for clarity, Table 1 lists all the important notations used in this paper.

Definition 2

Self-consistent ranking: A ranking is a self-consistent ranking iff .

For the set of seed nodes obtained by greedy algorithms, there exists an interplay between the ranks of nodes and their marginal influence spread. On one hand, these nodes are ranked in descending order of their marginal influence spread. On the other hand, the marginal influence spread of nodes is calculated with respect to the ranks of nodes. Indeed, the set of seed nodes obtained by greedy algorithms forms a self-consistent ranking.

Theorem 1

Greedy algorithms for influence maximization gives a self-consistent ranking.

{proof}

Greedy algorithms iteratively select the node with maximum marginal influence spread as seed node. With a ranking denoting the order seed nodes are selected, we have , for . In addition, the submodularity of influence spread function implies that . Using transitivity, we complete the proof with .

For a given social network, however, there are multiple self-consistent rankings besides the one obtained by greedy algorithms. Hence it is critical to develop effective algorithms to achieve a desired self-consistent ranking which is either the very ranking obtained by greedy algorithms or comparable to it from the point of influence maximization.

4 IMRank

In this section, we develop an efficient iterative framework IMRank to solve the influence maximization problem through finding a desired self-consistent ranking. IMRank distinguishes itself from greedy algorithms in one key point: in each iteration, IMRank efficiently estimates the marginal influence spread of all nodes based on the current ranking, while greedy algorithm compute the marginal influence spread from scratch with high computational cost.

4.1 IMRank: iterative framework

IMRank aims to find a self-consistent ranking from any initial ranking. It achieves the goal by iteratively adjusting current ranking as follows:

  • Compute the ranking-based marginal influence spread of all nodes with respect to the current ranking ;

  • Obtain a new ranking by sorting all nodes according to .

This iterative process is formally described in Algorithm 1. It definitely converges to a self-consistent ranking, starting from any initial ranking (see Section 4.3 for proof). Intuitively, IMRank iteratively promotes influential nodes to top positions in the ranking, always increasing the influence spread of top- nodes during the process until it converges to a self-consistent ranking. Indeed, different initial rankings could make IMRank converge to different self-consistent rankings. We leave the discussion about initial ranking to Section 4.4.

1:  
2:  
3:  repeat
4:     
5:     Calculate with respect to the ranking
6:     Generate a new ranking by sorting nodes in decreasing order according to
7:  until 
8:  output the self-consistent ranking
Algorithm 1 IMRank

4.2 Calculate ranking-based marginal influence spread

The core step in IMRank is the calculation of ranking-based marginal influence spread. One straightforward way is to directly compute using Monte Carlo simulation, as done by greedy algorithms. However, prohibitively high computational cost makes it impractical for IMRank. To combat this problem, we propose a Last-to-First Allocating (LFA) strategy to efficiently estimate , leveraging the intrinsic interdependence between ranking and ranking-based marginal influence spread. We develop the LFA strategy under the widely-adopted independent cascade model [12]. For the independent cascade model, when a node is activated, it has one chance to independently activate its neighboring nodes with a propagation probability if has not been activated yet. Each node can be activated for only once.

The LFA strategy is based on the following fact: by definition, the ranking-based marginal influence spread is equal to the expected number of nodes activated by , given that when all nodes ranked higher than it have finished the propagation of their influence. This implies two basic rules under the calculation of :

  1. Each node can only be activated by nodes ranked higher than it in the given ranking;

  2. When a node could be activated by multiple nodes, higher-ranked node has higher priority to activate it.

Following the two basic rules, the LFA strategy is described as follows:

  • Given a ranking , the initial value of of each node is set to be , satisfying the fact that the sum of over all nodes is equal to the number of nodes, since each node can only be activated once.

  • Scanning the ranking from the last node to the top one, a fraction of is delivered to the nodes ranked higher than , reflecting the first rule;

  • The delivered influence score of is allocated among the nodes in terms of their ranks, reflecting the second rule.

    Specifically, with denoting the fraction of influence score delivered to node from node , we have

    (1)

    where is the propagation probability that node directly activates node , known as a priori for independent cascade model.

The calculation of the ranking-based marginal influence spread is completed after all nodes are scanned. The LFA strategy is formally depicted in Algorithm 2.

1:  for  to  do
2:     
3:  end for
4:  for  to  do
5:     for  to  do
6:         +
7:        
8:     end for
9:  end for
10:  output
Algorithm 2 Calculate ()

Now we use an example to illustrate the LFA strategy. In Figure 1, denotes the node with rank for convenience, and is the propagation probability along edge . Here, the ranking is simply . Solid lines represent the edges where influence could propagate, while dashed lines depict the edges where influence score is delivered when nodes are scanned. The lack of dashed line from node to node reflects that node is ranked higher than node . For this case, the LFA strategy computes the ranking-based marginal influence spread as follows:

  1. Initially, .

  2. Node is then scanned as the last node in the ranking. According to Equation( 1), delivers to and to respectively. Accordingly, becomes .

  3. Then node is scanned. Since is now , delivers to . Note that the second item characterizes the influence of to through the path , reflecting that the LFA strategy could effectively capture the indirect influence among nodes. After is scanned, the final value of is .

  4. When node is scanned. it delivers to node , with remained.

  5. Finally, node is scanned. After is scanned, the final scores of and are and respectively. The term in captures the indirect influence from to through the path , indicating that the LFA strategy does collect influence with multiple intermediate nodes on the path. Note that it is not necessary to scan node since it does not delivery influence to other nodes

The above illustration tells us that the LFA strategy efficiently calculates the ranking-based marginal influence spread for all nodes, scanning each node only once. Meanwhile, with indirect influence propagation being effectively captured, the LFA strategy provides a good delegate to calculate ranking-based marginal influence spread. We show the numerical results of the LFA strategy and 20,000 Monte Carlo simulations in the case of setting for all edges as done in uniform independent cascade model. As shown in Table 2, our strategy offers very close results to the time-consuming Monte Carlo simulations.

Finally, we summarize the LFA strategy by explaining why it works remarkably. First, it achieves its high efficiency by exploiting the interdependence between ranking and ranking-based marginal influence spread, avoiding the adoption of Monte Carlo simulations done in greedy algorithms. Second, it employs the intermediate nodes as delegates, in a last-to-first manner, to capture both direct and indirect influence propagation among nodes. In this way, ranking-based marginal influence spread could be efficiently calculated via scanning all nodes only once. In addition, we want to spell out that the LFA strategy only offers one effective approximation rather than exact calculation of influence spread. This is partly caused by the restriction that influence could only propagate from higher-ranked nodes to lower-ranked nodes. In Section 5, we will further improve the LFA strategy via relaxing this restriction.

Figure 1: Illustration of the LFA strategy.
MC 1.29846 1.38800 0.77941 0.89406 0.64007
LAF 1.24000 1.42400 0.76800 0.92800 0.64000
Table 2: Estimation on ranking-based marginal influence spread. MC indicates Monte Carlo simulation, and LAF indicates the LAF strategy.

4.3 Convergence of IMRank

In this section, we first theoretically prove the convergence of IMRank. Then we illustrate the convergence empirically using a real-word network as example.

Theorem 2

Starting from any initial ranking of nodes, IMRank converges to a self-consistent ranking after a finite number of iterations.

{proof}

We first prove that, with respect to any , the influence spread of the set of top- nodes, denoted as for convenience, is nondecreasing in the iterative process of IMRank. After each iteration of IMRank, a ranking is adjusted to another ranking . Since IMRank adjusts all nodes in decreasing order of their current ranking-based influence spread , the values of () are the largest values among all the . Hence, there is . Moreover, iff the sets of top- nodes in ranking and are the same, otherwise . Now let’s consider a new ranking obtained from just reordering the top- nodes in ranking in decreasing order of their ranks in ranking and keeping the ranks of other nodes still. Apparently, the sets of top- nodes are the same between ranking and , thus . Then, for each node , the set of nodes ranked higher than it in ranking is definitely a subset of the set of nodes ranked higher than it in ranking . According to the submodularity of influence spread function, we can obtain for each node (). Thus, there is . Note we have proved and . Taken together, we can obtain , and the equal-sign is tenable iff the sets of the top- nodes in ranking and are the same, otherwise .

Based on the above conclusion, as long as the current ranking is not a self-consistent ranking, in each iteration all the values of () are nondecreasing, and at least one increases. Since and for each has an upper bound (i.e., ), IMRank eventually converges to a self-consistent ranking within a finite number of iterations, starting from any initial ranking.

In fact, the above proof also explains the effectiveness of IMRank that it consistently improves the influence spread of top- nodes for any , resulting in a quick convergence which is much faster than greedy algorithms.

We now empirically illustrate the convergence of IMRank, using a scientific collaboration network, namely HEPT, extracted from the “High Energy Physics-Theory” section of the e-print arXiv website arXiv.org. This network is composed of nodes and edges. We run IMRank to select 50 seed nodes. Figure a shows the percent of different nodes in two successive iterations. For two widely-used models, weighted independent cascade (WIC) model [12] and trivalency independent cascade (TIC) model [3], the set of top- nodes becomes unchanged after and iterations respectively. Clearly, IMRank converges significantly quicker than greedy algorithms, which requires iteration for selecting seed nodes. Figure b depicts the influence spread of top- nodes. For convenience, we employ the relative influence spread, i.e., the ratio of the influence spread of top- nodes in each iteration to the final influence spread obtained when IMRank converges. IMRank only takes and iterations to achieve a stable and high influence spread under the two models respectively. The influence spread of top- nodes always converges with smaller number of iterations than the convergence of the set of top- nodes. Therefore, one can stop IMRank safely in practice by checking the change of top- nodes between two successive iterations.

In sum, we have theoretically and empirically demonstrated the convergence of IMRank. Indeed, the convergence of IMRank could be affected by the estimation of marginal influence spread. Extensive experiments further show IMRank with the LFA strategy always converge quickly in Section 6.

(a) Top- nodes
(b) Influence spread
Figure 2: Convergence of IMRank

4.4 Analysis of initial ranking

Since IMRank is guaranteed to converge to a self-consistent ranking from any initial ranking, it is necessary to extend the discussion to its dependence on the initial ranking: does an arbitrary initial ranking results in a unique convergence? If not, what initial ranking corresponds to a better result? We explore those questions by empirically simulating IMRank with five typical initial rankings as follows,

  • Random: Nodes are initially ranked randomly;

  • Degree: Nodes are initially ranked in descending order of degrees (undirected networks) or out-degrees (directed networks);

  • InversedDegree: Nodes are initially ranked in ascending order of degrees (undirected networks) or out-degrees (directed networks);

  • Strength: Nodes are initially ranked in descending order of node strengths (undirected networks) or node out-strengths (directed networks). The node strength is the sum of all weights on its edges. The node out-strength is the sum of all weights on its out-edges;

  • PageRank: Nodes are initially ranked in descending order of PageRank scores [2], with the default value for the damping factor parameter.

Empirical results on the HEPT dataset under the WIC model are reported in Figure b, to compare the performance of IMRank with different initial rankings, as well as the performance of those rankings alone. We also report the performance of classic greedy algorithm for comparison, implemented with CELF optimization [14]. Performance of IMRank with Random initial ranking, and that of the Random ranking alone, are averaged over trials.

With the empirical results we conclude:

  • With different initial rankings, IMRank could converge to different self-consistent rankings. However, IMRank consistently improves the initial rankings in terms of obtained influence spread.

  • Comparable with the greedy algorithm, IMRank with a “good” initial ranking such as Degree, Strength, and PageRank show indistinguishable performance, shown in a single curve in Figure  a. A good initial ranking prefers nodes with high influence;

  • IMRank with a “neural” initial ranking such as random also shows fair performance, slightly poorer than the greedy algorithm and IMRank with a good initial ranking. A neural initial ranking prefers no nodes;

  • IMRank with a “bad” initial ranking such as InversedDegree shows remarkably improvements upon the initial ranking alone but is dominated by the greedy algorithm. A bad initial ranking prefers nodes with low influence.

Therefore, IMRank is robust to the selection of initial ranking, and IMRank works well with an initial ranking that prefers nodes with high influence, which could be obtained efficiently in practice. A possible explanation is the priori bias that a high-ranked node earns more allocated influence than a low ranked node, even with the same topological circumstance. Therefore, it helps IMRank to converge to a good ranking if the nodes with high influence are initially ranked high.

Among the three “good” initial rankings with indistinguishable performance, Degree offers a good candidate of initial ranking, since computing the initial ranking consumes a large part in the total running time of IMRank, as shown in Figure b.

(a) Influence spread
(b) Running time when
Figure 3: Comparison between IMRank with different initial rankings under the WIC model.

5 Advanced IMRank

In the LFA strategy, a node is only allowed to allocate its influence to a higher ranked neighboring node , implying the assumption that a node can only be activated by higher ranked neighbors. The assumption ignores the possibility that a lower ranked neighbor activates a higher ranked node by playing the role of an intermediate agent of another node with . Take the path in Figure 1 for example. After is selected as a seed, it activates and then as an intermediate agent activates .

To combat the above problem, we propose a generalized LFA strategy that trades a slight increase in running time for better accuracy in estimating , through exploring more paths that potentially propagate influence. This generalized LFA strategy can further improve the performance of IMRank on influence spread. In order to avoid duplicate computing that a long path is contained in another longer path, we introduce the influence paths as corrections.

Definition 3

Influence path: Given a ranking , a simple path is called an influence path if is the only node along the path that is ranked higher than .

Lemma 1

A directed edge is an influence path if .

Lemma 2

A node allocates influence score to another node only along an influence path , if exists any.

{proof}

Consider a path . If , has no chance to trigger a cascade to activate , immediately or eventually. Therefore a path is not negligible only when . Furthermore, if there is an intermediate node with , there is no chance that activates along this path since is triggered earlier, thus such a path can be neglected. If there exists an intermediate node with , the influence allocated from to already contains the fraction that activates , as discussed in Section 4.2. Thus such a path should not be counted to avoid duplicate computing.

We denote to the probability that is activated by through any influence path. is equal to the probability that any influence path from to has all its nodes activated, discounted by the probability that is already activated before attempts. can be obtained as follows,

(2)

where is the joint probability that activates all nodes on an influence path , and denotes the set of all the influence paths starting from and ending with .

To summarize, the generalized LFA strategy calculates marginal influence spread by replacing the allocation method: a node delivers a fraction of its influence to each higher-ranked node instead of each adjacent higher-ranked node, with replaced by .

Although searching all influence paths takes exhausting computation, we can safely limit the higher-order correction to a second-order or third-order correction to avoid expensive computation. Specifically, we prune paths longer than hops which are expensive to count but propagate influence with low probabilities. Therefore the marginal influence spread allocation operation is restricted within a local region, avoiding exploring the whole network. Obviously makes the generalized LFA strategy collapsed into the LFA strategy.

The time and space complexity of IMRank with the generalized LFA strategy mainly depends on . Let denote to the largest number of paths with length of ends in an arbitrary node. The time required for scanning any node is , used for searching candidate nodes, sorting candidate nodes by their ranks, and allocating influence. Hence the total time complexity of IMRank is , where is the number of iterations needed for the convergence of IMRank. Our experiment results show that, IMRank always converges with a fairly small significantly smaller than , e.g, when . Since is usually much smaller than , e.g. is just the largest indegree among all nodes when , the time complexity of IMRank is low. Talking about the space (memory) complexity, IMRank only needs to store the value of for each node, which takes space in memory. Hence the space complexity is also low.

Figure 4 shows the impact of on the performance of IMRank, measured on the NEPT network with the WIC model and for example. We compare the results of IMRank with Degree and Random initial rankings since the results for other initial rankings are similar. It shows that, when increases from to , there is a visible increase on the performance of IMRank, measured with influence spread. It indicates that, a larger indeed makes the estimation of marginal influence spread more accurate, and further makes IMRank obtain better ranking. When increases beyond , the performance of IMRank converges fast, because the propagation probabilities of long paths decrease exponentially with the length. Hence, long influence paths impact little on the final estimation. As shown in the inset figure of Figure 4, the running time of IMRank increases rapidly as increases, since much more paths need searching. Balancing the trade-off between the influence spread and running time of IMRank, a suitable can be selected based on the practical requirement on accuracy and affordable computational resource.

Figure 4: Impact of on the performance of IMRank.

6 Experiments

In this section, we evaluate IMRank on real-world networks by comparing IMRank with state-of-the-art influence maximization algorithms.

(a) WIC model
(b) TIC model
(c) Running Time
Figure 5: Influence spread and running time on the PHY dataset
(a) WIC model
(b) TIC model
(c) Running Time
Figure 6: Influence spread and running time on the DBLP dataset

6.1 Experimental Setup

Diffusion models

Experiments are conducted under two widely-used independent cascade models:

  • Weighted independent cascade (WIC) model  [12]: Each edge is assigned a propagation probability , where is the indegree of node .

  • Trivalency independent cascade (TIC) model  [3]: Each edge is assigned a propagation probability selected from {0.1,0.01,0.001} in a uniform random manner, indicating high, medium and low levels of influence.

Baseline algorithms

The compared algorithms include two implementations of IMRank and two state-of-the-art heuristic algorithms, i.e., PMIA and IRIE. Details are as follows:

  • IMRank1: This is the IMRank with Degree as initial ranking method and . According to the analysis of section 4.3, we set its stopping criteria as when the sets of top- nodes are the same during two successive iterations or the iteration runs 10 rounds.

  • IMRank2: This is the IMRank with Degree as initial ranking method and , with the same stopping criteria to IMRank1.

  • PMIA: This heuristic algorithm estimates influence spread based on maximum influence paths [3]. We use the recommended parameter setting .

  • IRIE: This heuristic algorithm integrates influence ranking with influence estimation [11]. The parameters and are set to be and , and the maximum times of iterations for initial round and subsequent rounds are respectively 20 and 5 as recommended.

Datasets #Nodes #Edges Directed?
PHY 37K 231K undirected
DBLP 655K 2M undirected
EPINIONS 76K 509K directed
DOUBAN 552K 22M directed
LIVEJOURNAL 4M 69M directed
Table 3: Statistics of test networks

Datasets

Experiments are conducted on five real-world networks, two undirected scientific collaboration networks and three directed online social networks. Table 3 gives basic statistics of those networks. One of the two scientific collaboration networks, denoted as PHY, is obtained from the complete list of papers of the Physics section of the e-print arXiv website. The other one, denoted as DBLP, is extracted from the DBLP Computer Science Bibliography 1. The three online social networks are EPINIONS, DOUBAN, and LIVEJOURNAL 2, respectively extracted from the websites of epinions.com, douban.com and livejournal.com. In the EPINIONS dataset, an edge between two users and , denoted as , represents that user trusts user . In the DOUBAN dataset [9], an edge between two users and represents that user follows user . In the LIVEJOURNAL network [1], an edge between two users and represents that user declares user as his/her friend. We choose these five networks based on the consideration that these networks possess various kinds of relationships and different sizes ranging from hundreds of thousands edges to millions of edges. Actually we test our method on many other networks. Limited by space, results on these networks are not included in this paper.

All experiments are conducted on a server with 1.9GHz Quad-Core AMD Opteron(tm) Processor 8347HEx4 and 64G memory.

(a) WIC model
(b) TIC model
(c) Running Time
Figure 7: Influence spread and running time on the EPINIONS dataset
(a) DOUBAN
(b) LIVEJOURNAL
(c) Running Time
Figure 8: Influence spread and running time on the DOUBAN and LIVEJOURNAL datasets

6.2 Experimental results

We evaluate IMRank on real-world networks by comparing it with state-of-the-art algorithms. Evaluation metrics include influence spread and running time. For the comparison of obtained influence spread, we test the cases of . For the comparison of running time, we focus on the typical case . Each figure of Figures c-c shows the results on a certain network. The first two subfigures give the results of influence spread under the WIC model and the TIC model respectively, and the last one gives the results of running time.

Figure c shows the experimental results on the PHY dataset. Under the WIC model, IMRank2 achieves the best influence spread, followed by IMRank1, outperforming PMIA and IRIE. The distinguished accuracy of IMRank2 is attributed to the fact IMRank2 explores more influence paths to accurately estimate ranking-based marginal influence spread. PMIA exhibits the worst performance, 6.3% lower influence spread than IMRank2 when . Under the TIC model, as shown in Figure b, similar results are obtained and the gaps between those algorithms become more visible. For influence spread, IMRank2 and IMRank1 are the top two algorithms while PMIA slightly outperforms IRIE. The influence spread obtained by IMRank2 is 13.8% and 12.7% higher than that obtained by IRIE and PMIA respectively. Moreover, as shown in Figure c, IMRank1 and IMRank2 run faster than the competing algorithms under both WIC model and TIC model. IMRank1 is the fastest one followed by IMRank2 which achieves higher influence spread at the cost of longer running time, while PMIA takes the third place and IRIE runs slowest. In particular, the running times of IRIE and PMIA are 30 times and 10 times longer than the running time of IMRank1 under the WIC model respectively, and 18 times and 9 times longer than that of IMRank1 under the TIC model. With the running time dramatically reduced, IMRank1 still achieves better influence spread which is about 5.5% and 4.5% higher than that of IRIE and PMIA respectively. The consistent performance of IMRank1 and IMRank2 demonstrates the effectiveness of IMRank. The inconsistent performance of PMIA and IRIE under the two diffusion models illustrates that both PMIA and IRIE are unstable.

Figure c shows the results on DBLP, a network with two millions edges. The performance of the four algorithms on this network is similar to their performance on PHY dataset. For the WIC model, IMRank2 achieves the highest influence spread and IMRank1 is the fastest one. In particular, when , the highest influence spread is achieved by IMRank2 and its running time is less than PMIA and IRIE. IMRank1 obtains similar influence spread to PMIA and its running time is one order of magnitude smaller than that of PMIA. For the TIC model, IMRank1, IMRank2 and PMIA achieve very similar influence spread, which is significantly higher than the influence spread achieved by IRIE. Moreover, IMRank1 runs nearly 8 times and 13 times faster than PMIA and IRIE.

Figure c gives the results on EPINIONS, a social network with more than half a million edges. For the WIC model, IMRank1 and IMRank2 run faster than PMIA and IRIE. In particular, comparative to PMIA, IMRank1 reduces the running time in more than two orders of magnitudes and IMRank2 reduces the running time in more than one order of magnitude. For the TIC model, IMRank2 achieves the best influence spread and IMRank1 takes the second place. Both IMRank1 and IMRank2 significantly outperform PMIA and IRIE. Moreover, the running time of IMRank1 is only 0.1% of the running time of PMIA and 5% of that of IRIE. With similar running time, IMRank2 achieves significant higher influence spread than that of PMIA and IRIE.

Figure c shows the results on the DOUBAN and LIVEJOURNAL datasets. The number of edges of DOUBAN and LIVEJOURNAL is millions and millions respectively. Here we only give the results under the WIC model. On the DOUBAN network, the four algorithms achieve comparable influence spread. However, IMRank1 runs more than two orders of magnitude faster than PMIA and more than one order of magnitude faster than IRIE. On the LIVEJOURNAL network, IMRank2 and IRIE have similar influence spread, while IMRank1 follows and PMIA achieves the lowest influence spread. Note that IMRank2 runs faster than IRIE, and IMRank1 runs much faster than PMIA. We do not show the results under the TIC model since no visible difference is observed among the four tested algorithms. This is due to the fact that selecting one influential node always achieves a very large influence spread on DOUBAN and LIVEJOURNAL networks, and no increase of influence spread can be gained by adding a new seed. Such phenomenon has been observed and discussed in [12] and [3]. The possible reason is that the influence networks generated by the TIC model on the two networks have a relatively large strongly connected component. In addition, IRIE runs faster than PMIA on EPINIONS while PMIA runs faster than IRIE on the two scientific collaboration networks, PHY and DBLP. This demonstrates that the both PMIA and IRIE perform unstable on different networks.

These experiments clearly show that PMIA and IRIE perform unstable on different scenarios while IMRank consistently shows good performance. According to these experiments, PMIA always runs the slowest among the four tested algorithms on denser networks, such as EPINIONS, DOUBAN and LIVEJOURNAL. This is mainly because such networks involve lots of influence paths to calculate and store. In contrast, IRIE always performs the worst on sparser and smaller networks, PHY and DBLP. This is probably because IRIE strictly obeys the iterative ranking and iterative estimation, resulting in relatively long time in sparser and smaller networks. Different from the two algorithms, IMRank seems to perform efficient and stable among different tested cases. IMRank1 always runs more than one order of magnitude faster than PMIA and IRIE when they achieve similar influence spread. IMRank2 consistently provides better influence spread than PMIA and IRIE, but runs faster than them.

7 Conclusions

In this paper, we investigated influence maximization from a novel ranking perspective. We proposed an efficient iterative framework IMRank to explore the benefits of accurate greedy algorithms and efficient heuristic estimation of influence spread. This framework effectively tunes any initial ranking into a self-consistent ranking in an iterative manner through fully leveraging the interplay between the ranking of nodes and their ranking-based marginal influence spread. A last-to-first allocating strategy is further proposed to efficiently estimate the ranking-based marginal influence spread. This strategy is elaborately designed according to the characteristics of the independent cascade model and the ranking-based marginal influence spread. We further generalize the last-to-first allocating strategy in order to achieve more accurate estimation. We also prove the convergence of IMRank and analyze the impact of initial ranking. Moreover, IMRank always work well with simple heuristic rankings, such as degree, strength. Extensive experiments on large scale real-world social networks demonstrate the efficiency of IMRank. Its scalability outperforms the state-of-the-art heuristics while its accuracy is comparable to the greedy algorithms.

For future work, we will try to analyze the accuracy of IMRank theoretically. Moreover, we believe our proposed iterative framework is of generality for the some cases greedy algorithm is suitable for. We will try to extend it to other problems beyond influence maximization, such as diversity problem in retrieval.

Footnotes

  1. http://www.informatik.uni-trier.de/ley/db/
  2. EPINIONS and LIVEJOURNAL can be downloaded from http://snap.stanford.edu/data/. DOUBAN can be obtained on demand via email to the authors.

References

  1. L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large social networks: membership, growth, and evolution. In KDD’06, pages 44–54, 2006.
  2. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW’98, pages 107–117, 1998.
  3. W. Chen, C. Wang, and Y. Wang. Scalable influence maximization for prevalent viral marketing in large-scale social networks. In KDD’10, pages 1029–1038, 2010.
  4. W. Chen, Y. Wang, and S. Yang. Efficient influence maximization in social networks. In KDD’09, pages 199–208, 2009.
  5. W. Chen, Y. Yuan, and L. Zhang. Scalable influence maximization in social networks under the linear threshold model. In ICDM’10, pages 88–97, 2010.
  6. S. Cheng, H. Shen, J. Huang, G. Zhang, and X. Cheng. StaticGreedy: Solving the Scalability-Accuracy Dilemma in Influence Maximization. In CIKM’13, 2013.
  7. P. Domingos and M. Richardson. Mining the network value of customers. In KDD’01, pages 57–66, 2001.
  8. A. Goyal, W. Lu, and L. V. Lakshmanan. Celf++: optimizing the greedy algorithm for influence maximization in social networks. In WWW’11, pages 47–48, 2011.
  9. J. Huang, X.-Q. Cheng, H.-W. Shen, T. Zhou, and X. Jin. Exploring social influence via posterior effect of word-of-mouth recommendations. In WSDM’12, WSDM ’12, pages 573–582, 2012.
  10. Q. Jiang, G. Song, C. Gao, Y. Wang, W. Si, and K. Xie. Simulated annealing based influence maximization in social networks. In AAAI’11, 2011.
  11. K. Jung, W. Heo, and W. Chen. IRIE: Scalable and robust influence maximization in social networks. In ICDM’12, pages 918–923, 2012.
  12. D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In KDD’03, pages 137–146, 2003.
  13. M. Kimura, K. Saito, R. Nakano, and H. Motoda. Extracting influential nodes on a social network for information diffusion. Data Mining and Knowledge Discovery, 20(1):70–97, 2010.
  14. J. Leskovec, A. Krause, C. Guestrin, et al. Cost-effective outbreak detection in networks. In KDD’07, pages 420–429, 2007.
  15. M. Mathioudakis, F. Bonchi, C. Castillo, A. Gionis, and A. Ukkonen. Sparsification of influence networks. In KDD’11, pages 529–537, 2011.
  16. E. Mossel and S. Roch. On the submodularity of influence in social networks. In STOC’07, pages 128–134, 2007.
  17. R. Narayanam and Y. Narahari. A shapley value-based approach to discover influential nodes in social networks. IEEE Transactions on Automation Science and Engineering, 8(1):130–147, 2011.
  18. M. Richardson and P. Domingos. Mining knowledge-sharing sites for viral marketing. In KDD’02, pages 61–70, 2002.
  19. D. Sheldon, B. Dilkina, A. Elmachtoub et al. Maximizing the spread of cascades using network design. In UAI’10, pages 517–526, 2010.
  20. J. Tang, J. Sun, C. Wang, and Z. Yang. Social influence analysis in large-scale networks. In KDD’09, pages 807–816, 2009.
  21. C. Wang, W. Chen, and Y. Wang. Scalable influence maximization for independent cascade model in large-scale social networks. Data Mining and Knowledge Discovery, 25(3):545–576, 2012.
  22. Y. Wang, G. Cong, G. Song, and K. Xie. Community-based greedy algorithm for mining top-k influential nodes in mobile social networks. In KDD’10, pages 1039–1048, 2010.
105018
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
Edit
-  
Unpublish
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
Comments 0
Request comment
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description