Outward Influence and Cascade Size Estimationin Billion-scale Networks

Outward Influence and Cascade Size Estimation in Billion-scale Networks

Abstract

Estimating cascade size and nodes’ influence is a fundamental task in social, technological, and biological networks. Yet this task is extremely challenging due to the sheer size and the structural heterogeneity of networks. We investigate a new influence measure, termed outward influence (OI), defined as the (expected) number of nodes that a subset of nodes will activate, excluding the nodes in . Thus, OI equals, the de facto standard measure, influence spread of minus . OI is not only more informative for nodes with small influence, but also, critical in designing new effective sampling and statistical estimation methods.

Based on OI, we propose SIEA/SOIEA, novel methods to estimate influence spread/outward influence at scale and with rigorous theoretical guarantees. The proposed methods are built on two novel components 1) IICP an important sampling method for outward influence; and 2) RSA, a robust mean estimation method that minimize the number of samples through analyzing variance and range of random variables. Compared to the state-of-the art for influence estimation, SIEA is times faster in theory and up to several orders of magnitude faster in practice. For the first time, influence of nodes in the networks of billions of edges can be estimated with high accuracy within a few minutes. Our comprehensive experiments on real-world networks also give evidence against the popular practice of using a fixed number, e.g. 10K or 20K, of samples to compute the “ground truth” for influence spread.

Outward influence; FPRAS; Approximation Algorithm
\orcid

0000-0003-4847-8356

1 Introduction

In the past decade, a massive amount of data on human interactions has shed light on various cascading processes from the propagation of information and influence [Kempe et al. (2003)] to the outbreak of diseases [Leskovec et al. (2007)]. These cascading processes can be modeled in graph theory through the abstraction of the network as a graph and a diffusion model that describes how the cascade proceeds into the network from a prescribed subset of nodes. A fundamental task in analyzing those cascades is to estimate the cascade size, also known as influence spread in social networks. This task is the foundation of the solutions for many applications including viral marketing [Kempe et al. (2003), Tang et al. (2014), Tang et al. (2015), Nguyen et al. (2016b)], estimating users’ influence [Du et al. (2013), Lucier et al. (2015)], optimal vaccine allocation [Preciado et al. (2013)], identifying critical nodes in the network [Dinh and Thai (2015)], and many others. Yet this task becomes computationally challenging in the face of the nowadays social networks that may consist of billions of nodes and edges.

Most of the existing work in network cascades uses stochastic diffusion models and estimates the influence spread through sampling [Kempe et al. (2003), Cohen et al. (2014), Dinh and Thai (2015), Tang et al. (2015), Lucier et al. (2015), Ohsaka et al. (2016)]. The common practice is to use a fixed number of samples, e.g. 10K or 20K [Kempe et al. (2003), Tang et al. (2015), Cohen et al. (2014), Ohsaka et al. (2016)], to estimate the expected size of the cascade, aka influence spread. Not only is there no single sample size that works well for all networks of different sizes and topologies, but those approaches also do not provide any accuracy guarantees. Recently, Lucier et al. [Lucier et al. (2015)] introduced INFEST, the first estimation method that comes with accuracy guarantees. Unfortunately, our experiments suggest that INFEST does not perform well in practice, taking hours on networks with only few thousand nodes. Will there be a rigorous method to estimate the cascade size in billion-scale networks?

Influence Outward Inf.
Figure 1: Left: the influence of nodes under IC model. The influence of all nodes are roughly the same, despite that is much less influential than and . Right: Outward influence is better at reflecting the relative influence of the nodes. has the least outward influence, , while ’s is nearly twice as that of

In this paper, we investigate efficient estimation methods for nodes’ influence under stochastic cascade models [Daley et al. (2001), Kempe et al. (2003), Du et al. (2013)]. First, we introduce a new influence measure, called outward influence and defined as , where denotes the influence spread. The new measure excludes the self-influence artifact in influence spread, making it more effective in comparing relative influence of nodes. As shown in Fig. 1, the influence spread of the nodes are roughly the same, . In contrast, the outward influence of nodes and are , and , respectively. Those values correctly reflect the intuition that is the least influential nodes and is nearly twice as influential as .

More importantly, the outward influence measure inspires novel methods, termed SIEA/SOIEA, to estimate influence spread/outward influence at scale and with rigorous theoretical guarantees. Both SOIEA and SIEA guarantee arbitrary small relative error with high probability within an observed influence. The proposed methods are built on two novel components 1) IICP an important sampling method for outward influence; and 2) RSA, a robust mean estimation method that minimize the number of samples through analyzing variance and range of random variables. IICP focuses only on non-trivial cascades in which at least one node outside the seed set must be activated. As each IICP generates cascades of size at least two and outward influence of at least one, it leads to smaller variance and much faster convergence to the mean value. Under the well-known independent cascade model [Kempe et al. (2003)], SOIEA is times faster than the state-of-the-art INFEST [Lucier et al. (2015)] in theory and is four to five orders of magnitude faster than both INFEST and the naive Monte-Carlo sampling. For other stochastic models, such as continuous-time diffusion model [Du et al. (2013)], LT model [Kempe et al. (2003)], SI, SIR, and variations [Daley et al. (2001)], RSA can be applied directly to estimate the influence spread, given a Monte-Carlo sampling procedure, or, better, with an extension of IICP to the model.

Our contributions are summarized as follows.

  • We introduce a new influence measure, called Outward Influence which is more effective in differentiating nodes’ influence. We investigate the characteristics of this new measure including non-monotonicity, submodularity, and #P-hardness of computation.

  • Two fully polynomial time randomized approximation schemes (FPRAS) SIEA and SOIEA to provide -approximate for influence spread and outward influence with only an observed influence in total. Particularly, SOIEA, our algorithm to estimate influence spread, is times faster than the state-of-the-art INFEST [Lucier et al. (2015)] in theory and is four to five orders of magnitude faster than both INFEST and the naive Monte-Carlo sampling.

  • The robust mean estimation algorithm, termed RSA, a building block of SIEA, can be used to estimate influence spread under other stochastic diffusion models, or, in general, mean of bounded random variables of unknown distribution. RSA will be our favorite statistical algorithm moving forwards.

  • We perform comprehensive experiments on both real-world and synthesis networks with size up to 65 million nodes and 1.8 billion edges. Our experiments indicate the superior of our algorithms in terms of both accuracy and running time in comparison to the naive Monte-Carlo and the state-of-the-art methods. The results also give evidence against the practice of using a fixed number of samples to estimate the cascade size. For example, using 10000 samples to estimate the influence will deviate up to 240% from the ground truth in a Twitter subnetwork. In contrast, our algorithm can provide (pseudo) ground truth with guaranteed small (relative) error (e.g. 0.5%). Thus it is a more concrete benchmark tool for research on network cascades.

Organization. The rest of the paper is organized as follows: In Section 2, we introduce the diffusion model and the definition of outward influence with its properties. We propose an FPRAS for outward influence estimation in Section 3. Applications in influence estimation are presented in Section 5 which is followed by the experimental results in Section 6 and conclusion in Section 8. We cover the most recent related work in Section 7.

2 Definitions and Properties

In this section, we will introduce stochastic diffusion models, the new measure of Outward Influence, and showcase its properties under the popular Independent Cascade (IC) model [Kempe et al. (2003)].

Diffusion model. Consider a network abstracted as a graph , where and are the sets of nodes and edges, respectively. For example, in a social network, and correspond to the set of users and their social relationships, respectively. Assume that there is a cascade starting from a subset of nodes , called seed set. How the cascade progress is described by a diffusion model (aka cascade model) that dictates how nodes gets activated/influenced. In a stochastic diffusion model, the cascade is dictated by a random vector in a sample space . Describing the diffusion model is then equivalent to specifying the distribution of .

Let be the size of the cascade, the number of activated nodes in the end. The influence spread of , denoted by , under diffusion model is the expected size of the cascade, i.e.,

(1)

For example, we describe below the unknown vector and their distribution for the most popular diffusion models.

  • Information diffusion models, e.g. Independent Cascade (IC), Linear Threshold (LT), the general triggering model [Kempe et al. (2003)]: , and is a Bernouli random variable that indicates whether activates/influences . That is for given , if activates with a probability and 0, otherwise.

  • Epidemic cascading models, e.g., Susceptible-Infected (SI) [Daley et al. (2001), Nguyen et al. (2016)] and its variations: , and is a random variable following a geometric distribution. indicates how long it takes to activates after is activated.

  • Continuous-time models [Du et al. (2013)]: , and is a continuous random variable with density function . also indicates the transmission times (time until activates ) like that in the SI model, however, the transmissions time on different edges follow different distributions.

Outward Influence. We introduce the notion of Outward Influence which captures the influence of a subset of nodes towards the rest of the network. Outward influence excludes the self-influence of the seed nodes from the measure.

Definition 1 (Outward Influence)

Given a graph , a set and a diffusion model , the Outward Influence of , denoted by , is

(2)

Thus, influence and outward influence of a seed set differ exactly by the number of nodes in .

Influence Spread/Outward Influence Estimations. A fundemental task in network science is to estimate the influence of a given seed set . Since the exact computation is #P-hard (Subsection 2.2), we aim for estimation with bounded error.

Definition 2 (Influence Spread Estimation)

          Given a graph and a set , the problem asks for an -estimate of influence spread , i.e.,

(3)

The outward influence estimation problem is stated similarly:

Definition 3 (Outward Influence Estimation)

           Given a graph and a set , the problem asks for an -estimate of influence spread , i.e.,

(4)

A common approach for estimation is through generating independent Monte-Carlo samples and taking the average. However, one faces two major challenges:

  • How to achieve a minimum number samples to get an -approximate?

  • How to effectively generate samples with small variance, and, thus, reduce the number of samples?

For simplicity, we focus on the well-known Independent Cascade (IC) model and provide the extension of our approaches to other cascade models in Subsection 5.3.

2.1 Independent Cascade (IC) Model

Given a probabilistic graph in which each edge is associated with a number . indicates the probability that node will successfully activate once is activated. In practice, the probability can be mined from interaction frequency [Kempe et al. (2003), Tang et al. (2014)] or learned from action logs [Goyal et al. (2010)].

Cascading Process. The cascade starts from a subset of nodes , called seed set. The cascade happens in discrete rounds . At round , only nodes in are active and the others are inactive. When a node becomes active, it has a single chance to activate (aka influence) each neighbor of with probability . An active node remains active till the end of the cascade process. It stops when no more nodes get activated.

Sample Graph. Associate with each edge a biased coin that lands heads with probability and tails with probability . Deciding the outcome when attempts to activate is then equivalent to the outcome of flipping the coin. If the coin landed heads, the activation attemp succeeds and we call a live-edge. Since all the activation on the edges are independent in the IC model, it does not matter when we flip the coin. That is we can flip all the coins associated with the edges at the same time instead of waiting until node becomes active. We call the graph that contains the nodes and all the live-edges a sample graph of .

Note that the model parameter for the IC is a random vector indicating the states of the edges, i.e. live-edge or not. In other words, corresponds to the space of all possible sample graphs of , denoted by .

Probabilistic Space. The graph can be seen as a generative model. The set of all sample graphs generated from together with their probabilities define a probabilistic space . Recall that each sample graph can be generated by flipping coins on all the edges to determine whether or not the edge is live or appears in . Each edge will be present in the a sample graph with probability . Thus, the probability that a sample graph is generated from is

(5)

Influence Spread and Outward Influence. In a sample graph , let be the set of nodes reachable from . The influence spread in Eq. 1 is rewritten,

(6)

and the outward influence is defined accordingly to Eq. 2,

(7)

2.2 Outward Influence under IC

We show the properties of outward influence under the IC model.

Better Influence Discrepancy. As illustrated through Fig. 1, the elimination of the nominal constant helps to differentiate the “actual influence” of the seed nodes to the other nodes in the network. In the extreme case when , the ratio between the influence spread of and is , suggesting and have the same influence. However, outward influence can capture the fact that can influence roughly twice the number of nodes than , since s .

Non-monotonicity. Outward influence as a function of seed set is non-monotone. This is different from the influence spread. In Figure 1, , however, . That is adding nodes to the seed set may increase or decrease the outward influence.

Submodularity. A submodular function expresses the diminishing returns behavior of set functions and are suitable for many applications, including approximation algorithms and machine learning. If is a finite set, a submodular function is a set function , where denotes the power set of , which satisfies that for every with and every , we have,

(8)

Similar to influence spread, outward influence, as a function of the seed set , is also submodular.

Lemma 1

Given a network , the outward influence function for is a submodular function

2.3 Hardness of Computation

If we can compute outward influence of , the influence spread of can be obtained by adding to it. Since computing influence spread is #P-hard [Chen et al. (2010)], it is no surprise that computing outward influence is also #P-hard.

Lemma 2

Given a probabilistic graph and a seed set , it is #P-hard to compute .

However, while influence spread is lower-bounded by one, the outward influence of any set can be arbitrarily small (or even zero). Take an example in Figure 1, node has influence of for any value of . However, ’s outward influence can be exponentially small if . This makes estimating outward influence challenging, as the number of samples needed to estimate the mean of random variables is inversely proportional to the mean.

Monte-Carlo estimation. A typical approach to obtain an -approximaion of a random variable is through Monte-Carlo estimation: taking the average over many samples of that random variable. Through the Bernstein’s inequality [Dagum et al. (2000)], we have the lemma:

Lemma 3

Given a set of i.i.d. random variables having a common mean , there exists a Monte-Carlo estimation which gives an -approximate of the mean and uses random variables where is an upper-bound of , i.e. .

To estimate the influence spread , existing work often simulates the cascade process using a BFS-like procedure and takes the average of the cascades’ sizes as the influence spread. The number of samples needed to obtain an -approximation is samples. Since , in the worst-case, we need only a polynomial number of samples, .

Unfortunately, the same argument does not apply for the case of , since can be arbitrarily close to zero. For the same reason, the recent advances in influence estimation in [Borgs et al. (2014), Lucier et al. (2015)] cannot be adapted to obtain a polynomial-time algorithm to compute an -approximation (aka FPRAS) for outward influence. We shall address this challenging task in the next section.

We summarize the frequently used notations in Table 1.

Notation    Description
#nodes, #edges of graph .
Influence Spread of seed set .
Outward Influence of seed set .
The set of out-neighbors of :
.
The event that is active and are not active after round 1.
.
for
Table 1: Table of notations

3 Outward Influence Estimation via Importance Sampling

We propose a Fully Polynomial Randomized Approximation Scheme (FPRAS) to estimate the outward influence of a given set . Given two precision parameters , our FPRAS algorithm guarantees to return an -approximate of the outward influence ,

(9)

General idea. Our starting point is an observation that the cascade triggered by the seed set with small influence spread often stops right at round . The probability of such cascades, termed trivial cascades, can be computed exactly. Thus if we can sample only the non-trivial cascades, we will obtain a better sampling method to estimate the outward influence. The reason is that the “outward influence” associated with non-trivial cascade is also lower-bounded by one. Thus, we again can apply the argument in the previous section on the polynomial number of samples.

Given a graph and a seed set , we introduce our importance sampling strategy to generate such non-trivial cascades. It consists of two stages:

  1. Guarantee that at least one neighbor of will be activated through a biased selection towards the cascades with at least one node outside of and,

  2. Continue to simulate the cascade using the standard procedure following the diffusion model.

This importance sampling strategy is general for different diffusion models. In the following, we illustrate our importance sampling under the focused IC model.

3.1 Importance IC Polling

We propose Importance IC Polling (IICP) to sample non-trivial cascades in Algorithm 2.

Figure 2: Neighbors of nodes in

First, we “merge” all the nodes in and define a “unified neighborhood” of . Specifically, let the set of out-neighbors of and the set of out-neighbors of excluding . For each ,

(10)

the probability that is activated directly by one (or more) node(s) in . Without loss of generality, assume that (otherwise, we simply add into ).

Assume an order on the neighborhood of , that is

where . For each , let be the event that be the “first” node that gets activated directly by :

The probability of is

(11)

For consistency, we also denote the event that none of the neighbors are activated, i.e.,

(12)

Note that is also the event that the cascade stops right at round . Such a cascade is termed a trivial cascade. As we can compute exactly the probability of trivial cascades, we do not need to sample those cascades but focus only on the non-trivial ones.

Denote by the probability of having at least one nodes among activated by , i.e.,

(13)
{algorithm}

[!h] IICP - Importance IC Polling \KwInA graph and a seed set \KwOut - size of a random outward cascade from Stage 1 \tcpSample non-trivial neighbors of set S Precompute using Eq. 11 and Eq. 12
Select one neighbor among with probability of selecting being
Queue ; Mark and all nodes in visited
for  do With a probability do
      Add into ; ; Mark visited. Stage 2 \tcpSample from newly influenced nodes while  is non-empty do .pop()
\ForEach
unvisited neighbor of With a probability
     Add to ; ; Mark visited. return ;
We now explain the details in the Importance IC Polling Algorithm (IICP), summarized in Alg. 2. The algorithm outputs the size of the cascade minus the seed set size. We term the output of IICP the outer size of the cascade. The algorithm consists of two stages.
Stage 1. By definition, the events are disjoint and form a partition of the sample space. To generate a non-trivial cascade, we first select in the first round with a probability (excluding ). This will guarantee that at least one of the neighbors of will be activated. Let be the selected node, after the first round becomes active and remains inactive. The nodes among are then activated independently with probability (Eq. 10). Stage 2. After the first stage of sampling neighbors of , we get a non-trivial set of nodes directly influenced from . For each of those nodes and later influenced nodes, we will sample a set of its neighbors by the naive BFS-like IC polling scheme [Kempe et al. (2003)]. Assume sampling neighbors of a newly influenced node , each neighbor is influenced by with probability . The neighbors of those influenced nodes are next to be sampled in the same fashion. In addition, we keep track of the newly influenced nodes using a queue and the number of active nodes outside using .

The following lemma shows how to estimate the (expected) cascade size through the (expected) outer size of non-trivial cascades.

Lemma 4

Given a seed set , let be the random variable associated with the output of the IICP algorithm. The following properties hold,

Further, let be the probability space of non-trivial cascades and the probability space for the outer size of non-trivial cascades, i.e, . The probability of is given by,

3.2 FPRAS for Outward Influence Estimation

From Lemma 4, we can obtain an estimate of through getting an estimate of by,

(14)

where the estimate . Thus, finding an -approximation of is then equivalent to finding an -approximate of .

The advantage of this approach is that estimating , in which the random variable has value of at least , requires only a polynomial number of samples. Here the same argument on the number of samples to estimate influence spread in subsection 2.3 can be applied. Let be the random variables denoting the output of IICP. We can apply Lemma 3 on the set of random variables satisfying . Since each random variable is at least 1 and hence, , we need at most a polynomial random variables for the Monte-Carlo estimation. Since, IICP has a worst-case time complexity , the Monte-Carlo using IICP is an FPRAS for estimating outward influence. {theorem} Given arbitrary and a set , the Monte-Carlo estimation using IICP returns an -approximation of using samples.

In Section 5, we will show that both outward influence and influence spread can be estimated by a powerful algorithm saving a factor of more than random variables compared to this FPRAS estimation. The algorithm is built upon our mean estimation algorithms for bounded random variables proposed in the following.

4 Efficient Mean Estimation for Bounded Random Variables

In this section, we propose an efficient mean estimation algorithm for bounded random variables. This is the core of our algorithms for accurately and efficiently estimating the outward influence and influence spread in Section 5.

We first propose an ‘intermediate’ algorithm: Generalized Stopping Rule Estimation (GSRA) which relies on a simple stopping rule and returns an -approximate of the mean of lower-bounded random variables. The GSRA simultaneously generalizes and fixes the error of the Stopping Rule Algorithm [Dagum et al. (2000)] which only aims to estimate the mean of random variables and has a technical error in its proof.

The main mean estimation algorithm, namely Robust Sampling Algorithm (RSA) presented in Alg. 5, effectively takes into account both mean and variance of the random variables. It uses GSRA as a subroutine to estimate the mean value and variance at different granularity levels.

4.1 Generalized Stopping Rule Algorithm

We aim at obtaining an -approximate of the mean of random variables . Specifically, the random variables are required to satisfy the following conditions:

  • ,

where are fixed constants and (unknown) .

Our algorithm generalizes the stopping rule estimation in [Dagum et al. (2000)] that provides estimation of the mean of i.i.d. random variables . The notable differences are the following:

  • We discover and amend an error in the stopping algorithm in [Dagum et al. (2000)]: the number of samples drawn by that algorithm may not be sufficient to guarantee the -approximation.

  • We allow estimating the mean of random variables that are possibly dependent and/or with different distributions. Our algorithm works as long as the random variables have the same means. In contrast, the algorithm in [Dagum et al. (2000)] can only be applied for i.i.d random variables.

  • Our proposed algorithm obtains an unbiased estimator of the mean, i.e. while [Dagum et al. (2000)] returns a biased one.

  • Our algorithm is faster than the one in [Dagum et al. (2000)] whenever the lower-bound for random variables .

{algorithm}

Generalized Stopping Rule Alg. (GSRA) \KwInRandom variables and \KwOutAn -approximate of If , return .
Compute: ;
Initialize ;
while  do ;
return ;
Our Generalized Stopping Rule Algorithm (GSRA) is described in details in Alg. 4.1. Denote . The algorithm contains two main steps: 1) Compute the stopping threshold (Line 2) which relies on the value of computed from the given precision parameters and the range of the random variables; 2) Consecutively acquire the random variables until the sum of their outcomes exceeds (Line 4-5). Finally, it returns the average of the outcomes, (Line 6), as an estimate for the mean, . Notice that in GSRA depends on and thus, getting tighter bounds on the range of random variables holds a key for the efficiency of GSRA in application perspectives. The approximation guarantee and number of necessary samples are stated in the following theorem. {theorem} The Generalized Stopping Rule Algorithm (GSRA) returns an -approximate of , i.e., (15) and, the number of samples satisfies, (16)

The hole in the Stopping Rule Algorithm in [Dagum et al. (2000)]. The estimation algorithm in [Dagum et al. (2000)] for estimating the mean of random variables in range also bases on a main stopping rule condition as our GSRA. It computes a threshold

(17)

where is the base of natural logarithm, and generates samples until . The algorithm returns as a biased estimate of .

Unfortunately, the threshold to determine the stopping time does not completely account for the fact that the necessary number of samples should go over the expected one in order to provide high solution guarantees. This actually causes a flaw in their later proof of the correctness.

To amend the algorithm, we slightly strengthen the stopping condition by replacing the in the formula of with an (Line 2, Alg. 4.1). Since (else the algorithm returns ) and assume w.l.o.g. that , it follows that . Thus the number of samples, in comparison to those in the stopping rule algorithm in [Dagum et al. (2000)] increases by at most a constant factor.

Benefit of considering the lower-bound . By dividing the random variables by , one can apply the stopping rule algorithm in [Dagum et al. (2000)] on the normalized random variables. The corresponding value of is then

(18)

in our proposed algorithm is however smaller by a multiplicative factor of . Thus it is faster than the algorithm in [Dagum et al. (2000)] by a factor of on average. Note that in case of estimating the influence, we have . Compared to algorithm applied [Dagum et al. (2000)] directly, our GSRA algorithm saves the generated samples by a factor of .

Martingale theory to cope with weakly-dependent random variables. To prove Theorem 4.1, we need a stronger Chernoff-like bound to deal with the general random variables in range presented in the following.

Let define random variables . Hence, the random variables form a Martingale [Mitzenmacher and Upfal (2005)] due to the following,

Then, we can apply the following lemma from [Chung and Lu (2006)] stating,

Lemma 5

Let be a martingale, such that , for all , and

(19)

Then, for any ,

(20)

In our case, we have , , and . Apply Lemma 2 with and , we have,

(21)

Then, since ( since Bernoulli random variables with the same mean have the maximum variance), we also obtain,

(22)

Similarly, also form a Martingale and applying Lemma 5 gives the following probabilistic inequality,

(23)
{algorithm}

[!h] Robust Sampling Algorithm (RSA) \KwInTwo streams of i.i.d. random variables, and and \KwOutAn -approximate of Step 1 \tcpObtain a rough estimate of if  then return GSRA()
GSRA()
Step 2 \tcpEstimate the variance \tcp* is defined the same as in Alg. 4.1 for  do /2;
;
Step 3 \tcpEstimate Set ;
for  do ; return ;

4.2 Robust Sampling Algorithm

Our previously proposed GSRA algorithm may have problem in estimating means of random variables with small variances. An important tool that we rely on to prove the approximation guarantee in GSRA is the Chernoff-like bound in Eq. 22 and Eq. 23. However, from the inequality in Eq. 21, we can also derive the following stronger inequality,

(24)

In many cases, random variables have small variances and hence . Thus, Eq. 24 is much stronger than Eq. 22 and can save a factor of in terms of required observed influences translating into the sample requirement. However, both the mean and variance are not available.

To achieve a robust sampling algorithm in terms of sample complexity, we adopt and improve the algorithm in [Dagum et al. (2000)] for general cases of random variables. The robust sampling algorithms (RSA) subsequently will estimate both the mean and variance in three steps: 1) roughly estimate the mean value with larger error ( or a constant); 2) use the estimated mean value to compute the number of samples necessary for estimating the variance; 3) use both the estimated mean and variance to refine the required samples to estimate mean value with desired error ().

Let and are two streams of i.i.d random variables. Our robust sampling algorithm (RSA) is described in Alg. 5. It consists of three main steps:

  • If , run GSRA with parameter and return the result (Line 1-2). Otherwise, assume and use the Generalized Stopping Rule Algorithm (Alg. 4.1) to obtain an rough estimate using parameters of (Line 3).

  • Use the estimated in step 1 to compute the necessary number of samples, , to estimate the variance of , . Note that this estimation uses the second set of samples,

  • Use both in step 1 and in step 2 to compute the actual necessary number of samples, , to approximate the mean . Note that this uses the same set of samples as in the first step.

The numbers of samples used in the first two steps are always less than a constant times which is the minimum samples that we can achieve using the variance. This is because the first takes the error parameter which is higher than and the second step uses samples.

At the end, the algorithm returns the influence estimate which is the average over samples, . The estimation guarantees are stated in the following theorem.

{theorem}

Let be the probability distribution that and are drawn from. Let be the estimate of returned by Alg. 5 and be the number of drawn samples in Alg. 5 w.r.t. . We have,

  • ,

  • There is a universal constant such that

    (25)

    where .

Compared to the algorithm in [Dagum et al. (2000)], first of all, we replace their stopping rule algorithm with GSRA and also, we change the computation of which is always smaller than that of [Dagum et al. (2000)] by a factor of when .

5 Influence Estimation at Scale

This section applies our RSA algorithm to estimate both the outward influence and the traditional influence spread.

5.1 Outward Influence Estimation

We directly apply RSA algorithm on two streams of i.i.d. random variables and , which are generated by IICP sampling algorithm, with the precision parameters .

The algorithm is called Scalable Outward Influence Estimation Algorithm (SOIEA) and presented in Alg. 5.1 which generates two streams of random variables and (Line 1) and applies RSA algorithm on these two streams (Line 2). Note that outward influence estimate is achieved by scaling down by (Lemma 4). {algorithm} SOIEA Alg. to estimate outward influence \KwInA probabilistic graph , a set and \KwOut - an -estimate of Generate two streams of i.i.d. random variables and by IICP algorithm.
return

We obtain the following theoretical results incorporated from Theorem 4.2 of RSA and IICP samples. {theorem} The SOIEA algorithm gives an outward influence estimation. The observed outward influences (sum of ) and the number of generated random variables are in and respectively, where .

Note that .

5.2 Influence Spread Estimation

Not only is the concept of outward influence helpful in discriminating the relative influence of nodes but also its sampling technique, IICP, can help scale up the estimation of influence spread (IE) to billion-scale networks.

Naive approach. A naive approach is to 1) obtain an -approximation of