Information diffusion epidemics in social networks

Information diffusion epidemics in social networks

José Luis Iribarren IBM Corporation, ibm.com e-Relationship Marketing Europe, E-28002, Madrid, Spain    Esteban Moro Grupo Interdisciplinar de Sistemas Complejos (GISC) and Departamento de Matemáticas, Universidad Carlos III de Madrid, E-28911, Leganés (Madrid), Spain
July 8, 2019

Abstract: The dynamics of information dissemination in social networks is of paramount importance in processes such as rumors or fads propagation yamir (), spread of product innovations valente () or ”word-of-mouth” communications womma (); buzzbuzz (). Due to the difficulty in tracking a specific information when it is transmitted by people, most understanding of information spreading in social networks comes from models golden () or indirect measurements motion (). Here we present an integrated experimental and theoretical framework to understand and quantitatively predict how and when information spreads over social networks. Using data collected in Viral Marketing campaigns jurvetson () that reached over 31,000 individuals in eleven European markets, we show the large degree of variability of the participants’ actions, despite them being confronted with the common task of receiving and forwarding the same piece of information. Specifically we observe large heterogeneity in both the number of recommendations made by individuals and of the time they take to transmit the information. Both have a profound effect on information diffusion: Firstly, most of the transmission takes place due to super-spreading events which would be considered extraordinary in population-average models. Secondly, due to the different way individuals schedule information transmission barabasinature (); telephone (); blogs () we observe a slowing down of the spreading of information in social networks that happens in logarithmic time. Quantitative description of the experiments is possible through an stochastic branching process branching () which corroborates the importance of heterogeneity. The fact that both the intensity and frequency of human responses show also large degrees of heterogeneity in many other activities pitkow (); tipping (); sexual () suggests that our findings are pertinent to many other human driven diffusion processes like rumors, fads, innovations or news which has important consequences for organizations management, communications, marketing or electronic social communities.

Each day, millions of conversations, e-mails, SMS, blog comments, instant messages or web pages containing various types of information are exchanged between people. Humans behave in a viral fashion, having a natural inclination to share the information so as to gain reputation, trustworthiness or money. This “word-of-mouth” (WOM) dissemination of information through social networks is of paramount importance in our every day life. For example, WOM is known to influence purchasing decisions to the extent that 2/3 of the economy of the United States is driven by WOM recommendations buzzbuzz (). But also WOM is important to understand communication inside organizations, opinion formation in societies or rumor spreading. Despite its importance, detailed empirical data about how humans disseminate information are scarce or indirect golden (); kleinberg (). Most understanding comes from implementing models and ideas borrowed from epidemiology on empirical or synthetic social networks yamir (); motion (). However, unlike virus spreading, information diffusion depends on the voluntary nature of humans, has a perceived transmission cost and is only passed by its host to individuals who may be interested on it huberman (); flow (). Here we present a large scale experiment designed to measure and understand the influence of human behavior on the diffusion of information.

We analyzed a series of controlled viral marketing jurvetson () campaigns in which subscribers to an on-line newsletter were offered incentives for promoting new subscriptions among friends and colleagues. This offering was virally spread through recommendation e-mails sent by participants. This “recommend-a-friend” mechanism was fully conducted electronically and thus could be monitored at every step. Spurred by exogenous online advertising, a total of 7,153 individuals started recommendation cascades subsequently fueled through viral propagation carried out by 2,112 secondary spreaders. This resulted in another 21,918 individuals touched by the message which they did not pass along further. All in all, 31,183 individuals were “infected” by the viral message. Of those, 9,265 were spreaders. Thus, 77% of the participants were reached by the endogenous WOM viral mechanism. We call seed nodes the individuals spontaneously initiating recommendation cascades and viral nodes the individuals who pass e-mail invitations along after having received them from other participants. The topology of the resulting viral recommendations graph (designated as the Viral Network) is a directed network formed by 7,188 isolated components, or viral cascades, where nodes representing participants are connected by arcs representing recommendation e-mails (see Fig. 1).

Figure 1: The viral network detected in the campaigns consists of a large number of disconnected clusters as this one found in Spain. It has 122 nodes and its diameter (longest undirected path) is 13. The structure starts out of a seed participant in the center (black) and grows through secondary viral propagation of viral nodes (gray) until it reaches this large size. The probability of finding a similar occurrence in homogeneous random network models (see Figure 3) is negligible.
Figure 2: Upper panel: Fanout cumulative probability distribution function for viral campaigns in all countries (circles). Solid lines show maximum likelihood fits for power-law (black circles) with a normalization constant, and and and Poisson probability distribution functions with mean (see appendix A). Lower panel: Fanout Coefficient for viral (circles) and seed (squares) participants as a function of the Viral Transmissibility for different groups of countries. For a given campaign, both parameters are linearly dependent as because the participants viral decisions stem from evaluating the same utility function. For the campaigns analyzed the linear fit results in and . Variation between countries is due to a different acceptance of the offering by customers in those markets.
Group Nodes Cascades
ALL 31,183 7,188  2.51  2.96  0.088  4.39  4.34
SP+IT 6,862 1,162  3.14  3.38  0.11  5.99  5.91
France 11,754 3,244  2.20  2.50  0.070  3.67  3.62
AT+DE 7,938 1,743  2.55  3.07  0.095  4.59  4.55
UK+Nordic 4,629 1,039  2.69  2.79  0.084  4.51  4.45
Table 1: The eleven participating countries have been distributed in four culturally homogeneous groups for statistical relevance. Network parameters of their corresponding viral network, shown above, include the theoretical average cascade size predicted by the model through equation (), and the real value measured in the campaigns.

The spreading of information or diseases in a population is often described by average quantities andersonmay (). Although infection and propagation can be quite involving processes, population-level analysis describe viral propagation as a function of the probability of a virally informed person to become a secondary spreader (), and of the average number of people contacted by secondary spreaders (). Thus, in this simple approach, two parameters fully characterize the mean-field description of information diffusion: Viral Transmissibility () and Fanout coefficient (). In the viral campaigns we found that only of the participants receiving a recommendation e-mail engaged in spreading, and thus . The Fanout coefficient , is the average number of recommendation e-mails sent by spreading nodes. Its value is noticeably higher for viral nodes () than for seed nodes () showing a stronger involvement in viral behavior when the invitation to pass messages along is received from a trusted source. As a result, the average number of secondary cases generated by each informed individual is given by the basic reproductive number . Both and also depend on the specific country in which the campaign was run (see figure 2) but in all cases we found , i.e. the viral campaigns did not reached the “tipping-point”. Since the campaign execution was identical in all countries, we conclude that differences observed in the propagation parameters are due to the varying appeal of the viral offering to customers in different markets. However, the data suggest a strong linear correlation between the Transmissibility and the Fanout coefficient. This peculiarity of information diffusion processes, not observed in traditional epidemics, stems from the fact that the decisions of becoming a spreader and of the number of viral messages to send, are taken by the same individual and thus are, in average, correlated. As a result, the basic reproductive number scales at least quadratically with the probability of a touched individual becoming a spreader, i.e. being convinced to propagate the message. Thus, increasing the perceived value of the viral campaign offer would have a quadratic effect instead of a linear one and the tipping-point would be reached for lower than expected values.

However, average quantities like can hide the heterogeneous nature of information diffusion. In fact we find in our experiments that most of the transmission we observe takes place due to extraordinary events. In particular, we get that the number of recommendations sent by spreaders is distributed as a power-law as seen in figure 2, indicating the high probability to find large number of recommendations in the viral cascades. This large demographic stochasticity has been observed in a number of other human activities like the number of e-mails sent by individuals per day barabasinature (), the number of telephone calls placed by users telephone (), the number of weblogs posts by a single user blogs (), the number of web page clicks per user pitkow (), and the number of a person’s social relationships tipping () or sexual contacts sexual (). All these examples suggest that the response of humans to a particular task cannot be described by close-to-average models in which they behave in a similar fashion probably with some small degree of demographic stochasticity. For example we find that 2% of the population has , suggesting the existence of super-spreading individuals in sharp contrast with homogeneous models of information spreading bass (). Super-spreading individuals have also been found in non-sexual disease spreading ssdisease () where they have a profound effect. As in that case, we find that super-spreading individuals are responsible for making large viral cascades rarer but more explosive (see figure 3). For example, if we neglect the existence of super-spreading individuals but still consider some degree of stochasticity in the number of recommendations by making a Poisson process with average , a viral cascade like the one in figure 1 would have a probability of appearance of approximately once every seeds, a number much larger than the total world population (see figure 3).

Figure 3: a) Cumulative distribution function of the viral cascades size in all countries (circles). The solid black line represents the prediction of the branching model (see text) while the red solid line is the Poisson prediction. b) Average size of the viral cascades as a function of the Viral Transmissibility for different groups of countries (circles). The solid line is the prediction of the branching model (Eq. 1) which diverges at the tipping point estimated using the linear fits of figure 2 for and . The red line and symbols shows as a function of . Note that at the tipping point the average number of viral e-mails sent is just .

An important question is whether the observed demographic stochasticity in the number of recommendations is directly related to the heterogeneity of social contacts NM (). Recent available data about social networks has revealed that humans show also large variability in their number of social contacts. In particular, it has been found that social connectivity is distributed as a power-law, much like the number of recommendations in our viral campaigns ebelemail (). Moreover, large variability in the numbers of social contacts have a profound effect in information or disease spreading epidemic (); satorras (). Specifically, simulations of information or disease spreading models on networks show that if information or disease flows through every social contact, the topological properties of social networks can significantly lower the “tipping-point”. While this might be the case of computer virus spreading or any other kind of automatic propagation through social networks, information transmission is voluntary and participants who engage in the spreading consider the cost and benefits of doing so. Thus, the number of recommendations sent by each participant (including not sending any) results from a trade-off between the information forwarding cost and the perceived value of doing it. When the value is low, the average number of recommendations can be very low, a small fraction of the sender’s social contacts which makes the social network topology largely irrelevant in the decision making problem. In fact, our data suggest that this is the case; specifically, most of the viral cascades have a tree-like structure while social networks are characterized by the large density of local loops why (). To illustrate this observation quantitatively, we have measured the clustering coefficient , i.e., the fraction of an individual contacts who are in contact between themselves. Email social networks have large values of clustering () NM () while in our case we find . Of course, these numbers are not independent: as shown in the appendix C and under fairly general assumptions we should expect that where is the average number of social contacts of the neighbors of an individual. In social networks is a large number, and then viral cascades have a very small clustering coefficient even when close to the tipping-point . Thus, we have found that reach of information diffusion can be very large without sampling the topological properties of the social network of individuals. This implies that the large heterogeneity observed in the number of recommendations is a characteristic of human decision making tasks rather than a reflection of the social network.

Given the above results, we have modeled the viral campaigns recommendation cascades through a branching process in which the recommendation heterogeneity is considered but the social network topology is neglected. Each cascade starts from an initial seed that initiates viral propagation with a random number of recommendations distributed by and whose average is . Touched individuals become secondary spreaders with probability thereby giving birth to a new generation of viral nodes which, in turn, propagate the message further with recommendations distributed by with average 111Actually, the distributions and are different but we use the same letter for clarity. See appendix A for more information. The propagation continues through successive generations until none of the last touched individuals decide to become secondary spreaders. This process corresponds to the well known Bellman-Harris branching model branching (). On average, the infinite time limit cascade size can be estimated as

(1)

which are within a striking error of the experimental values found in the viral campaigns (see Table 1). Not only are average cascade sizes well predicted, but their distribution is properly replicated when the heterogeneity in the number of recommendations is implemented (see figure 3). Both results show how accurate the model can be in predicting the extent of a viral marketing campaign: since the values of and can be roughly estimated during the early stages of the campaign, we could have predicted the final reach of a viral campaign at its very beginning. Moreover, giving the knowledge of how and are connected and using equation (1) we could give estimations of the critical viral transmissibility which makes the viral message percolate through a fraction of the entire network 222Since e-mail Networks carrying viral propagation are semidirected NM () some portions of them are unreachable due to lack of connecting paths. So, we define percolation as the state where messages reach a large fraction of the e-mail Network Giant Connected Component (GCC). We found that which correspond to . Of course this is an upper limit to the real “tipping-point” since it is based on the assumption that each seed originates one isolated viral cascade, which is only valid far from the “tipping-point”. The low number of recommendations needed to reach the “tipping point” illustrates the limited effect of the social network topology in the efficiency of viral campaigns. Thus, it is not necessary to send the message to each participants’ social contact in order to reach a significant fraction of the target population.

Information diffusion dynamics is also affected by the different way individuals program the execution of their tasks. The time it takes for participants to pass the message along since it was received, or “waiting-time” , shows also a large degree of variability: participants forward the message after days on average, but with a very large standard deviation of days, with some participants responding as late as days after receiving the invitation email (see figure 4). The large variability of the distribution for waiting times observed in our data is consistent with recent measures of how humans organize their time when working on specific tasks, such as email answering, market trading or web pages visits. barabasinature (); vazquez (). Traditional Poissonian models for cannot match the observed data and several long-tailed models like power laws vazquez () or log-normal amaralemail () distributions for have been proposed to incorporate the large waiting-times between actions observed. Our data is fully consistent with a log-normal distribution and, moreover, the data shows no statistical correlation with the number of recommendations made by the participant (see figure 4).

Figure 4: a) Cumulative probability distribution of time elapsed between the reception and forwarding of the viral information (circles) for participants in all countries. The solid line shows MLE fit to a log-normal distribution with and . Only viral nodes are considered, since reception time for seed nodes is undefined. Inset shows absence of statistical correlation between the number of recommendations made and the time elapsed until each participant forwards the message. b) Average number of touched participants as a function of the cascades start time in our campaigns (circles) compared with the prediction of the Bellman-Harris model (solid line), with the fitted log-normal distribution (black), and with an exponential distribution of the same mean (red). The dashed line is the analytical approximation to a Bellman-Harris process with log-normal waiting times given by , where is the cumulative distribution function of the log-normal distribution in a). Inset: Remarkable agreement between the average size of the viral cascades as function of total campaign time in log scale (circles) with the Bellman-Harris model prediction with G(t) log-normal. Also shown, in red, the prediction with G(t) exponential.

This means that the delay in passing along a message and the number of recommendations made by individuals are largely independent decisions. Within this approximation, our simulations of the Bellman-Harris process with waiting times distributed by log-normal and number of recommendations by the power-law show a remarkable agreement with our data from the campaigns (see figure 4). On the other hand, population-average models predict that the average number of infected individuals passing along the message at time is described by the growth equation

(2)

where is the Malthusian rate parameter of the population. The number of people aware of the information until time is the cumulative sum of infected individuals, . Equation (2) is the starting point of many different deterministic models to describe the evolution of epidemics, information or innovations in a population. It also describes the asymptotic dynamics of those situations in the models with some mild degree of heterogeneity in 333If is Poissonian, the average number of infected people in Bellman-Harris process is given exactly by equation (2). The situation changes drastically when has a large degree of variability. Specifically, if belongs to the so-called class of subexponential distributions, i.e. distributions that decay slower than exponentially when , equation (2) is not valid. This class contains important instances as power-law (or Pareto) distribution, the Weibull or, like in our case, the log-normal distribution. In the latter we obtain that for , is given in the long run by

(3)

with a constant independent of (see appendix B). Equation (3) demonstrates the deep impact of large degree of heterogeneity in our population: the very functional form of the time dependence is changed and the dynamics of the system depends on a logarithmic time scale, thus slowing down the propagation of information in a drastic way. The situation is the opposite for moderate values of where with given by the solutions of but with and thus information spreads much faster than expected. The different behavior both above and below the “tipping-point” is due to the different importance that individuals with small or large values of have in the dynamics: while below the number of infected individuals decay in time up to the point where a sole individual can halt the dynamics of a viral cascade, above the dynamics is governed by individuals with small number of which are more abundant than those with and thus speed up the diffusion. Since subexponential distributions are found in other human tasks barabasinature (); vazquez (); amaralemail (), our findings have the important consequence that the high variability in the response of humans to a particular task can slow down or speed up the dynamics of processes taking place on social networks when compared to the traditional population-average models.

Figure 5: Prevalence time as a function of number of initially infected people (i.e. number of seeds ) for the Bellman-Harris branching process with values of and obtained in our campaigns for all countries (see table 1). Prevalence time is calculated by solving equation . Solid lines correspond to different distributions : log-normal (black) and Poisson (red).

Our study does not explain why the frequency and number of recommendations made by people in our experiments are so heterogeneous despite the decision they faced was the same. Rational expectations suggest that individuals should have made their decisions based on similar utility functions and then the answers would have been closer to each other. The fact that the same degree of heterogeneity has been found for so many different tasks in humans barabasinature (); vazquez (); amaralemail () suggest that it is an intrinsic feature of human nature to be so wildly heterogeneous. As we have shown, the main consequence of the large variability of human behavior is that population-level average quantities do not explain the dynamics of social network processes. Important consequences of this large variability of behavior are the slowing down or speed up of information diffusion and that most of the diffusion takes place due to otherwise considered extraordinary events. The corrections to population-averaged predictions go beyond a different set of values for the dynamics parameters: They can even change the time scale or functional form of the predictions. In particular, we have seen that we are forced to revisit the way we model spreading processes mediated by humans by using differential equations like (2). On the other hand, the slowing down of information diffusion implies that viral cascades or outbreaks do last much longer than expected, which could explain the prevalence of some informations, rumors or computer viruses. For example, if we assume that initially seeds are infected, we could take as the end of information diffusion the point when the fraction of infected individuals decays to . While Poissonian approximations yield to , in our case we find that where is independent of . When is large enough there is a huge difference between both estimations. For example, if (a large but moderate value), then days (with ) for Poissonian models while year if is described by a log-normal distribution. As suggested in barabasivirus (), the high variability of response times can be the origin of the prevalence of computer viruses. In fact, our viral cascades span in time longer than initially expected, which may render viral campaigns unpractical for information diffusion. Companies, organizations or individuals implementing such marketing tactics to disseminate information over social networks face the following dichotomy: If the tactic is successful and information spread reaches the “tipping-point” it does so very quickly; however, if it fails in reaching the “tipping-point”, the situation is even worse because information travels slowly in logarithmic time. We hope that our experiments and the fact that they can be accurately explained by simple models will trigger more research to understand quantitatively human behavior.

Acknowledgments: J.L.I. acknowledges IBM Corporation support for the collection of anonymous data of its Viral Marketing campaigns propagation. E.M. acknowledges partial support from MEC (Spain) through grant FIS2004-01001 and a Ramón y Cajal contract. We thank Alex Arenas for sharing with us the e-mail Network data used in our simulations.

Appendix A Model Selection

a.1 Candidate Models for the recommendation distribution

The recommendation distribution is the probability distribution of the number of recommendations made by each participant in the campaign. As shown in figure 1b, there is a large degree of heterogeneity in the way the participants engaged in the campaign. The number of recommendations per participant varies from one to more than one hundred and thus any modeling of the distribution of recommendations has to incorporate those extreme events.

We consider two distinct treatments of the number of recommendations:

  1. In order to incorporate demographic stochasticity inherent to the transmission process, many classical epidemiological models assume that the offspring distribution is represented by a Poisson process, and thus .

  2. However, there is an increasing evidence that humans tend to respond in a untamed way in different activities. Most people behave close to the average behavior, but a not negligible portion of humans show bursts of activities, like the number of e-mails sent per day ebelemail (), the number of telephone calls placed by users telephone (), the number of weblogs posts by a single userblogs (), the time spent between receiving and replying an e-mail barabasinature () or the number of web page clicks per user pitkow (). To account for those extreme events, power-law distributions of activity have been proposed and observed statistically. Here we propose a model for the number of recommendations based on a power-law distribution which has the following pdf

    (4)

    which asymptotically decreases like a power law and shows a cutoff at small numbers of recommendations . Here, is a normalization constant so that .

a.2 Parameter estimation

We estimate the model parameters by the method of moments to ensure that all models have the same mean value (and ) observed in the campaigns, so that the difference between models is due to the different way they handle heterogeneity. Note that the Poisson distribution has only one parameter and then only can be fitted. In the other case, the , there are two parameters and data can be fitted to the first and second moment of as shown in table 2. We model independently the pdf of the number of recommendations made by seeds and viral nodes to account for the different values observed. It is interesting to note that both pdfs seem to decay as a power law with the same exponent .

Group
Seeds 3.48 29.66
Viral 3.50 60.07
Table 2: Parameters of the different probability distribution models for the observed number of recommendations made by seed nodes and viral nodes. Parameters and refer to

Appendix B Viral Marketing propagation dynamics

b.1 The Galton-Watson branching process

Branching processes describe the evolution of systems where an initial set of objects called the 0-th generation reproduce themselves into a set of children of the same kind call the first generation and so on through successive generations. The Galton-Watson process is the simplest mathematical description of such situation and only keeps track of the sizes of the successive generations, not the times at which individual objects are born or their individual family relationships. We can define two sets of random variables with being the number of individuals in generation and with . Since the probability law governing each generation does not depend on the sizes of the preceding generation, both form a Markov Chain.

The probability distribution of the variable is given by and we can define its probability generating function (pgf) as

(5)

whose derivative evaluated at is the expected value of as follows

(6)

It was demonstrated by Watson harris () that the generating function of is , the n-th iterate of the generating function , as follows

(7)

This important property leads to the following result for the average size of the n-th generation:

(8)

b.2 Model for Viral Marketing propagation

Applying the Galton-Watson formalism to the viral propagation dynamics, we consider a single propagation tree starting from one node () whose components are all nodes touched by the message. Its total size at generation is and the nodes can be divided in Active () and Passive () depending on whether they have passed the viral message along or not. Now, we define the Viral Transmissibility, or the probability of any one node being Active, as and the Fanout Coefficient, or average number of email referrals sent by Active nodes, as where is the number of email referrals sent by node n. Now the average number of email referrals sent by all nodes (Active or Passive) is

since summation over Inactive nodes is zero. In our mean-field approach, this value will be considered to be constant throught all generations.

Now, the probability function of the Galton-Watson process is given by , where is the power-law distribution in (4) with , and . The corresponding generating function is

(10)

and applying the Galton-Watson process results in (6) and (8) we write the average size of each of the generations in the propagation tree as

(11)

and

(12)

hence, the average size of a branch in the mean-field approach at the infinite time limit is given by

(13)

since the summation converges because the system is below the percolation threshold and . Now, the total number of nodes in the Viral Network graph in the infinite time limit results from adding the nodes in the trees generated by each seed node and multiplying by the total number of seed nodes. Thus we have, seed nodes included, that

(14)

where the validity condition of being far from the percolation threshold is necessary to ensure that outbreaks (or clusters) originating from different seed nodes do not merge with one another.

b.3 Age-dependent dynamics: Bellman-Harris process

The description of viral marketing dynamics based on the Galton-Watson process does not consider the ”waiting time” () elapsed between the reception of a message and the moment its passing along, assuming implicitly that both actions take place at the same instant. However, viral propagation does not occur instantaneously and our experiments show that it follows a log-normal time distribution much like those observed in other human activities.

To describe this behavior we will use the Bellman-Harris process, a continuous time generalization of the Galton-Watson one, in which both the number of descendants at each generation and their lifetimes are represented by non-negative, independent random variables harris (). It is described as follows: A single ancestor is originated at and lives for time which is a random variable with cumulative distribution function with mean . At the moment of its disappearance the particle generates a number of progeny according to a probability distribution whose pgf is denoted as . The process continues with descendants behaving independently and in the same fashion as their ancestors did. Thus, the branching process is described by the random variable representing the number of active particles at time . In our case, represents the number of active participants at time , i.e. the number of people that have received the information before time and that will send it in a future time.

Analytically, we use the generating function for calculating the probability of having particles active at time . It is defined as

(15)

It can be proved harris () that in the asymptotic limit satisfies a renewal equation of the form

(16)

As a result , the expected value of , verifies that

(17)

where we have used that

(18)

General explicit solutions of the integral equation (17) do not exist, although the asymptotic behavior is known in the case in which the Malthusian parameter of the population exists. This parameter is defined explicitly by

(19)

If a solution of this equation exists, then harris ()

(20)

The normalization of implies that, if exists, for and for thus recovering the exponential growth or decay above and below the “tipping-point”. Important instances of this case are:

  1. Galton-Watson process. For , where is the unit step function at 0 (i.e., lifespan of all particles is identical and equal to ), we recover a Galton-Watson process with progeny generating function and mean

    (21)

    which yields to equation (14) since .

  2. Markov age-dependent branching process. Traditional modeling of the lifespan or “waiting time” of human activities implies that is of the Poissonian type . One of the important reasons is that this exponential distribution has the lack-of-memory property which is suitable for modeling the dynamics using Markovian processes. This is exemplified in our case by the fact that, if is exponentially distributed, then the solution of (17) is exactly given by

    (22)

Note that both cases correspond to the basic Markovian growth models of epidemic transmission in which the average number of infected people grows or decays exponentially within a time scale proportional to the average lifespan of infected individuals.

However, the Malthusian parameter of the population does not exist when for a broad and important class of distributions called sub-exponential distributions: a probability distribution with cdf defined on is said to be subexponential if as where and denotes the n-fold convolution of function by itself. As a consequence of this asymptotic behavior, the integral in (19) does not exist for which means that the pdf of this class of distributions decays slower than any exponential when . Important instances like the Pareto, log-normal and Weibull distributions belong to this category. In this case, the solution of (17) is a non-Markovian and the usual modeling of epidemics in terms of growth equations or differential equations fails: in particular, the knowledge of how information has been diffused until time does not determine the dynamics for longer times. The general asymptotic behavior of equation (17) is known to be of the form athreya ()

(23)

and thus the number of infected people decays like the tail of the distribution.

We have analyzed the evolution of viral campaigns and found that the average cascade size as a function of time can be modeled with remarkable precision by a Bellman-Harris process as in (23) with lognormal. Thus, instead of observing the usual exponential decay of active people the active viral population evolves as

(24)
(25)

for large . The asymptotic behavior depends then on a different time scale (logarithmic in time ) rather than the normal time scale , a result that highlights the failure of typical modeling to explain observed behavior when the variability of humans is so large than it is described by a subexponential distribution.

Note that the influence of the log-normal distributions of waiting times occurs even at the population average level and not only on fluctuations around the average value , i.e., it changes the dynamics not just quantitatively but also qualitatively. Finally, the dynamics is slowed down by the high probability of finding an individual with large response times, as the logarithmic time scale in our case shows.

Figure 6: Malthusian parameter of the population above the “tipping-point” as a function of the average number of secondary cases for different distributions of .

For the Malthusian parameter exists for the class of subexponential distributions and then grows exponentially like . But, even in this case, there is a large quantitative difference between the solutions of equation (19) and the values expected by assuming exponential distributions. As shown in figure 6 the difference in our case can be of one order of magnitude which implies that if the campaign reaches the tipping-point the information spreads much faster than expected. For example, if and using the values of days obtained in our campaigns we should have expected an exponential growth with time scale days, while in the case of a log-normal distribution we get hours. This large quantitative difference is due to the fact that subexponential distributions are more skewed than the Poisson ones and thus there is a higher probability of finding participants with small “waiting-times” (compared to the mean) in subexponential distributions. Those fast responders are responsible for this exponential growth with shorter time scale.

Appendix C Inferences on the substrate e-mail Network

The e-mail Network serving as substrate of the viral messages propagation is formed by individuals (nodes) and by their e-mail connections (links between nodes) as determined by the addresses listed in their e-mail address books. In their propagation, viral messages can only go through the links in the e-mail Network and the viral network is thus a subset of it. We have observed however, that even when viral propagation has fully percolated, the substrate e-mail Network is not readily perceived through observation of the Viral Network.

Nevertheless, because both networks are related, some parameters in the e-mail Network can be gleaned through measures on the viral network. We prove here that in a viral propagation process the clustering coefficients of the substrate network (the e-mail Network) and of its virally percolated subset (the Viral Network) are correlated and derive, based on a mean-field approximation, an expression of such correlation. The clustering coefficient, according to Watts and Strogatz clustering (), is defined as

(26)

where ”triple” means a single node with edges running to an unordered pair of others. If such pair is also connected, it forms a triangle or ”transitive triad”. Now we can write, in a mean-field approximation, the clustering coefficients of the e-Mail and Viral networks respectively as

(27)
(28)

Considering an e-mail Network node connected to triangles and triples, we can watch the bond percolation progress of a viral message planted on it. The probability of a triangle on such node being fully percolated by e-mails is the joint probability of percolation of each of the edges in the triple and of the link between the two neighbors at the end of them which forms the triangle third side

(29)

As a result, we can estimate as follows the average number of triangles and triples in the Viral Network with the mean-field approximation

(30)
(31)

Combining (27), (28), (30) and (31) we obtain

(32)

Considering that the clustering coefficient is calculated for non-directed networks (i.e. arcs in the e-mail Network are assimilated to undirected edges), that nodes reached by the viral message become active with probability (the Transmissibility) and that, after becoming active they send messages with Fanout each, we conclude that the probability for the third side of the triple being percolated by a viral message, so as to close a triangle, is given by

(33)

where is the average over the email network of the nearest neighbors average degree. It has to be decreased by 1 because the propagation rules do not allow messages to be sent back to ancestor nodes. The factor 2 results from the fact that either of the two nodes at the open end of a triple can send the message that closes the corresponding triangle. Substituting (33) and (27) in (32) we arrive to the relationship between an e-mail Network clustering coefficient and that of its virally percolated one

(34)

This expression has been tested through simulations of the viral propagation model on a real email network gathered from email server logs of a Spanish university alex () (see figure 7). In the model, any node becomes a secondary spreader with probability and transmits the message among of his/her email connections (if possible) with average number of recommendations. While the real network has a rather large clustering coefficient , the resulting viral cascades have a very small clustering coefficient even for large probabilities of getting infected. This low values of justify the assumption made in our model that the social network is largely irrelevant to understand the dynamics of information propagation below or even close to the tipping point.

Figure 7: Clustering coefficient for the viral cascades obtained through simulations of the viral propagation model on a real email network (symbols) compared with the lineal relationship given by equation (34). The email network has and

Appendix D Viral campaigns general description

The following describes in some detail the technical and marketing aspects involved in the execution of the Viral Marketing campaigns utilized as source of the viral propagation data used in our studies. It covers 16 different campaigns executed in 11 European countries, all of them with the same structure, strategy, user interfaces, data flow or participants conditions.

The primary marketing objective of the viral campaign was to increase the number of subscriptions to the company on-line newsletter, and the offering consisted in the free subscription to such newsletter which can be customized according to the subscriber’s interest who was asked to choose from a list of available generic topics represented by interest codes. The subscription was formalized by filling in a form located in the main campaign web page (a.k.a. registration page) of the campaign. A series of drive-to-web tactics, variable by country, was put in place to attract visitors to the registration page. This included e-mail campaigns, banner advertising, search engines placement, promotion at the company web site and other web based promotional activities.

Additionally, a viral propagation tool consisting of a button located at the registration page was established to trigger the message propagation. The caption in that button invited visitors to recommend the page to friends and colleagues and offered, as additional incentive for people to forward the page, tickets for a prize draw to win a laptop computer. Two situations caused participants to become eligible to receive prize draw tickets:

  • One ticket was assigned to participants sending any number of recommendations to friends or colleagues

  • Unlimited number of additional tickets were given to the sender for each of the recommended friends who would, as a result of such recommendation, subscribe to the newsletter

The ticket eligibility rules above were designed to discourage spam-like behavior where recommendations are sent indiscriminately to individuals not interested in the offering all the while they encouraged to send the highest possible number of recommendations to individuals presumed to be interested in the newsletter. Additionally, the participation rules guarantees that the incentive was direct consequence of the viral message propagation and not of registration to the newsletter.

References, Notes and Acknowledgements

  • (1) Moreno, Y., Nekovee, M., & Pacheco, A.F., Dynamics of rumor spreading in complex networks, Phys. Rev. E 69, 066103, (2004).
  • (2) Valente, T.W., Network Models of the Diffusion of Innovations, Hampton Press, Cresskill, NJ, (1995).
  • (3) Sernovitz, A. et al., Word of Mouth 101, Word of Mouth Marketing Association, New York, (2005).
  • (4) Dye, R., The Buzz on Buzz. Harvard Business Rev., vol. 78, No. 6, pp. 139-146 (2000).
  • (5) Goldenberg, J., Libai, B. & Solomon, S., Marketing Percolation, Phys A 284, (1-4), 335-347, (2000).
  • (6) Hidalgo, C.A., Castro, A., & Rodriguez-Sickert, C., The effect of social interactions in the primary consumption life cycle of motion pictures, New J. Phys. 8 52 (2006).
  • (7) Jurvetson, S. & Draper, R., Viral Marketing. Netscape M-Files, (1997).
  • (8) Barabási, A.-L., The origin of bursts and heavy tails in human dynamics, Nature 435, 207, (2005).
  • (9) Aiello, W., Chung, F. & Lu, L., A random graph model for power law graphs. In Proceedings of the 32nd Annual ACM Symposium of Theory of Computing, pp. 171-180, Association of Computing Machinery, New York, (2000).
  • (10) Gruhl, D., Guha, R., Liben-Nowell, D. & Tomkins, A., Information Diffussion Through Blogspace, In Proceedings of the 13th international conference on World Wide Web, pp. 491-501, (Association of Computing Machinery, New York, 2004).
  • (11) Harris, T.E., The Theory of Branching Processes, Springer-Verlag, Berlin, (2002).
  • (12) Pitkow, J.E., Summary of WWW Characterizations. In Proceedings of the Seventh World Wide Web Conference (WWW7), (1997).
  • (13) Gladwell, M., The Tipping Point, Little, Brown and Company, New York, (2000).
  • (14) Liljeros, F., Edling, C.R., Nunes Amaral, L.A., Stanley, H.E. & Aberg, Y., The web of human sexual contacts, Nature, 411, pp. 907-908 (2001).
  • (15) Kempe, D., Kleinberg, J. & Tardos, E., Maximizing the Spread of Influence through a Social Network, SIGKDD, (2003).
  • (16) Leskovec, J., Adamic, L. & Huberman, B., The Dynamics of Viral Marketing, Preprint at http://www.hpl.hp.com/idl/papers/viral/viral.pdf (2005).
  • (17) Wu, F., Huberman, B.A., Adamic, L.A. & Tyler, J.R., Information flow in social groups, Preprint at (http://www.hpl.hp.com/shl/papers/flow/flow.pdf) (2003).
  • (18) Anderson, R. M. & May, R., Infectious diseases of humans: dynamics and control, Oxford University Press, (1991).
  • (19) Bass, F.M., A New Product Growth Model for Consumer Durables, Management Science 15, pp. 215-227 (1969).
  • (20) Lloyd-Smith, J.O., Schreiber, S.J., Kopp, P.E. & Getz, W.W., Superspreading and the effect of individual variation on disease emergence, Nature 438, 355, (2005).
  • (21) Newman, M.E.J., Forrest, S. & Balthrop, J., Email networks and the spread of computer viruses, Phys. Rev. E 66, 035101 (R), (2002).
  • (22) Ebel, H., Mielsch, L.-I., & Bornholdt, S., Scale-free topology of e-mail networks, Phys. Rev. E 66, 035103, (2002).
  • (23) Newman, M.E.J., The spread of epidemic disease on networks, Phys. Rev. E 66, 016128, (2002).
  • (24) R. Pastor-Satorras, A. Vespignani, Epidemic spreading in scale-free networks, Phys. Rev. Lett. 86, pp. 3200-3203 (2001).
  • (25) Newman, M.E.J. & Park, J., Why social networks are different from other types of networks, Phys. Rev. E 68, 036112, (2003).
  • (26) Vázquez, A., Gama-Oliveira, J., Dezsö, Z., Goh, K. & Barabási, A.-L., Modeling bursts and heavy tails in human dynamics Phys. Rev. E 73, 036127 (2006).
  • (27) Stouffer, D.B., Malmgren, R.D. & Amaral, L.A.N., Comments on ”The origin of bursts and heavy tails in human dynamics”, arXiv:physics/0510216 (2005).
  • (28) Vázquez, A., Balázs, R., András, L. & Barabási, A.-L., Impact of non-Poisson activity patterns on spreading processes, Phys. Rev. Lett. 98, 158702 (2007).
  • (29) Guimerá, R., Danon, L., Díaz-Guilera, A., Giralt, F. & Arenas, A., Self-similar community structure in organisations, Physical Review E 68, 065103 (2003).
  • (30) Newman, M.E.J., The Structure and Function of Complex Networks. SIAM Review, Vol. 45, No.2, 167-256, (2003).
  • (31) K. Athereya, & P. Ney, Branching Processes, (Springer Verlag), Berlin (1972).
  • (32) Harris, T.E., The Theory of Branching Processes, Springer Verlag, Berlin, (1963).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
28969
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description