Learning AsynchronousTime Information Diffusion Models
and its
Application to Behavioral Data Analysis
over Social Networks
Abstract
One of the interesting and important problems of information diffusion over a large social network is to identify an appropriate model from a limited amount of diffusion information. There are two contrasting approaches to model information diffusion. One is a push type model, known as Independent Cascade (IC) model and the other is a pull type model, known as Linear Threshold (LT) model. We extend these two models (called AsIC and AsLT in this paper) to incorporate asynchronous time delay and investigate 1) how they differ from or similar to each other in terms of information diffusion, 2) whether the model itself is learnable or not from the observed information diffusion data, and 3) which model is more appropriate to explain for a particular topic (information) to diffuse/propagate. We first show that there can be variations with respect to how the time delay is modeled, and derive the likelihood of the observed data being generated for each model. Using one particular time delay model, we show that the model parameters are learnable from a limited amount of observation. We then propose a method based on predictive accuracy by which to select a model which better explains the observed data. Extensive evaluations were performed using both synthetic data and real data. We first show using synthetic data with the network structures taken from four real networks that there are considerable behavioral differences between the AsIC and the AsLT models, the proposed methods accurately and stably learn the model parameters, and identify the correct diffusion model from a limited amount of observation data. We next apply these methods to behavioral analysis of topic propagation using the real blog propagation data, and show that there is a clear indication as to which topic better follows which model although the results are rather insensitive to the model selected at the level of discussing how far and fast each topic propagates from the learned parameter values. The correspondence between the topic and the model selected is well interpretable considering such factors as urgency, popularity and people’s habit.
Learning AsynchronousTime Information Diffusion Models Saito, Kimura, Ohara, & Motoda \firstpageno1
1 Introduction
The growth of Internet has enabled to form various kinds of largescale social networks, through which a variety of information including innovation, hot topics and even malicious rumors can be propagated in the form of socalled “wordofmouth” communications. Social networks are now recognized as an important medium for the spread of information, and a considerable number of studies have been made [\BCAYNewman, Forrest, \BBA BalthropNewman et al.2002, \BCAYNewmanNewman2003, \BCAYGruhl, Guha, LibenNowell, \BBA TomkinsGruhl et al.2004, \BCAYDomingosDomingos2005, \BCAYLeskovec, Adamic, \BBA HubermanLeskovec et al.2006, \BCAYRomero, Meeder, \BBA KleinbergRomero et al.2011, \BCAYBakshy, Hofman, Mason, \BBA WattsBakshy et al.2011, \BCAYMathioudakis, Bonch, Castillo, Gionis, \BBA UkkonenMathioudakis et al.2011].
Widely used information diffusion models in these studies are the independent cascade (IC) [\BCAYGoldenberg, Libai, \BBA MullerGoldenberg et al.2001, \BCAYKempe, Kleinberg, \BBA TardosKempe et al.2003, \BCAYKimura, Saito, \BBA MotodaKimura et al.2009] and the linear threshold (LT) [\BCAYWattsWatts2002, \BCAYWatts \BBA DoddsWatts \BBA Dodds2007]. They have been used to solve such problems as the influence maximization problem [\BCAYKempe, Kleinberg, \BBA TardosKempe et al.2003, \BCAYChen, Wang, \BBA YangChen et al.2009, \BCAYKimura, Saito, Nakano, \BBA MotodaKimura et al.2010] and the contamination minimization problem [\BCAYKimura, Saito, \BBA MotodaKimura et al.2009]. These two models assume different mechanisms for information diffusion which are based on two opposite views. In the IC model each active node independently influences its inactive neighbors with given diffusion probabilities (information push style model). In the LT model a node is influenced by its active neighbors if their total weight exceeds the threshold for the node (information pull style model). Which model is more appropriate depends on the situation and selecting the appropriate one for a particular problem is an interesting and important problem. To answer this question, first of all, we have to understand the behavioral difference between there two models.
Both models have parameters that need be specified in advance: diffusion probabilities for the IC model, and weights for the LT model. However, their true values are not known in practice, which poses a challenging problem of estimating them from a limited amount of information diffusion data that are observed as timesequences of influenced (activated) nodes. Fortunately this falls in a well defined parameter estimation problem in machine learning setting. Given a generative model with its parameters and the independent observed data, we can calculate the likelihood that the data are generated and can estimate the parameters by maximizing the likelihood. This approach has a thorough theoretical background. The way the parameters are estimated depends on how the generative model is given. To the best of our knowledge, we were the first to follow this line of research. We addressed this problem first for the basic IC model [\BCAYSaito, Nakano, \BBA KimuraSaito et al.2008, \BCAYKimura, Saito, Nakano, \BBA MotodaKimura et al.2009] and then its variant that incorporates asynchronous time delay (referred to as the AsIC model) [\BCAYSaito, Kimura, Ohara, \BBA MotodaSaito et al.2009]. We further applied this to a variant of the LT model that also incorporates asynchronous time delay (referred to as the AsLT model) [\BCAYSaito, Kimura, ohara, \BBA MotodaSaito et al.2010a, \BCAYSaito, Kimura, Ohara, \BBA MotodaSaito et al.2010c].
gruhl:sigkdd also challenged the same problem of estimating the parameters and proposed an EMlike algorithm, but they did not formalize the likelihood and it is not clear what is being optimized in deriving the parameter update formulas. \citeAgoyal:wsdm attacked this problem from a different angle. They employed a variant of the LT model and estimated the parameter values by four different methods, all of which are directly computed from the frequency of the events in the observed data. Their approach is efficient, but it is more likely ad hoc and lacks in theoretical evidence. \citeAbakshy:ec addressed the problem of diffusion of usercreated content (asset) and used the maximum likelihood method to estimate the rate of asset adoption. However, they only modeled the rate of adoption and did not consider the diffusion model itself. Their focus was data analysis. \citeArodriguez:kdd proposed an efficient method of inferring a network from the observed diffusion sequences based on the continuous time version of the IC model, assuming the probability that a node affects its child node is a function of the difference of the activation times between the two nodes. Their focus is inferring the structure of the network rather than inferring the best predictive model for a known network. They fixed a model and approximated the likelihood function in such a way that the simplified likelihood function can be maximized by adding a link in each iteration. Recent work of \citeAmyers:nips10 is close to ours. They used a model similar to but different in details from the AsIC model and showed that the liklihood maximization problem can effectively be transformed to a convex programming for which a global solution is guaranteed^{1}^{1}1We discuss the difference between their model and our model in Section 7.. Their focus was also inferring the structure of the network.
In this paper, we first detail the Asynchronous Independent Cascade Model and the Asynchronous Linear Threshold Model as two contrasting information diffusion models. Both are extensions of the basic Independent Cascade Model and Linear Threshold Model that incorporate time delay in an asynchronous way. Especially we focus on the liklihood derivation of these models. We show that there are a few variations of time delay and different time delay models result in different liklihood formulations. We then show for a particular time delay model how to obtain the parameter values that maximize the respective liklihood by deriving an EMlike iterative approach using the observed sequence data. Indeed, being able to cope with asynchronous time delay is indispensable to do realistic analysis of information diffusion because, in the real world, information propagates along the continuous time axis, and timedelays can occur during the propagation asynchronously. In fact, the time stamps of the observed data are not equally spaced. This means that the proposed learning method has to estimate not only the diffusion parameters (diffusion probabilities for the AsIC model and weights for the AsLT model) but also the timedelay parameters from the observed data. We identified that there are basically two types of delay: link delay and node delay. The former corresponds to the delay associated with information propagation, and the latter corresponds to the delay associated with human action which is further divided into two types: nonoverride and override. We choose link delay to explain the learning algorithms and perform the experiments on this model. For the other time delay models we only derive the likelihood functions that are required for the learning algorithms. Incorporating timedelay makes the timesequence observation data structural, which makes the analysis of diffusion process difficult because there is no way of knowing which node has activated which other node from the observation data sequence.
Knowing the optimal parameter values does not mean that the observation follows the model well. We have to decide which model better explains the observation and select the right (or more appropriate) model. We solve this problem by comparing the predictive accuracy of each model. We use a variant of holdout method applied to a set of sequential data, which is similar to the leaveoneout method applied to a multiple time sequence data, i.e., we use a part of the data, train the model, predict the activation probability at one step later and compare it with the observation. We repeat this by changing the size of the training data.
In summary, we want to 1) clarify how the AsIC model and the AsLT model differ from or similar to each other in terms of information diffusion, 2) propose a method to learn the model parameters from a limited number of observed data and show that the method is effective, and 3) show that how the information diffuses depend on the topic and the proposed method can identify which model is more appropriate to explain for a particular topic (information) to diffuse/propagate.
We have performed extensive experiments to verify the proposed approaches using both synthetic data and real data. Experiments using synthetic data generated by the models (AsIC and AsLT) with network structures taken from four real networks revealed that there are considerable behavioral difference between the AsIC and the AsLT models, and the difference can be explained by the diffusion mechanism qualitatively. It is also shown that the proposed liklihood maximization methods accurately and stably learn the model parameters, and identify the correct diffusion model from a limited amount of observation data. Experiments of behavioral analysis of topic propagation using the real blog data show that the results are rather insensitive to the model selected at an abstract level of discussing how relatively far and fast each topic propagates from the learned parameter values but still there is a clear indication as to which topic better follows which model. The correspondence between the topic and the model selected is well interpretable considering such factors as urgency, popularity and people’s habit.
The paper is organized as follows. In Section 2, we introduce the two contrasting information diffusion models (AsIC and AsLT) we used in this paper, and in Section 3, we detail how the likelihood functions can be formulated for various variations of time delay model and in Appendix how the parameters can be obtained using one particular model of time delay (link delay). In Section 4, we show the detailed analysis results of behavioral difference between AsIC and AsLT obtained by using four real network structures. In Section 5 we detail the learning performance (accuracy of parameter learning and influential node ranking) using the synthetic data obtained by the same four real network structure. In Section 6 we focus on model selection using both synthetic data and a real blog network data. In Section 7 we discuss some of the important issues regarding the related work and those for future work. We end the paper by summarizing what has been achieved in Section 8.
2 Information Diffusion Models
2.1 Two Contrasting Diffusion Models
It is quite natural to bring in the notion of information sender and receiver. The IC model is sendercentered. It is motivated by epidemic spread in which the disease carrier is the information sender. If a person gets infected, his or her neighbors also get infected, i.e., the information sender tries to push information to its neighbors. The LT model is receivercentered. It is based on the view that the receiver has a control over the information flow. This models the way innovation propagates. For example, a person is attempted to buy a new tablet PC if many of his or her neighbors have purchased it and said that it is good, i.e., the information receiver tries to pull information.
Both models have respective reasons for their working mechanisms, but they are quite contrasting to each other. We are interested in 1) how they differ from or similar to each other in terms of information diffusion, 2) whether the model itself is learnable or not from the observed information diffusion data, and 3) which model is more appropriate to explain for a particular topic (information) to diffuse/propagate. Both models have parameters, i.e., diffusion probability attached to each directional link in the IC model and weight attached to each directional link in the LT model. As shown later in Section 3.2, the weight is equivalent to a probability. Thus, intuitively both models appear to be comparative in terms of the average influence degree if the parameter values are comparable. The simulation results, however, show that these two models behave quite differently. We will explain why they are different in Section 4.2.
In the following two subsections we will describe the two diffusion models that we use in this paper: the asynchronous independent cascade (AsIC) model, first introduced by \citeAsaito:acml09, and the asynchronous linear threshold (AsLT) model, first introduced by \citeAsaito:sbp10. They differ from the basic IC and LT models in that they explicitly handle the time delay. The diffusion process evolves with time. The basic models deal with time by allowing nodes to change their states in a synchronous way at each discrete time step, i.e., no time delay is considered, or one can say that every state change is uniformly delayed exactly by one discrete time step. Their asynchronous time delay versions explicitly treat the time delay of each node independently. We discuss the notion of time delay in more depth in Section 3.3.1.
The models we explain in the following two sub sections and the learning algorithms we describe in Section 3 are based on a particular timedelay model, which we call link delay. This is the model that the time delay is caused by the communication channel, e.g., network traffic and/or some malfunction, and as soon as the information arrives at the destination, the node responds without delay.
Before we explain the models, we give the definition of a graph and children and parents of a node. A graph we use is a directed graph without selflinks, where and stand for the sets of all the nodes and links, respectively. For each node in the network , we denote as a set of child nodes of , i.e.,
Similarly, we denote as a set of parent nodes of , i.e.,
We call nodes active if they have been influenced with the information. In the following models, we assume that nodes can switch their states only from inactive to active, but not the other way around, and that, given an initial active node set , only the nodes in are active at an initial time.
2.2 Asynchronous Independent Cascade Model
We first recall the definition of the IC model according to the work of \citeAkempe:kdd, and then introduce the AsIC model. In the IC model, we specify a real value with for each link in advance. Here is referred to as the diffusion probability through link . The diffusion process unfolds in discrete timesteps , and proceeds from a given initial active set in the following way. When a node becomes active at timestep , it is given a single chance to activate each currently inactive child node , and succeeds with probability . If succeeds, then will become active at timestep . If multiple parent nodes of become active at timestep , then their activation attempts are sequenced in an arbitrary order, but all performed at timestep . Whether or not succeeds, it cannot make any further attempts to activate in subsequent rounds. The process terminates if no more activations are possible.
In the AsIC model, we specify real values with in advance for each link in addition to , where is referred to as the timedelay parameter through link . The diffusion process unfolds in continuoustime , and proceeds from a given initial active set in the following way. Suppose that a node becomes active at time . Then, is given a single chance to activate each currently inactive child node . We choose a delaytime from the exponential distribution^{2}^{2}2 Similar formulation can be derived for other distributions such as powerlaw and Weibull. with parameter . If has not been activated before time , then attempts to activate , and succeeds with probability . If succeeds, then will become active at time . Said differently, whichever parent that succeeds in satisfying the activation condition and for which the activation time is the earliest considering the time delay associated with each link can actually activate the node. Under the continuous time framework, it is unlikely that is activated simultaneously by its multiple parent nodes exactly at time . So we do not consider this possibility. Whether or not succeeds, it cannot make any further attempts to activate in subsequent rounds. The process terminates if no more activations are possible.
2.3 Asynchronous Linear Threshold Model
Same as the above, we first recall the LT model. In this model, for every node , we specify a weight from its parent node in advance such that
The diffusion process from a given initial active set proceeds according to the following randomized rule. First, for any node , a threshold is chosen uniformly at random from the interval . At timestep , an inactive node is influenced by each of its active parent nodes, , according to weight . If the total weight from active parent nodes of is no less than , that is,
then will become active at timestep . Here, stands for the set of all the parent nodes of that are active at timestep . The process terminates if no more activations are possible.
The AsLT model is defined in a similar way to the AsIC. In the AsLT model, in addition to the weight set , we specify real values with in advance for each link . Same as for AsIC, we refer to as the timedelay parameter through link . The diffusion process unfolds in continuoustime , and proceeds from a given initial active set in the following way. Each active parent of the node exerts its effect on with the time delay drawn from the exponential distribution with the delay parameter . Suppose that the accumulated weight from the active parents of node has become no less than at time for the first time. Then, the node becomes active at without any delay and exerts its effect on its child with a delay associated with its link. This process is repeated until no more activations are possible.
3 Learning Algorithms
We define the diffusion parameter vector and the timedelay parameter vector by
for the AsIC model, and the weight parameter vector and the timedelay parameter vectors by
for the AsLT model. We next consider an observed data set of independent information diffusion results,
Here, each is a set of pairs of active node and its activation time in the th diffusion result,
We denote by the activation time of node for the th diffusion result. For each , we denote the observed initial time by
and the observed final time by
Note that is not necessarily equal to the final activation time. Hereafter, we express our observation data by
For any , we set
Namely, is the set of active nodes before time in the th diffusion result. For convenience sake, we use as referring to the set of all the active nodes in the th diffusion result, i.e.,
Moreover, we define a set of nonactive nodes with at least one active parent node for each by
For each node , we define the following subset of parent nodes, each of which had a chance to activate .
Note that the underlying model behind the observed data is not available in reality. Thus, we investigate how the model affects the information diffusion results, and consider selecting a model which better explains the given observed data from the candidates, i.e., AsIC and AsLT models. To this end, we first have to estimate the values of and for the AsIC model, and the values of and for the AsLT model for the given .
3.1 Learning Parameters of AsIC Model
First, we propose a method of learning the model parameters from the observed data for the AsIC model. To estimate the values of and from for the AsIC model, we derive the likelihood function to use as the objective function.
First, for the th information diffusion result, we consider any node with , and derive the probability density that the node is activated at time . Note that if . Let denote the probability density that a node activates the node at time , that is,
(1) 
Let denote the probability that the node is not activated by a node within the timeperiod , that is,
(2)  
If there exist multiple active parents for the node , i.e., , we need to consider possibilities that each parent node succeeds in activating at time . However, in case of the continuous time delay model, we don’t have to consider simultaneous activations by multiple active parents due to the continuous property. Here, for any , let be the probability density that the node activates at time but all the other nodes in have failed in activating within the timeperiod for the th information diffusion result. Then, we have
Since the probability density is given by , we have
(3)  
Note that we are not able to know which node actually activated the node . This can be regarded as a hidden structure.
Next, for the th information diffusion result, we consider any link such that and , and derive the probability that the node fails to activate its child nodes. Note that if . Let denote the probability that the node is not activated by the node within the observed time period . We can easily derive the following equation:
(4) 
Here we can naturally assume that each information diffusion process finished sufficiently earlier than the observed final time, i.e., . Thus, as in Equation (4), we can assume
(5) 
Therefore, the probability is given by
(6) 
By using Equations (3) and (6), and the independence properties, we can define the likelihood function with respect to and by
(7) 
In this paper, we focus on Equation (5) for simplicity, but we can easily modify our method to cope with the general one (i.e., Equation (4)). Thus, our problem is to obtain the values of and , which maximize Equation (7). For this estimation problem, we derive a method based on an iterative algorithm in order to stably obtain its solution. The details of the parameter update algorithm are given in Appendix A.
3.2 Learning Parameters of AsLT Model
Next, we propose a method of learning the model parameters from the observed data for the AsLT model. Similarly to the AsIC model, we first derive the likelihood function with respect to and . For the sake of technical convenience, we introduce a slack weight for each node such that
Here note that we can regard each weight as a multinomial probability since a threshold is chosen uniformly at random from the interval for each node .
First, for the th information diffusion result, we fix any node with , and derive the probability density that the node is activated at time . Note that if . Suppose any parent node exerts its effect on with a delay . Further suppose that the threshold is first exceeded when the effect of reaches after the delay . We define the subset of by
Then, we have
This implies that the probability that is chosen from this range is . Let denote the probability density that node activates node at time . Then, we have
(8) 
Since the probability density is given by , we have
(9) 
Next, for the th information diffusion result, we consider any node , and derive the probability that node is not activated within the observed time period . We can calculate as
(10)  
Therefore, by using Equations (9) and (10), and the independence properties, we can define the likelihood function with respect to and by
(11) 
Thus, our problem is to obtain the timedelay parameter vector and the weight parameter vector , which together maximize Equation (11). The details of the parameter update algorithm are given in Appendix B.
3.3 Alternative Timedelay models
In Section 2 we introduced one instance of time delay, i.e., link delay. In this subsection we discuss time delay phenomena in more depth for both the AsIC and the AsLT models.
3.3.1 Notion of Timedelay
Each parent of a node can be activated independently of the other parents and because the associated time delay from a parent to its child is different for every single pair, which parent actually affects the node in which order is more or less opportunistic.
To explicate the information diffusion process in a more realistic setting, we consider two examples, one associated with blog posting and the other associated with electronic mailing. In case of blog posting, assume that some blogger posts an article. Then it is natural to think that it takes some time before another blogger comes to notice the posting. It is also natural to think that if the blogger reads the article, he or she takes an action to respond (activated) because the act of reading the article is an active behavior. In this case, we can think that there is a delay in information diffusion from to (from ’s posting and ’s reading) but there is no delay in taking an action (from ’s reading to ’s posting). In case of electronic mailing, assume that someone sends a mail to someone else . It is natural to think that the mail is delivered to the receiver instantaneously. However, this does not necessarily mean that reads the mail as soon as it has been received because the act of receiving a mail is a passive behavior. In this case, we can think that there is no delay in information diffusion from to (’s sending and ’s receiving) but there is a delay in taking an action (from ’s receiving to ’s sending). Further, when notices the mail, may think to respond to it later. But before responds, a new mail may arrive which needs a prompt response and sends a mail immediately. We can think of this as an update of acting time.^{3}^{3}3Note that there are two actions here, reading and sending, but the activation time in the observed sequence data corresponds to the time sends a mail. These are just two examples, but it appears worth distinguishing the difference of these two kinds of time delay and update scheme (override of decision) in a more general setting.
In view of the discussion above, we define two types of delay: link delay and node delay. It is easiest to think that link delay corresponds to propagation delay and node delay corresponds to action delay. We further assume that they are mutually exclusive. This is a strong restriction as well as a strong simplification by necessity because the activation time of a node we can observe is a sum of the activation time of its parent node and the two delays and we cannot distinguish between these two delays. Thus we have to choose either one of the two as occurring exclusively for the likelihood maximization to be feasible. In addition, in case of node delay there are two types of activation: nonoverride and override. The former sticks to the initial decision when to activate and the latter can decide to update (override) the time of activation multiple times to the earliest possible each time one of the parents gets newly activated. In summary, node delay can go with either override or nonoverride, and link delay can only go with nonoverride.
Since we have already derived the likelihood function for link delay, here we consider the likelihood function for node delay. In this case, the time delay parameter vector is expressed as . The likelihood function for the AsIC in the case of node delay is given by Equation (7), where is the probability density that node is activated at time for the th information result, and is the probability that node does not activate its child nodes within the observed time period for the th information result. Note that remains the same as in the case of link delay (see Equations (5) and (6)). The likelihood function for the AsLT in the case of node delay is given by Equation (11), where the definition of is the same as above, and is the probability that the node is not activated within the observed time period for the th information result. Note also that remains the same as in the case of link delay (see Equation (10)). Therefore, our task now is: We fix any node with , and present the probability density that node is activated at time for the th information result in the case of node delay, Here for simplicity, we order the active parent node of node according to the time it was activated, and set
3.3.2 Alternative Asynchronous Independent Cascade Model
First, we derive for node delay with nonoverride and for node delay with override in the case of the AsIC model.
Node delay with nonoverride
There is no delay in propagating the information to the node from the node , but there is a delay before the node gets actually activated. Assume that it is the node that first succeeded in activating the node (more precisely satisfying the activation condition). Since there is no link delay and no override, it must be the case that all the other parents that had become active before must have failed in activating (more precisely satisfying the activation condition). Since the node decides when to actually activate itself at the time the node succeeded in satisfying the activation condition and would not change its mind, other nodes which may have been activated after the node got activated could do nothing on the node . Thus, the probability density is given by
Node delay with override
In this case the actual activation time is allowed to be updated. For example, suppose that the node first succeeded in satisfying the activation condition of the node and the node decided to activate itself at time . At some time later but before , other parent also succeeded in satisfying the activation condition of the node . Then the node is allowed to change its actual activation time to time if it is before . Thus, the probability density is given by
Here, is the probability density that node activates node at time , and is obtained by Equation (12). Also, is the probability that node is not activated by node within the timeperiod , and is obtained by
(see Equation (2)). Note that this formula is equivalent to Equation (3) except that the parameter is replaced by .
3.3.3 Alternative Asynchronous Linear Threshold Model
Next, we derive for node delay with nonoverride and for node delay with override in the case of the AsLT model.
Node delay with nonoverride
As soon as the parent node is activated, its effect is immediately exerted to its child . The delay depends on the node ’s choice. Suppose the node first became activated for the th parent according to the time ordering. Then by the same reasoning as in Section 3.2, the threshold is between and , and the probability density can be expressed as
where is the probability density that node activates node at time , and is obtained by Equation (8). Note that this formula is equivalent to Equation (9) except that the parameter is replaced by .
Node delay with override
Here, multiple updates of the activation time of the node is allowed. Suppose that the node ’s threshold is first exceeded by receiving the effect of the parent . All the parents that have become activated after that can still influence the updates. Among these parents, let be the one which succeeded in activating the node and let be the other parents that failed. Then, the probability density that the node is activated at time by the node , which get activated later than for which the threshold is first exceeded is given by
Thus, we obtain
Note that this formula is substantially different from Equation (9).
3.3.4 Summary of Different Time Delay Models
We note that for link delay and node delay with override is identical for the AsIC model and that for link delay and node delay with nonoverride is identical for the AsLT model, except for a minor notational difference in the time delay parameter in both. Thus, there are basically two cases for each model. We omit to show how different time delay models affect diffusion phenomena. There are indeed some differences in transient time period (for the first 10 to 30 time span in unit of average time delay).^{4}^{4}4Note that difference in the time delay models vanishes when an equilibrium is reached. The difference becomes larger as the values for diffusion parameters become larger as expected. For more details, see the work of \citeAsaito:acml10.
We only showed the parameter learning algorithms for the case of link delay for both AsIC and AsLT models in Appendix. It is straightforward to derive the similar algorithm for the other time delay models.
3.4 Assumptions Introduced in Parameter Setting
The formulations so far assumed that the parameters ( and ^{5}^{5}5To be more precise we assumed that in case of nodedelay. Simplification in this case can also be made accordingly.) that appear both in the AsIC and the AsLT models depend on individual link . The number of parameters, thus, is equal to the number of links, which is huge for any realistic social network. This means that we need a prohibitively huge amount of observation data that passes each link at least several times to obtain accurate estimates for these parameters that do not overfit the data. This is not realistic and we can introduce a few alternative simplifying assumptions to avoid this overfitting problem.
The simplest one would be to assume that each of the parameters and be represented by a single variable for the whole network. For a diffusion probability, we assume a uniform value for all links. For a weight we assume a uniform coefficient such that , i.e., the weight is proportional to the reciprocal of the number of ’s parents. This is the simplest realization to satisfy the constraint . As can be shown later in Section 6.3.2, this is a reasonable approximation to discuss information diffusion for a specific topic. Next simplification would be to divide (or ) into subsets (or ) and assign the same value for each parameter within each subset. For example, we may divide the nodes into two groups: those that strongly influence others and those not, or we may divide the nodes into another two groups: those that are easily influenced by others and those not. Links connecting these nodes can accordingly be divided into subsets. If there is some background knowledge about the node grouping, our method can make the best use of it. Obtaining such background knowledge is also an important research topic in the knowledge discovery from social networks. Yet another simplification which looks more realistic would be to focus on the attribute of each node and assume that there is a generic dependency between the parameter values of a link and the attribute values of the connected nodes and learn this dependency rather than learn the parameter values directly from the data. In \citeAsaito:ismis11 we adopted this approach assuming a particular class of attribute dependency, and confirmed that the dependency can be correctly learned even if the number of parameters is several tens of thousands. Learning a function is much more realistic and does not require such a huge amount of data. This way it is possible that the parameter values take different values for each link (or node).
4 Behavioral Difference between the AsIC and the AsLT Models
4.1 Data Sets and Parameter Setting
We employed four datasets of large real networks (all bidirectionally connected). The first one is a trackback network of Japanese blogs used by \citeAkimura:tkdd and has nodes and directed links (the blog network). The second one is a network of people derived from the “list of people” within Japanese Wikipedia, also used by \citeAkimura:tkdd, and has nodes and directed links (the Wikipedia network). The third one is a network derived from the Enron Email Dataset [\BCAYKlimt \BBA YangKlimt \BBA Yang2004] by extracting the senders and the recipients and linking those that had bidirectional communications. It has nodes and directed links (the Enron network). The fourth one is a coauthorship network used by \citeApalla and has nodes and directed links (the coauthorship network). These networks are confirmed to satisfy the typical characteristics of social networks, e.g., power law for degree distribution, higher clustering coefficient, etc.
In this experiments, we set the value of diffusion probability (AsIC) and the value of the link weight (AsLT) such that they are consistent in the following sense under the simplest assumption to make a fair comparison: . Thus, and for any , where is the average outdegree of the network. Thus, the value of () is given as 0.15, 0.04, 0.1, and 0.32 for the Blog, the Wikipedia, the Enron, and the Coauthorship networks, respectively.
We compare influence degree obtained by the AsIC and the AsLT models from various angles. Here, the influence degree of a node is defined to be the expected number of active nodes at the end of information diffusion process that starts from a single initial activate node . Since the timedelay parameter vector does not affect the influence degree (because it is defined at the end of diffusion process), that is, is invariant with respect to the value of , we can evaluate the value of by the influence degree of the corresponding basic IC or LT model. We estimated the influence degree by the bond percolation based method [\BCAYKimura, Saito, Nakano, \BBA MotodaKimura et al.2010], in which we used bond percolation processes according to \citeAkempe:kdd, meaning that the expectation is approximated by the empirical mean of independent simulations.
4.2 Experimental Results
First, we investigated which of the AsIC and AsLT models can spread information more widely. Figure 1 shows the cumulative probability of influence degree, , for the AsIC and the AsLT models. At a glance we can see that the AsIC model has by far many more nodes of high influence degrees than the AsLT model. Further, we examined the difference of influence degree between the two models for the respective influential nodes of both the AsIC and the AsLT models. We ranked nodes according to the influence degree of AsIC and AsLT, respectively, and extracted the top influential nodes for each. Figures 2 and 3 display the respective influence degree of rank node of AsIC and AsLT (). Here, the red line indicates the influence degree of AsIC, and the blue line indicates the influence degree of AsLT. We can see that the difference of influence degree between the two models is quite large for these influential nodes. This clearly indicates that the information can diffuse more widely under the AsIC model than the AsLT model. This can be attributed to the scalefree nature (having powerlaw degree distributions) of the four real networks used in the experiments. It is known [\BCAYAlbert, Jeong, \BBA BarabasiAlbert et al.2000] that hub nodes, defined as those having many outgoing links, play an important role for widely spreading information in a scalefree network. By the information diffusion mechanism of the AsIC and AsLT models, it is more difficult for the AsLT model to transmit information to hub nodes than the AsIC model in a scalefree network. Therefore, the result is understandable.
Next, we compared the difference of the influential nodes between the AsIC and the AsLT models. The results are shown in Figures 4 and 5. For both figures the horizontal axes are node ranking (), and the actual ranking depends which model we are considering, e.g., the rank node for AsIC is different from the same rank node for AsLT. The vertical axis are influence degree for both figures, but it is the influence degree for AsIC in Figure 4 and that for AsLT in Figure 5. The red line corresponds to nodes for AsIC and the blue line corresponds to nodes for AsLT. Thus, by definition of node ranking, the influence degree of AsIC (red thick line) is nonincreasing in Figure 4 and the influence degree of AsLT (blue thick line) in Figure 5 is nonincreasing. However, the corresponding line for AsLT (blue line) in Figure 4 and that for AsIC (red line) in Figure 5 are very irregular. This means that almost all the nodes that are influential for AsIC model are different from the nodes that are influential for AsLT, and vice versa. There are small number of influential nodes that overlap for both the models, but how similar the influential nodes are (degree of overlapping) depends on the characteristics of the network structure, and no general tendency can be extracted.
5 Learning Performance Evaluation
5.1 Data Sets and Parameter Setting
We used the same four datasets that are used in Section 4, and employed also the simplest approximation for the parameter setting but with a slight difference according to the work \citeAsaito:acml09.
We set , for AsIC and , for AsLT. Under this assumption there is no need for the observation sequence data to pass through every link or node at minimum once and desirably several times. This drastically reduces the amount of data we have to generate to use as the training data to learn the parameters. Then, our task is to estimate the values of these parameters from the training data. According to the work of \citeAkempe:kdd, we set to a value slightly smaller than . Thus, the true value of was set to for the coauthorship network, for the blog and Enron networks, and for the Wikipedia network. The true value of was set to for every network to achieve reasonably long diffusion results, and the true value of was set to .^{6}^{6}6Note that a different value of corresponds to a different scaling of the time axis under the assumption of uniform value.
Using these parameter values, we generated a diffusion sequence from a randomly selected initial active node for each of the AsIC and the AsLT models in four networks. We then constructed a training dataset such that each diffusion sequence has at least 10 nodes. Parameter updating is terminated when either the iteration number reaches its maximum (set to 100) or the following condition is first satisfied: for AsIC and for AsLT, where the superscript indicates the value for the th iteration. In most of the cases, the above inequality is satisfied in less than 100 iterations. The converged values are rather insensitive to the initial parameter values, and we confirmed that the parameter updating algorithm stably converges to the correct values which we assumed to be the true values.
5.2 Parameter Estimation
Network  Number of active nodes  

1,163  0.019  0.026  
Blog  5,151  0.018  0.014 
10,322  0.011  0.011  
1,275  0.060  0.032  
Wikipedia  5,386  0.013  0.009 
10,543  0.006  0.007  
1,456  0.031  0.030  
Enron  5,946  0.011  0.011 
10,468  0.005  0.006  
1,203  0.028  0.022  
Coauthorship  5,193  0.009  0.007 
10,132  0.006  0.006 
Network  Number of active nodes  

1,023  0.020  0.020  
Blog  5,018  0.012  0.020 
10,037  0.012  0.020  
1018  0.032  0.024  
Wikipedia  5,038  0.015  0.020 
10,025  0.006  0.017  
1,017  0.023  0.014  
Enron  5,054  0.013  0.011 
10,024  0.007  0.010  
1,014  0.017  0.034  
Coauthorship  5,023  0.017  0.029 
10,023  0.006  0.027 
We generated the training set for each of the AsIC and the AsLT models as follows to evaluate the proposed learning methods as a function of the number of observed active nodes, i.e., amount of the training data. First we specified the target number of the active nodes we want to have, and the training set is generated by increasing the sequence one by one such that the total number of active nodes reaches with each sequence starting from a randomly chosen initial active node, skipping very short ones (those in which the number of nodes is less than 10). In the experiments, we investigated the cases of . Let , and denote the true values of , and , respectively, and , and the estimated values of , and , respectively. We define the parameter estimation errors , and by
Tables 1 and 2 show the parameter estimation errors of the proposed learning methods for the AsIC model and the AsLT model in four networks as a function of the number of observed active nodes, respectively. Here, the results are averaged over five independent experiments. As can be expected, the error is progressively reduced as the number of active nodes becomes larger. The algorithm guarantees to converge but does not guarantee the global optimal solution. In most of the cases, the number of iterations is less than 100. These results indicate that it converges to the correct solution in practice for all the parameters and for all the networks, which demonstrate the effectiveness of the proposed methods.
Network  Blog  Wikipedia  Enron  Coauthorship  

AsIC  0.091 (0.121)  0.088 (0.132)  0.029 (0.020)  0.119 (0.173)  
0.064 (0.085)  0.043 (0.056)  0.022 (0.019)  0.121 (0.255)  
AsLT  0.188 (0.219)  0.192 (0.272)  0.143 (0.140)  0.214 (0.194)  
0.078 (0.049)  0.069 (0.043)  0.077 (0.053)  0.086 (0.054)  








Next, we investigated the performance of the proposed learning method when the training set is a single diffusion sequence. Table 3 shows the results for four networks, where the results are averaged over independent experiments. Compared with Tables 1 and 2, the errors become larger. The average error of and for AsIC is 6% and 8%, and the average error of and for AsLT is 8% and 18%, respectively. The best results for AsIC is Enron network (2% for and 3% for ), and the best results for AsLT is Wikipedia network (7% for ) and Enron network (14% for ). The worst results for AsIC is Coauthorship network (12% for and 11% for ), and the worst results for AsLT is Coauthorship network (9% for and 21% for ). In general the accuracy is better for AsIC than for AsLT. This is because the lengths of the sequences are larger for AsIC. Further, is more difficult to correctly estimate than and . In order to see the difference in the learning result for each sequence in more depth, we plotted the number of active nodes as a function of time (the influence curve),^{7}^{7}7This is different from the influence degree described in Section 4.1 which is the expected value of the number of active nodes at the final time. and the values of the parameters learned, for AsIC and for AsLT, in Figures 6 and 7. The length of each sequence varies considerably. Some sequences are short and some others are long. The color of the dots for the learned parameters is determined in such a way that it goes from true blue to true red in proportion to the sequence length, i.e., the shortest sequence is true blue and the longest sequence is true red. From these results we can see the algorithm learns the parameter values within 10% of the correct values if the length is reasonably long. For example, Enron network generates long sequences from all the randomly chosen initial active nodes in case of AsIC and the learning accuracy is very good. We draw a conclusion that although it is not desirable we can still estimate the parameter values from a single observation sequence if this is the only choice available.
5.3 Node Ranking
We measure the influence of node by the influence degree for the diffusion model that has generated . We compared the result of the high ranked influential nodes for the true model that uses the assumed true parameter values with 1) the proposed method that uses the learned parameter values, 2) four heuristics widely used in social network analysis (all computed by the network topology alone) and 3) the same proposed method in which an incorrect diffusion model is assumed, i.e., data generated by AsIC but learning assumed AsLT and vice versa. Here again the influence degree is estimated by the bond percolation method [\BCAYKimura, Saito, \BBA NakanoKimura et al.2007, \BCAYKimura, Saito, Nakano, \BBA MotodaKimura et al.2010], where we used bond percolation processes according to \citeAkimura:tkdd and \citeAkimura:dmkd.
We call the proposed method the model based method. We call it the AsIC model based method if it employs the AsIC model as the information diffusion model. We then learn the parameters of the AsIC model from the observed data , and rank nodes according to the influence degrees based on the learned model. The AsLT model based method is defined in the same way. Among the four heuristics we used, the first three are “degree centrality”, “closeness centrality”, and “betweenness centrality”. These are commonly used as influence measure in sociology [\BCAYWasserman \BBA FaustWasserman \BBA Faust1994], where the outdegree of node is defined as the number of links going out from , the closeness of node is defined as the reciprocal of the average distance between and other nodes in the network, and the betweenness of node is defined as the total number of shortest paths between pairs of nodes that pass through . The fourth is “authoritativeness” obtained by the “PageRank” method [\BCAYBrin \BBA L.PageBrin \BBA L.Page1998]. We considered this measure as one alternative since this is a well known method for identifying authoritative or influential pages in a hyperlink network of web pages. This method has a parameter ; when we view it as a model of a random web surfer, corresponds to the probability with which a surfer jumps to a page picked uniformly at random [\BCAYNg, Zheng, \BBA JordanNg et al.2001]. In our experiments, we used a typical setting of .
In terms of extracting influential nodes from the network , we evaluated the performance of the ranking methods mentioned above by the ranking similarity within the rank , where and are the true set of top nodes and the set of top nodes for a given ranking method, respectively. We focused on the performance for high ranked nodes since we are interested in extracting influential nodes. Figures 8 and 9 show the results for the AsIC and the AsLT models, respectively. For the diffusion model based methods, we plotted the average value of at for five independent experimental results. We see that the proposed method gives better results than the other methods for these networks, demonstrating the effectiveness of our proposed learning method. It is interesting to note that the model based method in which an incorrect diffusion model is used is as bad as and in general worse than the heuristic methods. The results imply that it is important to consider the information diffusion process explicitly in discussing influential nodes and also to identify the correct model of information diffusion for the task in hand, same observation as in Section 4.
6 Model Selection
Now we have a method to estimate the parameter values from the observation for each of the assumed models. In this section we discuss whether the proposed learning method can correctly identify which of the two models: AsIC and AsLT the observed data come from, i.e., Model Selection problem. We assume that the topic is the decisive factor in determining the parameter values and place a constraint that the parameters depend only on topics but not on nodes and links of the network , and differentiate different topics by assigning an index to topic .
Therefore, we set and for any link in case of the AsIC model and and for any node and link in case of the AsLT model. Note that and . Since we normally have a very small number of observation for each , often only one, without this constraint, there is no way to learn the parameters.
6.1 Model Selection based on Predictive Accuracy
We have to select a model which is more appropriate to the model for the observed diffusion sequence. We decided to use predictive accuracy as the criterion for selection. We cannot use an information theoretic criterion such as AIC (Akaike Information Criterion)[\BCAYAkaikeAkaike1978] or MDL (Minimum Description Length)[\BCAYRissanenRissanen1978] because we need to select the one from models with completely different probability distributions. Moreover, for both models, it is quite difficult to efficiently calculate the exact activation probability of each node for more than two information diffusion cascading steps ahead. In order to avoid these difficulties, we propose a method based on a holdout strategy, which attempts to predict the activation probabilities at one step ahead and repeat this multiple times.
We now group the observed data sequences into topics. Assume that each topic has sequences of observation, i.e., , where each is a set of pairs of active node and its activation time in the th diffusion result in the th topic. Accordingly we add a subscript to other variables, e.g., we denote to indicate the time that a node is activated in the th sequence of the th topic.
We learn the model parameters for each topic separately. This does not exclude treating each sequence in a topic separately and learn from each, i.e., , which would help investigating if the same topic propagate similarly or not. For simplicity, we assume that for each , the initial observation time is zero, i.e., for . Then, we introduce a set of observation periods
where is the number of observation data we want to predict sequentially and each has the following property: There exists some such that . Let denote the observation data in the period for the th diffusion result in the th topic, i.e.,
We also set ; , , . Let denote the set of parameters for either the AsIC or the AsLT models, i.e., or . We can estimate the values of from the observation data by using the learning algorithms in Sections 3.1 (Appendix A.) and 3.2 (Appendix B.). Let denote the estimated values of . Then, we can calculate the activation probability of node at time () using .
For each , we select the node and the time by
Note that is the first active node in . We evaluate the predictive performance for the node at time . Approximating the empirical distribution by
with respect to , we employ the KullbackLeibler (KL) divergence
where and stand for Kronecker’s delta and Dirac’s delta function, respectively. Then, we can easily show
(13) 
By averaging the above KL divergence with respect to , we propose the following model selection criterion (see Equation (13)):
(14) 
where expresses the information diffusion model (i.e., the AsIC or the AsLT models). In our experiments, we adopted