Neural Diffusion Model for
Microscopic Cascade Prediction
The prediction of information diffusion or cascade has attracted much attention over the last decade. Most cascade prediction works target on predicting cascade-level macroscopic properties such as the final size of a cascade. Existing microscopic cascade prediction models which focus on user-level modeling either make strong assumptions on how a user gets infected by a cascade or limit themselves to a specific scenario where “who infected whom” information is explicitly labeled. The strong assumptions oversimplify the complex diffusion mechanism and prevent these models from better fitting real-world cascade data. Also, the methods which focus on specific scenarios cannot be generalized to a general setting where the diffusion graph is unobserved.
To overcome the drawbacks of previous works, we propose a Neural Diffusion Model (NDM) for general microscopic cascade prediction. NDM makes relaxed assumptions and employs deep learning techniques including attention mechanism and convolutional network for cascade modeling. Both advantages enable our model to go beyond the limitations of previous methods, better fit the diffusion data and generalize to unseen cascades. Experimental results on diffusion prediction task over four realistic cascade datasets show that our model can achieve a relative improvement up to against the best performing baseline in terms of F1 score.
Information diffusion is a ubiquitous and fundamental event in our daily lives, such as the spread of rumors, the contagion of viruses and the propagation of new ideas and technologies. The diffusion process, also called a cascade, has been studied over a broad range of domains. Though some works believe that even the eventual size of a cascade cannot be predicted , recent works [2, 3, 4] have shown the ability to predict the size, growth and many other key properties of a cascade. Nowadays the modeling and prediction of cascades play an important role in many real-world applications, e.g. production recommendation [5, 6, 7, 8, 9], epidemiology [10, 11], social networks [12, 13, 14] and the spread of news and opinions [15, 16, 17].
Most previous works on cascade prediction focus on the prediction of macroscopic properties such as the total number of users who share a specific photo  and the growth curve of the popularity of a blog . However, macroscopic cascade prediction is a rough estimate of cascades and cannot be adapted for microscopic questions as shown in Fig. 1. Microscopic cascade prediction, which pays more attention to user-level modeling and prediction instead of cascade-level, is much more powerful than macroscopic prediction and allows us to apply user-specific strategies for real-world applications. For example, during the adoption of a new product, microscopic cascade prediction can help us deliver advertisements to those users that are most likely to buy the product at each stage. In this paper, we focus on the study of microscopic cascade prediction.
Though useful and powerful, the microscopic prediction of cascades faces great challenges because the real-world diffusion process could be rather complex  and usually partially observed [11, 19]:
Complex mechanism. Since the mechanism of how a specific user gets infected 111We use “infected” and “activated” alternatively to indicate that a user is influenced by a cascade. is sophisticated, traditional cascade models based on strong assumptions and simple formulas may not be the best choice for microscopic cascade prediction. Existing cascade models [20, 21, 22, 23] which could be adopted for microscopic prediction mostly ground in Independent Cascade (IC) model . IC model assigns a static probability to user pairs with pairwise independent assumptions, where the probability indicates how likely user will get infected by user when is infected. Other diffusion models [24, 25] make even stronger assumptions that the infected users are only determined by the source user. Though intuitive and easy to understand, these cascade models are based on strong assumptions and oversimplified probability estimation formulas, both of which limit the expressivity and ability to fit complex real-world cascade data . The complex mechanism of real-world diffusions encourages us to explore more sophisticated models, e.g. deep learning techniques, for cascade modeling.
Incomplete observation. On the other hand, the cascade data is usually partially observed indicates that we can only observe those users getting infected without knowing who infected them. However, to the best of our knowledge, existing deep-learning engined microscopic cascade models [27, 28] are based on the assumption that the diffusion graph where a user can only infect and get infected by its neighbors is already known. For example, when we study the retweeting behavior on the Twitter network, “who infected whom” information is explicitly labeled in retweet chain and the next infected user candidates are restricted to the neighboring users rather than the whole user set. While in most diffusion processes such as the adoption of a product or the contamination of a virus, the diffusion graph is unobserved [11, 19, 29]. Therefore, these methods consider a much simpler problem and cannot be generalized to a general setting where the diffusion graph is unknown.
To fill in the blank of general microscopic cascade prediction and address the limitations of traditional cascade models, we propose a neural diffusion model based on relaxed assumptions and employ up-to-date deep learning techniques, i.e. attention mechanism and convolutional neural network, for cascade modeling. The relaxed assumptions enable our model to be more flexible and less constrained, and deep learning tools are good at capturing the complex and intrinsic relationships that are hard to be characterized by hand-crafted features. Both advantages allow our model to go beyond the limitations of traditional methods based on strong assumptions and oversimplified formulas and better fit the complicated cascade data. Following the experimental settings in , we conduct experiments on diffusion prediction task over four realistic cascade datasets to evaluate the performances of our proposed model and other state-of-the-art baseline methods. Experimental results show that our model can achieve a relative improvement up to against the best performing baseline in terms of F1 score.
To conclude, our contributions are -fold:
To the best of our knowledge, our work is the first attempt to employ deep learning techniques for general microscopic cascade prediction problem where the diffusion graph is unknown.
We design a neural diffusion model based on relaxed assumptions compared with the pairwise independence assumption in traditional cascade models and allow our model to better fit real-world cascades and generalize to unseen data.
Experimental results on diffusion prediction task over four realistic datasets demonstrate the effectiveness and robustness of our proposed model. Compared with the best performing baseline, our model can achieve a relative improvement up to on F1 score.
2 Related Works
We organize related works into macroscopic and microscopic cascade prediction methods. In terms of methodology, our work is also related to network representation learning methods.
2.1 Macroscopic Cascade Prediction
Most previous works on cascade prediction focused on macroscopic level prediction such as the eventual size of a cascade  and the growth curve of popularity . Macroscopic cascade prediction methods can further be classified into feature-based approaches, generative approaches, and deep-learning based approaches. Feature-based approaches formalized the task as a classification problem [30, 2] or a regression problem [31, 32] by applying SVM, logistic regression and other machine learning algorithms on hand-crafted features including temporal  and structural  features. Generative approaches considered the growth of cascade size as an arrival process of infected users and employed stochastic processes, such as Hawkes self-exciting point process [34, 4], for modeling. With the success of deep learning techniques in various applications, deep-learning based approaches, e.g. DeepCas  and DeepHawkes , were proposed to employ Recurrent Neural Network (RNN) for encoding cascade sequences into feature vectors instead of hand-crafted features. Compared with hand-crafted feature engineering, deep-learning based approaches have better generalization ability across different platforms and give better performance on macroscopic prediction task.
2.2 Microscopic Cascade Prediction
Our work is more related to microscopic cascade prediction which focuses on user-level modeling and predictions. We classify related works into three categories: IC-based approaches, embedding-based approaches, and deep-learning based approaches.
IC model [36, 12, 15, 37] is one of the most popular diffusion models which assumed independent diffusion probability through each link. Extensions of IC model further considered time delay information by incorporating a predefined time-decay weighting function, such as continuous time IC , CONNIE , NetInf  and Netrate . Infopath  was proposed to infer dynamic diffusion probabilities based on information diffusion data and study the temporal evolution of information pathways. MMRate  inferred multi-aspect transmission rates by incorporating aspect-level user interactions and various diffusion patterns. All above methods learned the probabilities from cascade sequences. Once a model is trained, it can be used for microscopic cascade prediction by simulating the generative process using Monte Carlo simulation.
Embedding-based approaches encoded each user into a parameterized real-valued vector and trained the parameters by maximizing an objective function. Embedded IC  followed the pairwise independence assumption in IC model and modeled the diffusion probability between two users by a function of their user embeddings. Other embedding-based diffusion models [24, 25] made even stronger assumptions that infected users are determined only by the source user and the content of information item. As shown in previous work , such models with strong assumptions oversimplify the reality and generally show poor performance on real prediction tasks.
Existing deep-learning based microscopic cascade prediction approaches [27, 28] focused on the retweeting and sharing behaviors in a social network where “who infected whom” information is explicitly labeled in retweet chain. The next infected user candidates are also restricted to the neighboring users when the diffusion graph is known. However, the diffusion graph is usually unknown for most diffusion processes [11, 19]. For example, during the contamination of a virus, by whom a patient gets infected is unobserved. Existing deep-learning based methods considered a much simpler problem and cannot be generalized to a general setting where the diffusion graph is unobserved. To the best of our knowledge, our work is the first attempt to employ deep learning techniques for general microscopic cascade prediction problem where the diffusion graph is unknown.
2.3 Network Representation Learning
Researchers have explored many algorithms to represent nodes in a network by real-valued vectors. By projecting topology structure into vectors, we can apply machine learning techniques for many network applications, e.g. classification. Most network representation learning works focus on task unspecific learning where the downstream task is unknown. Early stage works  use eigenvector computation to learn node embeddings. With the success of neural networks, people also employ simple neural networks for representation learning [41, 42]. For task specific learning, a certain task such as classification  and recommendation  is specified and the network embeddings serve as the bottom layer of their model as what we will do in this paper. In terms of diffusion prediction task, Embedded IC  is proposed and will be used as our baseline method.
3 Data Observation
In this section, we will conduct data observation on real-world datasets and investigate the intrinsic relationships between activated users in a diffusion sequence. In specific, we will try to figure out whether consecutively activated users are more likely to be relevant and thus appear in more diffusion sequences together. We will first introduce the datasets.
We collect four real-world cascade datasets that cover a variety of applications for evaluation. A cascade is an item or some kind of information that spreads through a set of users. Each cascade consists of a list of pairs where each pair indicates the fact that the user gets infected at the timestamp.
Lastfm is a music streaming website. We collect the dataset from . The dataset contains the full history of nearly users and the songs they listened to over one year. We treat each song as an item spreading through users and remove the users who listen to no more than songs.
Irvine is an online community for students at University of California, Irvine collected from . Students can participate in and write posts on different forums. We regard each forum as an information item and remove the users who participate in no more than forums.
Memetracker 222http://www.memetracker.org collects a million of news stories and blog posts and track the most frequent quotes and phrases, i.e. memes, for studying the migration of memes across a group of people. Each meme is considered to be an information item and each URL of websites or blogs is regarded as a user. Following the settings of previous works , we filter the URLs to only keep the most active ones to alleviate the effect of noise.
Twitter dataset  concerns tweets containing URLs posted on Twitter during October 2010. The complete tweeting history of each URL is collected. We consider each distinct URL as a spreading item over Twitter users. We filter out the users with no more than tweets. Note that the scale of Twitter dataset is competitive and even larger than the datasets used in previous neural-based cascade modeling algorithms [23, 28].
Note that all the above datasets have no explicit evidence about by whom a user gets infected. Though we have the following relationship in Twitter dataset, we still cannot trace the source of by whom a user is encouraged to tweet a specific URL unless the user directly retweets.
|Dataset||# Users||# Links||# Cascades||Avg. Length|
We list the statistics of datasets in Table I. Since we have no interaction graph information between users, we assume that there exists a link between two users if they appear in the same cascade sequence. Each virtual “link” will be assigned a parameterized probability in traditional IC model and thus the space complexity of traditional methods is relatively high especially for large datasets. We also calculate the average cascade length of each dataset in the last column.
3.2 Statistical Analysis
Now we will try to reveal the correlation patterns between users by statistical results. By intuition, two consecutively infected users in a cascade sequence are more likely to have connections, e.g. one infects another, and thus participate in many other diffusion sequences together.
To demonstrate this statement, we consider the following statistics: given the fact that user and are infected in a cascade sequence with users infected between them in this sequence, what will be the expectation of the number of cascade sequences that user and both participate in? Here indicates that user and are consecutively activated. If the intuition is true, then the expectation should decrease as increases.
Fig. 2 presents the statistical results of all four datasets. Here we list the results for and the average for and . The statistics show that the expectations of co-occurrence times for are consistently larger than those for . Note that the gap is not very large for some datasets due to the long-tail effect. Therefore, we further present the results only for the top 5% user pairs in terms of co-occurrence times for each in Fig. 3. We can see the differences more clearly in this setting.
These statistical results demonstrate that consecutively infected users in a cascade sequence are more likely to be relevant. By saying two users are “relevant”, there could be a direct diffusion path between them or they are both likely to be infected by a third one. Also, we find that not only the most recently infected user will be relevant to the next infected one: As shown in Fig. 2 and 3, all recent infected users () could be relevant with minor differences (more relevant for smaller ). We will build our model based on these findings in next section.
In this section, we will start by formalizing the problem and introducing the notations. Then we propose two heuristic assumptions according to the data observations as our basis and design a Neural Diffusion Model (NDM) using deep learning techniques. Finally, we will introduce the overall optimization function and other details of our model.
4.1 Problem Formalization
A cascade dataset records the information that an item spreads to whom and when during its diffusion. For example, the item could be a product and the cascade records who bought the product at what moment. However, in most cases, there exists no explicit interaction graph between the users [37, 23]. Therefore, we have no explicit information about how a user was infected by other users.
Formally, given user set and observed cascade sequence set , each cascade consists a list of users ranked by their infection time, where is the length of sequence and is the -th user in the sequence . Note that we only consider the order of users getting infected and ignore the exact timestamps of infections in this paper as previous works did [23, 29, 28].
In this paper, our goal is to learn a cascade prediction model which can predict the next infected user given a partially observed cascade sequence . The learned model is able to predict the entire infected user sequence based on the first few observed infected users and thus be used for microscopic cascade prediction illustrated in Figure 1. In our model, we add a virtual user called “Terminate” to the user set . At training phase, we append “Terminate” to the end of each cascade sequence and allow the model to predict next infected user as “Terminate” to indicate that no more users will be infected in this cascade.
Further, we represent each user by a parameterized real-valued vector to project users into vector space. The real-valued vectors are also called embeddings. We denote the embedding of user as where is the dimension of embeddings. In our model, a larger inner product between the embeddings of two users indicates a stronger correlation between the users.
4.2 Model Assumptions
In traditional Independent Cascade (IC) model  settings, all previously infected users can activate a new user independently and equally regardless of their orders of getting infected. Many extensions of IC model further considered time delay information such as continuous time IC (CTIC)  and Netrate . However, none of these models tried to find out which users are actually active and more likely to activate other users at the moment. To address this issue, we propose the following assumption.
Assumption 1. Given a recently infected user , users that are strongly correlated to user including user itself are more likely to be active.
This assumption is intuitive and straight-forward. As a newly activated user, should be active and may infect other users. The users strongly correlated to user are probably the reason why user gets activated recently and thus more likely to be active than other users at the moment. We further propose the concept of “active user embedding” to characterize all such active users.
Definition 1. For each recently infected user , we aim to learn an active user embedding which represents the embedding of all active users related to user , and can be used for predicting the next infected user in next step.
The active user embedding characterizes the potential active users related to the fact that user gets infected. From the data observations, we can see that all recently infected users could be relevant to the next infected one. Therefore, the active user embeddings of all recently infected users should contribute to the prediction of next infected user, which leads to the following assumption.
Assumption 2. All recently infected users should contribute to the prediction of next infected user and be processed differently according to the order of getting infected.
Compared with the strong assumptions made by IC-based and embedding-based method introduced in related works, our heuristic assumptions allow our model to be more flexible and better fit cascade data. Now we will introduce how to build our model based on these two assumptions, i.e. extracting active users and unifying these embeddings for prediction.
4.3 Extracting Active Users with Attention Mechanism
For the purpose of computing active user embeddings, we propose to use attention mechanism [48, 49] to extract the most likely active users by giving them more weights than other users. As shown in Figure 4, the active embedding of user is computed as a weighted sum of previously infected users:
where the weight of is
Note that for every and . is the normalized inner product between the embeddings of and which indicates the strength of correlation between them.
From the definition of active user embedding in Eq. 1, we can see that the user embeddings which have a larger inner product with will be allocated a larger weight . This formula naturally follows our assumption that users strongly correlated to user including user itself should be paid more attention.
To fully utilize the advantages of a neural model, we further employ the multi-head attention  to improve the expressibility. Multi-head attention projects the user embeddings into multiple subspaces with different linear projections. Then multi-head attention performs attention mechanism on each subspace independently. Finally, multi-head attention concatenates the attention embeddings in all subspaces and feeds the result into a linear projection again.
Formally, in a multi-head attention with heads, the embedding of -th head is computed as
are head-specific linear projection matrices. In particular, and can be seen to project user embeddings into receiver space and sender space respectively for asymmetric modeling.
Then we have the active user embedding
where indicates concatenation operation and projects the concatenated results into -dimensional vector space.
Multi-head attention allows the model to “divide and conquer” information from different perspectives (i.e. subspaces) independently and thus is more powerful than the traditional attention mechanism.
4.4 Unifying Active User Embeddings for Prediction with Convolutional Network
Different from previous works [50, 21] which directly give a time-decay weight that assumes larger weights for the most recently infected users, we propose to use a parameterized neural network to handle the active user embeddings at different positions. Compared with a predefined exponential-decay weighting function , a parameterized neural network can be learned automatically to fit the real-world dataset and capture the intrinsic relationship between active user embedding at each position and next infected user prediction. In this paper, we consider Convolutional Neural Network (CNN) to meet this purpose.
CNN has been widely used in image recognition , recommender systems  and natural language processing . CNN is a shift-invariant neural network and allows us to assign position-specific linear projections to the embeddings.
Figure 4 illustrates an example where the window size of our convolutional layer . The convolutional layer first converts each active user embedding into a -dimensional vector by a position-specific linear projection matrix for . Then the convolutional layer sums up the projected vectors and normalizes the summation by softmax function.
Formally, given partially observed cascade sequence , the predicted probability distribution is
where and denotes the -th entry of a vector . Each entry of represents the probability that the corresponding user gets infected at next step.
Since the initial user plays an important role in the whole diffusion process, we further take into consideration:
where is the projection matrix for initial user and is a hyperparameter which controls whether incorporate initial user for prediction or not.
4.5 Overall Architecture, Model Details and Learning Algorithms
We naturally maximize the log-likelihood of all observed cascade sequences to build the overall optimization function.
where is the predicted probability of ground truth next infected user at position in cascade , and is the set of all parameters need to be learned, including user embeddings for each , projection matrices in multi-head attention for and projection matrices in convolutional layer for .
Implementation Details. We implement our model using PyTorch 333http://pytorch.org and optimize the parameters by gradient descent with Adam optimizer . We further employ layer normalization  and residual connection  operation to active user embedding to avoid gradient explosion or vanishment problem that may occur in deep neural networks. In other words, the active user embedding is replaced by instead where the function encourages the output to have zero mean and unit variance. We also use dropout  to the attention mechanism to prevent our model from overfitting and the dropout rate is set to . Since the same user will not be infected twice, we mask the users that are already infected in the Eq. 7 so that they won’t be predicted. We release our source code at github 444https://github.com/albertyang33/NeuralDiffusionModel and all the details are listed. Hyperparameter settings will be introduced in next section.
Complexity. The space complexity of our model is where is the embedding dimension which is much less than the size of user set. Note that the space complexity of training traditional IC model will go up to because we need to assign an infection probability between each pair of potential linked users. Therefore, the space complexity of our neural model is less than that of traditional IC methods.
The computation of a single active embedding takes time where is the length of corresponding cascade and the next infected user prediction in Eq. 7 step takes time. Hence the time complexity of training a single cascade is which is competitive with previous neural-based models such as embedded IC model . But as we will show in the experiments, our model converges much faster than embedded IC model and is capable of handling large-scale dataset.
We conduct experiments on diffusion prediction task as previous works did  to evaluate the performance of our model and various baseline methods. We will first introduce the baseline methods, evaluation metrics and hyperparameter settings. Then we will present the experimental results and give further analysis about the evaluation.
We consider a number of state-of-the-art baselines to demonstrate the effectiveness of our algorithm. Most of the baseline methods will learn a transition probability matrix from cascade sequences where each entry represents the probability that user gets infected by when is activated.
Netrate  considers the time-varying dynamics of diffusion probability through each link and defines three transmission probability models, i.e. exponential, power-law and Rayleigh, which encourage the diffusion probability to decrease as the time interval increases. In our experiments, we only report the results of exponential model since the other two models give similar results.
Infopath  also targets on inferring dynamic diffusion probabilities based on information diffusion data. Infopath employs stochastic gradient to estimate the temporal dynamics and studies the temporal evolution of information pathways.
Embedded IC  explores representation learning technique and models the diffusion probability between two users by a function of their user embeddings instead of a static value. Embedded IC model is trained by stochastic gradient descent method.
LSTM is a widely used neural network framework  for sequential data modeling and has been used for cascade modeling recently. Previous works employ LSTM for some simpler tasks such as popularity prediction  and cascade prediction with known diffusion graph [28, 27]. Since none of these works are directly comparable to ours, we adopt LSTM network for comparison by adding a softmax classifier to the hidden state of LSTM at each step for next infected user prediction.
5.2 Hyperparameter Settings for Neural Models
Though the parameter space of neural network based methods is much less than that of traditional IC models, we have to set several hyperparameters to train neural models. To tune the hyperparameters, we randomly select of training cascade sequences as validation set. Note that all training cascade sequences including the validation set will be used to train the final model for testing.
For Embedded IC model, the dimension of user embeddings is selected from as the original paper did . For LSTM model, the dimensions of user embeddings and hidden states are set to the best choice from . For our model NDM, the number of heads used in multi-head attention is set to , the window size of convolutional network is set to and the dimension of user embeddings is set to . Note that we use the same set of for all the datasets. The flag in Eq. 7 which determines whether the initial user is used for prediction is set to for Twitter dataset and for the other three datasets. We will show the robustness of our model in parameter sensitivity subsection.
Note that neural models, i.e. Embedded IC, LSTM and NDM, are based on matrix multiplication operations and thus naturally benefit from the GPU acceleration. Therefore, we train these three methods on a GPU device (GeForce GTX TITAN X) instead of a CPU device (Intel Xeon E5-2620 @ 2.0GHz).
5.3 Diffusion Prediction
To compare the ability of cascade modeling, we evaluate our model and all baseline methods on the diffusion prediction task. We follow the experimental settings in Embedded IC . We randomly select cascade sequences as training set and the rest as test set. For each cascade sequence in the test set, only the initial user is given and all successively infected users need to be predicted.
All baseline methods and our model are required to predict a set of users and the results will be compared with ground truth infected user set . For baseline methods that ground in IC model, i.e. Netrate, Infopath and Embedded IC, we will simulate the infection process according to the learned pairwise diffusion probability and their corresponding generation process. For LSTM and our model, we can sequentially sample a user according to the probability distribution of softmax classifier at each step.
Note that the ground truth infected user set could also be partially observed because the datasets are crawled within a short time window. Therefore, for each test sequence with ground truth infected users, all the algorithms are only required to predict the first infected users in a single simulation. Also note that the simulation may terminate and stop infecting new users before activating users.
We conduct times Monte Carlo simulations for each test cascade sequence for all algorithms and compute the infection probability of each user . We evaluate the prediction results using two classic evaluation metrics: Macro-F1 and Micro-F1.
Macro-F1. Macro-averaged F1 first computes the precision , recall and F1 score locally for each test cascade sequence in the test set . Then macro-averaged F-measure takes the average over all test cascade sequences:
Micro-F1. Micro-averaged F1 computes precision , recall globally by averaging over all predictions and serves as a complementary view by giving larger weights to longer cascades:
To further evaluate the performance of cascade prediction at early stage, we conduct additional experiments by only predicting the first five infected users in each test cascade. We present the experimental results in Table II and III. Here “-” indicates that the algorithm fails to converge in hours. The last column represents the relative improvement of NDM against the best performing baseline method. We have the following observations:
(1) NDM consistently and significantly outperforms all the baseline methods. As shown in Table II, the relative improvement against the best performing baseline is at least in terms of Macro-F1 score. The improvement on Micro-F1 score further demonstrates the effectiveness and robustness of our proposed model. The results also indicate that well-designed neural network models are able to surpass traditional cascade methods on cascade modeling.
(2) NDM has even more significant improvements on cascade prediction task at early stage. As shown in Table III, NDM outperforms all baselines by a large margin on both Macro and Micro F1 scores. Note that it’s very important to predict the first wave of infected users accurately for real-world applications because a wrong prediction will lead to error propagation in following stages. A precise prediction of infected users at early stage enables us to better control the spread of information items through users. For example, we can prevent the spread of a rumor by warning the most vulnerable users in advance and promote the spread of a product by paying the most potential customers more attention. This experiment demonstrates that NDM has the ability to be used for real-world applications.
(3) NDM is capable of handling large-scale cascade datasets. Previous neural-based method, Embedded IC, fails to converge in hours on Twitter dataset with around thousand users and million of potential links. In contrast, NDM converges in hours on this dataset with the same GPU device, which is at least times faster than Embedded IC. This observation demonstrates the scalability of NDM.
5.4 Social Link Prediction
Sometimes the underlying social network of users is available, e.g. the Twitter dataset used in our experiments. In the Twitter dataset, a network of Twitter followers is observed though the information diffusion is not necessarily passed through the edges of the social network. Though the diffusion network and the social network do not strictly align with each other, we still expect that the most closely related users in information diffusion should also be socially connected in the social network. Therefore, we conduct social link prediction experiments to verify this statement.
Firstly, we need to specify “the most closely related users”. For Infopath algorithm, the model output directly contains the diffusion probability of each inferred edge. Thus we can rank the users by their diffusion probability to get the most closely related users. For LSTM and our model NDM, the output contains users’ embeddings as real-valued vectors and we can simply use the inner product or the multi-head attention weight in Eq. 4 between user embeddings to measure the closeness of two users.
Secondly, we use the following experimental settings for evaluation. Note that this setting is a reasonable choice but not the only choice. For each user in the dataset, we select the most closely related user according to the first step and check whether this most closely related user is a follower of user or not. Then we naturally use the accuracy as evaluation metric. We present the experimental results in Fig. 5.
From the results in Fig. 5 we can see that all three algorithms, i.e. LSTM, Infopath and NDM, are able to predict the social links in Twitter dataset to some extent. Even the accuracy of LSTM is much higher than a random guess (around ). This indicates that information spreads through some social links frequently and thus these links can be inferred successfully. Moreover, our NDM model performs best on this task. This fact indicates that NDM can better capture the intrinsic relationship between users. Also, the absolute value of social link prediction accuracy is still not high enough (less than ). One possible reason is that the overlap between a diffusion network and a social network is small compared with the entire network.
5.5 Benefits from Social Network
On the other side, we also hope that diffusion prediction process could benefit from the observed social network structure. We apply a simple modification on our NDM model to take advantage of the social network. Now we will introduce the modification in detail.
Firstly, we embed the topological social network structure into real-valued user features by DeepWalk , a widely used network representation learning algorithm. The dimension of network embeddings learned by DeepWalk is set to which is half of the dimension which is the representation size of our model. Secondly, we use the learned network embeddings to initialize the first dimensions of the user representations of our model and fix them during the training process without changing any other modules. In other words, a -dimensional user representation is made up of a -dimensional fixed network embedding learned by DeepWalk from social network structure and another -dimensional randomly initialized trainable embedding. We name the modified model with Social Network considered as NDM+SN for short. This is a simple but useful implementation and we will explore a more sophisticated model to take the social network into modeling directly in future work. Fig. 6 and 7 show the comparison between NDM and NDM+SN.
Experimental results show that NDM+SN is able to improve the performance on diffusion prediction task slightly with the help of incorporating social network structure as prior knowledge. The relative improvement of Micro-F1 is around . The results demonstrate that our neural model is very flexible and can be easily extended to take advantage of external features. Note that these results are also consistent with those in previous subsection: The diffusion network and social network have overlapping parts but the overlapping part is relative small compared to the whole network.
5.6 Parameter Sensitivity
In this subsection, we will take Lastfm dataset as an illustrative example to present how hyperparameter settings affect the performance of our model. We use the best set of hyperparameter settings as our basis, i.e. number of heads , window size of convolutional network , dimension of user embeddings and flag of using initial user for prediction . Then we vary each hyperparameter while keeping others fixed. Figure 8 shows the performance on diffusion prediction under different hyperparameter settings.
We can see that the performance of NDM is stable when we vary the hyperparameters within a reasonable range. NDM does not encounter serious overfitting problem when we double the dimension of embeddings to . This experiment demonstrates the robustness of our model.
Admittedly, the interpretability is usually a weak point of neural network models. Compared with feature engineering methods, neural-based models encode a user into a real-valued vector space and there is no explicit meaning of each dimension of user embeddings. In our proposed model, each user embedding is projected into subspaces by an -head attention mechanism. Intuitively, the user embedding in each subspace represents a specific role of the user. But it is quite hard for us to link the embeddings to interpretable hand-crafted features. We will consider the alignment between user embeddings and interpretable features based on a joint model in future work.
Fortunately, we still have some findings in the convolutional layer. Recall that for are position-specific linear projection matrices in convolutional layer and is the projection matrix for the initial user. All four matrices are randomly initialized before training. In a learned model, if the scale of one of these matrices is much larger than that of other ones, then the prediction vector is more likely to be dominated by the corresponding position. For example, if the scale of is much larger than that of other ones, then we can infer that the most recent infected user contributes most to the next infected user prediction.
Following the notations in Eq. 7, we set for all datasets in this experiment and compute the square of Frobenius norm of learned projection matrices as shown in Table IV. We have the follow observations:
(1) For all four datasets, the scales of and are competitive and the scale of is always a little bit larger than that of the other two. This observation indicates that the active embeddings of all three recently infected users will contribute to the prediction of . Also, the most recent infected user is the most important one among the three. This finding naturally matches our intuitions and verifies Assumption 2 proposed in method section.
(2) The scale of is the largest on Twitter dataset. This indicates that the initial user is very important in diffusion process on Twitter. This is partly because Twitter dataset contains the complete history of the spread of a URL and the initial user is actually the first one tweeting the URL. While in the other three datasets, the initial user is only the first one within the time window of crawled data. Note that we set hyperparameter only for Twitter dataset in diffusion prediction task because we find that the performances are competitive or even worse on the other three datasets if we set .
In this paper, we propose a Neural Diffusion Model (NDM) for microscopic cascade modeling. To go beyond the limitations of traditional cascade models based on strong assumptions and oversimplified formulas, we build our model based on two heuristic assumptions and employ deep learning techniques including convolutional neural network and attention mechanism to implement the assumptions. Experimental results on diffusion prediction task demonstrate the effectiveness and robustness of our proposed model. In addition, NDM greatly outperforms baseline methods on diffusion prediction at early stage, which shows the applicability and feasibility of NDM for real-world applications.
For future works, we will consider linking neural-based model with hand-crafted features and statistics to improve the interpretability of learned models. An intelligible model is always welcome and can help us better understand the motivations and behaviors of users in a diffusion process.
The incorporation of extra information for cascade modeling is also an intriguing direction. For example, the timestamp information and the description of information items can be used for more accurate cascade modeling.
This work was supported by the 973 Program (No. 2014CB340501), the Major Project of the National Social Science Foundation of China (13&ZD190) and the National Natural Science Foundation of China (No. 61772302). This work is also part of the NExT++ project, supported by the National Research Foundation, Prime Ministerâs Office, Singapore under its IRC@Singapore Funding Initiative.
-  M. J. Salganik, P. S. Dodds, and D. J. Watts, “Experimental study of inequality and unpredictability in an artificial cultural market,” science, 2006.
-  J. Cheng, L. Adamic, P. A. Dow, J. M. Kleinberg, and J. Leskovec, “Can cascades be predicted?” in Proceedings of WWW. ACM, 2014.
-  L. Yu, P. Cui, F. Wang, C. Song, and S. Yang, “From micro to macro: Uncovering and predicting information cascading process with behavioral dynamics,” in Data mining (ICDM). IEEE, 2015, pp. 559–568.
-  Q. Zhao, M. A. Erdogdu, H. Y. He, A. Rajaraman, and J. Leskovec, “Seismic: A self-exciting point process model for predicting tweet popularity,” in Proceedings of the 21th ACM SIGKDD. ACM, 2015, pp. 1513–1522.
-  P. Domingos and M. Richardson, “Mining the network value of customers,” in Proceedings of SIGKDD. ACM, 2001.
-  J. Leskovec, A. Singh, and J. Kleinberg, “Patterns of influence in a recommendation network,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2006, pp. 380–389.
-  J. Leskovec, L. A. Adamic, and B. A. Huberman, “The dynamics of viral marketing,” ACM Transactions on the Web (TWEB), vol. 1, no. 1, p. 5, 2007.
-  D. J. Watts and P. S. Dodds, “Influentials, networks, and public opinion formation,” Journal of consumer research, vol. 34, no. 4, pp. 441–458, 2007.
-  S. Aral and D. Walker, “Identifying influential and susceptible members of social networks,” Science, 2012.
-  H. W. Hethcote, “The mathematics of infectious diseases,” SIAM review, vol. 42, no. 4, pp. 599–653, 2000.
-  J. Wallinga and P. Teunis, “Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures,” American Journal of epidemiology, 2004.
-  D. Kempe, J. Kleinberg, and É. Tardos, “Maximizing the spread of influence through a social network,” in Proceedings of SIGKDD. ACM, 2003, pp. 137–146.
-  T. Lappas, E. Terzi, D. Gunopulos, and H. Mannila, “Finding effectors in social networks,” in Proceedings of SIGKDD. ACM, 2010, pp. 1059–1068.
-  P. A. Dow, L. A. Adamic, and A. Friggeri, “The anatomy of large facebook cascades.” ICWSM, 2013.
-  D. Gruhl, R. Guha, D. Liben-Nowell, and A. Tomkins, “Information diffusion through blogspace,” in Proceedings of WWW. ACM, 2004.
-  D. Liben-Nowell and J. Kleinberg, “Tracing information flow on a global scale using internet chain-letter data,” Proceedings of the national academy of sciences, vol. 105, no. 12, pp. 4633–4638, 2008.
-  J. Leskovec, L. Backstrom, and J. Kleinberg, “Meme-tracking and the dynamics of the news cycle,” in Proceedings of SIGKDD. ACM, 2009, pp. 497–506.
-  D. M. Romero, B. Meeder, and J. Kleinberg, “Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter,” in Proceedings of WWW. ACM, 2011, pp. 695–704.
-  S. Myers and J. Leskovec, “On the convexity of latent social network inference,” in Advances in neural information processing systems, 2010, pp. 1741–1749.
-  M. Gomez Rodriguez, J. Leskovec, and A. Krause, “Inferring networks of diffusion and influence,” in Proceedings of SIGKDD. ACM, 2010.
-  M. G. Rodriguez, J. Leskovec, D. Balduzzi, and B. Schölkopf, “Uncovering the structure and temporal dynamics of information propagation,” Network Science, vol. 2, no. 1, pp. 26–65, 2014.
-  M. Gomez Rodriguez, J. Leskovec, and B. Schölkopf, “Structure and dynamics of information pathways in online media,” in Proceedings of WSDM. ACM, 2013.
-  S. Bourigault, S. Lamprier, and P. Gallinari, “Representation learning for information diffusion through social networks: an embedded cascade model,” in Proceedings of WSDM. ACM, 2016.
-  S. Bourigault, C. Lagnier, S. Lamprier, L. Denoyer, and P. Gallinari, “Learning social network embeddings for predicting information diffusion,” in Proceedings of WSDM. ACM, 2014.
-  S. Gao, H. Pang, P. Gallinari, J. Guo, and N. Kato, “A novel embedding method for information diffusion prediction in social network big data,” IEEE Transactions on Industrial Informatics, 2017.
-  C. Li, J. Ma, X. Guo, and Q. Mei, “Deepcas: An end-to-end predictor of information cascades,” in Proceedings of WWW. International World Wide Web Conferences Steering Committee, 2017, pp. 577–586.
-  W. Hu, K. K. Singh, F. Xiao, J. Han, C.-N. Chuah, and Y. J. Lee, “Who will share my image?: Predicting the content diffusion path in online social networks,” in Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 2018, pp. 252–260.
-  J. Wang, V. W. Zheng, Z. Liu, and K. C.-C. Chang, “Topological recurrent neural network for diffusion prediction,” in ICDM. IEEE, 2017, pp. 475–484.
-  Z. T. Kefato, N. Sheikh, and A. Montresor, “Di: Diffusion network inference through representation learning,” 2017.
-  P. Cui, S. Jin, L. Yu, F. Wang, W. Zhu, and S. Yang, “Cascading outbreak prediction in networks: a data-driven approach,” in Proceedings of SIGKDD. ACM, 2013.
-  O. Tsur and A. Rappoport, “What’s in a hashtag?: content based prediction of the spread of ideas in microblogging communities,” in Proceedings of WSDM. ACM, 2012, pp. 643–652.
-  L. Weng, F. Menczer, and Y.-Y. Ahn, “Predicting successful memes using network and community structure.” in ICWSM, 2014.
-  H. Pinto, J. M. Almeida, and M. A. Gonçalves, “Using early view patterns to predict the popularity of youtube videos,” in Proceedings of WSDM. ACM, 2013, pp. 365–374.
-  S. Gao, J. Ma, and Z. Chen, “Modeling and predicting retweeting dynamics on microblogging platforms,” in Proceedings of WSDM. ACM, 2015.
-  Q. Cao, H. Shen, K. Cen, W. Ouyang, and X. Cheng, “Deephawkes: Bridging the gap between prediction and understanding of information cascades,” in Proceedings of CIKM. ACM, 2017.
-  J. Goldenberg, B. Libai, and E. Muller, “Talk of the network: A complex systems look at the underlying process of word-of-mouth,” Marketing letters, 2001.
-  K. Saito, R. Nakano, and M. Kimura, “Prediction of information diffusion probabilities for independent cascade model,” in Knowledge-based intelligent information and engineering systems. Springer, 2008, pp. 67–75.
-  K. Saito, M. Kimura, K. Ohara, and H. Motoda, “Learning continuous-time information diffusion model for social behavioral data analysis,” in Asian Conference on Machine Learning. Springer, 2009, pp. 322–337.
-  S. Wang, X. Hu, P. S. Yu, and Z. Li, “Mmrate: inferring multi-aspect diffusion networks with multi-pattern cascades,” in Proceedings of SIGKDD. ACM, 2014, pp. 1246–1255.
-  L. Tang and H. Liu, “Relational learning via latent social dimensions,” in Proceedings of SIGKDD. ACM, 2009, pp. 817–826.
-  B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in Proceedings of SIGKDD. ACM, 2014, pp. 701–710.
-  J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: Large-scale information network embedding,” in Proceedings of WWW. International World Wide Web Conferences Steering Committee, 2015, pp. 1067–1077.
-  T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
-  C. Yang, M. Sun, W. X. Zhao, Z. Liu, and E. Y. Chang, “A neural network approach to jointly modeling social networks and mobile trajectories,” TOIS, vol. 35, no. 4, p. 36, 2017.
-  Ò. Celma Herrada, “Music recommendation and discovery in the long tail,” 2009.
-  T. Opsahl and P. Panzarasa, “Clustering in weighted networks,” Social networks, vol. 31, no. 2, pp. 155–163, 2009.
-  N. O. Hodas and K. Lerman, “The simple rules of social contagion,” Scientific reports, vol. 4, p. 4343, 2014.
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
-  M. G. Rodriguez, D. Balduzzi, and B. Schölkopf, “Uncovering the temporal dynamics of diffusion networks,” arXiv preprint arXiv:1105.0697, 2011.
-  Y. LeCun et al., “Lenet-5, convolutional neural networks,” URL: http://yann. lecun. com/exdb/lenet, 2015.
-  A. Van den Oord, S. Dieleman, and B. Schrauwen, “Deep content-based music recommendation,” in Advances in neural information processing systems, 2013, pp. 2643–2651.
-  R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” in Proceedings of ICML. ACM, 2008.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of CVPR, 2016, pp. 770–778.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Cheng Yang is a 4-th year PhD student of the Department of Computer Science and Technology, Tsinghua University. He got his B.E. degree from Tsinghua University in 2014. His research interests include natural language processing and network representation learning. He has published several top-level papers in international journals and conferences including ACM TOIS, IJCAI and AAAI.
Maosong Sun is a professor of the Department of Computer Science and Technology, Tsinghua University. He got his BEng degree in 1986 and MEng degree in 1988 from Department of Computer Science and Technology, Tsinghua University, and got his Ph.D. degree in 2004 from Department of Chinese, Translation, and Linguistics, City University of Hong Kong. His research interests include natural language processing, Chinese computing, Web intelligence, and computational social sciences. He has published over 150 papers in academic journals and international conferences in the above fields. He serves as a vice president of the Chinese Information Processing Society, the council member of China Computer Federation, the director of Massive Online Education Research Center of Tsinghua University, and the Editor-in-Chief of the Journal of Chinese Information Processing.
Haoran Liu is a 4-th year undergraduate student of the Department of Electric Engineering, Tsinghua University. His research interests include network representation learning and machine learning.
Shiyi Han is a 1-st year master student in Computer Science department at Brown University. He got his B.E. degree from Beihang University in 2018. His research interests include natural language processing and machine learning.
Zhiyuan Liu is an associate professor of the Department of Computer Science and Technology, Tsinghua University. He got his BEng degree in 2006 and his Ph.D. in 2011 from the Department of Computer Science and Technology, Tsinghua University. His research interests are natural language processing and social computation. He has published over 40 papers in international journals and conferences including ACM Transactions, IJCAI, AAAI, ACL and EMNLP.
Huanbo Luan is the deputy director of NExT++ Research Center at both Tsinghua University and National University of Singapore. He received his B.S. degree in computer science from Shandong University in 2003 and Ph.D degree in computer science from Institute of Computing Technology, Chinese Academy of Sciences in 2008. His research interests include natural language processing, multimedia information retrieval, social media and big data analysis.