Neural Diffusion Model for
Microscopic Cascade Prediction
Abstract
The prediction of information diffusion or cascade has attracted much attention over the last decade. Most cascade prediction works target on predicting cascadelevel macroscopic properties such as the final size of a cascade. Existing microscopic cascade prediction models which focus on userlevel modeling either make strong assumptions on how a user gets infected by a cascade or limit themselves to a specific scenario where “who infected whom” information is explicitly labeled. The strong assumptions oversimplify the complex diffusion mechanism and prevent these models from better fitting realworld cascade data. Also, the methods which focus on specific scenarios cannot be generalized to a general setting where the diffusion graph is unobserved.
To overcome the drawbacks of previous works, we propose a Neural Diffusion Model (NDM) for general microscopic cascade prediction. NDM makes relaxed assumptions and employs deep learning techniques including attention mechanism and convolutional network for cascade modeling. Both advantages enable our model to go beyond the limitations of previous methods, better fit the diffusion data and generalize to unseen cascades. Experimental results on diffusion prediction task over four realistic cascade datasets show that our model can achieve a relative improvement up to against the best performing baseline in terms of F1 score.
1 Introduction
Information diffusion is a ubiquitous and fundamental event in our daily lives, such as the spread of rumors, the contagion of viruses and the propagation of new ideas and technologies. The diffusion process, also called a cascade, has been studied over a broad range of domains. Though some works believe that even the eventual size of a cascade cannot be predicted [1], recent works [2, 3, 4] have shown the ability to predict the size, growth and many other key properties of a cascade. Nowadays the modeling and prediction of cascades play an important role in many realworld applications, e.g. production recommendation [5, 6, 7, 8, 9], epidemiology [10, 11], social networks [12, 13, 14] and the spread of news and opinions [15, 16, 17].
Most previous works on cascade prediction focus on the prediction of macroscopic properties such as the total number of users who share a specific photo [2] and the growth curve of the popularity of a blog [3]. However, macroscopic cascade prediction is a rough estimate of cascades and cannot be adapted for microscopic questions as shown in Fig. 1. Microscopic cascade prediction, which pays more attention to userlevel modeling and prediction instead of cascadelevel, is much more powerful than macroscopic prediction and allows us to apply userspecific strategies for realworld applications. For example, during the adoption of a new product, microscopic cascade prediction can help us deliver advertisements to those users that are most likely to buy the product at each stage. In this paper, we focus on the study of microscopic cascade prediction.
Though useful and powerful, the microscopic prediction of cascades faces great challenges because the realworld diffusion process could be rather complex [18] and usually partially observed [11, 19]:
Complex mechanism. Since the mechanism of how a specific user gets infected ^{1}^{1}1We use “infected” and “activated” alternatively to indicate that a user is influenced by a cascade. is sophisticated, traditional cascade models based on strong assumptions and simple formulas may not be the best choice for microscopic cascade prediction. Existing cascade models [20, 21, 22, 23] which could be adopted for microscopic prediction mostly ground in Independent Cascade (IC) model [12]. IC model assigns a static probability to user pairs with pairwise independent assumptions, where the probability indicates how likely user will get infected by user when is infected. Other diffusion models [24, 25] make even stronger assumptions that the infected users are only determined by the source user. Though intuitive and easy to understand, these cascade models are based on strong assumptions and oversimplified probability estimation formulas, both of which limit the expressivity and ability to fit complex realworld cascade data [26]. The complex mechanism of realworld diffusions encourages us to explore more sophisticated models, e.g. deep learning techniques, for cascade modeling.
Incomplete observation. On the other hand, the cascade data is usually partially observed indicates that we can only observe those users getting infected without knowing who infected them. However, to the best of our knowledge, existing deeplearning engined microscopic cascade models [27, 28] are based on the assumption that the diffusion graph where a user can only infect and get infected by its neighbors is already known. For example, when we study the retweeting behavior on the Twitter network, “who infected whom” information is explicitly labeled in retweet chain and the next infected user candidates are restricted to the neighboring users rather than the whole user set. While in most diffusion processes such as the adoption of a product or the contamination of a virus, the diffusion graph is unobserved [11, 19, 29]. Therefore, these methods consider a much simpler problem and cannot be generalized to a general setting where the diffusion graph is unknown.
To fill in the blank of general microscopic cascade prediction and address the limitations of traditional cascade models, we propose a neural diffusion model based on relaxed assumptions and employ uptodate deep learning techniques, i.e. attention mechanism and convolutional neural network, for cascade modeling. The relaxed assumptions enable our model to be more flexible and less constrained, and deep learning tools are good at capturing the complex and intrinsic relationships that are hard to be characterized by handcrafted features. Both advantages allow our model to go beyond the limitations of traditional methods based on strong assumptions and oversimplified formulas and better fit the complicated cascade data. Following the experimental settings in [23], we conduct experiments on diffusion prediction task over four realistic cascade datasets to evaluate the performances of our proposed model and other stateoftheart baseline methods. Experimental results show that our model can achieve a relative improvement up to against the best performing baseline in terms of F1 score.
To conclude, our contributions are fold:

To the best of our knowledge, our work is the first attempt to employ deep learning techniques for general microscopic cascade prediction problem where the diffusion graph is unknown.

We design a neural diffusion model based on relaxed assumptions compared with the pairwise independence assumption in traditional cascade models and allow our model to better fit realworld cascades and generalize to unseen data.

Experimental results on diffusion prediction task over four realistic datasets demonstrate the effectiveness and robustness of our proposed model. Compared with the best performing baseline, our model can achieve a relative improvement up to on F1 score.
2 Related Works
We organize related works into macroscopic and microscopic cascade prediction methods. In terms of methodology, our work is also related to network representation learning methods.
2.1 Macroscopic Cascade Prediction
Most previous works on cascade prediction focused on macroscopic level prediction such as the eventual size of a cascade [4] and the growth curve of popularity [3]. Macroscopic cascade prediction methods can further be classified into featurebased approaches, generative approaches, and deeplearning based approaches. Featurebased approaches formalized the task as a classification problem [30, 2] or a regression problem [31, 32] by applying SVM, logistic regression and other machine learning algorithms on handcrafted features including temporal [33] and structural [2] features. Generative approaches considered the growth of cascade size as an arrival process of infected users and employed stochastic processes, such as Hawkes selfexciting point process [34, 4], for modeling. With the success of deep learning techniques in various applications, deeplearning based approaches, e.g. DeepCas [26] and DeepHawkes [35], were proposed to employ Recurrent Neural Network (RNN) for encoding cascade sequences into feature vectors instead of handcrafted features. Compared with handcrafted feature engineering, deeplearning based approaches have better generalization ability across different platforms and give better performance on macroscopic prediction task.
2.2 Microscopic Cascade Prediction
Our work is more related to microscopic cascade prediction which focuses on userlevel modeling and predictions. We classify related works into three categories: ICbased approaches, embeddingbased approaches, and deeplearning based approaches.
IC model [36, 12, 15, 37] is one of the most popular diffusion models which assumed independent diffusion probability through each link. Extensions of IC model further considered time delay information by incorporating a predefined timedecay weighting function, such as continuous time IC [38], CONNIE [19], NetInf [20] and Netrate [21]. Infopath [22] was proposed to infer dynamic diffusion probabilities based on information diffusion data and study the temporal evolution of information pathways. MMRate [39] inferred multiaspect transmission rates by incorporating aspectlevel user interactions and various diffusion patterns. All above methods learned the probabilities from cascade sequences. Once a model is trained, it can be used for microscopic cascade prediction by simulating the generative process using Monte Carlo simulation.
Embeddingbased approaches encoded each user into a parameterized realvalued vector and trained the parameters by maximizing an objective function. Embedded IC [23] followed the pairwise independence assumption in IC model and modeled the diffusion probability between two users by a function of their user embeddings. Other embeddingbased diffusion models [24, 25] made even stronger assumptions that infected users are determined only by the source user and the content of information item. As shown in previous work [26], such models with strong assumptions oversimplify the reality and generally show poor performance on real prediction tasks.
Existing deeplearning based microscopic cascade prediction approaches [27, 28] focused on the retweeting and sharing behaviors in a social network where “who infected whom” information is explicitly labeled in retweet chain. The next infected user candidates are also restricted to the neighboring users when the diffusion graph is known. However, the diffusion graph is usually unknown for most diffusion processes [11, 19]. For example, during the contamination of a virus, by whom a patient gets infected is unobserved. Existing deeplearning based methods considered a much simpler problem and cannot be generalized to a general setting where the diffusion graph is unobserved. To the best of our knowledge, our work is the first attempt to employ deep learning techniques for general microscopic cascade prediction problem where the diffusion graph is unknown.
2.3 Network Representation Learning
Researchers have explored many algorithms to represent nodes in a network by realvalued vectors. By projecting topology structure into vectors, we can apply machine learning techniques for many network applications, e.g. classification. Most network representation learning works focus on task unspecific learning where the downstream task is unknown. Early stage works [40] use eigenvector computation to learn node embeddings. With the success of neural networks, people also employ simple neural networks for representation learning [41, 42]. For task specific learning, a certain task such as classification [43] and recommendation [44] is specified and the network embeddings serve as the bottom layer of their model as what we will do in this paper. In terms of diffusion prediction task, Embedded IC [23] is proposed and will be used as our baseline method.
3 Data Observation
In this section, we will conduct data observation on realworld datasets and investigate the intrinsic relationships between activated users in a diffusion sequence. In specific, we will try to figure out whether consecutively activated users are more likely to be relevant and thus appear in more diffusion sequences together. We will first introduce the datasets.
3.1 Datasets
We collect four realworld cascade datasets that cover a variety of applications for evaluation. A cascade is an item or some kind of information that spreads through a set of users. Each cascade consists of a list of pairs where each pair indicates the fact that the user gets infected at the timestamp.
Lastfm is a music streaming website. We collect the dataset from [45]. The dataset contains the full history of nearly users and the songs they listened to over one year. We treat each song as an item spreading through users and remove the users who listen to no more than songs.
Irvine is an online community for students at University of California, Irvine collected from [46]. Students can participate in and write posts on different forums. We regard each forum as an information item and remove the users who participate in no more than forums.
Memetracker ^{2}^{2}2http://www.memetracker.org collects a million of news stories and blog posts and track the most frequent quotes and phrases, i.e. memes, for studying the migration of memes across a group of people. Each meme is considered to be an information item and each URL of websites or blogs is regarded as a user. Following the settings of previous works [23], we filter the URLs to only keep the most active ones to alleviate the effect of noise.
Twitter dataset [47] concerns tweets containing URLs posted on Twitter during October 2010. The complete tweeting history of each URL is collected. We consider each distinct URL as a spreading item over Twitter users. We filter out the users with no more than tweets. Note that the scale of Twitter dataset is competitive and even larger than the datasets used in previous neuralbased cascade modeling algorithms [23, 28].
Note that all the above datasets have no explicit evidence about by whom a user gets infected. Though we have the following relationship in Twitter dataset, we still cannot trace the source of by whom a user is encouraged to tweet a specific URL unless the user directly retweets.
Dataset  # Users  # Links  # Cascades  Avg. Length 

Lastfm  982  506,582  23,802  7.66 
Irvine  540  62,605  471  13.63 
Memetracker  498  158,194  8,304  8.43 
19,546  18,687,423  6,158  36.74 
We list the statistics of datasets in Table I. Since we have no interaction graph information between users, we assume that there exists a link between two users if they appear in the same cascade sequence. Each virtual “link” will be assigned a parameterized probability in traditional IC model and thus the space complexity of traditional methods is relatively high especially for large datasets. We also calculate the average cascade length of each dataset in the last column.
3.2 Statistical Analysis
Now we will try to reveal the correlation patterns between users by statistical results. By intuition, two consecutively infected users in a cascade sequence are more likely to have connections, e.g. one infects another, and thus participate in many other diffusion sequences together.
To demonstrate this statement, we consider the following statistics: given the fact that user and are infected in a cascade sequence with users infected between them in this sequence, what will be the expectation of the number of cascade sequences that user and both participate in? Here indicates that user and are consecutively activated. If the intuition is true, then the expectation should decrease as increases.
Fig. 2 presents the statistical results of all four datasets. Here we list the results for and the average for and . The statistics show that the expectations of cooccurrence times for are consistently larger than those for . Note that the gap is not very large for some datasets due to the longtail effect. Therefore, we further present the results only for the top 5% user pairs in terms of cooccurrence times for each in Fig. 3. We can see the differences more clearly in this setting.
These statistical results demonstrate that consecutively infected users in a cascade sequence are more likely to be relevant. By saying two users are “relevant”, there could be a direct diffusion path between them or they are both likely to be infected by a third one. Also, we find that not only the most recently infected user will be relevant to the next infected one: As shown in Fig. 2 and 3, all recent infected users () could be relevant with minor differences (more relevant for smaller ). We will build our model based on these findings in next section.
4 Method
In this section, we will start by formalizing the problem and introducing the notations. Then we propose two heuristic assumptions according to the data observations as our basis and design a Neural Diffusion Model (NDM) using deep learning techniques. Finally, we will introduce the overall optimization function and other details of our model.
4.1 Problem Formalization
A cascade dataset records the information that an item spreads to whom and when during its diffusion. For example, the item could be a product and the cascade records who bought the product at what moment. However, in most cases, there exists no explicit interaction graph between the users [37, 23]. Therefore, we have no explicit information about how a user was infected by other users.
Formally, given user set and observed cascade sequence set , each cascade consists a list of users ranked by their infection time, where is the length of sequence and is the th user in the sequence . Note that we only consider the order of users getting infected and ignore the exact timestamps of infections in this paper as previous works did [23, 29, 28].
In this paper, our goal is to learn a cascade prediction model which can predict the next infected user given a partially observed cascade sequence . The learned model is able to predict the entire infected user sequence based on the first few observed infected users and thus be used for microscopic cascade prediction illustrated in Figure 1. In our model, we add a virtual user called “Terminate” to the user set . At training phase, we append “Terminate” to the end of each cascade sequence and allow the model to predict next infected user as “Terminate” to indicate that no more users will be infected in this cascade.
Further, we represent each user by a parameterized realvalued vector to project users into vector space. The realvalued vectors are also called embeddings. We denote the embedding of user as where is the dimension of embeddings. In our model, a larger inner product between the embeddings of two users indicates a stronger correlation between the users.
4.2 Model Assumptions
In traditional Independent Cascade (IC) model [12] settings, all previously infected users can activate a new user independently and equally regardless of their orders of getting infected. Many extensions of IC model further considered time delay information such as continuous time IC (CTIC) [38] and Netrate [21]. However, none of these models tried to find out which users are actually active and more likely to activate other users at the moment. To address this issue, we propose the following assumption.
Assumption 1. Given a recently infected user , users that are strongly correlated to user including user itself are more likely to be active.
This assumption is intuitive and straightforward. As a newly activated user, should be active and may infect other users. The users strongly correlated to user are probably the reason why user gets activated recently and thus more likely to be active than other users at the moment. We further propose the concept of “active user embedding” to characterize all such active users.
Definition 1. For each recently infected user , we aim to learn an active user embedding which represents the embedding of all active users related to user , and can be used for predicting the next infected user in next step.
The active user embedding characterizes the potential active users related to the fact that user gets infected. From the data observations, we can see that all recently infected users could be relevant to the next infected one. Therefore, the active user embeddings of all recently infected users should contribute to the prediction of next infected user, which leads to the following assumption.
Assumption 2. All recently infected users should contribute to the prediction of next infected user and be processed differently according to the order of getting infected.
Compared with the strong assumptions made by ICbased and embeddingbased method introduced in related works, our heuristic assumptions allow our model to be more flexible and better fit cascade data. Now we will introduce how to build our model based on these two assumptions, i.e. extracting active users and unifying these embeddings for prediction.
4.3 Extracting Active Users with Attention Mechanism
For the purpose of computing active user embeddings, we propose to use attention mechanism [48, 49] to extract the most likely active users by giving them more weights than other users. As shown in Figure 4, the active embedding of user is computed as a weighted sum of previously infected users:
(1) 
where the weight of is
(2) 
Note that for every and . is the normalized inner product between the embeddings of and which indicates the strength of correlation between them.
From the definition of active user embedding in Eq. 1, we can see that the user embeddings which have a larger inner product with will be allocated a larger weight . This formula naturally follows our assumption that users strongly correlated to user including user itself should be paid more attention.
To fully utilize the advantages of a neural model, we further employ the multihead attention [49] to improve the expressibility. Multihead attention projects the user embeddings into multiple subspaces with different linear projections. Then multihead attention performs attention mechanism on each subspace independently. Finally, multihead attention concatenates the attention embeddings in all subspaces and feeds the result into a linear projection again.
Formally, in a multihead attention with heads, the embedding of th head is computed as
(3) 
where
(4) 
are headspecific linear projection matrices. In particular, and can be seen to project user embeddings into receiver space and sender space respectively for asymmetric modeling.
Then we have the active user embedding
(5) 
where indicates concatenation operation and projects the concatenated results into dimensional vector space.
Multihead attention allows the model to “divide and conquer” information from different perspectives (i.e. subspaces) independently and thus is more powerful than the traditional attention mechanism.
4.4 Unifying Active User Embeddings for Prediction with Convolutional Network
Different from previous works [50, 21] which directly give a timedecay weight that assumes larger weights for the most recently infected users, we propose to use a parameterized neural network to handle the active user embeddings at different positions. Compared with a predefined exponentialdecay weighting function [21], a parameterized neural network can be learned automatically to fit the realworld dataset and capture the intrinsic relationship between active user embedding at each position and next infected user prediction. In this paper, we consider Convolutional Neural Network (CNN) to meet this purpose.
CNN has been widely used in image recognition [51], recommender systems [52] and natural language processing [53]. CNN is a shiftinvariant neural network and allows us to assign positionspecific linear projections to the embeddings.
Figure 4 illustrates an example where the window size of our convolutional layer . The convolutional layer first converts each active user embedding into a dimensional vector by a positionspecific linear projection matrix for . Then the convolutional layer sums up the projected vectors and normalizes the summation by softmax function.
Formally, given partially observed cascade sequence , the predicted probability distribution is
(6) 
where and denotes the th entry of a vector . Each entry of represents the probability that the corresponding user gets infected at next step.
Since the initial user plays an important role in the whole diffusion process, we further take into consideration:
(7) 
where is the projection matrix for initial user and is a hyperparameter which controls whether incorporate initial user for prediction or not.
4.5 Overall Architecture, Model Details and Learning Algorithms
We naturally maximize the loglikelihood of all observed cascade sequences to build the overall optimization function.
(8) 
where is the predicted probability of ground truth next infected user at position in cascade , and is the set of all parameters need to be learned, including user embeddings for each , projection matrices in multihead attention for and projection matrices in convolutional layer for .
Implementation Details. We implement our model using PyTorch ^{3}^{3}3http://pytorch.org and optimize the parameters by gradient descent with Adam optimizer [54]. We further employ layer normalization [55] and residual connection [56] operation to active user embedding to avoid gradient explosion or vanishment problem that may occur in deep neural networks. In other words, the active user embedding is replaced by instead where the function encourages the output to have zero mean and unit variance. We also use dropout [57] to the attention mechanism to prevent our model from overfitting and the dropout rate is set to . Since the same user will not be infected twice, we mask the users that are already infected in the Eq. 7 so that they won’t be predicted. We release our source code at github ^{4}^{4}4https://github.com/albertyang33/NeuralDiffusionModel and all the details are listed. Hyperparameter settings will be introduced in next section.
Complexity. The space complexity of our model is where is the embedding dimension which is much less than the size of user set. Note that the space complexity of training traditional IC model will go up to because we need to assign an infection probability between each pair of potential linked users. Therefore, the space complexity of our neural model is less than that of traditional IC methods.
The computation of a single active embedding takes time where is the length of corresponding cascade and the next infected user prediction in Eq. 7 step takes time. Hence the time complexity of training a single cascade is which is competitive with previous neuralbased models such as embedded IC model [23]. But as we will show in the experiments, our model converges much faster than embedded IC model and is capable of handling largescale dataset.
5 Experiments
We conduct experiments on diffusion prediction task as previous works did [23] to evaluate the performance of our model and various baseline methods. We will first introduce the baseline methods, evaluation metrics and hyperparameter settings. Then we will present the experimental results and give further analysis about the evaluation.
5.1 Baselines
We consider a number of stateoftheart baselines to demonstrate the effectiveness of our algorithm. Most of the baseline methods will learn a transition probability matrix from cascade sequences where each entry represents the probability that user gets infected by when is activated.
Netrate [21] considers the timevarying dynamics of diffusion probability through each link and defines three transmission probability models, i.e. exponential, powerlaw and Rayleigh, which encourage the diffusion probability to decrease as the time interval increases. In our experiments, we only report the results of exponential model since the other two models give similar results.
Infopath [22] also targets on inferring dynamic diffusion probabilities based on information diffusion data. Infopath employs stochastic gradient to estimate the temporal dynamics and studies the temporal evolution of information pathways.
Embedded IC [23] explores representation learning technique and models the diffusion probability between two users by a function of their user embeddings instead of a static value. Embedded IC model is trained by stochastic gradient descent method.
LSTM is a widely used neural network framework [58] for sequential data modeling and has been used for cascade modeling recently. Previous works employ LSTM for some simpler tasks such as popularity prediction [26] and cascade prediction with known diffusion graph [28, 27]. Since none of these works are directly comparable to ours, we adopt LSTM network for comparison by adding a softmax classifier to the hidden state of LSTM at each step for next infected user prediction.
5.2 Hyperparameter Settings for Neural Models
Though the parameter space of neural network based methods is much less than that of traditional IC models, we have to set several hyperparameters to train neural models. To tune the hyperparameters, we randomly select of training cascade sequences as validation set. Note that all training cascade sequences including the validation set will be used to train the final model for testing.
Metric  Dataset  Method  Improvement  

Netrate  Infopath  Embedded IC  LSTM  NDM  
MacroF1  Lastfm  0.017  0.030  0.020  0.026  0.056  +87% 
Memetracker  0.068  0.110  0.060  0.102  0.139  +26%  
Irvine  0.032  0.052  0.054  0.041  0.076  +41%  
  0.044    0.103  0.139  +35%  
MicroF1  Lastfm  0.007  0.046  0.085  0.072  0.095  +12% 
Memetracker  0.050  0.142  0.115  0.137  0.171  +20%  
Irvine  0.029  0.073  0.102  0.080  0.108  +6%  
  0.010    0.052  0.087  +67% 
Metric  Dataset  Method  Improvement  

Netrate  Infopath  Embedded IC  LSTM  NDM  
MacroF1  Lastfm  0.018  0.028  0.010  0.018  0.048  +71% 
Memetracker  0.071  0.094  0.042  0.091  0.122  +30%  
Irvine  0.031  0.030  0.027  0.018  0.064  +106%  
  0.040    0.097  0.123  +27%  
MicroF1  Lastfm  0.016  0.035  0.013  0.019  0.045  +29% 
Memetracker  0.076  0.106  0.040  0.094  0.126  +19%  
Irvine  0.028  0.030  0.029  0.020  0.065  +117%  
  0.050    0.093  0.118  +27% 
For Embedded IC model, the dimension of user embeddings is selected from as the original paper did [23]. For LSTM model, the dimensions of user embeddings and hidden states are set to the best choice from . For our model NDM, the number of heads used in multihead attention is set to , the window size of convolutional network is set to and the dimension of user embeddings is set to . Note that we use the same set of for all the datasets. The flag in Eq. 7 which determines whether the initial user is used for prediction is set to for Twitter dataset and for the other three datasets. We will show the robustness of our model in parameter sensitivity subsection.
Note that neural models, i.e. Embedded IC, LSTM and NDM, are based on matrix multiplication operations and thus naturally benefit from the GPU acceleration. Therefore, we train these three methods on a GPU device (GeForce GTX TITAN X) instead of a CPU device (Intel Xeon E52620 @ 2.0GHz).
5.3 Diffusion Prediction
To compare the ability of cascade modeling, we evaluate our model and all baseline methods on the diffusion prediction task. We follow the experimental settings in Embedded IC [23]. We randomly select cascade sequences as training set and the rest as test set. For each cascade sequence in the test set, only the initial user is given and all successively infected users need to be predicted.
All baseline methods and our model are required to predict a set of users and the results will be compared with ground truth infected user set . For baseline methods that ground in IC model, i.e. Netrate, Infopath and Embedded IC, we will simulate the infection process according to the learned pairwise diffusion probability and their corresponding generation process. For LSTM and our model, we can sequentially sample a user according to the probability distribution of softmax classifier at each step.
Note that the ground truth infected user set could also be partially observed because the datasets are crawled within a short time window. Therefore, for each test sequence with ground truth infected users, all the algorithms are only required to predict the first infected users in a single simulation. Also note that the simulation may terminate and stop infecting new users before activating users.
We conduct times Monte Carlo simulations for each test cascade sequence for all algorithms and compute the infection probability of each user . We evaluate the prediction results using two classic evaluation metrics: MacroF1 and MicroF1.
MacroF1. Macroaveraged F1 first computes the precision , recall and F1 score locally for each test cascade sequence in the test set . Then macroaveraged Fmeasure takes the average over all test cascade sequences:
MicroF1. Microaveraged F1 computes precision , recall globally by averaging over all predictions and serves as a complementary view by giving larger weights to longer cascades:
To further evaluate the performance of cascade prediction at early stage, we conduct additional experiments by only predicting the first five infected users in each test cascade. We present the experimental results in Table II and III. Here “” indicates that the algorithm fails to converge in hours. The last column represents the relative improvement of NDM against the best performing baseline method. We have the following observations:
(1) NDM consistently and significantly outperforms all the baseline methods. As shown in Table II, the relative improvement against the best performing baseline is at least in terms of MacroF1 score. The improvement on MicroF1 score further demonstrates the effectiveness and robustness of our proposed model. The results also indicate that welldesigned neural network models are able to surpass traditional cascade methods on cascade modeling.
(2) NDM has even more significant improvements on cascade prediction task at early stage. As shown in Table III, NDM outperforms all baselines by a large margin on both Macro and Micro F1 scores. Note that it’s very important to predict the first wave of infected users accurately for realworld applications because a wrong prediction will lead to error propagation in following stages. A precise prediction of infected users at early stage enables us to better control the spread of information items through users. For example, we can prevent the spread of a rumor by warning the most vulnerable users in advance and promote the spread of a product by paying the most potential customers more attention. This experiment demonstrates that NDM has the ability to be used for realworld applications.
(3) NDM is capable of handling largescale cascade datasets. Previous neuralbased method, Embedded IC, fails to converge in hours on Twitter dataset with around thousand users and million of potential links. In contrast, NDM converges in hours on this dataset with the same GPU device, which is at least times faster than Embedded IC. This observation demonstrates the scalability of NDM.
5.4 Social Link Prediction
Sometimes the underlying social network of users is available, e.g. the Twitter dataset used in our experiments. In the Twitter dataset, a network of Twitter followers is observed though the information diffusion is not necessarily passed through the edges of the social network. Though the diffusion network and the social network do not strictly align with each other, we still expect that the most closely related users in information diffusion should also be socially connected in the social network. Therefore, we conduct social link prediction experiments to verify this statement.
Firstly, we need to specify “the most closely related users”. For Infopath algorithm, the model output directly contains the diffusion probability of each inferred edge. Thus we can rank the users by their diffusion probability to get the most closely related users. For LSTM and our model NDM, the output contains users’ embeddings as realvalued vectors and we can simply use the inner product or the multihead attention weight in Eq. 4 between user embeddings to measure the closeness of two users.
Secondly, we use the following experimental settings for evaluation. Note that this setting is a reasonable choice but not the only choice. For each user in the dataset, we select the most closely related user according to the first step and check whether this most closely related user is a follower of user or not. Then we naturally use the accuracy as evaluation metric. We present the experimental results in Fig. 5.
From the results in Fig. 5 we can see that all three algorithms, i.e. LSTM, Infopath and NDM, are able to predict the social links in Twitter dataset to some extent. Even the accuracy of LSTM is much higher than a random guess (around ). This indicates that information spreads through some social links frequently and thus these links can be inferred successfully. Moreover, our NDM model performs best on this task. This fact indicates that NDM can better capture the intrinsic relationship between users. Also, the absolute value of social link prediction accuracy is still not high enough (less than ). One possible reason is that the overlap between a diffusion network and a social network is small compared with the entire network.
5.5 Benefits from Social Network
On the other side, we also hope that diffusion prediction process could benefit from the observed social network structure. We apply a simple modification on our NDM model to take advantage of the social network. Now we will introduce the modification in detail.
Firstly, we embed the topological social network structure into realvalued user features by DeepWalk [41], a widely used network representation learning algorithm. The dimension of network embeddings learned by DeepWalk is set to which is half of the dimension which is the representation size of our model. Secondly, we use the learned network embeddings to initialize the first dimensions of the user representations of our model and fix them during the training process without changing any other modules. In other words, a dimensional user representation is made up of a dimensional fixed network embedding learned by DeepWalk from social network structure and another dimensional randomly initialized trainable embedding. We name the modified model with Social Network considered as NDM+SN for short. This is a simple but useful implementation and we will explore a more sophisticated model to take the social network into modeling directly in future work. Fig. 6 and 7 show the comparison between NDM and NDM+SN.
Experimental results show that NDM+SN is able to improve the performance on diffusion prediction task slightly with the help of incorporating social network structure as prior knowledge. The relative improvement of MicroF1 is around . The results demonstrate that our neural model is very flexible and can be easily extended to take advantage of external features. Note that these results are also consistent with those in previous subsection: The diffusion network and social network have overlapping parts but the overlapping part is relative small compared to the whole network.
5.6 Parameter Sensitivity
In this subsection, we will take Lastfm dataset as an illustrative example to present how hyperparameter settings affect the performance of our model. We use the best set of hyperparameter settings as our basis, i.e. number of heads , window size of convolutional network , dimension of user embeddings and flag of using initial user for prediction . Then we vary each hyperparameter while keeping others fixed. Figure 8 shows the performance on diffusion prediction under different hyperparameter settings.
We can see that the performance of NDM is stable when we vary the hyperparameters within a reasonable range. NDM does not encounter serious overfitting problem when we double the dimension of embeddings to . This experiment demonstrates the robustness of our model.
5.7 Interpretability
Admittedly, the interpretability is usually a weak point of neural network models. Compared with feature engineering methods, neuralbased models encode a user into a realvalued vector space and there is no explicit meaning of each dimension of user embeddings. In our proposed model, each user embedding is projected into subspaces by an head attention mechanism. Intuitively, the user embedding in each subspace represents a specific role of the user. But it is quite hard for us to link the embeddings to interpretable handcrafted features. We will consider the alignment between user embeddings and interpretable features based on a joint model in future work.
Dataset  

Lastfm  32.3  60.0  49.2  49.1 
Memetracker  13.3  16.6  13.3  13.0 
Irvine  13.9  13.9  13.7  13.7 
130.3  93.6  91.5  91.5 
Fortunately, we still have some findings in the convolutional layer. Recall that for are positionspecific linear projection matrices in convolutional layer and is the projection matrix for the initial user. All four matrices are randomly initialized before training. In a learned model, if the scale of one of these matrices is much larger than that of other ones, then the prediction vector is more likely to be dominated by the corresponding position. For example, if the scale of is much larger than that of other ones, then we can infer that the most recent infected user contributes most to the next infected user prediction.
Following the notations in Eq. 7, we set for all datasets in this experiment and compute the square of Frobenius norm of learned projection matrices as shown in Table IV. We have the follow observations:
(1) For all four datasets, the scales of and are competitive and the scale of is always a little bit larger than that of the other two. This observation indicates that the active embeddings of all three recently infected users will contribute to the prediction of . Also, the most recent infected user is the most important one among the three. This finding naturally matches our intuitions and verifies Assumption 2 proposed in method section.
(2) The scale of is the largest on Twitter dataset. This indicates that the initial user is very important in diffusion process on Twitter. This is partly because Twitter dataset contains the complete history of the spread of a URL and the initial user is actually the first one tweeting the URL. While in the other three datasets, the initial user is only the first one within the time window of crawled data. Note that we set hyperparameter only for Twitter dataset in diffusion prediction task because we find that the performances are competitive or even worse on the other three datasets if we set .
6 Conclusion
In this paper, we propose a Neural Diffusion Model (NDM) for microscopic cascade modeling. To go beyond the limitations of traditional cascade models based on strong assumptions and oversimplified formulas, we build our model based on two heuristic assumptions and employ deep learning techniques including convolutional neural network and attention mechanism to implement the assumptions. Experimental results on diffusion prediction task demonstrate the effectiveness and robustness of our proposed model. In addition, NDM greatly outperforms baseline methods on diffusion prediction at early stage, which shows the applicability and feasibility of NDM for realworld applications.
For future works, we will consider linking neuralbased model with handcrafted features and statistics to improve the interpretability of learned models. An intelligible model is always welcome and can help us better understand the motivations and behaviors of users in a diffusion process.
The incorporation of extra information for cascade modeling is also an intriguing direction. For example, the timestamp information and the description of information items can be used for more accurate cascade modeling.
Acknowledgments
This work was supported by the 973 Program (No. 2014CB340501), the Major Project of the National Social Science Foundation of China (13&ZD190) and the National Natural Science Foundation of China (No. 61772302). This work is also part of the NExT++ project, supported by the National Research Foundation, Prime Ministerâs Office, Singapore under its IRC@Singapore Funding Initiative.
References
 [1] M. J. Salganik, P. S. Dodds, and D. J. Watts, “Experimental study of inequality and unpredictability in an artificial cultural market,” science, 2006.
 [2] J. Cheng, L. Adamic, P. A. Dow, J. M. Kleinberg, and J. Leskovec, “Can cascades be predicted?” in Proceedings of WWW. ACM, 2014.
 [3] L. Yu, P. Cui, F. Wang, C. Song, and S. Yang, “From micro to macro: Uncovering and predicting information cascading process with behavioral dynamics,” in Data mining (ICDM). IEEE, 2015, pp. 559–568.
 [4] Q. Zhao, M. A. Erdogdu, H. Y. He, A. Rajaraman, and J. Leskovec, “Seismic: A selfexciting point process model for predicting tweet popularity,” in Proceedings of the 21th ACM SIGKDD. ACM, 2015, pp. 1513–1522.
 [5] P. Domingos and M. Richardson, “Mining the network value of customers,” in Proceedings of SIGKDD. ACM, 2001.
 [6] J. Leskovec, A. Singh, and J. Kleinberg, “Patterns of influence in a recommendation network,” in PacificAsia Conference on Knowledge Discovery and Data Mining. Springer, 2006, pp. 380–389.
 [7] J. Leskovec, L. A. Adamic, and B. A. Huberman, “The dynamics of viral marketing,” ACM Transactions on the Web (TWEB), vol. 1, no. 1, p. 5, 2007.
 [8] D. J. Watts and P. S. Dodds, “Influentials, networks, and public opinion formation,” Journal of consumer research, vol. 34, no. 4, pp. 441–458, 2007.
 [9] S. Aral and D. Walker, “Identifying influential and susceptible members of social networks,” Science, 2012.
 [10] H. W. Hethcote, “The mathematics of infectious diseases,” SIAM review, vol. 42, no. 4, pp. 599–653, 2000.
 [11] J. Wallinga and P. Teunis, “Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures,” American Journal of epidemiology, 2004.
 [12] D. Kempe, J. Kleinberg, and É. Tardos, “Maximizing the spread of influence through a social network,” in Proceedings of SIGKDD. ACM, 2003, pp. 137–146.
 [13] T. Lappas, E. Terzi, D. Gunopulos, and H. Mannila, “Finding effectors in social networks,” in Proceedings of SIGKDD. ACM, 2010, pp. 1059–1068.
 [14] P. A. Dow, L. A. Adamic, and A. Friggeri, “The anatomy of large facebook cascades.” ICWSM, 2013.
 [15] D. Gruhl, R. Guha, D. LibenNowell, and A. Tomkins, “Information diffusion through blogspace,” in Proceedings of WWW. ACM, 2004.
 [16] D. LibenNowell and J. Kleinberg, “Tracing information flow on a global scale using internet chainletter data,” Proceedings of the national academy of sciences, vol. 105, no. 12, pp. 4633–4638, 2008.
 [17] J. Leskovec, L. Backstrom, and J. Kleinberg, “Memetracking and the dynamics of the news cycle,” in Proceedings of SIGKDD. ACM, 2009, pp. 497–506.
 [18] D. M. Romero, B. Meeder, and J. Kleinberg, “Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter,” in Proceedings of WWW. ACM, 2011, pp. 695–704.
 [19] S. Myers and J. Leskovec, “On the convexity of latent social network inference,” in Advances in neural information processing systems, 2010, pp. 1741–1749.
 [20] M. Gomez Rodriguez, J. Leskovec, and A. Krause, “Inferring networks of diffusion and influence,” in Proceedings of SIGKDD. ACM, 2010.
 [21] M. G. Rodriguez, J. Leskovec, D. Balduzzi, and B. Schölkopf, “Uncovering the structure and temporal dynamics of information propagation,” Network Science, vol. 2, no. 1, pp. 26–65, 2014.
 [22] M. Gomez Rodriguez, J. Leskovec, and B. Schölkopf, “Structure and dynamics of information pathways in online media,” in Proceedings of WSDM. ACM, 2013.
 [23] S. Bourigault, S. Lamprier, and P. Gallinari, “Representation learning for information diffusion through social networks: an embedded cascade model,” in Proceedings of WSDM. ACM, 2016.
 [24] S. Bourigault, C. Lagnier, S. Lamprier, L. Denoyer, and P. Gallinari, “Learning social network embeddings for predicting information diffusion,” in Proceedings of WSDM. ACM, 2014.
 [25] S. Gao, H. Pang, P. Gallinari, J. Guo, and N. Kato, “A novel embedding method for information diffusion prediction in social network big data,” IEEE Transactions on Industrial Informatics, 2017.
 [26] C. Li, J. Ma, X. Guo, and Q. Mei, “Deepcas: An endtoend predictor of information cascades,” in Proceedings of WWW. International World Wide Web Conferences Steering Committee, 2017, pp. 577–586.
 [27] W. Hu, K. K. Singh, F. Xiao, J. Han, C.N. Chuah, and Y. J. Lee, “Who will share my image?: Predicting the content diffusion path in online social networks,” in Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 2018, pp. 252–260.
 [28] J. Wang, V. W. Zheng, Z. Liu, and K. C.C. Chang, “Topological recurrent neural network for diffusion prediction,” in ICDM. IEEE, 2017, pp. 475–484.
 [29] Z. T. Kefato, N. Sheikh, and A. Montresor, “Di: Diffusion network inference through representation learning,” 2017.
 [30] P. Cui, S. Jin, L. Yu, F. Wang, W. Zhu, and S. Yang, “Cascading outbreak prediction in networks: a datadriven approach,” in Proceedings of SIGKDD. ACM, 2013.
 [31] O. Tsur and A. Rappoport, “What’s in a hashtag?: content based prediction of the spread of ideas in microblogging communities,” in Proceedings of WSDM. ACM, 2012, pp. 643–652.
 [32] L. Weng, F. Menczer, and Y.Y. Ahn, “Predicting successful memes using network and community structure.” in ICWSM, 2014.
 [33] H. Pinto, J. M. Almeida, and M. A. Gonçalves, “Using early view patterns to predict the popularity of youtube videos,” in Proceedings of WSDM. ACM, 2013, pp. 365–374.
 [34] S. Gao, J. Ma, and Z. Chen, “Modeling and predicting retweeting dynamics on microblogging platforms,” in Proceedings of WSDM. ACM, 2015.
 [35] Q. Cao, H. Shen, K. Cen, W. Ouyang, and X. Cheng, “Deephawkes: Bridging the gap between prediction and understanding of information cascades,” in Proceedings of CIKM. ACM, 2017.
 [36] J. Goldenberg, B. Libai, and E. Muller, “Talk of the network: A complex systems look at the underlying process of wordofmouth,” Marketing letters, 2001.
 [37] K. Saito, R. Nakano, and M. Kimura, “Prediction of information diffusion probabilities for independent cascade model,” in Knowledgebased intelligent information and engineering systems. Springer, 2008, pp. 67–75.
 [38] K. Saito, M. Kimura, K. Ohara, and H. Motoda, “Learning continuoustime information diffusion model for social behavioral data analysis,” in Asian Conference on Machine Learning. Springer, 2009, pp. 322–337.
 [39] S. Wang, X. Hu, P. S. Yu, and Z. Li, “Mmrate: inferring multiaspect diffusion networks with multipattern cascades,” in Proceedings of SIGKDD. ACM, 2014, pp. 1246–1255.
 [40] L. Tang and H. Liu, “Relational learning via latent social dimensions,” in Proceedings of SIGKDD. ACM, 2009, pp. 817–826.
 [41] B. Perozzi, R. AlRfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in Proceedings of SIGKDD. ACM, 2014, pp. 701–710.
 [42] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: Largescale information network embedding,” in Proceedings of WWW. International World Wide Web Conferences Steering Committee, 2015, pp. 1067–1077.
 [43] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
 [44] C. Yang, M. Sun, W. X. Zhao, Z. Liu, and E. Y. Chang, “A neural network approach to jointly modeling social networks and mobile trajectories,” TOIS, vol. 35, no. 4, p. 36, 2017.
 [45] Ò. Celma Herrada, “Music recommendation and discovery in the long tail,” 2009.
 [46] T. Opsahl and P. Panzarasa, “Clustering in weighted networks,” Social networks, vol. 31, no. 2, pp. 155–163, 2009.
 [47] N. O. Hodas and K. Lerman, “The simple rules of social contagion,” Scientific reports, vol. 4, p. 4343, 2014.
 [48] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
 [49] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
 [50] M. G. Rodriguez, D. Balduzzi, and B. Schölkopf, “Uncovering the temporal dynamics of diffusion networks,” arXiv preprint arXiv:1105.0697, 2011.
 [51] Y. LeCun et al., “Lenet5, convolutional neural networks,” URL: http://yann. lecun. com/exdb/lenet, 2015.
 [52] A. Van den Oord, S. Dieleman, and B. Schrauwen, “Deep contentbased music recommendation,” in Advances in neural information processing systems, 2013, pp. 2643–2651.
 [53] R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” in Proceedings of ICML. ACM, 2008.
 [54] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [55] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
 [56] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of CVPR, 2016, pp. 770–778.
 [57] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
 [58] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Cheng Yang is a 4th year PhD student of the Department of Computer Science and Technology, Tsinghua University. He got his B.E. degree from Tsinghua University in 2014. His research interests include natural language processing and network representation learning. He has published several toplevel papers in international journals and conferences including ACM TOIS, IJCAI and AAAI. 
Maosong Sun is a professor of the Department of Computer Science and Technology, Tsinghua University. He got his BEng degree in 1986 and MEng degree in 1988 from Department of Computer Science and Technology, Tsinghua University, and got his Ph.D. degree in 2004 from Department of Chinese, Translation, and Linguistics, City University of Hong Kong. His research interests include natural language processing, Chinese computing, Web intelligence, and computational social sciences. He has published over 150 papers in academic journals and international conferences in the above fields. He serves as a vice president of the Chinese Information Processing Society, the council member of China Computer Federation, the director of Massive Online Education Research Center of Tsinghua University, and the EditorinChief of the Journal of Chinese Information Processing. 
Haoran Liu is a 4th year undergraduate student of the Department of Electric Engineering, Tsinghua University. His research interests include network representation learning and machine learning. 
Shiyi Han is a 1st year master student in Computer Science department at Brown University. He got his B.E. degree from Beihang University in 2018. His research interests include natural language processing and machine learning. 
Zhiyuan Liu is an associate professor of the Department of Computer Science and Technology, Tsinghua University. He got his BEng degree in 2006 and his Ph.D. in 2011 from the Department of Computer Science and Technology, Tsinghua University. His research interests are natural language processing and social computation. He has published over 40 papers in international journals and conferences including ACM Transactions, IJCAI, AAAI, ACL and EMNLP. 
Huanbo Luan is the deputy director of NExT++ Research Center at both Tsinghua University and National University of Singapore. He received his B.S. degree in computer science from Shandong University in 2003 and Ph.D degree in computer science from Institute of Computing Technology, Chinese Academy of Sciences in 2008. His research interests include natural language processing, multimedia information retrieval, social media and big data analysis. 