Influential Node Detection in Implicit Social Networks using Multi-task Gaussian Copula Models
Influential node detection is a central research topic in social network analysis. Many existing methods rely on the assumption that the network structure is completely known a priori. However, in many applications, network structure is unavailable to explain the underlying information diffusion phenomenon. To address the challenge of information diffusion analysis with incomplete knowledge of network structure, we develop a multi-task low rank linear influence model. By exploiting the relationships between contagions, our approach can simultaneously predict the volume (i.e. time series prediction) for each contagion (or topic) and automatically identify the most influential nodes for each contagion. The proposed model is validated using synthetic data and an ISIS twitter dataset. In addition to improving the volume prediction performance significantly, we show that the proposed approach can reliably infer the most influential users for specific contagions.
Editor: Oren Anava, Marco Cuturi, Azadeh Khaleghi, Vitaly Kuznetsov, Alexander Rakhlin
Information emerges dynamically and diffuses quickly via agent interactions in complex networks (e.g. social networks) (López-Pintado, 2008). Consequently, understanding and prediction of information diffusion mechanisms are challenging. There is a rapidly growing interest in exploiting knowledge of the information dynamics to better characterize the factors influencing spread of diseases, planned terrorist attacks, and effective social marketing campaigns, etc (Guille and Hacid, 2012). The broad applicability of this problem in social network analysis has led to focused research on the following questions: (I) Which contagions are the most popular and can diffuse the most? (II) Which members of the network are influential and play important roles in the diffusion process? (III) What is the range over which the contagions can diffuse (Guille et al., 2013)? While attempting to answer these questions, one is confronted with two crucial challenges. First, a descriptive diffusion model, which can mimic the behavior observed in real world data, is required. Second, efficient learning algorithms are required for inferring influence structure based on the assumed diffusion model.
A variety of information diffusion prediction frameworks have been developed in the literature (Yang and Leskovec, 2010; Wang et al., 2013; Guille et al., 2013; Du et al., 2013; Zhang et al., 2016). A typical assumption in many of these approaches is that a connected network graph and knowledge of the corresponding structure are available a priori. However, in practice, the structure of the network can be implicit or difficult to model, e.g., modeling the structure of the spread of infectious disease is almost impossible. As a result, network structure unaware diffusion prediction models have gained interest. For example, (Yang and Leskovec, 2010), Yang et. al. proposed a linear influence model, which can effectively predict the information volume by assuming that each of the contagions spreads with the same influence in an implicit network. Subsequently, in (Wang et al., 2013), the authors extended LIM by exploiting the sparse structure in the influence function to identify the influential nodes. Though the relationships between multiple contagions can be used for more accurate modeling, most of the existing approaches ignore that information.
In this paper, we address the above issues by augmenting linear influence models with complex task dependency information. More specifically, we consider the dependency of different contagions in the network, and characterize their relationships using Copula Theory. Furthermore, by imposing a low-rank regularizer, we are able to characterize the clustering structure of the contagions and the nodes in the network. Through this novel formulation, we attempt to both improve the accuracy of the prediction system and better regularize the influence structure learning problem. Finally, we develop an efficient algorithm based on proximal mappings to solve this optimization problem. Experiments with synthetic data reveal that the proposed approach fairs significantly better than a state-of-the-art multi-task variant of LIM both in terms of volume prediction and influence structure estimation performance. In addition, we demonstrate the superiority of the proposed method in predicting the time-varying volume of tweets using the ISIS twitter dataset111ISIS dataset from Kaggle is available at https://www.kaggle.com/kzaman/how-isis-uses-twitter..
In this section, we present the formulation of linear influence model (LIM) (Yang and Leskovec, 2010) and discuss its limitations. Consider a set of nodes that participate in an information diffusion process of different contagions over time. Node can be infected by contagion at time . The volume is defined as the total number of nodes that get infected by the contagion at time . Let the indicator function represent the event that node got infected by contagion at time , and otherwise. LIM models the volume as a sum of influences of nodes that got infected before time :
where each node has a particular non-negative influence function . One can simply think of as the number of follow-up infections time units after got infected. The value of is set to indicate that the influence of a node drops to after time units. Thus, the influence of node is denoted by the vector . Next, using the notation and , the inference procedure of LIM can be formulated as follows
where is obtained via concatenation of , denotes the Euclidean norm, and is an indicator function that is zero when and otherwise. Though LIM has been effective in predicting the future volume for each contagion, it assumes that each node has the same influence across all the contagions. Consequently, to achieve contagion-sensitive node selection in an implicit network, the LIM model was extended and the multitask sparse linear influential model (MSLIM) was proposed in (Wang et al., 2013).
The influence function is defined by extending in LIM into contagion-sensitive , which is a -length vector representing the influence of the node for the contagion . For each contagion , let be the vector obtained by concatenating . For each node , the influence matrix for the node is defined: . Using these notations, the inference procedure to estimate was formulated as follows
where denotes the Frobenius norm. The penalty term was used to encourage the entire matrix to be zero altogether, which means that the node is non-influential for all different contagions. If the estimated (i.e., the matrix is non-zero), a fine-grained selection is performed by the penalty , which is essentially a group-Lasso penalty and can encourage the sparsity of vectors . For a specific contagion , one can identify the most influential nodes by finding the optimal solution of (3). However, the penalty terms used in MSLIM encourages that certain nodes have no influence over all the contagions which may not be true in practice. Furthermore, for most of the real world applications, there exists complex dependencies among the contagions. In order to alleviate these shortcomings, we propose a novel probabilistic multi-task learning framework and develop efficient optimization strategies.
3 Proposed Approach
Probabilistic Multi-Contagion Modeling of Diffusion: We assume a linear regression model for each task:, where and are defined as before, and is an i.i.d. zero-mean Gaussian noise vector with the covariance matrix . The distribution for given , and can be expressed as
Assuming that the influence for a single contagion is also Gaussian distributed, we can express the marginal distributions as , where is the mean vector and can be expressed as , and is the covariance matrix of . For a node and contagion , we assume that the variables in the influence have the same mean, i.e., , where is a scalar and is a vector of all ones with dimension . Let represent the mean matrix with entries , and it is connected as , where and is the identity matrix with dimension and is the Kronecker product operator.
3.1 Dependence Structure Modeling Using Copulas
Consider a general case where the contagions are correlated. We construct a new influence matrix . In our formulation, ’s are assumed to be correlated and the joint distribution of is not a simple product of all the marginal distributions of as is adopted by most multi-task learning formulations. Here, we propose to use a multi-task copula that is obtained by tailoring the copula model for the multi-task learning problem.
(Sklar’s Theorem). Consider an -dimensional distribution function with marginal distribution functions . Then there exists a copula , such that for all in , . If is continuous for , then is unique, otherwise it is determined uniquely on where is the range of . Conversely, given a copula and univariate CDFs , is a valid multivariate CDF with marginals .
As a direct consequence of Sklar’s Theorem, for continuous distributions, the joint probability density function (PDF) is obtained by,
where is the marginal PDF and c is termed as the copula density given by
where . We extend the copula theory to multi-task learning and express the joint distribution of as follows:
where is the CDF of the influence for contagion. The copula density function takes all marginal CDFs as its arguments, and maintains the output correlations in a parametric form.
Gaussian copula: There are a finite number of well defined copula families that can characterize several dependence structures. Though, we can investigate the choice of an appropriate copula, we consider the Gaussian copula for its favorable analytical properties. A Gaussian copula can be constructed from the multivariate Gaussian CDF, and the resulting prior on is given by a multivariate Gaussian distribution as
where is the row covariance matrix modeling the correlation between the influence of different nodes, is the column covariance matrix modeling the correlation between the influence for different contagions, and is the mean matrix of . The two covariances can be computed as and respectively. We assume that individual nodes are spreading the contagions and influencing others independently, and thus the row covariance matrix is diagonal and can be expressed as where are scalars. The posterior distribution for , which is proportional to the product of the prior in Eq. 4 and the likelihood function in Eq. 8, is given as
where , , is the corresponding covariance matrix of . We assume and also an identical value of . We employ maximum a posteriori (MAP) and maximum likelihood estimation (MLE), and obtain , , and by
However, if we assume to be non-sparse, the solution to will not be defined (when ) or will overfit (when is of the same order as ) (Rai et al., 2012). In fact, some contagions in the network can be uncorrelated, which makes the corresponding entry values in zero. Hence, we add a penalty to promote sparsity of matrix to obtain
3.2 Modeling Structure of Influence Matrix
In order to better characterize the influence matrix, we propose to impose a low rank structure on the influence matrix . The nodes or the contagions in the influence network are known to form communities (or clustering structures), which may be captured using the low-rank property of the influence matrix. Note that, the sparse structure in the influence matrix implies that most individuals only influence a small fraction of contagions in the network while there can be a few nodes with wide-spread influence. We incorporate this into our formulation by using a sparsity promoting regularizer over .
where denotes the nuclear norm, and , , , and are the regularization parameters. With the estimated , one can predict the total volume of the contagion at by .
We adopt an alternating optimization approach to solve the problem in Eq. 10.
Optimization w.r.t. : Given and , the mean matrix can be obtained by solving the following problem
The estimate can be analytically obtained as .
Optimization w.r.t. : Given and , the contagion inverse covariance matrix can be estimated by solving the following optimization problem
The above is an instance of the standard inverse covariance estimation problem with sample covariance , which can be solved using standard tools. In particular, we use the graphical Lasso procedure in (Friedman et al., 2008)
Optimization w.r.t. : The corresponding optimization problem becomes
We rewrite the problem as
where . This formulation involves a sum of a convex differentiable loss and convex non-differentiable regularizers which renders the problem non-trivial. A string of algorithms have been developed for the case where the optimal solution is easy to compute when each regularizer is considered in isolation. This corresponds to the case where the proximal operator defined for a convex regularizer at a point by is easy to compute for each regularizer taken separately. See (Combettes and Pesquet, 2011) for a broad overview of proximal methods. The proximal operator for the nuclear norm is given by the shrinkage operation as follows (Beck and Teboulle, 2009). If is the singular value decomposition of , then . The proximal operator of the indicator function is simply the projection onto , which is denoted by . Next, we mention a matching serial algorithm introduced in (Bertsekas, 2011). We present here a version where updates are performed according to a cyclic order (Richard et al., 2012). Note that one can also randomly select the order of the updates. We use the optimization algorithm 1 to solve the optimization problem in Eq. 12.
We compare the performance of the proposed approach to MSLIM by applying it to both synthetic and real datasets. Since the volume of a contagion over time can be viewed as a time series, we set up this problem as a time series prediction task and evaluate the performance using the prediction mean-squared error (MSE). Furthermore, for the synthetic data set, where we have access to the true influence matrix , we also evaluate the performance of the influence matrix prediction task using the metric . We determined the regularization parameters for the proposed model using cross validation. In particular, we split the first of the time instances as the training set and the rest for validation. Following (Wang et al., 2013), we combine the training and validation sets to re-train the model with the best selected regularization parameters and estimate the influence matrix.
5.1 Synthetic Data
We created a synthetic dataset with the number of nodes fixed at and the number of contagions at . In addition, we assumed that and . A rank (low-rank) influence matrix was generated randomly with uniformly distributed entries. The matrix was generated with uniformly distributed random integers . Following our model assumption, the volume for each was calculated as follows where is a multivariate normal distribution with covariance matrix . In Table 1, we present the results obtained using the proposed approach and its comparison to MSLIM. As can be observed, for both volume prediction and influence matrix estimation tasks, the proposed approach achieves highly accurate estimates.
|Volume Prediction MSE||0.834||0.007|
|Influence Matrix Estimation Error||0.7681||0.62|
5.2 ISIS Twitter Data
In this section, we demonstrate the application of the proposed approach to a real-word analysis task. We begin by describing the twitter dataset used for analysis and the procedure adopted to extract the set of contagions. Following this, we discuss the problem setup and present comparisons to MSLIM on predicting the time-varying tweet volume. Finally, we present a qualitative analysis of the inferred influence structure for different contagions.
The ISIS dataset from Kaggle222ISIS dataset from Kaggle is available at https://www.kaggle.com/kzaman/how-isis-uses-twitter. is comprised of over tweets from users posted between January 2015 and May 2016. In addition to the actual tweets, meta-information such as the user name and the timestamp for each tweet are included. We performed a standard pre-processing by removing a variety of stop words, e.g. URLs, symbols. After preprocessing, we converted each tweet into a bag-of-words representation and extracted the term frequency-inverse document frequency (tf-idf) feature.
Topic Modeling: When applying our approach, the first step is to define semantically meaningful contagions. A simple way of defining topics is to directly use words as topics (e.g., ISIS). However, a single word may not be rich enough to represent a broad topic (e.g., social network sites). Hence, we propose to perform topic modeling on the tweets based on the tf-idf features. In our experiment, we obtained the topics using Non-negative Matrix Factorization (NMF), which is a popular scheme for topic discovery, with the number of topics set at . Table 2 lists the top words for each of the topics learned using NMF.
|Topic 1||isis ramiallolah iraq attack libya warreporter1 saa aamaq usa abu|
|Topic 2||killed soldiers today airstrikes injured wounded civilians militants iraqi attack|
|Topic 3||syria russia ramiallolah turkey ypg breakingnews usa group saa terror|
|Topic 4||state islamic fighters fighting group saudi new http wilaya control|
|Topic 5||aleppo nid gazaui rebels north today northern syrian ypg turkish|
|Topic 6||assad regime myra forces rebels fsa pro islam syrian jaysh|
|Topic 7||al qaeda nusra abu sham ahrar islam jabhat http warreporter1|
|Topic 8||army iraq near ramiallolah iraqi lujah turkey ramadi west sinai|
|Topic 9||allah people muslims abu accept muslim make know don islam|
|Topic 10||breaking islamicstate forces amaqagency city fighters iraqi near area syrian|
Volume Time Series Prediction: In our experiment, we set one day as the discrete time step for aggregating the tweet volume. The parameter denotes the number of time steps it takes for the influence of a user to decay to zero. We set the parameter equal to since we observed that beyond , there is hardly any improvement in performance. The MSE on the predicted volume is computed over the entire period of observation. The comparison of the prediction MSE is presented in Table 3. It can be seen that the proposed approach significantly outperforms MSLIM in predicting the time-varying volume.
Influential Node Detection: For a contagion , we identify the most influential nodes with respect to this contagion as nodes having high values. First, in Figure 2, we plot the correlation among topics learned by NMF. More specifically, we plot the pair-wise correlation structure learned by our approach. It can be seen that, a strong positive correlation structure exists, which enabled the improved prediction in Table 3. Following this, we use the predicted influence matrix to select a set of highly influential nodes from the dataset. A simple approach to select the influential users can be to select the ones with a large number of tweets. However, we argue that the influence predicted in an information diffusion model can be vastly different. Consequently, we consider a user to be influential if she has a high influence score for at least one of the topics, or if she can be influential for multiple topics. For example, in Figure 1(a), we plot average influence scores of the users (averaged over all the topics) against the total number of tweets. Similarly, in Figure 1(b), we plot influence scores of the users (maximum over all the topics) against the total number of tweets. The first striking observation is that the users with high influence scores are not necessarily the ones with the most number of tweets. Instead, their impact on the information diffusion relies heavily on the complex dynamics of the implicit network.
|Volume Prediction MSE||2.7||0.329|
Finally, in Figure 2 we plot the percentage of tweets regarding each of the topics for top influential nodes. Influential nodes are obtained as a union of nodes identified based on both average and maximum influence scores. More specifically, we select the union of users with average influence score greater than and maximum influence score greater than . In addition to displaying the distribution across topics, for each influential user, we show the total number of tweets posted by that user. It can be seen that the total number of tweets of these users vary a lot and, therefore, is not a good indication of their influence.
In this paper, we considered the problem of influential node detection and volume time series prediction. We proposed a descriptive diffusion model to take dependencies among the topics into account. We also proposed an efficient algorithm based on alternating methods to perform inference and learning on the model. It was shown that the proposed technique outperforms existing influential node detection techniques. Furthermore, the proposed model was validated both on a synthetic and a real (ISIS) dataset. We showed that the proposed approach can efficiently select the most influential users for specific contagions. We also presented several interesting patterns of the selected influential users for the ISIS dataset.
- Beck and Teboulle (2009) Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
- Bertsekas (2011) Dimitri P Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010:1–38, 2011.
- Combettes and Pesquet (2011) Patrick L Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. In Fixed-point algorithms for inverse problems in science and engineering, pages 185–212. Springer, 2011.
- Du et al. (2013) Nan Du, Le Song, Manuel Gomez-Rodriguez, and Hongyuan Zha. Scalable influence estimation in continuous-time diffusion networks. In Advances in neural information processing systems, pages 3147–3155, 2013.
- Friedman et al. (2008) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
- Guille and Hacid (2012) Adrien Guille and Hakim Hacid. A predictive model for the temporal dynamics of information diffusion in online social networks. In Proceedings of the 21st international conference on World Wide Web, pages 1145–1152. ACM, 2012.
- Guille et al. (2013) Adrien Guille, Hakim Hacid, Cecile Favre, and Djamel A Zighed. Information diffusion in online social networks: A survey. ACM SIGMOD Record, 42(2):17–28, 2013.
- López-Pintado (2008) Dunia López-Pintado. Diffusion in complex social networks. Games and Economic Behavior, 62(2):573–590, 2008.
- Rai et al. (2012) Piyush Rai, Abhishek Kumar, and Hal Daume. Simultaneously leveraging output and task structures for multiple-output regression. In Advances in Neural Information Processing Systems (NIPS), pages 3185–3193, 2012.
- Richard et al. (2012) Emile Richard, Pierre-andre Savalle, and Nicolas Vayatis. Estimation of simultaneously sparse and low rank matrices. In Proceedings of the 29th International Conference on Machine Learning (ICML), pages 1351–1358, 2012.
- Wang et al. (2013) Yingze Wang, Guang Xiang, and Shi-Kuo Chang. Sparse multi-task learning for detecting influential nodes in an implicit diffusion network. In AAAI, 2013.
- Yang and Leskovec (2010) Jaewon Yang and Jure Leskovec. Modeling information diffusion in implicit networks. In 2010 IEEE International Conference on Data Mining, pages 599–608. IEEE, 2010.
- Zhang et al. (2016) Peng Zhang, Jing He, Guodong Long, Guangyan Huang, and Chengqi Zhang. Towards anomalous diffusion sources detection in a large network. ACM Transactions on Internet Technology (TOIT), 16(1):2, 2016.