Dynamic Network Model from Partial Observations
Can evolving networks be inferred and modeled without directly observing their nodes and edges? In many applications, the edges of a dynamic network might not be observed, but one can observe the dynamics of stochastic cascading processes (e.g., information diffusion, virus propagation) occurring over the unobserved network. While there have been efforts to infer networks based on such data, providing a generative probabilistic model that is able to identify the underlying time-varying network remains an open question. Here we consider the problem of inferring generative dynamic network models based on network cascade diffusion data. We propose a novel framework for providing a non-parametric dynamic network model—based on a mixture of coupled hierarchical Dirichlet processes—based on data capturing cascade node infection times. Our approach allows us to infer the evolving community structure in networks and to obtain an explicit predictive distribution over the edges of the underlying network—including those that were not involved in transmission of any cascade, or are likely to appear in the future. We show the effectiveness of our approach using extensive experiments on synthetic as well as real-world networks.
Networks of interconnected entities are widely used to model pairwise relations between objects in many important problems in sociology, finance, computer science, and operations research [1,2,3]. Often times, these networks are dynamic, with nodes or edges appearing or disappearing over time, and the underlying network structure evolving over time. As a result, there is a growing interest in developing dynamic network models allowing for the study of evolving networks.
Non-parametric models are specially useful when there is no prior knowledge or assumption about the shape or size of the network as they can automatically address the model selection problem. Non-parametric Bayesian approaches mostly rely on the assumption of vertex exchangeability, in which the distribution of a graph is invariant to the order of its vertices [4,5]. Vertex-exchangeable models such as the Stochastic Block model and its variants, explain the data by means of an underlying latent clustering structure. However, such models yield dense graphs  and are less appropriate for predicting unseen interactions. Recently, an alternative notion of edge-exchangeability was introduced for graphs, in which the distribution of a graph is invariant to the order of its edges [7,8]. Edge-exchangeable models can exhibit sparsity, and small-world behavior of real-world networks. Such models allow both the latent dimensionality of the model and the number of nodes to grow over time, and are suitable for predicting future interactions.
Existing models, however, aim to model a fully observed network [4,5,7,8] but in many real-world problems, the underlying network structure is not known. What is often known are partial observations of a stochastic cascading process that is spreading over the network. A cascade is created by a contagion (e.g., a social media post, a virus) that starts at some node of the network and then spreads like an epidemic from node to node over the edges of the underlying (unobserved) network. The observations are often in the form of the times when different nodes get infected by different contagions. A fundamental problem, therefore, is to infer the underlying network structure from these partial observations. In recent years, there has been a body of research on inferring diffusion networks from node infection times. However, these efforts mostly rely on a fixed cascade transmission model—describing how nodes spread contagions—to infer the set of most likely edges [2,9,10,11]. More recently, there have been attempts to predict the transmission probabilities from infection times, either by learning node representations , or by learning diffusion representations using the underlying network structure [12,13,14]. However, it remains an open problem to provide a generative probabilistic model for the underlying network from partial observations.
Here we propose a novel online dynamic network inference framework, Dyference, for providing non-parametric edge-exchangeable network models from partial observations. We build upon the non-parametric network model of  that assumes that the network clusters into groups and then places a mixture of Dirichlet processes over the outgoing and incoming edges in each cluster while coupling the network using a shared discrete base measure. Given a set of cascades spreading over the network, we process observations in time intervals. For each time interval we first find a probability distribution over the cascade diffusion trees that may have been involved in each cascade. We then sample a set of edges from this distribution and provide the samples to a Gibbs sampler to update the model variables. In the next iteration, we use the updated edge probabilities provided by the model to update the probability distributions over edges supported by each cascade. We continue the above iterative process until the model does not change considerably. Extensive experiments on synthetic and real-world networks show that Dyference is able to track changes in the structure of dynamic networks and provides accurate online estimates of the time-varying edge probabilities for different network topologies. We also apply Dyference for diffusion prediction and predicting the most influential nodes in Twitter and MemeTracker datasets.
2 Related Work
There is a body of work on inferring diffusion network from partial observations. NetInf  and MultiTree  formulate the problem as submodular optimization. NetRate  and ConNIe  further infer the transmission rates using convex optimization. InfoPath  considers inferring varying transmission rates in an online manner using stochastic convex optimization. The above methods assume that diffusion rates are derived from a predefined parametric probability distribution. In contrast, we don’t make any assumption on the transmission model. EmbeddedIC  embeds nodes in a latent space based on Independent Cascade model, and infer diffusion probabilities based on the relative node positions in the latent space. DeepCas  and TopoLSTM  use the network structure to learn diffusion representations and predict diffusion probabilities. Our work is different in nature to the existing methods in that we aim at providing a generative probabilistic model for the underlying dynamic network from diffusion data.
There has also been a growing interest in developing probabilistic network models that can capture network evolutions from full observations. Bayesian generative models such as the stochastic block model , the mixed membership stochastic block model , the infinite relational model , the latent space model , the latent feature relational model , the infinite latent attribute model , and the random function model  are among the vertex-exchangeable examples. A limitation of the edge-exchangeable models is that they generate dense or empty networks with probability one [24,25]. This is in contrast with the sparse nature of many real-world networks. Recently, edge-exchangeable models have been proposed and shown to exhibit sparsity [8,7,26]. However, these models assume that networks are fully observed. In contrast, our work here considers the network is unobserved but what we observe are node infection times of a stochastic cascading process spreading over the network.
In this section, we first formally define the problem of dynamic network inference from partial observations. We then review the non-parametric edge-exchangeable network model of  that we will build upon in the rest of this paper. Finally, we give a brief overview of Bayesian inference for inferring latent model variables.
3.1 Dynamic Network Inference Problem
Consider a hidden directed dynamic network where nodes and edges may appear or disappear over time. At each time step , the network consists of a set of vertices , and a set of edges . A set of cascades spread over edges of the network from infected to non-infected nodes. For each cascade , we observe a cascade , recording the times when each node got infected by the cascade . If node is not infected by the cascade , we set . For each cascade, we only observe the time when node got infected, but not what node and which edge infected node . Our goal is to infer a model to capture the latent structure of the dynamic network over which cascades propagated, using these partial observations. Such a model, in particular, can provide us with the probabilities of all the potential edges between nodes .
3.2 Non-parametric Edge-exchangeable Network Model
We adopt the Bayesian non-parametric model of  that combines structure elucidation with predictive performance. Here, the network is modeled as an exchangeable sequence of directed edges, and can grow over time. More specifically, each community in the network is modeled by a mixture of Dirichlet network distributions (MDND).
The model can be described as:
The edges of the network are modeled by a Dirichlet distribution . Here, is the measurable space, denotes an indicator function centered on , and is the corresponding probability of an edge to exist at , with . The concentration parameter controls the number of edges in the network, with larger results in more edges. The size and number of communities are modeled by a stick-breaking distribution with concentration parameter . For every community , two Dirichlet distribution , and models the outlinks and inlinks in community . To ensure that outlinks and inlinks are defined on the same set of locations , distributions , and are coupled using the shared, discrete base measure .
To generate an edge , we first select a cluster according to . We then select a pair of nodes and according to the cluster-specific distributions . The concentration parameter controls the overlap between clusters, with smaller results in smaller overlaps. Finally, is the integer-valued weight of edge .
3.3 Bayesian Inference
Having specified the model in terms of the joint distribution in Eq. 1, we can infer the latent model variables for a fully observed network using Bayesian inference. In the full observation setting where we can observe all the edges in the network, the posterior distribution of the latent variables conditioned on a set of observed edges can be updated using the Bayes rule:
Here, is the infinite dimensional parameter vector of the model specified in Eq. 1. The denominator in the above equation is difficult to handle as it involves summation over all possible parameter values. Consequently, we need to resort to approximate inference. In Section 4, we show how we extract our set of observations from diffusion data and construct a collapsed Gibbs sampler to update the the posterior distributions of latent variables.
4 Dyference: Dynamic Network Inference from Partial Observations
In this section, we describe our algorithm, Dyference, for inferring the latent structure of the underlying dynamic network from diffusion data. Dyference works based on the following iterative idea: in each iteration we (1) find a probability distribution over all the edges that could be involved in each cascade; (2) Then we sample a set of edges from the probability distribution accosted with each cascade, and provide the sampled edges as observations to a Gibbs sampler to update the posterior distribution of the latent variables of our non-parametric network model. We start by explaining our method on a static directed network, over which we observe a set of cascades . In Section 4.3, we shall then show how we can generalize our method to dynamic networks.
4.1 Extracting Observations from Diffusion Data
The set of edges that could have been involved in transmission of a cascade is the set of all edges for which is infected before , i.e., . Similarly, is the set of all infected nodes in cascade . To find the probability distribution over all the edges in , we first assume that every infected node in cascade gets infected through one of its neighbors, and therefore each cascade propagates as a directed tree. For a cascade , each possible way in which the cascade could spread over the underlying network creates a tree. To calculate the probability of a cascade to spread as a tree , we use the following Gibbs measure .
where is the temperature parameter. The normalizing constant is the partition function that ensures that the distribution is normalized, and is the weight of edge . The most probable tree for cascade is a MAP configuration for the above distribution, and the distribution will concentrate on the MAP configurations as .
To calculate the probability distribution over the edges in , we use the result of  who showed that the probability distribution over subsets of edges associated with all the spanning trees in a graph is a Determinantal Point Processes (DPP) , where the probability of every subset can be calculated as:
Here, is the is the restriction of the DPP kernel to the entries indexed by elements of . For constructing the kernel matrix , we take the incidence matrix , in which indicates that edge is an outlink/inlink of node , and we removed an arbitrary vertex from the graph. Then, construct its Laplacian and compute and .
Finally, the marginal probabilities of an edge in can be calculated as:
where is the vector with coordinates equal to zero, except the -th coordinate which is one. All marginal probabilities can be calculated in time , where is the desired relative precision and and .
To construct our multiset of observations —in which each edge can appear multiple times—, for each we sample a set of edges from the probability distributions of edges in .
Initially, without any prior knowledge about the structure of the underlying network, we initialize for all , and otherwise. However, in the subsequent iterations when we get the updated posterior probabilities from our model, we use .
The pseudocode for extracting observations from diffusion data is shown in Algorithm 1.
4.2 Updating Latent Model Variables
To update the posterior distribution of the latent model variables conditioned on the extracted observations, we construct a collapsed Gibbs sampler by sweeping through each variable to sample from its conditional distribution with the remaining variables fixed to their current values.
Sampling cluster assignments .
Following , we model the posterior probability for an edge to belong to cluster as a function of the importance of the cluster in the network, and the importance of as a source and as a destination in cluster , as well as the importance of in the network. To this end, we measure the importance of a cluster by the total number of its edges, i.e., . Similarly, the importance of as a source, and the importance of as a destication in cluster is measured by the number of outlinks of associated with cluster , i.e. , as well as inlinks of associated with cluster , i.e. . Finally, the importance of node in the network is determined by the probability mass of its outlinks and inlinks , i.e. . The distribution over the cluster assignment of an edge , given the end nodes , the cluster assignments for all other edges, and is given by:
where is used to exclude the variables associated with the current edge being observed. As discussed in Section 3.2, , , and controls the number of clusters, cluster overlaps, and the number of nodes in the network. Moreover, are the number of nodes and edges in the network.
Sampling edge probabilities .
Due to the edge-exchangeability, we can treat as the last variable being sampled. The conditional posterior for given the rest of the variables can be calculated as:
where is the probability mass for all the edges that may appear in the network in the future, and is number of clusters. We observe that an edge may appear between existing nodes in the network, or because one or two nodes has appeared in the network. Note that the predictive distribution for a new link to appear in the network can be calculated similarly using Eq. 8.
Sampling outlink and inlink probabilities .
The probability mass on the outlinks and inlinks of node associated with cluster are modeled by variables and . The posterior distribution of (similarly ), can be calculated using:
where are unsigned Stirling numbers of the first kind. I.e., for and for . Other entries can be computed as . However, for large , it is often more efficient to sample by simulating the table assignments of the Chinese restaurant according to Eq. 8 .
Sampling node probabilities .
Finally, the probability of each node is the sum of the probability masses on its edges and is modeled by a Dirichlet distribution, i.e.,
The pseudocode for inferring the latent network variables from diffusion data is given in Algorithm 2.
4.3 Online Dynamic Network Inference
In order to capture the dynamics of the underlying network and keep the model updated over time, we consider time intervals of length . For the -th interval, we only consider the infection times for all , and update the model conditioned on the observations in the current time interval. There is a trade-off between the speed of tracking changes in the underlying network structure, and the accuracy of our algorithm. For smaller intervals, we incorporate new observations more rapidly, and therefore we are able to track changes faster. On the other hand, for larger intervals we find the probability distribution over a larger set of edges that could have been involved in transmission of a cascade. Obtaining a larger sample from this probability distribution can provide us with more information about the network structure.
Note that we don’t infer a new model for the network based on infection times in each time interval. Instead, we use new observations to update the latent variables from the previous time interval. Updating the model with observations in the current interval results in a higher probability for the observed edges, and a lower probability for the edges that have not been observed recently. Therefore, we do not need to consider an aging factor to take into account the older cascades. The pseudocode of our dynamic inference method is given in Algorithm 3.
In this section, we address the following questions: (1) What is the predictive performance of Dyference in static and dynamic networks and how does it compare to the existing network inference algorithms? (2) How does predictive performance of Dyference change with the number of cascades? (3) How does running time of Dyference compare to the baselines? And, (4) How does Dyference perform for the task of predicting diffusion and influential nodes?
Baselines. We compare the performance of Dyference to NetInf , NetRate , TopoLSTM , DeepCas , EmbeddedIC  and InfoPath . InfoPath is the only method able to infer dynamic networks, hence we can only compare the performance of Dyference on dynamic networks with InfoPath.
Evaluation Metrics. For performance comparison, we use Precision, Recall, F1 score, and Map@. Precision is the fraction of edges in the inferred network present in the true network, Recall is the fraction of edges of the true network present in the inferred network, and F1 score is 2(precisionrecall)/(precision+recall). MAP@ is the classical mean average precision measure.
In all the experiments we use a sample size of for all the cascades . We further consider a window of length day in our dynamic network inference experiments.
Experiments on synthetic data. We generated synthetic networks consist of 1024 nodes and about 2500 edges using Kronecker graph model : core-periphery network (CP) (Kronecker parameters [0.9,0.5;0.5,0.3]), hierarchical community network (HC) (parameters [0.9,0.1;0.1,0.9]), and the Forest Fire model : with forward and backward burning probability 0.2 and 0.17. For dynamic networks, we assign a pattern to each edge uniformly at random from a set of five edge evolution patterns: Slab, and Hump (to model outlinks of nodes that temporarily become popular), Square, and Chainsaw (to model inlinks of nodes that update periodically), and Constant (to model long term interactions) . Transmission rates are generated for each edge according to its chosen evolution pattern for 100 time steps. We then generate 500 cascades per time step (1 day) on the network with a random initiator .
Figures (a)a, and (b)b compare precision, recall and F1 score of Dyference to InfoPath for online dynamic network inference on CP-Kronecker network with exponential edge transmission model, and HC-Kronecker network with Rayleigh edge transmission model. It can be seen that Dyference outperforms InfoPath in terms of F1 score as well as precision and recall on different network topologies in different transmission models. Figures (c)c, (d)d, (e)e compare F1 score of Dyference compared to InfoPath and NetRate for static network inference for varying number of cascades over CP-Kronecker network with Rayleigh and Exponential edge transmission model, and Forest Fire network with Power-law edge transmission model. We observe that Dyference consistently outperforms the baselines in terms of accuracy and is robust to varying number of cascades.
Experiments on Real data. We applied Dyference to two real wold datasets, (1) Twitter  contains the diffusion of URLs on Twitter during 2010 and the follower graph of users. The network consists of 6,126 nodes and 12,045 edges with 5106 cascades of length of 17 on average. And, (2) Memes  contains the diffusion of memes from March 2011 to February 2012 over online news websites; The real diffusion network is constructed by the temporal dynamics of hyperlinks created between news sites. The network consists of 5,000 nodes and 313,669 edges with 54,847 cascades of length of 14 on average.
Figures (g)g, (h)h, (i)i, (j)j compare the F1 score of Dyference to InfoPath for online dynamic network inference on the time-varying hyperlink network with four different topics over time from March 2011 to July 2011. As we observe, Dyference outperforms InfoPath in terms of the prediction accuracy in all the networks. Figure (f)f compares the running time of Dyference to that of InfoPath. We can see that Dyference has a running time that is comparable to InfoPath, while consistently outperforms it in terms of the prediction accuracy.
Diffusion Prediction. Table 2 compares Map@ for Dyference vs. TopoLSTM, DeepCas, and EmbeddedIC. We use the infection times in the first 80% of the total time interval for training, and the remaining 20% for the test. It can be seen that Dyference has a very good performance for the task of diffusion prediction.
Influence Prediction. Table 2 shows the set of influential websites found based on the predicted dynamic Memes network by Dyference vs Infopath. The dynamic Memes network for Linkedin is predicted till 30-06-2011, and the influential websites are found using the method of . We observe that using the predicted network by Dyference we could predict the influential nodes with a good accuracy.
We considered the problem of developing generative dynamic network models from partial observations, i.e. diffusion data. We proposed a novel framework, Dyference, for providing a non-parametric edge-exchangeable network model based on a mixture of coupled hierarchical Dirichlet processes. Dyference provides online time-varying estimates of probabilities for all the potential edges in the underlying network, and track the evolution of the underlying community structure over time. We showed the effectiveness of our approach using extensive experiments on synthetic as well as real-world networks.
 A Namaki, AH Shirazi, R Raei, and GR Jafari. Network analysis of a financial market based on genuine correlation and threshold method. Physica A: Statistical Mechanics and its Applications, 390(21):3835– 3841, 2011.
 Seth Myers and Jure Leskovec. On the convexity of latent social network inference. In Advances in neural information processing systems, pages 1741–1749, 2010.
 Amr Ahmed and Eric P Xing. Recovering time-varying networks of dependencies in social and biological studies. Proceedings of the National Academy of Sciences, 106(29):11878?11883, 2009.
 Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109–137, 1983.
 Tom AB Snijders and Krzysztof Nowicki. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of classification, 14(1):75–100, 1997.
 James Lloyd, Peter Orbanz, Zoubin Ghahramani,and Daniel M Roy.Random function priors for exchangeable arrays with applications to graphs and relational data. In Advances in Neural Information Processing Systems, pages 998–1006, 2012.
 S Williamson. Nonparametric network models for link prediction. Journal of Machine Learning Research, 17(202):1–21, 2016.
 Diana Cai, Trevor Campbell, and Tamara Broderick. Edge-exchangeable graphs and sparsity. In Advances in Neural Information Processing Systems, pages 4249–4257, 2016.
 Manuel Gomez-rodriguez and David Balduzzi Bernhard Schölkopf. Uncovering the temporal dynamics of diffusion networks. In in Proc. of the 28th Int. Conf. on Machine Learning (ICML’11. Citeseer, 2011.
 Manuel Gomez Rodriguez, Jure Leskovec, and Andreas Krause. Inferring networks of diffusion and influence. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1019?1028. ACM, 2010.
 Manuel Gomez Rodriguez and Bernhard Schölkopf. Submodular inference of diffusion networks from multiple trees. arXiv preprint arXiv:1205.1671, 2012.
 Simon Bourigault, Sylvain Lamprier, and Patrick Gallinari. Representation learning for information diffusion through social networks: an embedded cascade model. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pages 573?582. ACM, 2016.
 Jia Wang, Vincent W Zheng, Zemin Liu, and Kevin Chen-Chuan Chang. Topological recurrent neural network for diffusion prediction. arXiv preprint arXiv:1711.10162, 2017.
 Cheng Li, Jiaqi Ma, Xiaoxiao Guo, and Qiaozhu Mei. Deepcas: An end-to-end predictor of information cascades. In Proceedings of the 26th International Conference on World Wide Web, pages 577–586. International World Wide Web Conferences Steering Committee, 2017.
 Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Krause. Inferring networks of diffusion and influence. ACM Transactions on Knowledge Discovery from Data (TKDD), 5(4):21, 2012.
 Manuel Gomez-Rodriguez and Bernhard Sch’́lkopf. Submodular inference of diffusion networks from multiple trees. In Proceedings of the 29th International Conference on International Conference on Machine Learning, pages 1587–1594. Omnipress, 2012.
 Manuel Gomez Rodriguez, David Balduzzi, and Bernhard Sch’́lkopf. Uncovering the temporal dynamics of diffusion networks. arXiv preprint arXiv:1105.0697, 2011.
 Manuel Gomez Rodriguez, Jure Leskovec, and Bernhard Sch’́olkopf. Structure and dynamics of information pathways in online media. In Proceedings of the sixth ACM international conference on Web search and data mining, pages 23–32. ACM, 2013.
 Edoardo M Airoldi, David M Blei, Stephen E Fienberg, and Eric P Xing. Mixed membership stochastic blockmodels. Journal of Machine Learning Research, 9(Sep):1981–2014, 2008.
 Charles Kemp, Joshua B Tenenbaum, Thomas L Griffiths, Takeshi Yamada, and Naonori Ueda. Learning systems of concepts with an infinite relational model. In AAAI, volume 3, page 5, 2006.
 Peter D Hoff, Adrian E Raftery, and Mark S Handcock. Latent space approaches to social network analysis. Journal of the american Statistical association, 97(460):1090–1098, 2002.
 Kurt Miller, Michael I Jordan, and Thomas L Griffiths. Nonparametric latent feature models for link prediction. In Advances in neural information processing systems, pages 1276–1284, 2009.
 K Palla, F Caron, and YW Teh. A bayesian nonparametric model for sparse dynamic networks. arXiv preprint, 2016.
 David J Aldous. Representations for partially exchangeable arrays of random variables. Journal of Multivariate Analysis, 11(4):581–598, 1981.
 Douglas N Hoover. Relations on probability spaces and arrays of random variables. Preprint, Institute for Advanced Study, Princeton, NJ, 2, 1979.
 Harry Crane and Walter Dempsey. Edge exchangeable models for network data. arXiv preprint arXiv:1603.04571, 2016.
 Josip Djolonga and Andreas Krause. Learning implicit generative models using differentiable graph tests. arXiv preprint arXiv:1709.01006, 2017.
 Russell Lyons. Determinantal probability measures. Publications Mathématiques de l’Institut des Hautes Études Scientifiques, 98(1):167–212, 2003.
 Daniel A Spielman and Shang-Hua Teng. Nearly linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. SIAM Journal on Matrix Analysis and Applications, 35(3):835–885, 2014.
 Emily B Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky. The sticky hdp-hmm: Bayesian nonparametric hidden markov models with persistent states.
 Jure Leskovec and Christos Faloutsos. Scalable modeling of real graphs using kronecker multiplication. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 497–504, New York, NY, USA, 2007. ACM.
 Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In KDD ’05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 177–187, New York, NY, USA, 2005. ACM Press.
 Nathan Oken Hodas and Kristina Lerman. The simple rules of social contagion. CoRR, abs/1308.5015, 2013.
 Jure Leskovec, Lars Backstrom, and Jon Kleinberg. Meme-tracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ?09, pages 497–506, New York, NY, USA, 2009. ACM.
 David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137–146. ACM, 2003.