Rapid Bayesian Inference of Global Network Statistics using Random Walks
We propose a novel Bayesian methodology which uses random walks for rapid inference of statistical properties of undirected networks with weighted or unweighted edges. Our formalism yields high-accuracy estimates of the probability distribution of any network node-based property, and of the network size, after only a small fraction of network nodes has been explored. The Bayesian nature of our approach provides rigorous estimates of all parameter uncertainties. We demonstrate our framework on several standard examples, including random, scale-free, and small-world networks, and apply it to the large-scale network formed by the links between Wikipedia pages.
Over the past few years, our lives have become increasingly dependent on large-scale networks, often available through our computers and smartphones. In addition to the original computer-based networks such as the World Wide Web and the Internet, many online social networks have emerged, notably Twitter and Facebook. Our professional and personal activities are influenced daily by knowledge-sharing online services such as Wikipedia and YouTube. More generally, complex networks describe a broad spectrum of systems in nature, science, technology, and society Albert2002 (). Many of these networks are large and constantly changing, making investigation of their statistical properties a challenging task. In particular, estimating the network size becomes non-trivial if the network is too large to resort to brute-force methods such as visiting every node. Consequently, predicting various network statistics, typically from random samples of limited size, has attracted considerable attention in the literature Newman2003 (); Lee2006 (); Yoon2007 (); Estrada2010 (); Gjoka2010 (); Cooper2012 (); Bliss2014 (); zhang2015 ().
Here we develop a Bayesian approach to network sampling by random walks (RWs) Yoon2007 (); Cooper2012 (). Unlike previous results, our framework can be used to build full posterior probability distributions for any network node-based quantity of interest. Our framework reproduces several previously known global network statistics estimators within a single formalism, and automatically removes statistical biases caused by RW sampling Yoon2007 (); Estrada2010 (). Surprisingly, accurate estimates of various network properties, including its size, are obtained after examining only a small fraction of all network nodes.
Consider a RW on a network of nodes with weighted edges: , where is the rate of transition from node to node . At each step the walker will transition to a neighboring node with a probability , where the sum is over all nearest neighbors of node . We subdivide all network nodes into sets based on the value of some property , such as the number of links connected to the current node, known as the node degree Albert2002 (); there are nodes in each set and distinct sets. We assume that the property in question is discrete; continuous properties can be discretized by binning. We focus on undirected networks with symmetric rates, . In this case, the stationary probability for the RW to occupy node , , can be determined using the steady-state master equation vanKampen2007 (); Krapivsky2010 ():
Equation (1) is satisfied if , where is the total outward rate. For unweighted graphs, the node’s stationary probability is simply proportional to its degree Noh2004 (). With normalization, the stationary probabilities become .
If the walker starts from a node with property , the average number of steps between subsequent visits to any node within the set , also known as the mean return time (MRT), is given by Condamin2007 ():
In the case of undirected networks,
where is the fraction of nodes with property , , and .
The probability of making steps between subsequent visits to , , is asymptotically exponential in arbitrary finite networks Bollt2005 ():
where is the hitting rate of the nodes within . We find empirically that an exponential ansatz for is sufficiently accurate for our purposes (Fig. 1(a)–3(a)), although our approach is not limited to it. The likelihood that during a single RW with steps the walker has visited the nodes in at intervals , and has not returned to for the remaining steps, is given by
This likelihood function is maximized by . Assuming a uniform prior for in the range, the posterior probability density for becomes
where is a normalization constant, and the approximation is valid in the limit. In this limit, Eq. (6) becomes a gamma distribution , which rapidly approaches a Gaussian limit as increases, with the mean and the standard deviation .
This result in combination with Eq. (3) yields a maximum likelihood estimate (MLE) and a standard error for the probability of the property :
If the property of the node is its total outward rate , Eq. (7) yields
where is the number of visits to nodes whose total outward rates lie within the bin , and is the total number of bins.
For an arbitrary node property , each set can be additionally subdivided by the binned value of , such that
where Eq. (8) was employed to compute . Here, is the number of visits to nodes with property and the total outward rates within the bin . Thus, the knowledge of , , and is sufficient to reconstruct the MLE of the distribution of any property , estimate the error in this reconstruction (Eq. (7)), and compute moments of arbitrary order. Note that the division by the outward rates in Eq. (9) naturally corrects for the bias known to be introduced by RW sampling Estrada2010 (); Yoon2007 (); Gjoka2010 (). For unweighted networks (), reduces to , the network degree distribution Albert2002 ().
The MLE of the average outward rate is given by
Let us suppose now that the network nodes are divided into two sets: randomly chosen nodes, which we will refer to as pseudotargets, and all the rest. The pseudotarget nodes are drawn prior to exploring the network, so that the average pseudotarget outward rate, , is known. Equations (3) and (5) can now be used to construct the posterior probability for the network size (assuming a uniform prior in the range, where denotes an upper limit on ):
where is the number of visits to pseudotargets. Note that using non-uniform priors in Eqs. (6) and (12) will not significantly affect the results, as long as and are sufficiently large. Similar to Eq. (6), we find that this posterior probability quickly becomes Gaussian as increases, with
Using Eq. (10), we obtain
Note that the error in can be reduced either through increasing or assigning highly-connected nodes (network hubs) to be pseudotargets. In the limit, Eq. (13) recovers the network size estimator found in Ref. Cooper2012 ().
We have implemented the above network statistics acquisition framework as follows: for each network, pseudotargets are randomly drawn, and their is computed. Commencing the RW from one of these pseudotargets, we record , , , and for a desired set of node properties . At each step in the RW, Eqs. (7)–(14) can then be used to find various network statistics.
We have used this algorithm to study three unweighted, undirected networks: an Erdős-Rényi (ER) random graph Erdos1961 (), a scale-free (SF) random graph Albert2002 (), and a small-world (SW) network Watts1998 (). Each network has nodes. The ER network was constructed by randomly assigning edges between nodes, the SF network by the preferential attachment method Albert2002 () with edges attached to new nodes, and the SW network as described in Ref. Newman2000 (), with the shortcut probability .
For each network, pseudotargets were randomly drawn and the network was subsequently explored with a random walk for steps, visiting at most 10% of all nodes. Besides network sizes and degree distributions, we have tracked posterior probabilities of the average degree of nearest-neighbor nodes,
the clustering coefficient Newman2003 (),
where is the total number of links shared by the nearest neighbors of node , and a measure of the degree inhomogeneity Estrada2010 ()
A summary of the ER system statistics is provided in Fig. 1. Fig. 1(a) shows that the exponential ansatz for the RT distribution, Eq. (4), is accurate for this system. Fig. 1(b) demonstrates the convergence of the average degree distribution and the network size to the exact values during 5 representative runs. The predicted degree distribution, , known to be Poisson Erdos1961 (), is shown in Fig. 1(c). Finally, in Fig. 1(d) we demonstrate evolution of the posterior distribution for the network size as more data is collected. Additional statistics for the ER, SF and SW systems are summarized in Table 1. Although the network topologies of these three systems are quite different, all network statistics we have considered are recovered accurately.
Next, we have constructed a generalized ER network with nodes and weighted edges. After placing all the edges as in the unweighted ER network, a loop was added to each node with probability . All loops and edges were then assigned a symmetric weight drawn from an exponential distribution with unit mean. For this system, we have collected statistics on each node’s total outward rate, , loop weight, (note that for nodes without loops), outward rate averaged over all nearest neighbors of node , , and average nearest-neighbor loop weight, .
We have explored the statistics of these quantities using a RW with steps and randomly drawn pseudotargets (final row of Table 1, Fig. 2). Note that the RT distribution for this system deviates from purely exponential since many returns occur after a single step due to loops (Fig. 2(a)). Nonetheless, all the network statistics we have considered are predicted accurately (Fig. 2(b)–(d)), except for the tail of the Fig. 2(d) distribution since those rare events were not observed. Thus our methodology is equally applicable to studies of weighted networks with loops.
Finally, we have examined the network formed by hyperlinks between English articles on Wikipedia. Links connecting an article to itself were disregarded, multiple links between articles were counted as one, and automatic redirects were disallowed, resulting in an unweighted, undirected, loopless network consisting of all English articles, redirect pages, and disambiguation pages WikiPage (). To assign pseudotargets, the first pages were drawn from Wikipedia’s static HTML dumps. A single randomly chosen link was then taken from each of these pages and the node it pointed to was designated as a pseudotarget, resulting in . This procedure increases the likelihood that the pseudotargets are hubs with a large number of links, facilitating collection of the network statistics since grows more rapidly Noh2004 (); Lee2006 (); Cooper2012 ().
We have focused on several statistics that facilitate comparison with known properties of Wikipedia: the size of each page in bytes, , and two variables representing whether a page is a redirect or a disambiguation page, respectively. The quantities , , , and then give the fraction of redirect pages, disambiguation pages, both redirect and disambiguation pages, and the average storage space in bytes of English articles (Wikipedia excludes redirect pages from its estimates of the number of articles WikiPage ()). The RW was run for steps, with the resulting predictions shown in Table 2 and Fig. 3.
We find that Wikipedia contains million pages, each of which is connected to other pages on average. The majority of Wikipedia pages, , are redirect pages, and are disambiguation pages. We estimate the total number of English articles (including disambiguation pages) to be million, and the total number of redirect pages to be million, within the confidence intervals of the values reported by Wikipedia: and million, respectively WikiPage2 (). We find the total size of English articles in Wikipedia to be gigabytes (GB), in reasonable agreement with the Wikipedia statement that text alone accounts for GB of the storage space of English articles WikiPage3 ().
Fig. 3(a) demonstrates that the assumption of the exponential RT distribution is reasonable for Wikipedia, with some enrichment for short RTs due to the choice of network hubs as pseudotargets. Fig. 3(b) shows how the estimate of the total number of Wikipedia pages evolves as increases. As in many other Internet-based networks Faloutsos1999 (), the degree distribution of Wikipedia pages is scale-free (Fig. 3(c)). In contrast, the distribution of page sizes is not scale-free, and the size of an average Wikipedia page is only kB (Fig. 3(d), Table 2).
In conclusion, we have presented a general Bayesian approach to collecting various network statistics, including the size of the network, using RWs that visit only a small fraction of all network nodes. Our approach works for both weighted and unweighted undirected networks, and remains accurate in the presence of loops. Our main assumption, that of the exponentiality of the RT distribution, appears to hold in all the cases we have examined explicitly, and can be relaxed if necessary. Our future work will focus on extending this methodology to directed and time-dependent networks.
- (1) R. Albert and A. L. Barabási. Statistical mechanics of complex networks. Rev Mod Phys, 74:47–97, 2002.
- (2) M. E. J. Newman. Mixing patterns in networks. Phys Rev E, 67:026126, 2003.
- (3) S. H. Lee, P.-J. Kim, and J. Hawoong. Statistical properties of sampled networks. Phys Rev E, 73:016102, 2006.
- (4) S. Yoon, S. Lee, S.-H. Yook, and Y. Kim. Statistical properties of sampled networks by random walks. Phys Rev E, 75:046114, 2007.
- (5) E. Estrada. Quantifying network heterogeneity. Phys Rev E, 82:066102, 2010.
- (6) M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou. Walking in Facebook: A case study of unbiased sampling of OSNs. In Proc 29th Conf Inform Comm, INFOCOM’10, pages 2498–2506, Piscataway, NJ, USA, 2010. IEEE Press.
- (7) C. Cooper, T. Radzik, and Y. Siantos. Estimating network parameters using random walks. In 2012 Fourth International Conference on Computational Aspects of Social Networks (CASoN), pages 33–40, Nov 2012.
- (8) C. A. Bliss, C. M. Danforth, and P. S. Dodds. Estimation of global network statistics from incomplete data. PLoS One, 9:e108471, 2014.
- (9) Y. Zhang, E. D. Kolaczyk, and B. D. Spencer. Estimating network degree distributions under sampling: An inverse problem, with applications to monitoring social media networks. Ann Appl Stat, 9:166–199, 2015.
- (10) N. G. van Kampen. Stochastic Processes in Physics and Chemistry. Elsevier, Amsterdam, 2007.
- (11) P. L. Krapivsky, S. Redner, and E. Ben-Naim. A Kinetic View of Statistical Physics. Cambridge University Press, 2010.
- (12) J. D. Noh and H. Rieger. Random walks on complex networks. Phys Rev Lett, 92:118701, 2004.
- (13) S. Condamin, O. Benichou, and M. Moreau. Random walks and Brownian motion: a method of computation for first-passage times and related quantities in confined geometries. Phys Rev E, 75:021111, 2007.
- (14) E. M. Bollt and D. ben-Avraham. What is special about diffusion on scale-free nets? New J Phys, 7:26–47, 2005.
- (15) P. Erdos and A. Renyi. On the evolution of random graphs. Bull Inst Internat Stat, 38:343–347, 1961.
- (16) D. J. Watts and S. H. Strogatz. Collective dynamics of small-world networks. Nature, 393:440–442, 1998.
- (17) M. E. J. Newman, C. Moore, and D. Watts. Mean-field solution of the small-world network model. Phys Rev Lett, 84:3201–3204, 2000.
- (18) https://en.wikipedia.org/wiki/Wikipedia:What_is_an_article?
- (19) https://stats.wikimedia.org/EN/TablesWikipediaEN.htm.
- (20) https://en.wikipedia.org/wiki/Wikipedia:Size_in_volumes.
- (21) M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. SIGCOMM Comput. Commun. Rev., 29:251–262, 1999.