Estimating the Size of a Large Network and its Communities from a Random Sample
Most real-world networks are too large to be measured or studied directly and there is substantial interest in estimating global network properties from smaller sub-samples. One of the most important global properties is the number of vertices/nodes in the network. Estimating the number of vertices in a large network is a major challenge in computer science, epidemiology, demography, and intelligence analysis. In this paper we consider a population random graph from the stochastic block model (SBM) with communities/blocks. A sample is obtained by randomly choosing a subset and letting be the induced subgraph in of the vertices in . In addition to , we observe the total degree of each sampled vertex and its block membership. Given this partial information, we propose an efficient PopULation Size Estimation algorithm, called Pulse, that correctly estimates the size of the whole population as well as the size of each community. To support our theoretical analysis, we perform an exhaustive set of experiments to study the effects of sample size, , and SBM model parameters on the accuracy of the estimates. The experimental results also demonstrate that Pulse significantly outperforms a widely-used method called the network scale-up estimator in a wide variety of scenarios. We conclude with extensions and directions for future work.
Many real-world networks cannot be studied directly because they are obscured in some way, are too large, or are too difficult to measure. There is therefore a great deal of interest in estimating properties of large networks via sub-samples . One of the most important properties of a large network is the number of vertices it contains. Unfortunately census-like enumeration of all the vertices in a network is often impossible, so researchers must try to learn about the size of real-world networks by sampling smaller components. In addition to the size of the total network, there is great interest in estimating the size of different communities or sub-groups from a sample of a network. Many real-world networks exhibit community structure, where nodes in the same community have denser connections than those in different communities [7, 16, 15]. In the following examples, we describe network size estimation problems in which only a small subgraph of a larger network is observed.
Social networks. The social and economic value of an online social network (e.g. Facebook, Instagram, Twitter) is closely related to the number of users the service has. When a social networking service does not reveal the true number of users, economists, marketers, shareholders, or other groups may wish to estimate the number of people who use the service based on a sub-sample .
World Wide Web. Pages on the World-Wide Web can be classified into several categories (e.g. academic, commercial, media, government, etc.). Pages in the same category tend to have more connections. Computer scientists have developed crawling methods for obtaining a sub-network of web pages, along with their hyperlinks to other unknown pages. Using the crawled sub-network and hyperlinks, they can estimate the number of pages of a certain category [14, 13, 19, 10, 17].
Size of the Internet. The number of computers on the Internet (the size of the Internet) is of great interest to computer scientists. However, it is impractical to access and enumerate all computers on the Internet and only a small sample of computers and the connection situation among them can be accessible .
Counting terrorists. Intelligence agencies often target a small number of suspicious or radicalized individuals to learn about their communication network. But agencies typically do not know the number of people in the network. The number of elements in such a covert network might indicate the size of a terrorist force, and would be of great interest .
Epidemiology. Many of the groups at greatest risk for HIV infection (e.g. sex workers, injection drug users, men who have sex with men) are also difficult to survey using conventional methods. Since members of these groups cannot be enumerated directly, researchers often trace social links to reveal a network among known subjects. Public health and epdiemiological interventions to mitigate the spread of HIV rely on knowledge of the number of HIV-positive people in the population [9, 8, 20, 21, 22, 23, 6].
Counting disaster victims. After a disaster, it can be challenging to estimate the number of people affected. When logistical challenges prevent all victims from being enumerated, a random sample of individuals may be possible to obtain [2, 3].
In this paper, we propose a novel method called Pulse for estimating the number of vertices and the size of individual communities from a random sub-sample of the network. We model the network as an undirected simple graph , and we treat as a realization from the stochastic blockmodel (SBM), a widely-studied extension of the Erdős-Rényi random graph model  that accommodates community structures in the network by mapping each vertex into one of disjoint types or communities. We construct a sample of the network by choosing a sub-sample of vertices uniformly at random without replacement, and forming the induced subgraph of in . We assume that the block membership and total degree of each vertex are observed. We propose a Bayesian esitmation alogrithm Pulse for , the number of vertices in the network, along with the number of vertices in each block. We first prove important regularity results for the posterior distribution of . Then we describe the conditions under which relevant moments of the posterior distribution exist. We evaluate the performance of Pulse in comparison with the popular “network scale-up” method (NSUM) [9, 8, 20, 21, 22, 23, 6, 11]. We show that while NSUM is asymptotically unbiased, it suffers from serious finite-sample bias and large variance. We show that Pulse has superior performance – in terms of relative error and variance – over NSUM in a wide variety of model and observation scenarios. Proofs are given in the appendix.
2 Problem Formulation
The stochastic blockmodel (SBM) is a random graph model that generalizes the Erdős-Rényi random graph . Let be a realization from an SBM, where is the total number of vertices, the vertices are divided into types indexed , specified by the map , and a type- vertex and a type- vertex are connected independently with probability . Let be the number of type- vertices in , with . The degree of a vertex is . An edge is said to be of type- if it connects a type- vertex and a type- vertex. A random induced subgraph is obtained by sampling a subset with uniformly at random without replacement, and forming the induced subgraph, denoted by . Let be the number of type- vertices in the sample and be the number of type- edges in the sample. For a vertex in the sample, a pendant edge connects vertex to a vertex outside the sample. Let be the number of pendant edges incident to . Let be the number of type- pendant edges of vertex ; i.e., . We have . Let be the number of type- nodes outside the sample. For the ease of presentation we define , , and . We observe only and the total degree of each vertex in the sample. We assume that we know the type of each vertex in the sample. The observed data consists of , and ; i.e., . A table of notation is provided in the appendix.
Given the observed data , estimate the size of the vertex set and the size of each community .
Fig. 1 illustrates the vertex set size estimation problem. White nodes are of type- and gray nodes are of type-. All nodes outside are unobserved. We observe the types and the total degree of each vertex in the sample. Thus we know the number of pendant edges that connect each vertex in the sample to other, unsampled vertices. However, the destinations of these pendant edges are unknown to us.
3 Network Scale-Up Estimator
We briefly outline a simple and intuitive estimator for that will serve as a comparison to Pulse. The network scale-up method (NSUM) is a simple estimator for the vertex set size of Erdős-Rényi random graphs. It has been used in real-world applications to estimate the size of hidden or hard-to-reach populations such as drug users , HIV-infected individuals [8, 20, 21, 22, 23], men who have sex with men (MSM) , and homeless people . Consider a random graph that follows the Erdős-Rényi distribution. The expected sum of total degrees in a random sample of vertices is . The expected number of edges in the sample is , where is the number of edges within the sample. A simple estimator of the connection probability is . Plugging into into the moment equation and solving for yields , often simplified to [9, 8, 20, 21, 22, 23, 6, 11].
(Proof in Appendix) Suppose follows a stochastic blockmodel with edge probability for . For any sufficiently large sample size, the NSUM is positively biased and has an asymptotic lower bound , as becomes large, where for two sequences and , means that there exists a sequence such that ; i.e., for all and . However, as sample size goes to infinity, the NSUM becomes asymptotically unbiased.
4 Main Results
NSUM uses only aggregate information about the sum of the total degrees of vertices in the sample and the number of edges in the sample. We propose a novel algorithm, Pulse, that uses individual degree, vertex type, and the network structure information. Experiments (Section 5) show that it outperforms NSUM in terms of both bias and variance.
Given , the conditional likelihood of the edges in the sample is given by
and the conditional likelihood of the pendant edges is given by
where the sum is taken over all ’s () such that and . Thus the total conditional likelihood is
If we condition on and , the likelihood of the edges within the sample is the same as since it does not rely on , while the likelihood of the pendant edges given and is
Therefore the total likelihood conditioned on and is given by The conditional likelihood is indeed a function of . We may view this as the likelihood of given the data and the probabilities ; i.e., . Similarly, the likelihood conditioned on and is a function of and . It can be viewed as the joint likelihood of and given the data and the probabilities ; i.e., , and , where the sum is taken over all ’s, and , such that and , , . To have a full Bayesian approach, we assume that the joint prior distribution for and is . Hence, the population size estimation problem is equivalent to the following optimization problem for :
Then we estimate the total population size as .
We briefly study the regularity of the posterior distribution of . In order to learn about , we must observe enough vertices from each block type, and enough edges connecting members of each block, so that the first and second moments of the posterior distribution exist. Intuitively, in order for the first two moments to exist, either we must observe many edges connecting vertices of each block type, or we must have sufficiently strong prior beliefs about .
(Proof in Appendix) Assume that and follows the Beta distribution independently for . Let If is bounded and , then the -th moment of exists.
In particular, if , the variance of exists. Theorem 2 gives the minimum possible number of edges in the sample to make the posterior sampling meaningful. If the prior distribution of is , then we need at least three edges incident on type- edges for all types to guarantee the existence of the posterior variance.
4.1 Erdős-Rényi Model
In order to better understand how Pulse estimates the size of a general stochastic block-model we study the Erdős-Rényi case where , and all vertices are connected independently with probability . Let denote the total population size, be the sample with size and . For each vertex in the sample, let denote the number of pendant edges of vertex , and is the number of edges within the sample. Then In the Erdős-Rényi case, and thus . Therefore, the total likelihood of conditioned on is given by
We assume that has a beta prior and that has a prior . Let
where . The posterior probability is proportional to . The algorithm is presented in Algorithm 1.
4.2 General Stochastic Blockmodel Model
In the Erdős-Rényi case, . However, in the general stochastic blockmodel case, in addition to the unknown variables to be estimated, we do not know (, ) either. The expression involves costly summation over all possibilities of integer composition of (). However, the joint posterior distribution for and , which is proportional to , does not involve summing over integer partitions; thus we may sample from the joint posterior distribution for and , and obtain the marginal distribution for . Our proposed algorithm Pulse realizes this idea. Let . We know that the joint posterior distribution for and , denoted by , is proportional to . In addition, the conditional distributions and are also proportional to , where , and . The proposed algorithm Pulse is a Gibbs sampling process that samples from the joint posterior distribution (i.e., ), which is specified in Algorithm 2.
For every and , because the number of type- pendant edges of vertex must not exceed the total number of type- vertices outside the sample. Therefore, we have must hold for every . These observations put constraints on the choice of proposal distributions and , and ; i.e., the support of must be contained in and the support of must be contained in
Let be the window size for , taking values in . Let Let the proposal distribution be defined as below:
The proposed value is always greater than or equal to . This proposal distribution uniform within the window , and thus the proposal ratio is . The proposal for and its proposal ratio are presented in the appendix.
Effect of Parameter . We first evaluate the performance of Pulse in the Erdős-Rényi case. We fix the size of the network at and the sample size and vary the parameter . For each , we sample graphs from . For each selected graph, we compute NSUM and run Pulse times (as it is a randomized algorithm) to compute its performance. We record the relative errors by the Tukey boxplots shown in Fig. 1(a). The posterior mean proposed by Pulse is an accurate estimate of the size. For the parameter varying from to , most of the relative errors are bounded between and . We also observe that the NSUM tends to overestimate the size as it shows a positive bias. This confirms experimentally the result of Theorem 1. For both methods, the interquartile ranges (IQRs, hereinafter) correlate negatively with . This shows that the variance of both estimators shrinks when the graph becomes denser. The relative errors of Pulse tend to concentrate around with larger which means that the performance of Pulse improves with larger . In contrast, a larger does not improve the bias of the NSUM.
Effect of Network Size . We fix the parameter and the sample size and vary the network size from to . For each , we randomly pick graphs from . For each selected graph, we compute NSUM and run Pulse times. We illustrate the results via Tukey boxplots in Fig. 1(b). Again, the estimates given by Pulse are very accurate. Most of the relative errors reside in and almost all reside in . We also observe that smaller network sizes can be estimated more accurately as Pulse will have a smaller variance. For example, when the network size is , almost all of the relative errors are bounded in the range while for , the relative errors are in . This agrees with our intuition that the performance of estimation improves with a larger sampling fraction. In contrast, NSUM heavily overestimates the network size as the size increases. In addition, its variance also correlates positively with network size.
Effect of Sample Size . We study the effect of the sample size on the estimation error. Thus, we fix the size and the parameter , and we vary the sample size from to . For each , we randomly select graphs from . For every selected graph, we compute the NSUM estimate, run Pulse times, and record the relative errors. The results are presented in Fig. 1(c). We observe that for both methods that the IQR shrinks as the sample size increases; thus a larger sample size reduces the variance of both estimators. Pulse does not exhibit appreciable bias when the sample size varies from to . Again, NSUM overestimates the size; however, its bias reduces when the sample size becomes large. This reconfirms Theorem 1.
5.2 General Stochastic Blockmodel
Effect of Sample Size and Type Partition. Here, we study the effect of the sample size and the type partition. We set the network size to and we assume that there are two types of vertices in this network: type and type with and nodes, respectively. The ratio quantifies the type partition. We vary from to and the sample size from to . For each combination of and the sample size , we generate graphs with and . For each graph, we compute the NSUM and obtain the average relative error. Similarly, for each graph, we run Pulse times in order to compute the average relative error for the graphs and estimates for each graph. The results are shown as heat maps in Fig. 1(d). Note that the color bar on the right side of Fig. 1(d) is on logarithmic scale. In general, the estimates given by Pulse are very accurate and exhibit significant superiority over the NSUM estimates. The largest relative errors of Pulse in absolute value, which are approximately , appear in the upper-left and lower-left corner on the heat map. The performance of the NSUM (see the lower subfigure in Fig. 1(d)) is robust to the type partition and equivalently the ratio . As we enlarge the sample size, its relative error decreases.
The left subfigure in Fig. 1(d) shows the performance of Pulse. When the sample size is small, the relative error decreases as increases from to ; when rises from to , the relative error becomes large. Given the fixed ratio , as expected, the relative error declines when we have a larger sample. This agrees with our observation in the Erdős-Rényi case. However, when the sample size is large, Pulse exhibits better performance when the type partition is more homogeneous. There is a local minimum relative error in absolute value shown at the center of the subfigure. Pulse performs best when there is a balance between the number of edges in the sampled induced subgraph and the number of pendant edges emanating outward. Larger sampled subgraphs allow more precision in knowledge about , but more pendant edges allow for better estimation of , and hence each . Thus when the sample is approximately half of the total vertex set size, the balanced combination of the number of edges within the sample and those emanating outward leads to better performance.
Effect of Intra- and Inter-Community Edge Probability. Suppose that there are two types of nodes in the network. The mean degree is given by
We want to keep the mean degree constant and vary the random graph gradually so that we observe 3 phases: high intra-community and low inter-community edge probability (more cohesive), Erdős-Rényi , and low intra-community and high inter-community edge probability (more incohesive). We introduce a cohesion parameter . In the two-block model, we have , where is a constant. Let’s call the deviation from this situation and let
The mean degree stays constant for different . In addition, , and must reside in . This requirement can be met if we set the absolute value of small enough. By changing from positive to negative we go from cohesive behavior to incohesive behavior. Clearly, for , the graph becomes an Erdős-Rényi graph with .
We set the network size to , to , and to . We fix and let vary from to . When , the intra-community edge probabilities are and and the inter-community edge probability is . When , the intra-community edge probabilities are and and the inter-community edge probability is . For each , we generate graphs and for each graph, we run Pulse times. Given each value of , relative errors are shown in box plots. We present the results in Fig. 1(e) as we vary . From Fig. 1(e), we observe that despite deviation from the Erdős-Rényi graph, both methods are robust. However, the figure indicates that Pulse is unbiased (as median is around zero) while NSUM overestimates the size on average. This again confirms Theorem 1.
An important feature of Pulse is that it can also estimate the number of nodes of each type while NSUM cannot. The results for type- and type- with different are shown in Fig. 1(f). We observe that the median of all boxes agree with the line; thus the separate estimates for or are unbiased. Note that when the edge probabilities are more homogeneous (i.e., when the graph becomes more similar to the Erdős-Rényi model) the IQRs, as well as the interval between the two ends of the whiskers, become larger. This shows that when we try to fit an Erdős-Rényi model (a single-type stochastic blockmodel) into a two-type model, the variance becomes larger.
5.2.1 Effect of Number of Types and Sample Size
Finally, we study the impact of the number of types and the sample size on the relative error. To generate graphs with different number of types, we use a Chinese restaurant process (CRP) . We set the total number of vertices to , first pick 100 vertices and use the Chinese restaurant process to assign them to different types. Suppose that CRP gives types; We then distribute the remaining 100 vertices evenly among the types. The edge probability () is sampled from and () is sampled from , all independently. We set the sampling fraction to , and , and use NSUM and Pulse to estimate the network size. Relative estimation errors are illustrated in Fig. 1(g). We observe that with the same sampling fraction and same the number of types , Pulse has a smaller relative error than that of the NSUM. Similarly, the interquartile range of Pulse is also smaller than that of the NSUM. Hence, Pulse provides a higher accuracy with a smaller variance. For both methods the relative error decreases (in absolute value) as the sampling fraction increases. Accordingly, the IQRs also shrink for larger sampling fraction. With the sampling fraction fixed, the IQRs become larger when we increase the number of types in the graph. The variance of both methods increases for increasing values of . The median of NSUM is always above on average which indicates that it overestimates the network size.
In this paper, we have developed a method for using a random sub-sample to estimate the size of a graph whose distribution is given by a SBM. We analyzed the bias of the widely-used network scale-up estimator theoretically and showed that for sufficiently large sample sizes, it overestimates the vertex set size in expectation (but asymptotically unbiased). Regularity results establish the conditions under which the posterior distribution of the population size is well-defined. Extensive experimental results show that Pulse outperforms the network scale-up estimator in terms of the relative error and estimation variance.
-  D. J. Aldous. Exchangeability and related topics. Springer, 1985.
-  H. Bernard, E. Johnsen, P. Killworth, and S. Robinson. How many people died in the mexico city earthquake. Estimating the Number of People in an Average Network and in an Unknown Event Population. The Small World, ed. M. Kochen (forthcoming). Newark, 1988.
-  H. R. Bernard, P. D. Killworth, E. C. Johnsen, G. A. Shelley, and C. McCarty. Estimating the ripple effect of a disaster. Connections, 24(2):18–22, 2001.
-  M. S. Bernstein, E. Bakshy, M. Burke, and B. Karrer. Quantifying the invisible audience in social networks. In Proc. SIGCHI, pages 21–30. ACM, 2013.
-  F. W. Crawford. Hidden network reconstruction from information diffusion. In Fusion. IEEE, 2015.
-  S. Ezoe, T. Morooka, T. Noda, M. L. Sabin, and S. Koike. Population size estimation of men who have sex with men through the network scale-up method in Japan. PLoS One, 7(1):e31184, 2012.
-  M. Girvan and M. E. Newman. Community structure in social and biological networks. 2002.
-  W. Guo, S. Bao, W. Lin, G. Wu, W. Zhang, W. Hladik, A. Abdul-Quader, M. Bulterys, S. Fuller, and L. Wang. Estimating the size of HIV key affected populations in Chongqing, China, using the network scale-up method. PLoS One, 8(8):e71796, 2013.
-  C. Kadushin, P. D. Killworth, H. R. Bernard, and A. A. Beveridge. Scale-up methods as applied to estimates of heroin use. Journal of Drug Issues, 2006.
-  L. Katzir, E. Liberty, and O. Somekh. Estimating sizes of social networks via biased sampling. In Proc. WWW, pages 597–606. ACM, 2011.
-  P. D. Killworth, C. McCarty, H. R. Bernard, G. A. Shelley, and E. C. Johnsen. Estimation of seroprevalence, rape, and homelessness in the United States using a social network approach. Eval. Rev., 22(2):289–308, 1998.
-  A. S. Maiya and T. Y. Berger-Wolf. Benefits of bias: Towards better characterization of network sampling. In Proc. SIGKDD, pages 105–113. ACM, 2011.
-  L. Massoulié, E. Le Merrer, A.-M. Kermarrec, and A. Ganesh. Peer counting and sampling in overlay networks: random walk methods. In Proc. PODC, pages 123–132. ACM, 2006.
-  B. H. Murray and A. Moore. Sizing the internet. White paper, Cyveillance, page 3, 2000.
-  M. E. Newman. Modularity and community structure in networks. PNAS, 2006.
-  M. E. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69(2):026113, 2004.
-  M. Papagelis, G. Das, and N. Koudas. Sampling online social networks. Trans. KDE, 25(3):662–676, 2013.
-  A. Rényi and P. Erdős. On random graphs. Publicationes Mathematicae, 6(290-297):5, 1959.
-  B. Ribeiro and D. Towsley. Estimating and sampling graphs with multidimensional random walks. In Proc. IMC, pages 390–403. ACM, 2010.
-  M. J. Salganik, D. Fazito, N. Bertoni, A. H. Abdo, M. B. Mello, and F. I. Bastos. Assessing network scale-up estimates for groups most at risk of HIV/AIDS: evidence from a multiple-method study of heavy drug users in Curitiba, Brazil. American Journal of Epidemiology, 174(10):1190–1196, 2011.
-  G. A. Shelley, H. R. Bernard, P. Killworth, E. Johnsen, and C. McCarty. Who knows your HIV status? what HIV+ patients and their network members know about each other. Social Networks, 17(3):189–217, 1995.
-  G. A. Shelley, P. D. Killworth, H. R. Bernard, C. McCarty, E. C. Johnsen, and R. E. Rice. Who knows your HIV status II?: Information propagation within social networks of seropositive people. Human Organization, 65(4):430–444, 2006.
-  M. Shokoohi, M. R. Baneshi, and A.-a. Haghdoost. Size estimation of groups at high risk of HIV/AIDS using network scale up in Kerman, Iran. Int’l J. Prev. Medi., 3(7):471, 2012.
-  J. Wendel. Note on the gamma function. Am. Math. Mon., pages 563–564, 1948.
-  S. Xing and B.-P. Paris. Measuring the size of the internet via importance sampling. J. Sel. Areas Commun, 21(6):922–933, 2003.
Table of Notation
Notation in this paper is summarized in Table 1.
|Underlying graph structure||# of type- edges in the sample|
|True population size||# of edges in the sample|
|# of types||# of type- pendant edges of|
|Type of vertex||# of pendant edges of vertex|
|Degree of vertex in||# of type- vertices outside sample|
|Total # of type- vertices|
|Vertex set of the sample|
|Subgraph of induced by|
|# of type- vertices in the sample|
Proof of Theorem 1
Define is the probability that two different nodes and have an edge between them when they are sampled uniformly at random from the vertex set. Thus the probability is given by
Let’s compute the expectation of the network scale-up estimator. We have
By Jensen’s inequality, we know that
Plugging this in, we have
Note that when we have
When the sample size is sufficiently large, the inequality will hold because the terms and decrease exponentially. In this case,
Therefore for a sufficiently large sample size, the network scale-up estimator is biased and always overestimates the vertex set size. Furthermore, in addition to showing that it always overestimates the vertex set, we can derive an asymptotic lower bound for the bias via a more careful analysis.
Let us recall the definitions of asymptotic equality and inequality for completeness.
Let and be two sequences of real numbers. We say that and are asymptotically equal if ; in this case, we denote it by
Let and be two sequences of real numbers. We say that is asymptotically greater than or equal to if there exists a sequence such that for all and ; in this case, we denote it by
Recall that we just showed that
Then we have
Since decreases to exponentially in , we have . Thus we know that
Therefore, we deduce that
Now we would like to show its asymptotic unbiasedness. We have
To show the asymptotic unbiasedness, we have to derive an upper bound for the conditional expectation . Let be a constant to be determined later. We divide it into two cases where is concentrated around its mean and the anti-concentration happens:
Given , is always less than or equal to . Thus the second term in (2) can be bounded as below: