“What Do Your Friends Think?”: Efficient Polling Methods for Networks Using Friendship Paradox

“What Do Your Friends Think?”:
Efficient Polling Methods for Networks Using Friendship Paradox

Buddhika Nettasinghe,  and Vikram Krishnamurthy, 
Authors are with the School of Electrical and Computer Engineering, Cornell University and Cornell Tech.
E-mail: {dwn26, vikramk}@cornell.edu.
Abstract

This paper deals with randomized polling of a social network. In the case of forecasting the outcome of an election between two candidates A and B, classical intent polling asks randomly sampled individuals: who will you vote for? Expectation polling asks: who do you think will win? In this paper, we propose a novel neighborhood expectation polling (NEP) strategy that asks randomly sampled individuals: what is your estimate of the fraction of votes for A? Therefore, in NEP, sampled individuals will naturally look at their neighbors (defined by the underlying social network graph) when answering this question. Hence, the mean squared error (MSE) of NEP methods rely on selecting the optimal set of samples from the network. To this end, we propose three NEP algorithms for the following cases: (i) the social network graph is not known but, random walks (sequential exploration) can be performed on the graph (ii) the social network graph is unknown. For case (i) and (ii), two algorithms based on a graph theoretic consequence called friendship paradox are proposed. Theoretical results on the dependence of the MSE of the algorithms on the properties of the network are established. Numerical results on real and synthetic data sets are provided to illustrate the performance of the algorithms.

opinion polling, forecasting, expectation polling, friendship paradox, voting, stochastic ordering, degree distribution, graph sampling, social networks, social sampling

1 Introduction

This paper deals with randomized polling of a social network with a possibly unknown structure. In the case of forecasting the outcome of an election between two candidates A and B, classical intent polling asks randomly sampled individuals: who will you vote for? Expectation polling asks: who do you think will win? In this paper, we propose a novel neighborhood expectation polling strategy that asks randomly sampled individuals: what is your estimate of the fraction of votes for A? Next, we formally define the problem, explain the solution approach and the related work that motivates it.

Consider a social network represented by an undirected graph where, each node has a label . A pollster can query a total of (called the sampling budget) number of individuals from this social network.

Problem Definition.

Estimate,

(1)

which is the fraction of nodes with label 1, with a sampling budget for the following cases:

  • Case 1 - graph is not known but, the graph can be explored sequentially using a random walk

  • Case 2 - graph is not known but, uniform samples from can be obtained

We propose a class of polling methods that we call neighborhood expectation polling (NEP) to address the above problem111Applications of this problem are abundant, e.g. forecasting the outcome in an upcoming election[1, 2], estimating the fraction of individuals infected with a disease [3], estimating the number of individuals interested in buying a certain product (a market research). More specific real world examples for cases 1 and 2 are discussed in Sec. 3.1 and Sec. 3.2 respectively.. In NEP, a set of individuals from the social network are selected and asked,

“What is your estimate of the fraction of people with label 1?”.

When trying to estimate an unknown quantity about the world, any individual naturally looks at his/her neighbors. Therefore, each sampled individual would provide the fraction of their neighbors , with label . In other words, the response of the individual for the NEP query would be,

(2)

Then, the average of all the responses is used as the NEP estimate of the fraction .

1.1 Context

Why call it NEP? NEP takes its name from the fact that, the response of each sampled individual is the expected label value among his/her neighbors i.e. where, is a random neighbor of the sampled individual .

(a) Network : labels are highly correlated with the degrees of nodes
(b) Network : nodes with the same label are clustered (depicting Homophily)
(c) Network : a large regular graph with uniformly at random assigned labels
Fig. 1: Consider the case of uniformly sampling nodes and obtaining responses of sampled nodes about the fraction of red (i.e. label 1) nodes in the network. In graph of Fig. 0(a), most nodes have their only neighbor to be of color red even though most of the nodes in the network are of color blue. Hence, uniformly sampling nodes for NEP in this case would result in a highly biased estimate. In graph of Fig. 0(b), approximately half the nodes have only a red neighbor and, rest of the nodes have only a blue neighbor. Hence, uniformly sampling nodes for NEP in this case would result in an estimate with a large variance. In graph of Fig. 0(c), average of the NEP responses of nodes is approximately equal to the fraction of nodes with red labels. Further, does not vary largely among nodes. Hence, uniformly sampling nodes for NEP in this case would result in an accurate estimate. Similar examples can also be found in [4]. The figure highlights the importance of exploiting network structure and node labels when sampling nodes for NEP.

Why (not) use NEP? NEP is substantially different to classical intent polling where, each sampled individual is asked “What is your label?”. In intent polling, the response of each sampled individual is his/her label . In contrast, in NEP, the response of each sampled individual is a function of his/her neighborhood (defined by the underlying graph ) as well as the labels of his/her neighbors. Therefore, depending on the graph , function and the method of obtaining the samples , NEP might produce either,

  1. an estimate with a larger MSE compared to intent polling (e.g. networks in Fig. 0(a) and Fig. 0(b) shows when uniform sampling of individuals for NEP might not work), or,

  2. an estimate with a smaller MSE compared to intent polling (e.g. network in Fig. 0(c) shows when uniform sampling of individuals for NEP might work)

These two possible outcomes highlight the importance of using the available information about the graph and the function , when selecting the set of individuals in NEP. This lead us to the main results of this paper.

Remark 1.

If the graph is fully known, a greedy (deterministic) optimization method (similar to the one in [5]) can be used to solve the NP hard problem of finding the set of individuals whose collective neighborhood is largest, with a approximation guarantee. However, the largest collective neighborhood does not ensure that the set of individuals would provide an accurate NEP estimate of the fraction defined in (1) e.g. if the sampling budget , the node with the largest collective neighborhood in the graph in Fig. 0(b) is the red color node with degree seven, whose NEP response (fraction of red neighbors) is , even though . Hence, our focus is on randomized sampling methods for NEP that do not require the graph to be known.

1.2 Main Results and Organization

The main results of this paper are NEP algorithms for the two cases described in the problem definition and, their analysis. The algorithms utilize properties related to the structure of the network to find number of samples. The analysis provides simple and intuitive conditions under which, the proposed algorithms will provide a better estimate compared to intent polling. These results can be summarized as follows.

  • For cases 1 and 2, estimators are obtained by combining NEP with recent statistical results related to a phenomenon called friendship paradox[6]. Analytical results characterizing the dependence of MSE of estimates on the properties of the graph , labels of individuals are obtained. These results help identify conditions on the graph and the labels for which, NEP produces a better estimate compared to intent polling.

  • Numerical results on synthetic data are provided, illustrating the better performance of the algorithm compared to classical methods.


Organization: Sec. 2 presents a review of the friendship paradox. Sec. 3 presents the NEP algorithms based on the friendship paradox for case 1 and case 2, followed by their theoretical analysis in Sec. 4. Sec. 5 evaluates the proposed algorithms on synthetic data sets to illustrate and compare their performances. Finally, Sec. 6 provides a discussion about the two algorithms, their theoretical and experimental evaluations and how they relate to each other.

1.3 Related work

As described above in Sec. 1.1, in the classical intent polling222This method is called intent polling because, in the case of predicting the outcome of an election, this is equivalent to asking the voting intention of sampled individuals i.e. asking “Who are you going to vote for in the upcoming election?”) [7]., a set of nodes are obtained by uniform sampling with replacement and then, the average of their labels

(3)

is used as the estimate (called intent polling estimate henceforth) of the fraction . The main limitation of intent polling is that the sample size needed to achieve a - additive error is [4]. Our work is motivated by two recently proposed methods, namely “expectation polling” [7] and “social sampling” [4], that attempt to overcome this limitation in intent polling.

Firstly, in expectation polling [7], each sampled individual is asked to provide an estimate about the label held by the majority of the individuals in the network (e.g. asking “Who do you think will win the election?”). Then, each sampled individual will look at his/her neighbors and provide the value held by the majority of them. This method is more efficient (in terms of sample size) compared to the intent polling method since each sample now provides the putative response of a neighborhood333Intent polling and expectation polling have been considered intensively in literature, mostly in the context of forecasting elections and, it is generally accepted that expectation polling is more efficient compared to intent polling [8, 9, 10, 11, 12].444[13, 14] discuss how expectation polling can give rise to misinformation propagation in social learning and, propose Bayesian filtering methods to eliminate the misinformation propagation.. Secondly, in social sampling[4], the response of each sampled individual is a function of the labels, degrees and the sampling probabilities of his/her neighbors. [4] provides several unbiased estimators for the fraction using this method and, establishes bounds for their variances. The main limitation of social sampling method (compared to NEP) is that it requires the sampled individuals to know a significant amount of information about their neighbors (apart from just their labels), the graph and the sampling process. Therefore, a practical implementation of social sampling might not be feasible in settings with limited information about a very large graph. Hence, NEP can be thought of a as a method which asks a question that seeks a finer resolution compared to expectation polling and yet, simpler and intuitive compared to social sampling.

The key idea utilized in estimators proposed for cases 1, 2 is the friendship paradox (reviewed in detail in Sec. 2), which is a form of network bias observed in undirected graphs. Friendship paradox has recently gained attention in several applications related to networks under the broad theme “how network biases can be used effectively for estimation problems?”. For example, [15] shows how friendship paradox can be utilized for accurate estimation of a heavy tailed degree distribution, [16, 17] show how friendship paradox can be used for quickly detecting a disease outbreak, [18, 19, 20, 21] study how friendship paradox can be used for influence maximization. Our results for the cases 1, 2 also fall under this theme. Apart from these, [22, 23, 24, 25, 26, 27] also explore further effects and generalizations of friendship paradox.

2 What is Friendship Paradox?

“Friendship paradox” is a graph theoretic consequence first discovered in the paper [6] by Scott L. Feld in 1991. The friendship paradox states, “on average, the number of friends of a random friend is always greater than the number of friends of a random individual”. Formally:

Theorem 1.

(Friendship Paradox [6]) Let be an undirected graph, be a node chosen uniformly from and, be a uniformly chosen node from a uniformly chosen edge . Then,

(4)

where, denotes the degree of .

The intuitive reasoning behind Theorem 1 is as follows. Individuals with large number of friends appear as the friends of a large number of individuals. Hence, they can contribute to an increase in the average number of friends of friends. On the other hand, individuals with smaller number of friends appear as friends of a smaller number of individuals. Hence, they cannot cause a significant change in the average number of friends of friends.

Remark 2.

Random friend (or a random neighbor), denoted by random variable , refers to the uniformly at random chosen end of a uniformly at random chosen edge (a pair of friends or a friendship). This is different (with respect to the induced distribution on nodes) from choosing a node uniformly at random from and then choosing one of his/her random friends from among the neighbors uniformly at random.

The friendship paradox, in its original version given in Theorem 1, is a comparison between the degrees of a random individual and a random friend . However, a more intuitive comparison would be the comparison of degree of a random individual and the degree of a random friend of a random individual (as explained in Remark 2). Recently, [28] obtained the following result comparing and .

Theorem 2.

[28] Let be an undirected graph, be a node chosen uniformly from and, be a uniformly chosen neighbor of a uniformly chosen node from . Then,

(5)

where, denotes the first order stochastic dominance555A random variable (with a cumulative distribution function ) first order stochastically dominates a discrete random variable (with a cumulative distribution function ), denoted if, , for all ..

An immediate consequence of Theorem 2 is,

(6)

which says that a random neighbor of a random individual has more friends than a random individual, on average (from the fact that first order stochastic dominance implies larger mean).

Next, we present the NEP algorithms that are based on Theorem 1 and Theorem 2.

3 NEP Algorithms Based on Friendship Paradox

In this section, we consider randomized methods for selecting individuals for NEP based on the concept of friendship paradox explained in Sec. 3.1.

3.1 Case 1 - Sampling friends using random walks

In this section, we consider the case where the graph is not known initially, but sequential exploration of the graph is possible using multiple random walks (case 1 of problem definition) over the nodes of the graph.


A motivating example for case 1 is a massive online social network where the fraction of user profiles with a certain characteristic needs to be estimated (e.g. profiles with more than ten posts about a product). Web-crawling (using random walks) approaches are widely used to obtain samples from such massive online social networks without requiring the global knowledge of the full network graph [29, 30, 31, 32, 33].

Input: number of samples .
Output: Estimate of the of the fraction of nodes with label .
  1. Initialize random walks on the social network starting from .

  2. Run each random walk for a steps and then collect sample where, is collected from ith random walk.

  3. Query each to obtain and, compute the estimate

    of the fraction of nodes with label .

Algorithm 1 NEP with Random Walk Based Sampling

We propose Algorithm 1 for estimating the fraction in case 1. The intuition behind Algorithm 1 stems from the fact that the stationary distribution of a random walk on an undirected graph is uniform over the set of neighbors [34]. Therefore, Algorithm 1 obtains a set of neighbors independently (for sufficiently large ) from the graph in step 2. Then, the response of each sampled individual for the NEP query is used to compute the estimate in step 3. According to the friendship paradox (Theorem 1), using uniformly sampled neighbors is equivalent to using more nodes due to the fact that random neighbors have more neighbors than random nodes on average. Hence, it is intuitive that the performance of this method should have a smaller MSE compared to the method of NEP with uniformly sampled nodes and intent polling method. In Sec. 4, we verify this claim theoretically and explore the conditions on the labelling function and the properties of the graph for the estimator to be more accurate compared to the intent polling method.

3.2 Case 2 - Sampling a Random Friend of a Random Individual

Here we assume that the graph is not known and it is not possible to crawl the graph (using random walks). It is further assumed that a set of uniform samples from the set of nodes can be obtained and, each sampled individual has the ability to answer the question ”What is your (random) friend’s estimate of the fraction of individuals with label 1?”.


A motivating example for case 2 is the situation where random individuals are requested to answer survey questions for an incentive. In most such cases, the pollster does not have any information about the structural connectivity of the queried individuals and, will only be able to obtain their answer for a question.

For this case, we propose Algorithm 2 to obtain an estimate of the fraction of individuals with label .

Input: number of uniform samples .
Output: Estimate of the of the fraction of the individuals with label 1.
  1. Ask each to provide for some randomly chosen neighbor .

  2. Compute the estimate,

    of the fraction of the individuals with label 1.

Algorithm 2 NEP using Friends of Uniformly Sampled Nodes

In Algorithm 2, each uniformly sampled individual is asked the question ”What is your (random) friend’s estimate of the fraction of individuals with label 1?”. Then, each sampled node would provide for some randomly chosen . The theoretical reasoning behind this method comes from Theorem 2 in Section 2 which states that, a random friend of a randomly chosen individual has more friends than a randomly chosen individual on average666It should be noted that this does not follow from the original version of friendship paradox (Theorem 1) since the random friend is not a uniformly chosen neighbor from the set of all neighbors. Instead, the response now comes from a random neighbor conditioned to be a friend of the sampled node. . Therefore, this method should result in a smaller MSE compared to the method of NEP with uniformly sampled nodes and intent polling method.

4 Analysis of the Estimates Obtained via Algorithm 1 and Algorithm 2

Algorithm 1 and Algorithm 2 presented in Sec. 3 query random friends and random friends of random nodes (denoted by in Theorem 1 and Theorem 2) respectively, exploiting the friendship paradox.

In this context, the aim of this section is three fold:

  1. Theorem 3 motivates using friendship paradox based NEP algorithms (as opposed to NEP with uniformly sampled nodes)

  2. Theorem 4 relates bias and variance of the estimate obtained using Algorithm 1 to the properties of the network. Then, Corollary 5 gives sufficient conditions for to be an unbiased estimate with a smaller mean squared error (MSE) compared to intent polling method where, MSE of an estimate of a parameter is defined as

    (7)
    (8)
  3. Theorem 6 motivates the use of friendship paradox based sampling methods when the sampling budget is small

Theorem 3.

If the label of each node is independently and identically distributed then,

(9)
(10)

where, denotes mean square error defined in (8), is the NEP estimate with uniformly sampled nodes and, are the estimates obtained using Algorithm 1 and Algorithm 2 respectively.

Proof.

By definition,

Therefore, when the labels are iid,

Therefore, the estimates are unbiased when the labels are iid. Next, consider the variances of the estimate . Since all number of samples are independently sampled,

By applying the law of total variance, we get,

(since the labels are iid)

Following similar steps, we obtain,

where, denote the variance of the distribution of the labels. Then, the result follows by noting that

(11)
(12)

where, denotes the first order stochastic dominance defined in Footnote 5 in Sec. 2. Eq. (11) and (12) are immediate consequences of Theorem 1 and Theorem 2 (note that is strictly positive for connected graphs).

Theorem 3 shows that friendship paradox based sampling always has a smaller mean squared error when the node labels are independently and identically distributed (iid). This motivates the use of friendship paradox based NEP methods (Algorithm 1 and Algorithm 2) instead of uniform sampling based NEP. In the subsequent results, we show that the superiority of friendship paradox based NEP algorithms over the widely used intent polling method holds for conditions less stringent than the iid assumption.

Next, we formally quantify the bias and the variance of the estimator obtained via Algorithm 1 as the random walk length goes to infinity and then, compare it with the widely used intent polling method.

Theorem 4.

Let be a random node and be a random link sampled from a connected graph. Then, as tends to infinity, the bias and the variance of the estimate , obtained via Algorithm 1 are given by,

(13)
(14)
(15)
Proof.

If is a connected finite graph, then the stationary distribution of a random walk on samples each with a probability proportional to the degree of . Equivalently, the stationary distribution of a random walk on a finite connected graph sample friends (denoted by ) uniformly .

(16)
(17)
(18)

This proves the expression (13) for the bias of estimate .

To obtain the expression for the variance of the estimate , consider the variance of the opinion of a random friend . First note that the and are identically distributed (but, not independent) since, a friend of a random friend is also a random friend. Then, by applying the law of total variance,

(19)

Therefore,

which implies (since and have the same marginal distributions),

Therefore,

(20)
(21)

which proves the expression (15).

Theorem 4 provides insights into the properties of the networks for which, NEP based Algorithm 2 provides a better estimate compared to the intent polling method. Eq. (13) of Theorem 4 shows that, the bias of the estimate is the difference between the expected label value at a random friend, and the expected value at a random individual, . Further, (14) shows that it is proportional to the covariance between the degree and the label of a randomly chosen node . An immediate consequence of this result is the following corollary, which gives a sufficient condition for the estimate to be unbiased and, also have a smaller variance (and therefore, a smaller MSE) compared to intent polling.

Corollary 5.

If the label and the degree are uncorrelated and the graph is connected, the following statements hold as tends to infinity:

  1. The estimate , obtained via Algorithm 1 is unbiased for i.e.

    (22)
  2. The estimate , obtained via Algorithm 1 is more efficient compared to intent polling estimate in (3) i.e.

    (23)

    where, denotes mean square error defined in (8).

Proof.

If and are uncorrelated then, their covariance is zero and the first statement follows.

Next, from (19) and (20) note that,

(24)

Also, from 3, note that,

(25)

Consider, .

Therefore, from (16), implies, . Therefore, from (24) and (25), it follows that,

(26)

Since, both and are unbiased (under the hypothesis), the result follows from (26).

Theorem 4 also shows that the variance of the estimate is the covariance of the opinion of a random friend and the response of her random friend .

The following result gives sufficient conditions for to be a more efficient (in an MSE sense) estimator compared to intent polling method (even in the presence of bias) when the sampling budget .

Theorem 6.

Assume that the graph is connected and the sampling budget . Then, as tends to infinity, the estimate has a smaller MSE compared to the intent polling estimate , defined in (3), if

(27)

or

(28)
Proof.

Therefore, if and only if,

Since, , a sufficient condition for is,

Then, the result follows by noting that the sign of is the same as the sign of .

Theorem 6 shows that, if the expected degree of an individual with opinion 1 is larger (smaller) compared to the expected degree of an individual with opinion 0 and, the expected opinion in the network is above (below) half then, MSE of the estimate is smaller than intent polling estimate in (3) when the pollster can query only one individual. This helps the pollster to incorporate prior knowledge about the network and the labels in order to decide whether its suitable to use NEP based Algorithm 1 (over the intent polling method).


Summary: We showed that the MSE of the friendship paradox based NEP methods is smaller compared to NEP with randomly sampled individuals in Theorem 3 which is the motivation for using friendship paradox based sampling. Then, the expressions for bias and the variance of the estimate obtained via Algorithm 1 were derived in Theorem 4 and, these expressions show that estimate is unbiased with a variance smaller than intent polling estimate when the degree and the label of nodes are uncorrelated (Corollary 5). Finally, a result showing the conditions for the Algorithm 1 to outperform (in terms of MSE) intent polling method was given in Theorem 6 which motivates the use of friendship paradox based NEP method for small sampling budgets.

5 Experiments and Numerical Examples

In this section, we illustrate Algorithms 1 and 2 on synthetic networks. The aim is to evaluate the dependence of the accuracy (MSE) of the estimate of on the following properties of the network:

  1. Degree distribution , which is the probability that a randomly chosen node has neighbors.

  2. Neighbor Degree correlation (assortativity) coefficient

    (29)

    where, is the probability that a randomly chosen neighbor has neighbors (neighbor degree distribution), is the standard deviation of the degree of a random neighbor, is the probability of nodes at the either end of a randomly chosen edge have degrees and (joint degree distribution of neighbors).

  3. Degree-label correlation coefficient

    (30)

    where, are the standard deviations of the degree distribution and the label distribution respectively and, is the joint distribution of the labels and degrees of nodes.

A detailed discussion about these metrics and their effects can be found in [22].

5.1 Experimental Setup and Results

Generative Models of the Graphs: We use the following two generative models of graphs that result in two different types of degree distributions: power-law degree distribution and exponential degree distribution. In all experiments in this section, we consider graphs containing nodes.

  • Configuration Model [35]: Generate half-edges for each of the nodes where (where is a normalizing constant) and then, connect each half-edge to the another randomly selected half-edge avoiding self loops. This model will result in a power-law degree distribution777The power-law degree distribution is generally accepted as a key feature of many real world networks such as World Wide Web, Internet and social networks [33, 36, 37, 38] with a power-law exponent [39]. Further, it has been shown that friendship paradox and some of its effects are amplified in the presence of such power-law degree distributions[22, 15]. i.e. . We focus on two cases: and .

  • Erdős-Rényi (G(n,p)) model[37]: Any two (distinct) nodes are connected by an edge with probability . This model results in a Binomial degree distribution which can be approximated by a Poisson distribution for large . We choose to ensure that the graph has no isolated nodes with high probability.


Newman’s edge-rewiring procedure for modifying neighbor degree correlation: We utilize the edge-rewiring procedure proposed in [40] in order to change the assortativity coefficient defined in (29) of the graphs generated using the above models to a desired value while preserving the degree distribution. In the edge-rewiring procedure, two random links are chosen at each iteration and they are replaced with new edges if it increases (respectively, decreases) the value of the assortativity coefficient . The process is repeated until the desired value of the assortativity coefficient is achieved (or until it no longer changes).


Label swapping procedure for modifying degree-label correlation: Given a graph , we first assign labels to each node with a fixed probability. Then, in order to modify the degree-label correlation coefficient defined in (30) to a desired value, we utilize the label swapping procedure followed in [22]: a node with a label and a node with a label are selected randomly and their labels are swapped if (respectively, ) to increase (respectively, decrease) the degree-label correlation coefficient to the desired value (or until it no longer changes). We consider in our experiments to study the effect of negative and positive degree-label correlations.


Algorithms 1, 2 were evaluated on the networks obtained using the experimental procedure described above. The resulting mean squared errors for the configuration model (power-law degree distribution) are shown in Fig. 2 and Fig. 3 for power-law coefficient values and respectively. Similarly, results obtained for Erdős-Rényi graphs (Poisson degree distribution) are shown in Fig. 4. In the case of Erdős-Rényi graphs, we only consider assortativity coefficient since it cannot be changed significantly due to the homogeneity in the degree distribution.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Fig. 2: MSE of the estimates obtained using Algorithm 1 (), Algorithm 2 () and intent polling method () versus the sampling budget , for a power-law graph with parameter with different values of assortativity coefficient and degree-label correlation coefficient . This figure shows that, for power-law networks, the proposed friendship paradox based NEP methods have smaller mean squared error compared to classical intent polling method under general conditions.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Fig. 3: MSE of the estimates obtained using Algorithm 1 (), Algorithm 2 () and intent polling method () versus the sampling budget , for a power-law graph with parameter with different values of the assortativity coefficient and degree-label correlation coefficient . This figure shows that, for power-law networks, the proposed friendship paradox based NEP methods have smaller mean squared error compared to classical intent polling method under general conditions.
(a)
(b)
(c)
Fig. 4: MSE of the estimates obtained using Algorithm 1 (), Algorithm 2 () and intent polling method () versus the sampling budget , for a Erdős-Rényi graph with parameter average degree 50 with assortativity coefficient and different values of degree-label correlation coefficient . This figure shows that, for ER graphs, the proposed friendship paradox based NEP method as well as the greedy deterministic sample selection method result in better performance compared to the intent polling method.

6 Discussion of the Results

In this section, we discuss the main findings of the experiments and how they relate to the theoretical results. Further, we provide insights into how these findings can be useful to identify the best possible algorithm (out of Algorithms 1, Algorithm 2 proposed in this paper and the alternative intent polling method) depending on the situation.

6.1 Power-law Graphs

Intent Polling vs. Friendship Paradox Based Polling: The friendship paradox based Algorithm 1, Algorithm 2 performs better for all sample sizes compared to the intent polling method when the degree-label correlation (which agrees with Theorem 5). Further, even when is non-zero, the Algorithm 2 outperform the intent polling method for all considered samples sizes while Algorithm 1 outperforms the intent polling method for in all considered cases for small sample sizes ().


Effect of the Heavy-Tails: By comparing Fig. 2 with Fig. 3, it can be seen that, when the tail of the degree distribution is heavier (smaller power-law coefficient ), the performance of Algorithms 1, 2 is better than the intent polling method for small sampling budgets. The effect of the heavy tails is more visible on the Algorithm 2 which performs better than intent polling method in all cases for all sample sizes. This shows that, when the sampling budget is small and the network has a heavy tail, the friendship paradox based algorithms can offer significant advantage over classical intent polling method.


Effect of the Assortativity of the Network: Many different joint degree distributions can give rise to the same neighbor degree distribution (which is the marginal distribution of ). This marginal distribution does not capture the joint variation of the degrees a random pair of neighbors. In Algorithm 1 (which samples neighbors uniformly), the degree distribution of the samples is the neighbor degree distribution . Hence, the performance is not affected by the assortativity coefficient , which captures the joint variation (in terms of the joint degree distribution ) of the degrees of a random pair of neighbors. This is apparent in Fig. 3 where, each column (corresponding to different values) has approximately same MSE for Algorithm 1. However, it can be seen that, the MSE of Algorithm 2 (that samples random friends of random nodes) increases with due to the fact that the degree of a random friend of a random node is a function of the joint degree distribution. In order to make this point clear, Fig. 5 illustrates the effect of the neighbor degree correlation on (and the invariance of to ). Hence, if it is apriori known that the network is disassortative, the Algorithm 2 is a more suitable choice for polling (compared to Algorithm 1).


When to use friendship paradox based NEP? Both theoretical (Theorem 6) as well as numerical results (Fig. 3, Fig. 4) show that friendship paradox based NEP methods outperform classical intent polling method by a large margin when the sampling budget is small compared to the size of the network (which is the case in many applications related to polling). Further, the absence of correlation guarantees the better performance of friendship paradox based NEP methods for any sample size (Corollary 5) and the presence of assortativity improves the performance of Algorithm 2. These results/observations gives the pollster the ability to decide which algorithm to be deployed using the available information about the network and the sampling budget.

6.2 Erdős-Rényi Graphs

From the Fig. 4, it can be seen that Algorithms 1 and 2 both yield a smaller MSE than the intent polling method for Erdős-Rényi models. Further, Algorithm 1 and Algorithm 2 both have an equal MSE in the case of Erdős-Rényi Graphs. This is a result of the fact that distributions of the degree of a random neighbor and the distribution of the degree of a random neighbor of a random node are equal when the neighbor degree correlation is zero.

(a) (disassortative network)
(b)
(c) (assortative network)
Fig. 5: The cumulative distribution functions (CDF) of the degrees of a random node (), a random friend () and a random friend () of a random node respectively, for three graphs with the same degree distribution (power-law distribution with a coefficient ) but different neighbor-degree correlation coefficients , generated using the Newman’s edge rewiring procedure. This illustrates that for (Fig. 4(a)) and vice-versa. Further, this figure also shows how the distributions of remain invariant to the changes in the joint degree distribution that preserve the degree distribution .

7 Conclusion

We considered the problem of estimating the fraction of nodes in a graph that has a particular attribute (represented by a binary label of 1 or 0) and, proposed a novel class of polling methods called Neighborhood Expectation Polling (NEP). In NEP, each sampled individual responds with information about the fraction of his/her neighbors (defined by the underlying social network graph) that has label 1. Two methods were proposed under varying assumptions about the pollster’s knowledge about the underlying graph: 1) the pollster has no knowledge about the social graph but, has the ability to perform random walks on the graph 2) uniformly sampled nodes from the unknown social graph are available. Two algorithms were proposed (for case 1 and case 2) exploiting a type of network bias called friendship paradox. Theoretical results on sufficient conditions of the network for the estimates to have a smaller mean squared error compared to the classical polling methods were derived. Further, extensive numerical results on synthetic networks are provided to illustrate the performance of the proposed methods under different network metrics. These results show that the proposed friendship paradox based NEP methods are capable of obtaining an estimate with a smaller mean squared error using a smaller number of samples.

Acknowledgments

The authors thank Jon Kleinberg at Department of Computer Science of Cornell University for helpful suggestions.

References

  • [1] A. Tumasjan, T. O. Sprenger, P. G. Sandner, and I. M. Welpe, “Predicting elections with twitter: What 140 characters reveal about political sentiment.” ICWSM, vol. 10, no. 1, pp. 178–185, 2010.
  • [2] N. Silver, The signal and the noise: why so many predictions fail–but some don’t.   Penguin, 2012.
  • [3] K. J. Gile, “Improved inference for respondent-driven sampling data with application to hiv prevalence estimation,” Journal of the American Statistical Association, vol. 106, no. 493, pp. 135–146, 2011.
  • [4] A. Dasgupta, R. Kumar, and D. Sivakumar, “Social sampling,” in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2012, pp. 235–243.
  • [5] D. Kempe, J. Kleinberg, and É. Tardos, “Maximizing the spread of influence through a social network,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2003, pp. 137–146.
  • [6] S. L. Feld, “Why your friends have more friends than you do,” American Journal of Sociology, vol. 96, no. 6, pp. 1464–1477, 1991.
  • [7] D. M. Rothschild and J. Wolfers, “Forecasting elections: Voter intentions versus expectations,” 2011.
  • [8] A. Graefe, “Accuracy gains of adding vote expectation surveys to a combined forecast of us presidential election outcomes,” Research & Politics, vol. 2, no. 1, p. 2053168015570416, 2015.
  • [9] ——, “Accuracy of vote expectation surveys in forecasting elections,” Public Opinion Quarterly, vol. 78, no. S1, pp. 204–232, 2014.
  • [10] A. E. Murr, ““wisdom of crowds”? a decentralised election forecasting model that uses citizens’ local expectations,” Electoral Studies, vol. 30, no. 4, pp. 771–783, 2011.
  • [11] ——, “The wisdom of crowds: Applying condorcet’s jury theorem to forecasting us presidential elections,” International Journal of Forecasting, vol. 31, no. 3, pp. 916–929, 2015.
  • [12] C. F. Manski, “Measuring expectations,” Econometrica, vol. 72, no. 5, pp. 1329–1376, 2004.
  • [13] V. Krishnamurthy and W. Hoiles, “Online reputation and polling systems: Data incest, social learning, and revealed preferences,” IEEE Transactions on Computational Social Systems, vol. 1, no. 3, pp. 164–179, 2014.
  • [14] V. Krishnamurthy, Partially Observed Markov Decision Processes.   Cambridge University Press, 2016.
  • [15] Y.-H. Eom and H.-H. Jo, “Tail-scope: Using friends to estimate heavy tails of degree distributions in large-scale complex networks,” Scientific reports, vol. 5, 2015.
  • [16] N. A. Christakis and J. H. Fowler, “Social network sensors for early detection of contagious outbreaks,” PloS one, vol. 5, no. 9, p. e12948, 2010.
  • [17] M. Garcia-Herranz, E. Moro, M. Cebrian, N. A. Christakis, and J. H. Fowler, “Using friends as sensors to detect global-scale contagious outbreaks,” PloS one, vol. 9, no. 4, p. e92413, 2014.
  • [18] L. Seeman and Y. Singer, “Adaptive seeding in social networks,” in Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on.   IEEE, 2013, pp. 459–468.
  • [19] S. Lattanzi and Y. Singer, “The power of random neighbors in social networks,” in Proceedings of the Eighth ACM International Conference on Web Search and Data Mining.   ACM, 2015, pp. 77–86.
  • [20] D. A. Kim, A. R. Hwong, D. Stafford, D. A. Hughes, A. J. O’Malley, J. H. Fowler, and N. A. Christakis, “Social network targeting to maximise population behaviour change: a cluster randomised controlled trial,” The Lancet, vol. 386, no. 9989, pp. 145–153, 2015.
  • [21] T. Horel and Y. Singer, “Scalable methods for adaptively seeding a social network,” in Proceedings of the 24th International Conference on World Wide Web.   International World Wide Web Conferences Steering Committee, 2015, pp. 441–451.
  • [22] K. Lerman, X. Yan, and X.-Z. Wu, “The “majority illusion” in social networks,” PloS one, vol. 11, no. 2, p. e0147617, 2016.
  • [23] M. O. Jackson, “The friendship paradox and systematic biases in perceptions and social norms,” Available at SSRN: https://ssrn.com/abstract=2780003 or http://dx.doi.org/10.2139/ssrn.2780003, 2016.
  • [24] Y.-H. Eom and H.-H. Jo, “Generalized friendship paradox in complex networks: The case of scientific collaboration,” Scientific Reports, vol. 4, Apr. 2014.
  • [25] F. Kooti, N. O. Hodas, and K. Lerman, “Network weirdness: Exploring the origins of network paradoxes.” in ICWSM, 2014.
  • [26] N. O. Hodas, F. Kooti, and K. Lerman, “Friendship paradox redux: Your friends are more interesting than you,” arXiv preprint arXiv:1304.3480, 2013.
  • [27] X.-Z. Wu, A. G. Percus, and K. Lerman, “Neighbor-neighbor correlations explain measurement bias in networks,” Scientific Reports, vol. 7, no. 1, p. 5576, 2017.
  • [28] Y. Cao and S. M. Ross, “The friendship paradox.” Mathematical Scientist, vol. 41, no. 1, 2016.
  • [29] J. Leskovec and C. Faloutsos, “Sampling from large graphs,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2006, pp. 631–636.
  • [30] M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou, “Walking in facebook: A case study of unbiased sampling of OSNs,” in Infocom, 2010 Proceedings IEEE.   IEEE, 2010, pp. 1–9.
  • [31] B. Ribeiro and D. Towsley, “Estimating and sampling graphs with multidimensional random walks,” in Proceedings of the 10th ACM SIGCOMM conference on Internet measurement.   ACM, 2010, pp. 390–403.
  • [32] M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou, “Practical recommendations on crawling online social networks,” IEEE Journal on Selected Areas in Communications, vol. 29, no. 9, pp. 1872–1892, 2011.
  • [33] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee, “Measurement and analysis of online social networks,” in Proceedings of the 7th ACM SIGCOMM conference on Internet measurement.   ACM, 2007, pp. 29–42.
  • [34] D. Aldous and J. Fill, “Reversible Markov chains and random walks on graphs,” 2002.
  • [35] M. Molloy and B. Reed, “A critical point for random graphs with a given degree sequence,” Random structures & algorithms, vol. 6, no. 2-3, pp. 161–180, 1995.
  • [36] L. A. Adamic, R. M. Lukose, A. R. Puniyani, and B. A. Huberman, “Search in power-law networks,” Physical review E, vol. 64, no. 4, p. 046135, 2001.
  • [37] M. E. Newman, D. J. Watts, and S. H. Strogatz, “Random graph models of social networks,” Proceedings of the National Academy of Sciences, vol. 99, no. suppl 1, pp. 2566–2572, 2002.
  • [38] R. Albert and A.-L. Barabási, “Statistical mechanics of complex networks,” Reviews of modern physics, vol. 74, no. 1, p. 47, 2002.
  • [39] M. Boguná, R. Pastor-Satorras, and A. Vespignani, “Epidemic spreading in complex networks with degree correlations,” arXiv preprint cond-mat/0301149, 2003.
  • [40] M. E. Newman, “Assortative mixing in networks,” Physical review letters, vol. 89, no. 20, p. 208701, 2002.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
221186
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description