Faster Random Walks By Rewiring Online Social Networks OnTheFly
Abstract
Many online social networks feature restrictive web interfaces which only allow the query of a user’s local neighborhood through the interface. To enable analytics over such an online social network through its restrictive web interface, many recent efforts reuse the existing Markov Chain Monte Carlo methods such as random walks to sample the social network and support analytics based on the samples. The problem with such an approach, however, is the large amount of queries often required (i.e., a long “mixing time”) for a random walk to reach a desired (stationary) sampling distribution.
In this paper, we consider a novel problem of enabling a faster random walk over online social networks by “rewiring” the social network onthefly. Specifically, we develop Modified TOpology (MTO)Sampler which, by using only information exposed by the restrictive web interface, constructs a “virtual” overlay topology of the social network while performing a random walk, and ensures that the random walk follows the modified overlay topology rather than the original one. We show that MTOSampler not only provably enhances the efficiency of sampling, but also achieves significant savings on query cost over realworld online social networks such as Google Plus, Epinion etc.
I Introduction
Ia Aggregate Estimation over Online Social Networks
An online social network allows its users to publish contents and form connections with other users. To retrieve information from a social network, one generally needs to issue a individualuser query through the social network’s web interface by specifying a user of interest, and the web interface returns the contents published by the user as well as a list of other users connected with the user^{1}^{1}1We currently focus on the undirected relationship between users..
An online social network not only provides a platform for users to share information with their acquaintance, but also enables a third party to perform a wide variety of analytical applications over the social network  e.g., the analysis of rumor/news propagation, the mining of sentiment/opinion on certain subjects, and social media based market research. While some third parties, e.g., advertisers, may be able to negotiate contracts with the network owners to get access to the full underlying database, many third parties lack the resources to do so. To enable these thirdparty analytical applications, one must be able to accurately estimate bigpicture aggregates (e.g., the average age of users, the COUNT of user posts that contain a given word) over an online social network by issuing a small number of individualuser queries through the social network’s web interface. We address this problem of thirdparty aggregate estimation in the paper.
IB Existing Sampling Based Solutions and Their Problems
An important challenge facing thirdparty aggregate estimation is the lack of cooperation from online social network providers. In particular, the information returned by each individualuser query is extremely limited  only containing information about the neighborhood of one user. Furthermore, almost all largescale online social networks enforce limits on the number of web requests one can issue (e.g., 600 open graph queries per 600 seconds for Facebook^{2}^{2}2https://developers.facebook.com/docs/bestpractices/, and 350 requests per hour for Twitter^{3}^{3}3https://dev.twitter.com/docs/ratelimiting). As a result, it is practically impossible to crawl/download most or all data from an online social network before generating aggregate estimations. There is also no available way for a third party to obtain the entire topology of the graph underlying the social network.
To address this challenge, a number of sampling techniques have been proposed for performing analytics over an online social network without the prerequisite of crawling [15, 11, 12, 10]. The objective of sampling is to randomly select elements (e.g., nodes/users or edges/relationships) from the online social network according to a predetermined probability distribution, and then to generate aggregate estimations based on the retrieved samples. Since only individual local neighborhoods (i.e., a user and the set of its neighbors)  rather than the entire graph topology  can be retrieved from the social network’s web interface, to the best of our knowledge, all existing sampling techniques without prior knowledge of all nodes/edges are built upon the idea of performing random walks over the graph which only require knowledge of the local neighborhoods visited by the random walks.
In literature, there are two popular random walk schemes: simple random walk and Metropolis Hastings random walk. Simple random walk (SRW) [17] starts from an arbitrary user, repeatedly hops from one user to another by choosing uniformly at random from the former user’s neighborhood, and stops after a number of steps to retrieve the last user as a sample. When the simple random walk is sufficiently long, the probability for each user to be sampled tends to reach a stationary (probability) distribution proportional to each user’s degree (i.e., the number of users connected with the user). Thus, based on the retrieved samples and knowledge of such a stationary distribution, one can generate unbiased estimations of AVG aggregates (with or without selection conditions) over all users in the social network. If the total number of users in the social network is available^{4}^{4}4Which is the case for many realworld social networks whose providers publish the total number of users for advertising purposes., then COUNT and SUM aggregates can be answered without bias as well.
Metropolis Hastings random walk (MHRW) is a random walk achieving any distribution (typically uniform distribution) constructed by the famous MH algorithm. As an extension of MHRW, based on the knowledge of all the ids of a graph, [11] suggests that we can conduct random jump (RJ), which jumps to any random vertex^{5}^{5}5It may need the global topology or the whole user id space for generate random vertex, thus not viable for all online social networks. in the graph with a fixed probability in each step when it carries on the MHRW. Although MHRW can yield asymptotically uniform samples, which requires no additional processing for subsequent analysis, it is slower than SRW almost for all practical measurements of convergence, such as degree distribution distance, KS distance and mean degree error. According to [10] and [14], SRW is 1.58 times faster than MHRW. Thus we set the baseline as SRW, while we also include MHRW in the experimental section.
A critical problem of existing sampling techniques, however, is the large number of individualuser queries (i.e., web requests) they require for retrieving each sample. Consider the abovedescribed simple random walk as an example. In order to reach the stationary distribution (and thereby an accurate aggregate estimation), one may have to issue a large number of queries as a “burnin” period of the random walk. Traditional studies on graph theory found that the length of such a burnin period is determined by the graph conductance  an intrinsic property of the graph topology (formally defined in Section II). In particular, the smaller the conductance is, the longer the burnin period will be (i.e., the more individualuser queries will be required by sampling).
Unfortunately, a recent study [18] on realworld social networks such as Facebook, Livejournal, etc. found the conductance of their graphs to be substantially lower than expected. As a result, a random walk on these social networks often requires a large number of individualuser queries  e.g., approximately 500 to 1500 single random walk length for a realworld social network Livejournal of one million nodes to achieve acceptable variance distance [18]. One can see that, in order to retrieve enough samples to reach an accurate aggregate estimation, the existing sampling techniques may require a very large number of individualuser queries.
IC Outline of Technical Results
In this paper, we consider a novel problem of how to significantly increase the conductance of a social network graph by modifying the graph topology onthefly (during the thirdparty random walk process). In the following, we shall first explain what we mean by onthefly topology modification, and then describe the rationale behind our main ideas for topology modification.
First, by topology modification we do not actually modify the original topology of the social network graph  indeed, no third party other than the social network provider has the ability to do so. What we modify is the topology of an overlay graph on which we perform the random walks. Fig 1 depicts an example: if we can decide that not considering a particular edge in the random walk process can make the burnin period shorter (i.e., increase the conductance), then we are essentially performing random walks over an overlay graph on which this edge is removed. By doing so, we can achieve same accurate aggregate estimation with lower query cost. One can see that, with traditional random walk techniques, the overlay graph is exactly the same as the original social network graph. Our objective here is to manipulate edges in the overlay graph so as to maximize the graph conductance.
It is important to note that the technical challenge here is not how edge manipulations can boost graph conductance  a simple method to reach theoretical maximum on conductance is to repeatedly insert edges to the graph until it becomes a complete graph. This requires the knowledge of all nodes in the social network, which a thirdparty does not have. The key challenge here is how to perform edge manipulations only based on the knowledge of local neighborhoods that a random walk has passed by, and yet increases the conductance of the entire graph in a significant manner. In the following, we provide an intuitive explanation of our main ideas for topology modification.
To understand the main ideas, we first introduce the concepts of crosscutting and noncrosscutting edges intuitively with an example in Fig 1 (we shall formally define these concepts in Section II). Generally speaking, if we consider a social network graph consisting of multiple densely connected components (e.g., and in Fig 1), then the edges connecting them are likely to be crosscutting edges, while edges inside each densely connected component are likely noncrosscutting ones. A key intuition here is that the more crosscutting edges and/or the fewer noncrosscutting edges a graph has, the higher its conductance is. For example, Graph in Fig 1 has a low conductance (i.e., high burnin period) as a random walk is likely to get “stucked” in one of the two dense components which are difficult to escape, given that there is only one crosscutting edge . On the other hand, with far fewer noncrosscutting edges and a few additional crosscutting edges, has a much higher conductance as it is much easier now for a random walk to move from one component to the other.
With the concepts of crosscutting and noncrosscutting edges, we develop Modify TOpology Sampler (MTOSampler), a topology manipulation technique which first determines^{6}^{6}6Note that, as we shall prove in section IIIA, it is impossible to assert deterministically that an edge is crosscutting. Nonetheless, it is possible to assert deterministically that an edge is noncrosscutting. Thus, our algorithm has two possible outputs: noncrosscutting or uncertain. We shall show in the paper that it outputs noncrosscutting for a large number of (noncrosscutting) edges in realworld social networks. whether a given edge in the graph is a crosscutting edge based solely upon knowledge of the local neighborhood topology, and then removes the edge if it is noncrosscutting. MTOSampler may also “move” an edge by changing a node connected to the edge if it is determined that, by doing so, the new edge is more likely to be a crosscutting edge. We shall show in the paper that MTOSampler is capable of significantly improving the efficiency of random walks: For the example in Fig 1, MTOSampler is capable of reducing the mixing time (i.e., query cost of a random walk) by 97%. We also demonstrate through experimental results the significant improvement of efficiency achieved by MTOSampler for realworld social networks such as Epinions, Google Plus, etc.
The main contributions of our approach include:

(Problem Novelty) We consider a novel problem of modifying the graph topology onthefly (during the random walk process) for the efficient thirdparty sampling of online social networks.

(Solution Novelty) We develop MTOSampler which determines whether an edge is (non)crosscutting based solely upon local neighborhood knowledge retrieved by the random walk, and then manipulates the graph topology to significantly improve sampling efficiency.

Our contributions also include extensive theoretical analysis (on various social network models) and experimental evaluation on synthetic and realworld social networks as well as online at Google+ which demonstrate the superiority of our MTOSampler over the traditional sampling techniques.
Ii Preliminaries
Iia Model of Online Social Networks
In this paper, we consider an online social network with an interface that allows input queries of the form
: SELECT * FROM D WHERE USERID = ,
and responds with the information about user (e.g., user name, selfdescription, userpublished contents) as well as the list of all other users connected with (e.g., ’s friends in the network). This is a model followed by many online social networks  e.g., Google Plus, Facebook, etc  with the interface provided as either an enduserfriendly web page or a developerspecific API call.
Consider the socialnetwork topology as an undirected graph , where each node in is corresponding to a user in the social network^{7}^{7}7Note that without introducing ambiguity, we use “node” and “social network user” interchangeably in this paper., and each edge in represents the connection between two users. One can see that the answer to query () is a set of nodes , such that , there is an edge . We henceforth refer to as the neighborhood of . We use to denote the degree of  i.e., . For abbreviation, we also write as .
Running Example: We shall use, throughout this paper, the 22node, 111edge, barbell graph shown (as the original graph ) in Fig 1 as a running example.
IiB Performance Measures for Sampling
In the following, we shall discuss two key objectives for sampling: (1) minimizing bias  such that the retrieved samples can be used to accurately estimate aggregate query answers, and (2) reducing the number of queries required for sampling  given the stringent requirement often put in place by realworld social networks on the number of queries one can issue per day.
Bias: In general, sampling bias is the “distance” between the target (i.e., ideal) distribution of samples and the actual sampling distribution  i.e., the probability for each tuple to be retrieved as a sample. We shall further discuss a concrete bias measure in the next subsection and an experimental measure in Section VA3.
Query Cost: To this end, we consider the number of unique queries one has to issue for the sampling process, as any duplicate query can be answered from local cache without consuming the query limit enforced by the social network provider.
IiC Random Walk
A random walk is a Markov Chain Monte Carlo (MCMC) method which takes successive random steps on the abovedescribed graph according to a transition matrix , where represents the probability for the random walk to transit from node to . The premise here is that, after performing a random walk for a sufficient number of steps, the probability distribution for the walk to land on each node in converges to a stationary distribution which then becomes the sampling distribution^{8}^{8}8That is, if we take the end node as a sample. There are many different types of random walks, corresponding to the different designs of and different stationary distributions. In this paper, we consider the simple random walk that has a stationary distribution of for all .
Definition 1
(Simple Random Walk). Given a current node , a simple random walk chooses uniformly at random a neighboring node and transit to in the next step  i.e.,
(1) 
One can see that each step of a simple random walk requires exactly one query (i.e., to identify the neighborhood of and select the next stop ). Thus, the performance of sampling  i.e., the tradeoff between bias and query cost  is determined by how fast the random walk converges to the stationary distribution. Formally, we measure the convergence speed as the mixing time defined as follows.
Definition 2
(Mixing Time) Given , after steps of simple random walk, the relative pointwise distance between the current sampling distribution and the stationary distribution is
(2) 
where is the element of with indices and . The mixing time of the random walk is the minimum value of such that where is a predetermined threshold on relative pointwise distance.
One can see from the definition that the relative pointwise distance measures the bias of the random walk after steps. Mixing time, on the other hand, captures the query cost required to reduce the bias below a predetermined threshold . In the following subsection, we describe a key characteristics of the graph that determines the mixing time  the conductance of the graph.
IiD Conductance: An Efficiency Indicator
Intuitively, the conductance , which indicates how fast the simple random walk converges to its stationary distribution, measures how “wellknit” a graph is. Specifically, the conductance is determined by a cut of the graph  i.e., a partition of into two disjoint subsets and  which minimizes the ratio between the probability for the random walk to move from one partition to the other and the probability for the random walk to stay in the same partition. Formally, we have the following definition.
Definition 3
(Conductance). The conductance^{9}^{9}9Rigidly, the conductance is determined by both the graph topology and the transition matrix of the random walk. Here we tailor the definition to the simple random walk considered in this paper. of a graph is
The relationship between the graph conductance and the mixing time of a simple random walk is illustrated by the following inequality [3]:
(3) 
One can see that the graph conductance ranges between 0 and 1  and the larger is, the smaller the mixing time will be (for a fixed threshold ). Also note from (3) the log scale relationship between and the mixing time. This indicates a small change on may lead to a significant change of the mixing time. Let
(4)  
(5)  
(6) 
Here . For example, increasing conductance from 0.010 to 0.012 will change the mixing time from to .
Running Example: The conductance of the barbell graph in the running example is . The corresponding (and unique) and are shown in Fig 1. Correspondingly, the mixing time to reach a relative pointwise distance of is bounded from above by . We shall show throughout the paper how our onthefly topology modification techniques can significantly increase conductance and reduce the mixing time for this running example.
IiE Key for Conductance: CrossCutting Edges
A key observation from Definition 3 is that the graph conductance critically depends on the number of edges which “crosscut” and  i.e., . The more such crosscutting edges there are, the higher the graph conductance is likely to be. On the other hand, since a noncrosscutting edge is only counted in the denominator, the more noncrosscutting edges there are in the graph, the lower the conductance is likely to be. Formally, we define crosscutting edges as follows.
Definition 4
(Crosscutting edges). For a given graph , an edge is a crosscutting edge if and only if there exists such that , where , and
takes the minimum value among all possible .
We note that in large graphs such as online social networks, it is reasonable to assume that the number of crosscutting edges is relatively small when compared to total number of edges in or .
One can see that our objective of onthefly topology modification is then to increase the number of crosscutting edges and decrease the number of noncrosscutting edges as much as possible. We describe our main ideas for doing so in the next section.
Running Example: For the barbell graph, adding any edge between the two halves of the graph produces a new crosscutting edge, and increases the graph conductance from to  i.e., the mixingtime will be reduced to 3758.1/14212.3 = 0.264  a significant reduction of 75%.
Iii Main Ideas of OnTheFly Topology Modification
Iiia Technical Challenges: Negative Results
One can see from Section IIE that the key for increasing the conductance of a social network (and thereby reducing the query cost of sampling) through topology modification is to determine whether an edge is a crosscutting edge or not. Unfortunately, the deterministic identification of a crosscutting edge is a hard problem (in the worst case) even if the entire graph topology is given as prior knowledge, as shown in the following theorem.
Theorem 1
The problem of determining whether an edge is crosscutting or not is NPhard.
Consider the case of equal transition probability for each edge. The problem of finding all crosscutting edges is equivalent with finding the optimum cut of the graph according to the Cheeger constant  a problem proved to be NPhard [7].
Given the worstcase hardness result, we now consider the bestcase scenario  i.e., is there any graph topology (which is not the worstcase input, of course) for which it is possible to efficiently identify crosscutting edges? It is easy to see that, if the entire graph topology is given, then there certainly exist such graphs  with the original graph in Fig 1 being an example  for which the crosscutting edge(s) can be straightforwardly identified. Nonetheless, our interest lies on making such identifications based solely upon local neighborhood knowledge  because of the aforementioned restrictions of online socialnetwork interfaces. The following theorem, unfortunately, shows that it is impossible for one to deterministically confirm the crosscutting nature of an edge unless the entire graph topology has been crawled.
Theorem 2
Given the local neighborhood topology of vertices accessed by a thirdparty sampler, in where , for any given edge , there must exist a graph such that: (1) is not a crosscutting edge for , and (2) and are indistinguishable from the view of the sampler  i.e., there exists which have the exactly same local neighborhood as .
The construction of can be stated as follows: First, insert extra vertices and extra edges into the graph, such that , there is in the new graph. Note that at this moment, there is no edge between any and . Then, in the second step, identify from a vertex which has not been accessed by the sampler  i.e.,  and insert into the graph an edge . One can see that the only crosscutting edge in the output graph is  i.e., cannot be a crosscutting edge for . An intuitive demonstration of the proof is shown in Fig 2.
It is important to note from the theorem, however, that it still leaves two possible ways for one to increase the conductance of a social network based on only the local neighborhood knowledge: (1) While the theorem indicates that it is impossible to deterministically confirm the crosscutting nature of an edge, it may still be possible to deterministically disprove an edge from being crosscutting  i.e., we may prove that an edge is definitely noncrosscutting based on just local neighborhood knowledge, and therefore remove it to increase the conductance deterministically. (2) It is still possible to conditionally or probabilistically evaluate the likelihood of an edge being crosscutting  e.g., we may determine that an edge absent from the original graph is more likely to be a crosscutting edge (if added) than an existing edge, and thereby replace the existing edge with the new one to increase the conductance in a probabilistic fashion. We consider the removal and replacement strategies, respectively, in the next two subsections.
IiiB Deterministic Identification of Noncrosscutting Edges
To illustrate the main idea of our deterministic identification of noncrosscutting edges (for removals), we start with an example in Fig 3 to show why we can determine, based solely upon the local neighborhoods of and as shown in the graph, that (henceforth denoted by ) in the Fig is not a crosscutting edge. The intuition behind this is fairly simple: When and share a large number of common neighbors (e.g., 5 in Fig 3) but have relatively few other edges (e.g., 1 each in Fig 3), it is highly unlikely for the partition to cut through rather than the other edges of and  e.g., in Fig 3  if it cuts through any edges associated with and at all.
The rigid (dis)proof can be constructed with contradiction. Suppose is a crosscutting edge between two partitions of the graph, and . One can see that since and belong to different partitions, there must be at least 6 crosscutting edges in the subgraph (Fig 3 (a) depicts an example). We now show in the following discussion that this is actually impossible because one can always construct another partition and (by “dragging” and into the same part) and reduce the number of crosscutting edges to at most 5. Note that this contradicts the definition of and being a configuration which minimizes the number of crosscutting edges. Thus, cannot be a crosscutting edge.
To understand how the construction of and works, consider Fig 3 (b) as an example. For the partition illustrated in Fig 3 (a), we can “drag” into to form the new configuration, such that the number of crosscutting edges associated with and is now at most 5, as shown in Fig 3 (b). Note that the other edges not shown in the subgraph (no matter crosscutting or not) are not affected by the reconfiguration, because all vertices associated with are already known in the local neighborhood of (shown in Fig 3).
More generally, for the other possible settings of and (such as Fig 3(c)), one can construct the reconfiguration in analogy with the following general principle: First, find the “more popular” partition (i.e., either or ) among the 5 common neighbors of and (e.g., in Fig 3 (a) or Fig 3 (c)). Then, drag one of and to ensure that both of them are in this more popular partition under the new configuration. One can see that, since at most 2 common neighbors of and are in the less popular partition, the number of crosscutting edges under the new configuration is at most , where is the number of crosscutting edges associated with the 2 common neighbors in the less popular partition (at most 2 for each), and 1 is the number of crosscutting edge associated with the other (noncommon) neighbor of the node being dragged (i.e., in Fig 3 (a)).
The following theorem depicts the general case for which we can remove an edge onthefly to increase the graph conductance. Recall that and represent the set of neighbors and the degree of a node , respectively.
Theorem 3
[Edge Removal Criteria]: Given , , if and
(7) 
then is not a crosscutting edge.
Let , without losing generality, assuming , then there must be n crosscutting edges in these n disjoint paths of length 2 between and . We denote as the number of crosscutting edges in these n paths connected with u and v, so . One can see that if we try to “drag” from to , all the edges connected with would be modified, e.g. flip the edges linked to , which means the old crosscutting edges will be the new noncrosscutting edges, and vice versa. As the assumption from inequality (7): , so either or holds. Without losing generality, assuming for vertex the inequality holds, we change from set to , so the number of crosscutting edges must be strictly decreasing. Since we have assumed that the number of edges in or is much greater than the number of crosscutting edges, so must decrease according to the decrease of the number of cuttingedges, which leads to the contradiction of is a crosscutting edge.
Due to space limitations, please refer to the technical report [23] for the proofs of all theorems in the rest of the paper. Intuitively, theorem 3 gives us a clue that if two nodes have enough common neighbors, then we can deterministically say that the edge between them is noncrosscutting. Moreover, (7) is tight  i.e., if it does not hold, then we can always construct a counter example where is crosscutting  as shown in the following theorem.
Corollary 1
For all which satisfy
(8) 
there always exists a graph in which is crosscutting.
Running Example: With our onthefly edge removals, any random walk is essentially following an overlay topology which can be constructed by applying Theorem 3 to every edge in the original graph . For the barbell running example, the solid lines in Fig 1 depicts . The conductance is now . Compared with the original conductance of 0.018, the corresponding lower bound on mixing time is reduced to 1638.3/14212.3 = 0.115 of the original value  a reduction of .
IiiC Conditional Identification of Crosscutting Edges
We now describe our second idea of conditionally identifying crosscutting edges. We start with an example in Fig 4 to show why we can replace an existing edge with a new one such that (1) the new edge is more likely to be crosscutting, and (2) the replacement is guaranteed to not decrease the conductance.
Specifically, consider the replacement of by given the neighborhoods of and . A key observation here is that and cannot be both crosscutting edges. The reason is that otherwise we could always “drag” into the same partition as and to reduce the number of crosscutting edges by at least 1. Given this key observation, one can see that the replacement of by will only have two possible outcomes:

if is a crosscutting edge, then must also be a crosscutting edge because, due to the observation, cannot be a crosscutting edge. Thus, the replacement leads to no change on the graph conductance.

if is not a crosscutting edge, then replacing it with will either keep the same conductance, or increase the conductance if is crosscutting.
As such, the replacement operation never reduces the conductance, and might increase it when is crosscutting. More generally, we have the following theorem.
Theorem 4
Given , , if , , then replacing edge with will not decrease the conductance, while it also has positive possibility to increase the conductance.
Next, we are going to prove that is actually the only case when replacement is guaranteed to not reduce the conductance, as shown by the following corollary.
Corollary 2
For , if , then there always exist a graph , , such that replacing with will decrease the conductance or have no effect.
Running Example: With Theorem A, an example of the replacement operations one can perform over the barbell running example in Fig 1 is to replace with , given that (after edge removals) has a degree of 3. Compared with the original conductance of = 0.018 and the postremoval conductance of = 0.053, the conductance is now further increased to . The corresponding lower bound on mixing time is reduced to 416.6/1638.3 = 0.25 of the postremoval bound  a further reduction of 75%  and 416.6/14212.3 = 0.029 of the original bound  an overall reduction of .
IiiD Extension
If we know more about the user’s neighbors, especially the common neighbors of the user and the random walk’s next candidate, we will deterministically identify more noncrosscutting edges. When the random walk reaches the nodes we have accessed before, we can use their degree information without issuing extra web requests since we could retrieve data from our local database.
Fig 5 (a) shows an example that with the extra degree knowledge of and ’s common neighbor , must be a noncrosscutting edge. As , if we assume is a crosscutting edges, then there must be 3 crosscutting edges between and . However, there exists another configuration Fig 5 (b), which only has 2 crosscutting edges. Thus, it contradicts the assumption that is a crosscutting edge. Noticed that if we do not know the degree of , we could not deterministically identify since theorem 3 does not apply here.
Theorem 5
Given , , if and
(9) 
we can assert that is not a crosscutting edge. Here we denote .
Intuitively, the edge between two nodes which have many common neighbors has higher probability to be a noncrosscutting edge. Also, it is easy for us to find these edges in online social networks. If a friend knows almost every other friends of a person, then this edge may be considered as noncrosscutting edge according to theorem 3 and A.
Iv Algorithm MTOSampler
Iva Algorithm implementation
Algorithm description. To explain how the onthefly modification works, we demonstrate an example in Fig 6. Fig 6(a) is an overlay graph that has been modified according to former theorems, in which edges A, C and D are removed, and edge B is replaced. Fig 6(b) shows one possible track of how our MTOsampler change the simple random walk. For instance, when the random walk sees a node , and (it satisfies the condition of replacement), then it may replace an edge as we described in theorem A. The colored area contains all the nodes that the random walk visits.
Algorithm 1 depicts the detailed procedure of MTO sampler, and the stopping rule (which indicates that the random walk should stop and output samples) can be any convergence monitor used in Markov Chain.
Aggregate estimation and probability revision. After collecting samples, we use Importance Sampling to directly estimate the aggregate information through the samples from the random walk’s stationary distribution .
The key challenge for MTOSampler using importance sampling is to estimate the stationary distribution of MTOSampler random walk . Since MTOSampler modifies the topology, may not equal to the stationary distribution . Here we have
(10) 
is unknown in overlay graph , but we can draw simple random sample from ’s neighbors in to get an unbiased estimation of .
IvB Theoretical Model Analysis
In order to theoretically analysis the performance of MTOSampler, we introduce a well known graph generation model: Latent space model.
Latent space model. Latent space graph model [21] are connecting two nodes with the probability related to their distance in the latent space.
(11) 
here is the distance between two nodes and ; controls the level of sociability of a node in this graph, and is the sharpness of the function.
We will show that in the following theorem if two nodes’ distance is smaller than a threshold , then it is likely to be an noncrosscutting edge. Therefore, after finding the expected number of edges that can be removed we can calculate the increment of the conductance.
Theorem 6
Given a latent space graph model , assume , then the expected number of edges we can removed
(12) 
here V(r) is the volume of a hypersphere with radius in dimensional latent space. The proof can be found at [23].
Simple simulations show that from 20000 points experiment, one can get the empirical distribution of pointwise distance. More specifically, If we let , and , , then
(13) 
We compared the experimental results together with this theoretical bound of latent space model in section VB.
V Experiments
Va Experimental Setup
VA1 Hardware and Platform
We conducted all experiments on a computer with Intel Core i3 2.27GHz CPU, 4GB RAM and 64bit Ubuntu operating system. All algorithms were implemented in Python 2.7. Our local, synthetic and online datasets are stored in the inmemory Redis database and the MongoDB database.
VA2 Datasets
We tested three types of datasets in the experiments: local realworld social networks, Google Plus online social network, and synthetic social networks  which we describe respectively as follows.
Local Datasets: The local social networks  i.e., realworld social networks for which the entire topology is downloaded and stored locally in our server. For these datasets, we simulated the individualuserqueryonly web interface strictly according to the definition in Section 1, and ran our algorithms over the simulated interface. The rationale behind using such local datasets is so as we have the ground truth (e.g., real aggregate query answers over the entire network) to compare against for evaluating the performance of our algorithms.
Table I shows the list of local social networks we tested with (collected from [1]). All three datasets are previouslycaptured topological snapshots of Epinions and Slashdot, two realworld online social networks. Since we focus on sampling undirected graphs in this paper, for a realworld directed graph (e.g., Epinions), we first convert it to an undirected one by only keeping edges that appear in both directions in the original directed graph. Note by following this conversion strategy, we guarantee that a random walk over the undirected graph can also be performed over the original directed graph, with an additional step of verifying the inverse edge (resp. ) before committing to an edge (resp. ) in the random walk. The number of edges and the 90% effective diameter reported in Table I represent values after conversion.
Google Plus Online Social Graph: We also tested a second type of dataset: remote, online, social networks for which we have no access to the ground truth. In particular, we chose the Google Plus^{10}^{10}10https://plus.google.com/ network because its API^{11}^{11}11The source code of its Python wrapper can be found at https://github.com/pct/pythongoogleplusapi. After April 20, 2012, this social graph api will be fully retired. is the most generous among what we tested in terms of the number of accesses allowed per IP address per day. Using random walk and MTOSampler random walk, we have accessed 240,276 users in Google Plus. We observed that the interface provided by Google Social Graph API strictly adheres to our model of an individualuserqueryonly web interface, in that each API request returns the local neighborhood of one user. We also collected the data of users’ selfdescription.
Synthetic Social Networks: One can see that, for the realworld social network described above, we cannot change graph parameters such as size, connectivity, etc, and observe the corresponding performance change of our algorithms. To do so, we also tested synthetic social networks which were generated according to theoretical models. In particular, we tested the latent space model.
We note that, since the effectiveness of these theoretical models are still under research/debate, we tested these synthetic social networks for the sole purpose of observing the potential change of performance for social networks with different characteristics. The superiority of our algorithm over simple random walk, on the other hand, is tested by our experiments on the two types of realworld social networks.
VA3 Algorithms Implementation and Evaluation
Algorithms: We tested four algorithms, the simple random walk (i.e., baseline), Metropolis Hastings Random Walk (MHRW), Random Jump (RJ) and our MTOSampler, and compared their performance over all of the abovedescribed datasets.
Input Parameters: Both simple random walk and our MTOsampler are parameterless algorithms with one exception: They both need a convergence indicator to determine when the random walk has reached (or become sufficiently close to) the stationary distribution  so a sample can be retrieved from it. In the experiments, we used the Geweke indicator [9], one of the most popularly used methods in the literature, which we briefly explain as follows.
Given a sequence of nodes retrieved by a random walk, the Geweke method determines whether the random walk reaches the stationary distribution after a burnin of steps by first constructing two “windows” of nodes: Window A is formed by the first 10% nodes retrieved by the random walk after the step burnin period, and Window B formed by the last 50%. One can see that, if the random walk indeed converges to the stationary distribution after burnin, then the two windows should be statistically indistinguishable. This is exactly how the Geweke indicator tests convergence. In particular, consider any attribute that can be retrieved for each node in the network (a commonly used one is degree that applies to every graph). Let
(14) 
where and are means of for all nodes in Windows and , respectively, and and are their corresponding variances. One can see that when the random walk converges to the stationary distribution. Thus, the Geweke indicator confirms convergence if falls below a threshold. In the experiments, we set the threshold to be by default, while also performing tests with the threshold ranging from to .
Performance Measures for Sampling: As mentioned in Section IIB, a sampling technique for online social networks should be measured by query cost and bias  i.e., the distance between the (ideal) stationary distribution (i.e., for a simple random walk) and the actual probability distribution for each node to be sampled. To measure the query cost, we simply used the number of unique queries issued by the sampler. Bias, on the other hand, is more difficult to measure, as shown in the following discussions.
For a small graph, we measured bias by running the sampler for an extremely long amount of time (long enough so that each node is sampled multiple times). We then estimated the sampling distribution by counting the number of times each node is retrieved, and compared this distribution with the ideal one to derive the bias. In particular, we measured bias as the KLdivergence between the two distributions, specifically , where and are the ideal distribution and the (measured) sampling distribution, respectively.
For a larger graph, one may need a prohibitively large number of queries to sample each node multiple times. To measure bias in this case, we use the collected samples to estimate aggregate query answers over all nodes in the graph, and then compare the estimation with the ground truth. One can see that, a sampler with a smaller bias tends to produce an estimation with lower relative error. Specifically, for the local social networks, we used the average degree as the aggregate query (as only topological information is available for these networks). For the Google Social Graph experiment, we tested various aggregate queries including the average degree and the average length of user selfdescription.
Finally, to verify the theoretical results derived in the paper, we also tested a theoretical measure: the mixing time of the graph. In particular, we continuously ran our MTOSampler until it hits each node at least once  so we could actually obtain the topology of the overlay graph (e.g., as in Fig 1). Then, we computed the mixing time of the overlay graph (from the SecondLargest Eigenvalue Modulus (SLEM) of its adjacency matrix^{12}^{12}12Typical theoretical mixing time of Simple Random Walk can be defined as , where is SLEM of transition matrix ., see [6]). We would like to caution that, while we used it to verify our theoretical results of MTOSampler never decreasing the conductance of a graph, this theoretically computed measure does not replace the abovedescribed bias vs. query cost tests because it is often sensitive to a small number of “badlyconnected” nodes (which may not cause significant bias for practical purposes).
VB Performance Comparison Between Simple Random Walk and MTOSampler
We started by comparing the performance of Simple Random Walk (SRW) and MTOSampler over realworld social networks using all three performance measures described above  KLdivergence, relative error vs. query cost, and theoretical mixing time.
Local Datasets: We started by testing the relative error vs. query cost tradeoff of SRW, MTO, MHRW and RJ for estimating aggregate query answers. Since only topological information is available for local datasets, we used the average degree as the aggregate query. Fig 7 depicts the performance comparison for the three realworld social networks. Here each point represents the average of 20 runs of each algorithm, and the query cost (i.e., yaxis) represents the maximum query cost for a random walk to generate an estimation with relative error above a given value (i.e., xaxis). For random jump in the experiments, we set the probability of jumping as 0.5. One can see that, for all three datasets, our MTOSampler achieves a significant reduction of query cost compared with the SRW sampler, MHRW sampler and Random Jump sampler.
We also tested the KLdivergence measured by performing an extremely long execution of SRW and MTO in Fig 10  with each producing 20000 samples  to estimate the sampling probability for each node. The Geweke threshold was set to be 0.1 for the test. One can see that our MTOSampler not only requires fewer queries for generating each sample (i.e., converges to the stationary distribution faster), but also produces less bias than the SRW sampler.
To further test the bias of samples generated by our MTOSampler, we also conducted the test while varying the Geweke threshold from 0.1 to 0.8 on the dataset Slashdot B. Fig 10 depicts the change of measured bias for SRW and MTO, respectively. One can see from the figure that our MTOSampler achieves smaller bias than SRW for all cases being tested. In addition, a smaller threshold leads to a smaller bias and larger query cost, as indicated by the definition of Geweke convergence monitor.
Google Plus online social network: For Google Plus, we do not have the ground truth as the entire social network is too large (about 85.2 million users in Feb 2012^{13}^{13}13Estimated by Paul Allen’s model, http://goo.gl/nZCzN) to be crawled. Thus, we performed the tests in two steps. First, we continuously ran each sampler until their Geweke convergence monitor indicated that it had reached its stationary distribution. We then used the final estimation as the presumptive ground truth which we refer to as the converged value. In the second step, we used the converged value to compute the relative error vs. query cost tradeoff as previously described.
Fig 11(a) shows the estimated average degree when running SRW and MTOSampler random walk on Google Plus. It clearly shows that MTOSampler’s variance is smaller and converges faster than simpler random walk. Fig 11(b) and 11(c) illustrate the comparison between SRW and MTO of the relative error vs query cost of multiple attributes. We note that the selfdescription length is the number of characters in users’ selfdescription. One can see that our MTOSampler significantly outperforms SRW.
Synthetic Social Networks: Finally, we conducted further analysis of our MTOSampler, in particular the individual effects of edge removals (RM) and edge replacements (RP), using the synthetic latent space model described in Section VA2. Fig 10 depicts the results when the number of nodes in the graph varies from 50 to 100 (with the latent space model, we distributed these nodes in an area of , and set ). We derived the theoretical mixing time from the second largest eigenvalue modulus of the transition matrix. Note that Fig 10 also includes the theoretical bound derived in Section 4.2. One can see from the figure that our final MTOSampler achieves better efficiency than the individual applications of edge removal and replacement. In addition, the theoretical model represents a conservation estimation that is outperformed by the real efficiency of MTOSampler  consistent with our results in Section 4.2.
Vi Related Work
Sampling from online social networks. Several papers [15, 2, 13] have considered sampling from general large graph, and [12, 18, 10] focus on sampling from online social networks.
With global topology, [15] discussed sampling techniques like random node, random edge, random subgraph in large graphs. [11] introduced Albatross sampling which combines random jump and MHRW. [10] also demonstrated true uniform sampling method among the users’ id as “groundtruth”.
Without global topology, [10, 15] compared sampling techniques such as Simple Random Walk, MetropolisHastings Random Walk and traditional Breadth First Search (BFS) and Depth First Search (DFS). Also [10, 4] considered many parallel random walks at the same time, and MTOsampler can be applied to each parallel random walk straightforwardly, since it is an parameterfree and online algorithm.
Moreover, to the best of our knowledge, random walk is still the most practical way to sampling from large graphs without global topology.
Shorten the mixing time of random walks. [18] found that the mixing time of typical online social networks is much larger than anticipated, which validates our motivation to shorten the mixing time of random walk. [5] derived the fastest mixing random walk on a graph by convex optimization on second largest eigenvalue of the transition matrix, but it need the whole topology of the graph, and its high time complexity make it inapplicable in large graphs.
Vii Conclusions
In this paper we have initiated a study of enabling faster random walk over an online social network (with a restrictive web interface) by “rewiring” the social network onthefly. We showed that the key for speeding up a random walk is to increase the conductance of the graph topology followed by the random walk. As such, we developed MTOSampler which provably increases the graph conductance by constructing an overlay topology onthefly through edge removals and replacements. We provided theoretical analysis and extensive experimental studies over realworld social networks to illustrate the superiority of MTOSampler on achieving a smaller sampling bias while consuming a lower query cost.
References
 [1] Stanford large network dataset collection http://snap.stanford.edu/data/.
 [2] E. M. Airoldi. Sampling algorithms for pure network topologies. SIGKDD Explorations, 7:13–22, 2005.
 [3] N. Alon. Eigenvalues and expanders. Combinatorica, 6:83–96, 1986. 10.1007/BF02579166.
 [4] N. Alon, C. Avin, M. Koucky, G. Kozma, Z. Lotker, and M. R. Tuttle. Many random walks are faster than one. In SPAA, 2008.
 [5] S. Boyd, P. Diaconis, and L. Xiao. Fastest mixing markov chain on a graph. SIAM REVIEW, 46:667–689, 2003.
 [6] S. Boyd, A. Ghosh, and B. Prabhakar. Mixing times for random walks on geometric random graphs. SIAM ANALCO, 2005.
 [7] F. Chung. Random walks and local cuts in graphs. Linear Algebra and its Applications, 423(1):22–32, May 2007.
 [8] F. Chung and L. Lu. The small world phenomenon in hybrid power law graphs. In Complex Networks, (Eds. E. BenNaim et. al.), SpringerVerlag, pages 91–106. Springer, 2004.
 [9] J. Geweke. Evaluating the accuracy of samplingbased approaches to the calculation of posterior moments. In IN BAYESIAN STATISTICS, pages 169–193. University Press, 1992.
 [10] M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou. Walking in facebook: A case study of unbiased sampling of osns. In INFOCOM, 2010.
 [11] L. Jin, Y. Chen, P. Hui, C. Ding, T. Wang, A. V. Vasilakos, B. Deng, and X. Li. Albatross sampling: robust and effective hybrid vertex sampling for social graphs. In MobiArch, 2011.
 [12] L. Katzir, E. Liberty, and O. Somekh. Estimating sizes of social networks via biased sampling. In WWW, 2011.
 [13] M. Kurant, M. Gjoka, C. T. Butts, and A. Markopoulou. Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In SIGMETRICS, 2011.
 [14] C.H. Lee, X. Xu, and D. Y. Eun. Beyond random walk and metropolishastings samplers: Why you should not backtrack for unbiased graph sampling. In Sigmetrics, 2012.
 [15] J. Leskovec and C. Faloutsos. Sampling from large graphs. In SIGKDD, 2006.
 [16] J. Leskovec, K. Lang, A. Dasgupta, and M. Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large welldefined clusters. Internet Mathematics, 6(1):29–123, 2009.
 [17] L. Lovász. Random walks on graphs: A survey. Combinatorics, Paul Erdos is Eighty, 2(1):1–46, 1993.
 [18] A. Mohaisen, A. Yun, and Y. Kim. Measuring the mixing time of social graphs. In SIGCOMM, 2010.
 [19] M. Richardson, R. Agrawal, and P. Domingos. Trust management for the semantic web. In ISWC, 2003.
 [20] A. Sala, L. Cao, C. Wilson, R. Zablit, H. Zheng, and B. Zhao. Measurementcalibrated graph models for social network experiments. In WWW, 2010.
 [21] P. Sarkar, D. Chakrabarti, and A. W. Moore. Theoretical justification of popular link prediction heuristics. In COLT, 2010.
 [22] P. Sarkar and A. W. Moore. Dynamic social network analysis using latent space models. SIGKDD Explor. Newsl., 7:31–40, December 2005.
 [23] Z. Zhou, N. Zhang, Z. Gong, and G. Das. http://www.seas.gwu.edu/~nzhang10/rewiring.pdf.
Appendix A Appendix
Corollary 1. For all which satisfy
(15) 
there always exists a graph in which is crosscutting.
Let . We only need to construct a counterexample for each case that satisfies (15), but is a crosscutting edge. Assume we have a graph like Fig 12, which shows the whole view of it. We let the number of common neighbors of node and be . Assuming , from (15) we get:
(16) 
Here , which denotes the outer edges of which is not linked to the node and their common neighbors. We can carefully construct a graph like Fig 12: for each neighbor of node and , it only has 1degree neighbors. So we need to prove that after assigning the degree for each node, will be a crosscutting edge. If we simply let:
(17) 
and then we divide these nodes into two sets and .
Suppose is even. In order to achieve the minimum in the definition of conductance, there must exist the case such that we only need to decide whether node is in or in .
(18) 
If , we can easily assert that is a crosscutting edge. If , we can let when to minimize . So is an crosscutting edge under this circumstance.
Also, suppose is odd. Similarly, we have
(19) 
Since , in the same way we know that is a crosscutting edge.
Theorem 4. Given , , if , , then replacing edge with will not decrease the conductance, while it also has positive possibility to increase the conductance. {proof} First, no matter is a crosscutting edge or not, replace it with should at least obtain the same conductance. If is not a crosscutting edge, then obviously we are not going to decrease the conductance because or will not change. If is a crosscutting edge, we only need to prove that is also a crosscutting edge. Let’s assume is not a crosscutting edge, then we can infer that the is a crosscutting edge. But only has degree of 3, so it is obvious that letting , and be the same side will achieve less conductance, which contradicts the definition of conductance.
But if is a crosscutting edge, and we replace with , then has the positive probability to become one more crosscutting edge in this local view of , and , which result in higher conductance.
Corollary 2. For , if , then there always exist a graph , , such that replacing with will decrease the conductance or have no effect. {proof} If , then we could not cut it to disconnect the graph. If , we need to check some possible situations. If none of these edges linked to are crosscutting edges, then replacing would not has effect on the conductance. If either or is a crosscutting edge, then replace one of them with will not generate another crosscutting edge; because now , and it should belongs to one side of the separation, or .
So we only need to consider the situation when . See Fig 13. There exist the case when both and are crosscutting edges. Then replacing with would decrease the number of crosscutting edges from 2 to 1 locally, which may lead to dramatic decrease of the conductance of the graph.
The uniqueness of is that there would not exist the case when both and are crosscutting edges.
Theorem 5. Given , , if and
(20) 
we can assert that is not a crosscutting edge. Here we denote . {proof} Noticed that if we do not know any degree information about the common neighbors of and , then , and theorem A is exactly the same as theorem 3.
We are going to prove this theorem by contradiction, which means if we assume is a crosscutting edge, then we can find another configuration of and such that is not a crosscutting edge but obtain less conductance. Again, let , according to the assumption the number of common neighbors of and is n, then there must be crosscutting edges in this local view of the graph, see Fig 14.
Given a node , and according to some historical information we can achieve its degree without paying any query cost. So obviously, if then it makes no sense to consider the rearrangement of it because dragging from to would probably increase the number of crosscutting edges without knowing the edge information outside this local view of the graph. Therefore, we only need to consider , which is the set of all the nodes belongs to common neighbors of degree 2 and 3.
if we denote that the number of crosscutting edges linked to within is , the number of crosscutting edges linked to u outside is , and similarly we have and . So we have . According to the condition described in Proposition A, either of the following inequality would hold:
Without losing generality, assume the first one holds, then we are going to prove that by rearrange the set we can achieve a lower conductance and thus lead to the contradiction.
Imagine that if we try to drag the whole set of from to , then we need to “rearrange” all the edges linked to the set: those crosscutting edges linked to the set but outside will be “fliped”, i.e. from crosscutting edges to noncrosscutting edges and vice versa; those crosscutting edges linked to the set but inside will be eliminated, otherwise there will be two crosscutting edges linked to the node in , which is impossible because , .
Let , then
(21) 
And we know that the minimum number of crosscutting edges we can manipulate will be at least . So as the result of one line calculation of (20),
(22) 
Therefore moving the set from to will always results in a lower conductance.
Theorem 6. Given a latent space graph model G(V,E), assume , then the expected number of edges we can removed
(23) 
Moreover, if we assume the dimension , and nodes are uniformly distributed in a rectangle , then for the graph (after removing edges from G) is:
(24) 
where and are independent uniform random variable supported on and .
According to [21], we have
(25) 
is the volume of a dimensional hypersphere of radius r. Therefore, if we have small enough , than we can confirm that we can remove the edge . Conservatively, from theorem 3 we can reasonably assert that if , then the edge can be safely removed. So when
(26) 
the edge can be removed. Now, we have transformed the probability of removing an edge to the probability of two node’s distance is within a threshold. Since , so (23) holds.
Given more assumptions of dimension and the distribution of nodes, the probability of two nodes’ euclidean distance smaller than the threshold is:
(27) 
Also, since , the change of conductance can be calculated as
(28)  
(29)  
(30) 