Pruning based Distance Sketches with Provable Guarantees on Random Graphs
Measuring the distances between vertices on graphs is one of the most fundamental components in network analysis. Since finding shortest paths requires traversing the graph, it is challenging to obtain distance information on large graphs very quickly. In this work, we present a preprocessing algorithm that is able to create landmark based distance sketches efficiently, with strong theoretical guarantees. When evaluated on a diverse set of social and information networks, our algorithm significantly improves over existing approaches by reducing the number of landmarks stored, preprocessing time, or stretch of the estimated distances.
On Erdos-Renyi graphs and random power law graphs with degree distribution exponent , our algorithm outputs an exact distance data structure with space between and depending on the value of , where is the number of vertices. We complement the algorithm with tight lower bounds for Erdos-Renyi graphs and the case when is close to two.
Computing shortest path distances on large graphs is a fundamental problem in computer science and has been the subject of much study (Thorup and Zwick, 2005; Goldberg and Harrelson, 2005; Sommer, 2014). In many applications, it is important to compute the shortest path distance between two given nodes, i.e. to answer shortest path queries, in real time. Graph distances measure the closeness or similarity of vertices and are often used as one of the most basic metric in network analysis (He et al., 2007; Potamias et al., 2009; Vieira et al., 2007; Yahia et al., 2008). In this paper, we will focus on efficient and practical implementations of shortest path queries in classes of graphs that are relevant to web search, social networks, and collaboration networks etc. For such graphs, one commonly used technique is that of landmark-based labelings: every node is assigned a set of landmarks, and the distance between two nodes in computed only via their common landmarks. If the set of landmarks can be easily computed, and is small, then we obtain both efficient pre-processing and small query time.
Landmark based labelings (and their more general counterpart, Distance Labelings), have been studied extensively (Sommer, 2014; Bast et al., 2016). In particular, a sequence of results culminating in the work of Thorup and Zwick (Thorup and Zwick, 2005) showed that labeling schemes can provide a multiplicative 3-approximation to the shortest path distance between any two nodes, while having an overhead of storage per node on average in the graph (we use the standard notation that a graph has nodes and edges). In the worst case, there is no distance labeling scheme that always uses sub-quadratic amount of space and provides exact shortest paths. Even for graphs with maximum degree , it is known that any distance labeling scheme requires total space (Gavoille et al., 2001). In sharp contrast to these theoretical results, there is ample empirical evidence that very efficient distance labeling schemes exist in real world graphs that can achieve much better approximations. For example, Akiba et al. (Akiba et al., 2013) and Delling et al. (Delling et al., 2014a) show that current algorithms can find landmark based labelings that use only a few hundred landmarks per vertex to obtain exact distance, in a wide collection of social, Web, and computer networks with millions of vertices. In this paper, we make substantial progress towards closing the gap between theoretical and observed performance. We show that natural landmark based labeling schemes can give exact shortest path distances with a small number of landmarks for a popular model of (unweighted and undirected) web and social graphs, namely the heavy-tailed random graph model. We also formally show how further reduction in the number of landmarks can be obtained if we are willing to tolerate an additive error of one or two hops, in contrast to the multiplicative 3-approximation for general graphs. Finally, we present practical versions of our algorithms that result in substantial performance improvements on many real-world graphs.
In addition to being simple to implement, landmark based shortest path algorithms also offer a qualitative benefit, in that they can directly be used as the basis of a social search algorithm. In social search (Bahmani and Goel, 2012), we assume there is a collection of keywords associated with every node, and we need to answer queries of the following form: given node and keyword , find the node that is closest to among all nodes that have the keyword associated with them. This requires an index size that is times the size of the total social search corpus and a query time of , where is the number of landmarks per node in the underlying landmark based algorithm; the approximation guarantee for the social search problem is the same as that of the landmark based algorithm. Thus, our results lead to both provable and practical improvements to the social search problem.
Existing models for social and information networks build on random graphs with some specified degree distribution (Durrett, 2007; Chung and Lu, 2006; Van Der Hofstad, 2009), and there is considerable evidence that real-world graphs have power-law degree distributions (Clauset et al., 2009; Eikmeier and Gleich, 2017). We will use the Chung-Lu model (Chung and Lu, 2002), which assumes that the degree sequence of our graph is given, and then draws a “uniform” sample from graphs that have the same or very similar degree sequences. In particular, we will study the following question: Given a random graph from the Chung-Lu model with a power law degree distribution of exponent , how much storage does a landmark-based scheme require overall, in order to answer distance queries with no distortion?
In the rest of the paper, we use the term “random power law graph” to refer to a graph that is sampled from the Chung-Lu model, where the weight (equivalently, the expected degree) of each vertex is independently drawn from a power law distribution with exponent . We are interested in the regime when — this covers most of the empirical power law degree distributions that people have observed on social and information networks (Clauset et al., 2009). Admittedly, real-world graphs have additional structure in addition to having power-law degree distributions (Leskovec et al., 2008), and hence, we have also validated the effectiveness of our algorithm on real graphs.
1.1. Our Results
Our first result corresponds to the “easy regime”, where the degree distribution has finite variance (). We show that a simple procedure for generating landmarks guarantees exact shortest paths, while only requiring each node to store landmarks. The same conclusion also applies to Erdős-Renyi graphs when , or when .
We then study the case where . This is the most emblematic regime for power-law graphs, since the degree distribution has infinite variance but finite expectation. We present an algorithm that generates at most landmarks per node when ; and landmarks per node when . We obtain additional improvements if we allow an additive error of 1 or 2. See Figure 1 for an illustration of our results.
While the dependence on is complex, it is worth noting that in the entire range that we study (), the number of landmarks per node is at most for exact shortest paths. This is in stark contrast to known impossibility results for general graphs, where no distance labeling with a multiplicative stretch less than 3 can use sub-linear space per node (Gavoille et al., 2001). The query time of our algorithms is proportional to the number of landmarks per node, so we also get speed improvements.
Our algorithm is based on the pruned labeling algorithm of Akiba et al. (Akiba et al., 2013), but differs in important ways. The pruned labeling algorithm initially posits that every node is a landmark for every other node, and then uses the BFS tree from each node to iteratively prune away unnecessary landmarks. In our approach, we apply a similar BFS with pruning procedure on a small subset of (i.e. high degree vertices), but switch to lightweight local ball growing procedures up to radius for all other vertices. As we show, the original pruned labeling algorithm requires storing landmarks on sparse Erdös-Rényi graphs. By growing local balls of size , our algorithm recovers exact distances with at most landmarks instead, for Erdös-Rényi graphs and random power law graphs with . Hence, our algorithm combines exploiting the existence of high-degree “central landmarks” with finding landmarks that are “locally important”. Furthermore for , by setting up the number of global landmarks and the radius suitably, we provably recover the upper bounds described in Figure 1. While the algorithmic idea is quite simple, the analysis is intricate.
We complement our algorithmic results with tight lower bounds for the regime when : the total length of any distance labeling schemes that answer distance queries exactly in this regime is almost surely . We also show that when , any distance labeling scheme will generate labels of total size almost surely. In particular, our algorithm achieves the optimal bound when is close 2.
The parameter choice suggested by our theoretical analysis can be quite expensive to implement (as can earlier landmark based algorithms). We apply a simple but principled parameter tuning procedure to our algorithm that substantially improves the preprocessing time and generates a smaller set of landmarks at essentially no loss of accuracy. We conduct experiments on several real world graphs, both directed and undirected. First, compared to the pruned labeling algorithm, we find that our algorithm reduces the number of landmarks stored by 1.5-2.5x; the preprocessing time is reduced significantly as well. Next, we compare our algorithm to the distance oracle of Thorup and Zwick (Thorup and Zwick, 2005), which is believed to be theoretically optimal for worst-case graphs, as well as the distance sketch of Das Sarma et al (Das Sarma et al., 2010) which has been found to be both efficient and useful in prior work (Bahmani and Goel, 2012). For each graph, our algorithm substantially outperforms these two benchmarks. Details are in Section 5. It is important to note that the three algorithms we compare to also work much better on these real-world graphs than their theoretical guarantee, and we spend considerable effort tuning their parameters as well. Hence, the performance improvement given by our algorithm is particularly noteworthy.
It is worth mentioning that our technical tools only rely on bounding the growth rate of the breadth-first search. Hence we expect that our results can be extended to the related configuration model (Durrett, 2007) as well. One limitation of our work is that the analysis does not apply directly to preferential attachment graphs, which correspond to another family of well known power law graphs. But we believe that similar results can be obtained there by adapting our analysis to that setting as well. This is left for future work.
Organizations: The rest of the paper is organized as follows. Section 2 introduces the basics of random graphs, reviews the pruned labeling algorithm and related work. Section 3 introduces our approach. Section 4 analyzes our algorithm on random power law graphs. Then we present experiments in Section 5. We show the lower bounds in Section 6. We conclude in Section 7. The Appendix contains missing proofs from the main body.
2. Preliminaries and Related Work
Let be a directed graph with vertices and edges. For a vertex , denote by the outdegree of and the indegree of . Let denote the set of its out neighbors. Let denote the distance of and in , or for simplicity. When is undirected, then the outdegrees and indegrees are equal. Hence we simply denote by the degree of every vertex . For an integer and , denote by the set of vertices at distance from . Denote by the set of vertices at distance at most from .
We use notation to indicate that there exists an absolute constant such that . The notations and hide poly-logarithmic factors.
2.2. Landmark based Labelings
In a landmark based labeling (Delling et al., 2014b), each vertex is assigned a set of forward landmarks and backward landmarks . Each landmark set is a hash table, whose keys are vertices and values are distances. For example, if , then the value associated with would be , which is the “forward” distance from to . Given the landmark sets and , we estimate the distances as follows:
If no common vertex is found between and , then is not reachable from . In the worst case, computing set intersection takes time.
Denote the output of equation (1) by . Clearly, we have when is reachable from . The additive stretch of is given by , and the multiplicative stretch is given by . When there are no errors for any pairs of vertices, such landmark sets are called 2-hop covers (Cohen, 1997).
There is a more general family of data structures known as labeling schemes (Gavoille et al., 2001), which associates a vector for every vertex. When answering a query for a pair of vertices , only the two labels and are required without accessing the graph. The total length of is given by . It is clear from equation (1) that landmark sketches fall in the query model of labeling schemes.
2.3. The Pruned Labeling Algorithm
We review the pruned labeling algorithm (Akiba et al., 2013) for readers who are not familiar. The algorithm starts with an ordering of the vertices, . First for , a breadth first search (BFS) is performed over the entire graph. During the BFS, is added to the landmark set of every vertex.111For directed graphs, there will be a forward BFS which looks at ’s outgoing edges and its descendants, as well as a backward BFS which looks at ’s incoming edges and its predecessors. Next for , in addition to running BFS, a pruning step is performed before adding as a landmark. For example, suppose that a path of length is found from to . If lies on the shortest path fom to , then by checking their landmark sets, we can find the common landmark to certify that . In this case, is not added to ’s landmark set, and the neighbors of are pruned away. The above procedure is repeated on , , etc.
2.4. Random Graphs
We review the basics of Erdös-Rényi random graphs. Let be an undirected graph where every edge appears with probability . It is well known that when , has only one connected component with high probability. Moreover, the neighborhood growth rate (i.e. ) is highly concentrated around its expectation, which is . Formally, the following facts are well-known.
Fact 1 (Bollobás (Bollobás, 1998)).
Let be an undirected graph where every edge is sampled with probability . Let . Then the following are true with high probability:
The diameter of is at most ;
For any and , ;
For any and , we have .
The Chung-Lu model: Let denote a weight value for every vertex . Given the weight vector , the Chung-Lu model generalizes Erdös-Rényi graphs such that each edge is chosen independently with probability
where denote the volume of .
Random power law graphs: Let denote the probability density function of a power law distribution with exponent , i.e. , where (Clauset et al., 2009). The expectation of exists when . The second moment is finite when . When , the expectation is finite, but the empirical second moment grows polynomially in the number of samples with high probability. If , then even the expectation becomes unbounded as grows.
In a random power law graph, the weight of each vertex is drawn independently from the power law distribution. Given the weight vector , a random graph is sampled according to the Chung-Lu model. If the average degree , then it is known that almost surely the graph has a unique giant component (Chung and Lu, 2006).
2.5. Related Work
Landmark based labelings: There is a rich history of study on how to preprocess a graph to answer shortest path queries (Abraham and Gavoille, 2011; Alstrup et al., 2015; Althöfer et al., 1993; Cohen, 1997; Thorup and Zwick, 2005; Borassi et al., 2017a). It is beyond our scope to give a comprehensive review of the literature and we refer the reader to survey (Sommer, 2014) for references.
In general, it is NP-hard to compute the optimal landmark based labeling (or 2-hop cover). Based on an LP relaxation, a factor approximation can be obtained via a greedy algorithm (Cohen et al., 2003). See also the references (Angelidakis et al., 2017; Babenko et al., 2015; Delling et al., 2014b; Goldberg et al., 2013; Borassi et al., 2017b) for a line of followup work. The current state of the art is achieved based on the pruned labeling algorithm (Akiba et al., 2013; Delling et al., 2014a). Apart from the basic version that we have already presented, bit-parallel optimizations have been used to speed up proprocessing (Akiba et al., 2013). Variants which can be executed when the graph does not fit in memory have also been studied (Jiang et al., 2014). It is conceivable that such techniques can be added on top of the algorithms that we study as well. For the purpose of this work, we will focus on the basic version of the pruned labeling algorithm. Compared to classical approaches such as distance oracles, the novelty of the pruned labeling algorithm is using the idea of pruning to reduce redundancy in the BFS tree.
Network models: Earlier work on random graphs focus on modeling the small world phenomenon (Chung and Lu, 2006), and show that the average distance grows logarithmically in the number of vertices. Recent work have enriched random graph models with more realistic features, e.g. community structures (Kolda et al., 2014), shrinking diameters in temporal graphs (Leskovec et al., 2010).
Other existing mathematical models on special families of graphs related to distance queries include road networks (Abraham et al., 2010), planar graphs and graphs with doubling dimension. However none of them can capture the expansion properties that have been observed on sub-networks of real-world social networks (Leskovec et al., 2008).
Previous work of Chen et al. (Chen et al., 2009) presented a 3-approximate labeling scheme requiring storage per vertex, on random power law graphs with . Our (+2)-stretch result improves upon this scheme in the amount of storage needed per vertex for , with a much better stretch guarantee. Another related line of work considers compact routing schemes on random graphs. Enachescu et al. (Enachescu et al., 2008) presented a 2-approximate compact routing scheme using space on Erdös-Rényi random graphs, and Gavoille et al. (Gavoille et al., 2015) obtained a 5-approximate compact routing scheme on random power law graphs.
3. Our Approach
In order to motivate the idea behind our approach, we begin with an analysis of the pruned labeling algorithm on Erdös-Rényi random graphs. While the structures of real world graphs are far from Erdös-Rényi graphs, the intuition obtained from the analysis will be useful. Below we describe a simple proposition which states that for sparse Erdös-Rényi graphs, the pruned labeling algorithm outputs landmarks.
Proposition 1 ().
Let be an undirected Erdös-Rényi graph where every edge appears with probability . For any ordering of the vertices , with high probability over the randomness of , the total number of landmarks produced by Algorithm 1 is at least .
We first introduce a few notations. Let denote the growth rate of . Let . Consider a vertex where . Denote by . Consider any , if none of the shortest paths from to intersect with , then is called a bad pair. Note that must be added to ’s landmark set by Algorithm 1, because during the BFS from , all estimates through will be strictly larger than . Hence, to lower bound the total landmark sizes, it suffices to count the number of bad pairs. In the following, we show that in expectation for every where , there are at least vertices such that are bad. It follows that Algorithm 1 requires at least in expectation.
Let . Consider , the set of vertices at distance equal to from . We count the number of bad vertices in at follows. For each , consider the intersection and their subtree down to .
Starting from any , the subtree of would result in good vertices in , whose distance from can be correctly estimated (c.f. line 15-16 in Algorithm 1). In expectation, the size of the intersection is , because the probability that any two vertex has distance on is equal to , and there are at most vertices in . Next, each results in vertices in its -th level neighborhood. Combined together, the total number of good vertices which are covered by is at most . By summing over all , we obtain that the total number of good vertices in is at most .
On the other hand, the size of is . Hence the total number of bad vertices is at least . To show that the the proposition holds with high probability, it suffices to apply concentration results on neighborhood growth in the arguments above. We omit the details.
The interesting point from the above analysis is that landmarks are added throughout the first vertices. The reason is that there are no high degree vertices in Erdös-Rényi graphs, hence the landmarks we have added in the beginning do not cover the shortest paths for many vertex pairs later.Secondly, a large number of distant vertices are added in the landmark sets, which do not lie on the shortest paths of many pairs of vertices.
Motivated by the observation, we introduce our approach as follows. We start with an ordering of the vertices. For the top vertices in the ordering, we perform the same BFS procedure with pruning. For the rest of the vertices, we simply grow a local ball up to a desired radius. Concretely, only the vertices from the local ball will be used as a landmark. Algorithm 2 describes our approach in full.222Here we have omitted the details of the local backward BFS procedure, which can be derived similar to the local forward BFS procedure. As a remark, when the input graph is undirected, it suffices to run one of the forward or backward BFS procedures, and for each vertex, its forward and backward landmark sets can be combined to a single landmark set.
Recall that the backward and forward BFS procedures add a pruning step before enqueing a vertex (c.f. Algorithm 1). For each with , the parameter controls the depth of the local ball we grow from . Furthermore, at the bottom layer, we only add vertices whose outdegree is higher than any of its predecessor to ’s landmark set. The intuition is that vertices with higher outdegrees are more likely to be used as landmarks.
We begin by analyzing Algorithm 2 for Erdös-Rényi graphs, as a comparison to Proposition 1. The following proposition shows that without using global landmarks, local balls of suitable radius suffice to cover all the desired distances. The proof is by observing that for each vertex, if we add the closest vertices to the landmark set of every vertex, then the landmark sets of every pair of vertices will intersect with high probability, i.e. we have obtained a 2-hop cover.
Proposition 2 ().
Let be an undirected random graph where each edge is sampled with probability . By setting and for all , we have that Algorithm 2 outputs a 2-hop cover with at most landmarks with high probability.
Denote by the landmark set obtained by Algorithm 2, for every . We will show that with high probability:
For all , . This implies that is a 2-hop cover.
The size of is less than , for all .
Claim a) follows because the diameter of is at most with high probability by Fact 1. Note that contains , the set of vertices with distance at most . If , and already intersect. Otherwise, since the diameter is at most , these two neighborhoods must be connected by an edge . Suppose between ’s two endpoints, the one with a lower degree is on ’s side, then the local BFS from must add the other endpoint to , and vice versa. Therefore, must intersect with .
Claim b) is because is a subset of . By Fact 1, the size of is at most . Hence, the proof is complete. ∎
4. Random Power Law Graphs
In this section we analyze our algorithm on random power law graphs. We begin with the simple case of , which generalizes the result on Erdös-Rényi graphs. Because the technical intuition is the same with Proposition 2, we describe the result below and omit the proof.
Proposition 1 ().
Let be a random power law graph with average degree and power law exponent . For each , let be the smallest integer such that the number of edges between and is at least , where .
By setting and , Algorithm 2 outputs a 2-hop cover with high probability. Moreover, each vertex uses at most landmarks.
Remark: The high level intuition behind our algorithmic result is that as long as the breadth-first search process of the graph grows neither too fast nor too slow, but rather at a proper rate, then an efficient distance labeling scheme can be obtained. Proposition 1 can be easily extended to configuration models with bounded degree variance. It would be interesting to see if our results extend to preferential attachment graphs and Kronecker graphs.
The case of : Next we describe the more interesting case with power law exponent . Here the graph contains a large number of high degree vertices. By utilizing the high degree vertices, we show how to obtain exact distance landmark schemes, (+1)-stretch schemes and (+2)-stretch schemes. The number of landmarks used varies depending on the value of . We now state our main result as follows.
Theorem 2 ().
Let be a random power law graph with average degree and exponent . Let
Let be the number of vertices whose degree is at least in . Let be any ordering of vertices by their degrees in a non-increasing order. For each vertex , let be the smallest integer such that the number of edges between and is at least , where .
With ordering , parameters and , Algorithm 2 outputs a 2-hop cover with high probability. Moreover, the maximum number of landmarks used by any vertex is at most
The above theorem says that in Algorithm 2, first we use vertices whose degrees are at least as global landmarks. Then for the other vertices , we grow a local ball of radius , whose size is (right) above . The two steps together lead to a 2-hop cover. We now build up the intuition for the proof.
Building up a -stretch scheme: First, it is not hard to show that contains a heavy vertex whose degree is , by analyzing the power law distribution. Note that , hence we have added all such high degree vertices as global landmarks. This part, together with the local balls, already gives us a -stretch landmark scheme.
To see why, consider two vertices . If their local balls (of size ) already intersect, then we can already compute their distances correctly from their landmark sets. Otherwise, since the bottom layers of and already have weight/degree , they are at most two hops apart, by connecting to the heavy vertex with degree . Recall that the heavy vertex is added to the landmark sets of every vertex. Hence, the estimated distance is at most off by one. As a remark, to get the -stretch landmark scheme, the number of landmarks needed per vertex is on the order of . This is because we only need to use vertices whose degrees are at least as global landmarks (there are only of them), as opposed to in Theorem 2.
Fixing the -stretch: To obtain exact distances, for each vertex on the boundary of radius , we add all of its neighbors with a higher degree to the landmark set (c.f. line 15-17 in Algorithm 2). Whenever there is an edge connecting the two boundaries, the side with a lower degree will add the other endpoint as a landmark, which resolves the (+1)-stretch issue. For the size of landmark sets, it turns out that fixing the -stretch for the case significantly increases the number of landmarks needed. Specifically, the costs are landmarks per node.
Intuition for the -stretch scheme: As an additional remark, one can also obtain a -stretch landmark sketch by setting in Algorithm 2 in a way such that every vertex stores the closest vertices in its landmark set. This modification leads to a -stretch scheme, because for two vertices , once the bottom layers of have size at least , they are at most three hops away from each other. The reason is that with high probability, the bottom layer will connect to a vertex with weight in the next layer (which will all be connected), as it is not hard to verify that the volume of all vertices with weight is . By a similar proof to Theorem 2, the maximum number of landmarks used per vertex is at most .
We refer the reader to Appendix A for details of the full proof. The technical components involve carefully controlling the growth of the neighborhood sets by using concentration inequalities.
In this section, we substantiate our results with experiments on a diverse collection of network datasets. A summary of the findings are as follows. We first compare Algorithm 2 with the pruned labeling algorithm (Akiba et al., 2013). Recall that our approach differs from the pruned labeling algorithm by only performing a thorough BFS for a small set of vertices, while running a lightweight local ball growing procedure for most vertices. We found that this simple modification leads to 1.5-2.5x reduction in number of landmarks stored. The preprocessing time is reduced by 2-15x as well. While our algorithm does not always output the exact distance like the pruned labeling algorithm, we observe that the stretch is at most , relative to the average distance.
Next we compare our approach to two approximate distance sketches with strong theoretical guarantees, Das Sarma et al. sketch (Bahmani and Goel, 2012; Das Sarma et al., 2010) and a variant of Thorup-Zwick’s 3-approximate distance oracle (Thorup and Zwick, 2005), which uses high degree vertices as global landmarks (Chen et al., 2009). We observe that our approach incurs lower stretch and requires less space compared to Das Sarma et al. sketch. The accuracy of Thorup-Zwick sketch is comparable to ours, but we require much fewer landmarks.
5.1. Experimental Setup
To ensure the robustness of our results, we measure performances on a diverse collection of directed and undirected graphs, with the datasets coming from different domains, as described by Table 1. Stanford, Google and BerkStan are all Web graphs in which edges are directed. DBLP (collaboration network) and Youtube (friendship network) are both undirected graphs where there is one connected component for the whole graph. Twitter is a directed social network graph with about vertices inside the largest strongly connected component. All the datasets are downloaded from the Stanford Large Network Dataset Collection (Leskovec and Krevl, 2014).
|graph||# nodes||# edges||category||type|
Implementation details: We implemented all four algorithms in Scala, based on a publicly available graph library.333https://github.com/teapot-co/tempest The experiments are conducted on a 2.30GHz 64-core Intel(R) Xeon(R) CPU E5-2698 v3 processor, 40MB cache, 756 GB of RAM. Each experiment is run on a single core and loads the graph into memory before beginning any timings. The RAM used by the experiment is largely dominated by the storage needed for the landmark sets.
|Relative Average Stretch||0.37%||0.20%||0.51%||0.29%||0.33%||1.1%|
|Maximum Relative Stretch||21/10||10/7||8/5||4/3||4/3||7/5|
|Average Additive Stretch||0.046||0.030||0.060||0.014||0.018||0.075|
|Maximum Additive Stretch||11||3||3||1||2||2|
Parameters: In the comparison between the pruned labeling algorithm and our approach, we order the vertices in decreasing order by the indegree plus outdegree of each vertex.444There are more sophisticated techniques such as ordering vertices using their betweenness centrality scores (Delling et al., 2014a). It is conceivable that our algorithm can be combined with such techniques. Recall that there are two input parameters used in our approach, the number of global landmarks and the radiuses of local balls . To tune , we start with 100, then keep doubling to be 200, 400, etc. The radiuses are set to be for all graphs.555It follows from our theoretical analysis that the radiuses should be less than half of the average distance. As a rule of thumb, setting the radius as 2 works based on our experiments.
Benchmarks: For the Thorup-Zwick sketch, in the first step, vertices are sampled uniformly at random as global landmarks. In the second step, every other vertex grows a local ball as its landmark set until it hits any of the vertices. All vertices within the ball are used as landmarks. This method uses landmarks and achieves -stretch in worst case. In the follow up work of Chen et al. (Chen et al., 2009), the authors show a variant which uses high degree vertices as global landmarks and observe better performance. We implement Chen et al.’s variant in our experiment, and use the vertices with the highest indegree plus outdegree as global landmarks. In the experiment, we start with equal to . Then we report results for multiplied by .
For the Das Sarma et al. sketch, first, sets of different sizes are sampled uniformly from the set of vertices , for , where the size of is . Then a breadth first search is performed from , so that every vertex finds its closest vertex inside in graph distance. This closest vertex is then used as a landmark for . The number of landmarks used in Dar Sarma’s sketch is , and the worst case multiplicative stretch is . If more accurate estimation is desired, one can repeat the same procedure multiple times and union the landmark sets together. We begin with 5 repetitions, then keep doubling it to be 10, 20 etc.
Our approach differs from the above two methods by using the idea of pruning while running BFS. This dramatically enhances performance in practice, as we shall see in our experiments.
Metrics: We measure the stretch of the estimated distances, and compute aggregated statistics over a large number of queries. For a query , if is reachable from , but the algorithm reports no common landmark between the landmark sets of and , then we count such a mistake as a “False disconnect error.” On the other hand, if is not reachable from , then it is not hard to see that our algorithm always reports correctly that is not reachable from . In the experiments, we compute using Dijkstra’s algorithm.
To measure space usage, we report the number of landmarks per node used in each algorithm as a proxy. Since the landmark sets are stored in Int to Float hash maps, the actual space usage would be eight bytes times the landmark sizes in runtime, with a constant factor overhead.
For the query time, recall that for each pair of vertices , we estimate their distance by looking at the intersection of and and compute the minimum interconnecting distance (c.f. equation 1). To find the minimum, we iterate through the smaller landmark set. Hence the running time is multiplied by the time for a hash map lookup, which is a small fixed value in runtime. A special case is when or , where only one hash map lookup is needed. We will report the number of hash map lookups as a proxy for the query time.666It is conceivable that more sophisticated techniques may be devised to speedup set intersection. We leave the question for future work.
5.2. Comparisons to Exact Methods
We report the results comparing our approach to the pruned labeling algorithm. The pruned labeling algorithm is exact. To measure the accuracy of our approach, we randomly sample pairs of source and destination vertices. The number of global landmarks is set to be 400 for the Stanford dataset, 1600 for the DBLP dataset, and 800 for the rest of the datasets.
Figure 2 shows the preprocessing time, the number of landmarks and average query time used by both algorithms. We see that our approach reduces the number of landmarks used by 1.5-2.5x, except on the Twitter dataset.777By setting the radiuses to be 1, we incur relative additive stretch by using 173 landmarks per node, which improves over the pruned labeling algorithm by 1.5x. Our approach performs favorably in terms of preprocessing time and query time as well.
The accuracy of our computed estimate is shown in Table 2. We have also measured the median additive stretch, which turns out to be zero in all the experiments. To get a more concrete sense of the accuracy measures, consider the Google dataset as an example. Since the average additive stretch is and there are 2000 pairs of vertices, the total additive stretch is at most 120 summing over all 2000 pairs! Specifically, there can be at most 120 queries with non-zero additive stretch and for all the other queries, our approach returns the exact answer. Meanwhile, among all the datasets, we observed only one “False disconnect error” in total. It appeared in the Stanford Web graph experiment, where the true distance is 80.
5.3. Comparisons to Approximate Methods
Next we compare our approach to Das Sarma et al.’s sketch (or DS et al. sketch in short) and the variant of Thorup and Zwick’s sketch (or TZ sketch in short). Similar to the previous experiment, we sample 2000 source and destination vertices uniformly at random to measure the accuracy.
We start by setting the number of global landmarks to in Thorup-Zwick sketch. To allow for a fair comparison, we tune our approach so that the relative average stretch is comparable or lower. Specifically, the Stanford, BerkStan and Twitter datasets use , the Google and DBLP datasets use and the Youtube dataset uses .
Figure 3 shows the number of landmarks needed in each algorithm as well as the amount of preprocessing time consumed. Overall, our approach uses much fewer landmarks than the other two algorithms. In terms of preprocessing time, our approach is comparable or faster on all datasets, except on the DBLP network. We suspect that this may be because the degree distribution of the DBLP network is flatter than the others. Hence performing the pruning procedures on a small subset of high degree vertices are less effective in such a scenario.
We next report the relative average stretch for all three methods. As can be seen in Table 4, our approach is comparable to or slightly better than Thorup and Zwick’s sketch, but much more accurate than Das Sarma et al’s sketch. Note that the latter performed significantly worse than the other two approaches. We suspect that this may be because the sketch does not utilize the high degree vertices efficiently. Lastly, our approach performs favorably in the query time comparison as well. The query time of Das Sarma et al.’s sketch are not reported because of the worse accuracy.
Effect of parameter choice: Note that in the above experiment, for Thorup and Zwick’s sketch, we have set the number of global landmarks to be . In the next experiment, we vary the value of to multiplied by .
First, we report a detailed comparison on the Google dataset in Figure 5. Note that when , the Thorup and Zwick’s sketch requires over 2000 landmarks per node which is significantly larger than the other values. Hence, we dropped the data point from the plot. For our approach, we double from 100 up to 1600. Overall, we can see that our approach requires fewer landmarks across different stretch levels.
Next, we report brief results on the Youtube dataset in Table 4 since the results are similar. The conclusions obtained from other datasets are qualitatively similar, and hence omitted.
5.4. More Experimental Observations
By varying the number of global landmarks used Algorithm 1, it is possible to obtain a smooth tradeoff between stretch and number of landmarks used. As an example, we present the tradeoff curve for the Stanford Web dataset in Figure 5. Here we vary the number of global landmarks used from 100 to 1000. As one would expect, the relative average stretch decreases while the number of landmarks stored increases.
6. Fundamental Limits of Landmark Sketches
This section complements our algorithm with lower bounds. We begin by showing a matching lower bound for Erdös-Rényi graphs, saying that any 2-hop cover needs to store at least landmarks. The results imply that the parameter dependence on of our algorithm is tight for Erdös-Rényi graphs and random power law graphs with power law exponent . It is worth mentioning that the results not only apply to landmark sketches, but also work for the family of labeling schemes. Recall that labeling schemes associate a labeling vector for each vertex. To answer a query for a pair of vertices , only the labeling vectors of are accessed. We first state the lower bound for Erdös-Rényi graphs.
Theorem 1 ().
Let be an Erdös-Rényi graph where every edge is sampled with probability . With high probability over the randomness of , any labelings which can recover all pairs distances exactly have total length at least .
In particular, any 2-hop cover needs to store at least many landmarks with high probability.
For a quick overview, we divide into sets of size each. We will show that the total labeling length for each set of vertices has to be at least . By union bound over all the sets, we obtain the desired conclusion. We now go into the proof details.
Denote by . Let , where is a fixed constant (e.g. suffices). Divide into groups of size . Clearly, there are disjoint groups – let be one of them. Denote by a fixed constant which will be defined later. We argue that
Hence by Markov’s inequality, with high probability except for groups, all the other groups will have label length at least , because . Hence we obtain the desired conclusion. For the rest of the proof, we focus on proving equation (2) for the group .
Let be an arbitrary ordering of . We grow the neighborhood of each vertex in one by one, up to level . Denote by , where and . For any , if , then we define define to be the set of of vertices in whose distance is at most from . Define to be the set of vertices in whose distance is equal to from . On the other hand if , then and are both empty. More formally,
We then define . Denote by to be the induced subgraph of on the remaining vertices . We show that with high probability, a constant fraction of vertices satisfy that .
Lemma 2 (Martingale inequality).
In the setting of Theorem 1, with high probability, at least vertices satisfy that .
For any , consider
We claim that with high probability. It suffices to consider the case and . It is not hard to verify that by our setting of . Hence the size of is at least . Note that the subgraph is still an Erdös-Rényi random graph, and the number of vertices is at least . By Fact 1c), the size of is at least
Thus by Azuma-Hoeffding inequality, with high probability. We will show below that the contributions to from and is less than . Hence by taking union bound, we obtain the desired conclusion.
First, we show that the number of such that are at most with high probability. Note that implies that there exists some vertex with such that . On the other hand, by Fact 1,
Hence, it is not hard to verify that the expected number of vertex pairs in whose distance is at most , is , by the setting of . By Markov’s inequality, with probability only vertex pairs have distance at most in . Hence there exists at most ’s such that .
Secondly for all , the set of vertices is a subset of , the set of vertices within distance to on . By Fact 1c), the size of is at most . Hence we have for all with high probability. This proves the Lemma. ∎
Now we are ready to finish the proof. Given the labels of , we can recover all pairwise distances in . Let denote the distance function restricted to . Consider the following:
pairs such that . We know by Fact 1 that , for any . Hence the expected number of pairs with distance at most in , is at most . By Markov’s inequality, the probability that a random graph induces any such distance function is .
The number of pairs such that is at most in . Let denote
By Lemma 2 and our assumption for case b), the size of is at least For any , and are clearly disjoint. Note that the event whether there exists an edge between and is independent, conditional on revealing the subgraph for all up to distance . Hence
(because ) (because )
where in the last line. Denote by . Note that the number of labelings of length (or number of bits) less than is at most . For each labeling, the probability that it correctly gives all pairs distances is at most by our argument above. Therefore by union bound, the probability that the total labeling length of is at most is at most for large enough .
To recap, by combining case a) and b), we have shown that equation (2) is true. Hence the proof is complete. ∎
Extensions to : It is worth mentioning that the lower bound on Erdös-Rényi graphs can be extended to random power law graphs with . The proof structure is similar because the degree distribution has finite variance, hence the number of high degree vertices is small. The difference corresponds to technical modifications which deal with the neighborhood growth of random graphs with constant average degree. We state the result below and leave the proof to Appendix B.
Theorem 3 ().
Let a random power law graph with average degree and exponent . With high probability over the randomness of , any labelings which can recover all pairs distances exactly have total length at least .
In particular, any 2-hop cover needs to store at least many landmarks with high probability.
Lower bounds for close to 2: Next we show that the parameter dependence of our algorithm is tight when is close to 2. Specifically, any 2-hop cover needs to store at least many landmarks when . Hence it is not possible to improve over our algorithm when is close to 2. Furthermore, the lower bound holds for the general family of labeling schemes as well.
Theorem 4 ().
Let a random power law graph with average degree and exponent for . With high probability over the randomness of , any labelings which can recover all pairs distances exactly have total length at least .
In particular, any 2-hop cover needs to store at least many landmarks with high probability.
Let be the set of vertices whose degrees are on the order of . Let be a set of vertices, where each vertex has weight between and . Such a set is guaranteed to exist because there are of them.
We first reveal all edges of other than the ones between . We show that at this stage, most vertices in are more than 3 hops away from each other. If for some pair in whose distance is larger than three, and both and connect to exactly one (but different) vertex in , then knowing whether will reveal whether their neighbors in are connected by an edge.
Based on the observation, we show that the total labeling length of is at least . This is because the random bit between a vertex pair in has entropy . Since there are pairs of vertices in , the entropy of the labelings of must be (hence, its size must also be at least ). Similar to Theorem 1, this argument is applied to disjoint sets of “”, summing up to an overall lower bound of .
7. Conclusions and Future Work
In this work, we presented a pruning based landmark labeling algorithm. The algorithm is evaluated on a diverse collection of networks. It demonstrates improved performances compared to the baseline approaches. We also analyzed the algorithm on random power law graphs and Erdös-Rényi graphs. We showed upper and lower bounds on the number of landmarks used for the Erdös-Rényi random graphs and random power law graphs.
There are several possible directions for future work. One direction is to close the gap in our upper and lower bounds for random power law graphs. We believe that any improved understanding can potentially lead to better algorithms for real world power law graphs as well. Another direction is to evaluate our approach on transportation networks, which correspond to another important domain in practice.
Acknowledgements. The authors would like to thank Fan Chung Graham, Tim Roughgarden, Amin Saberi and D. Sivakumar for useful feedback and suggestions at various stages of this work. Also, thanks to the anonymous referees for their constructive reviews. Hongyang Zhang is supported by NSF grant 1447697.
- Abraham et al. (2010) Ittai Abraham, Amos Fiat, Andrew V Goldberg, and Renato F Werneck. 2010. Highway dimension, shortest paths, and provably efficient algorithms. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 782–793.
- Abraham and Gavoille (2011) Ittai Abraham and Cyril Gavoille. 2011. On approximate distance labels and routing schemes with affine stretch. In International Symposium on Distributed Computing. Springer, 404–415.
- Akiba et al. (2013) Takuya Akiba, Yoichi Iwata, and Yuichi Yoshida. 2013. Fast exact shortest-path distance queries on large networks by pruned landmark labeling. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 349–360.
- Alstrup et al. (2015) Stephen Alstrup, Søren Dahlgaard, Mathias Bæk Tejs Knudsen, and Ely Porat. 2015. Sublinear distance labeling. arXiv preprint arXiv:1507.02618 (2015).
- Althöfer et al. (1993) Ingo Althöfer, Gautam Das, David Dobkin, Deborah Joseph, and José Soares. 1993. On sparse spanners of weighted graphs. Discrete & Computational Geometry 9, 1 (1993), 81–100.
- Angelidakis et al. (2017) Haris Angelidakis, Yury Makarychev, and Vsevolod Oparin. 2017. Algorithmic and hardness results for the hub labeling problem. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 1442–1461.
- Babenko et al. (2015) Maxim Babenko, Andrew V Goldberg, Haim Kaplan, Ruslan Savchenko, and Mathias Weller. 2015. On the complexity of hub labeling. In International Symposium on Mathematical Foundations of Computer Science. Springer, 62–74.
- Bahmani and Goel (2012) Bahman Bahmani and Ashish Goel. 2012. Partitioned multi-indexing: bringing order to social search. In Proceedings of the 21st international conference on World Wide Web. ACM, 399–408.
- Bast et al. (2016) Hannah Bast, Daniel Delling, Andrew Goldberg, Matthias Müller-Hannemann, Thomas Pajor, Peter Sanders, Dorothea Wagner, and Renato F Werneck. 2016. Route planning in transportation networks. In Algorithm engineering. Springer, 19–80.
- Bollobás (1998) Béla Bollobás. 1998. Random graphs. In Modern Graph Theory. Springer, 215–252.
- Borassi et al. (2017a) Michele Borassi, Pierluigi Crescenzi, and Luca Trevisan. 2017a. An Axiomatic and an Average-Case Analysis of Algorithms and Heuristics for Metric Properties of Graphs. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 920–939.
- Borassi et al. (2017b) Michele Borassi, Pierluigi Crescenzi, and Luca Trevisan. 2017b. An axiomatic and an average-case analysis of algorithms and heuristics for metric properties of graphs. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 920–939.
- Chen et al. (2009) Wei Chen, Christian Sommer, Shang-Hua Teng, and Yajun Wang. 2009. Compact routing in power-law graphs. In International Symposium on Distributed Computing. Springer, 379–391.
- Chung and Lu (2002) Fan Chung and Linyuan Lu. 2002. The average distances in random graphs with given expected degrees. Proceedings of the National Academy of Sciences 99, 25 (2002), 15879–15882.
- Chung and Lu (2006) Fan RK Chung and Linyuan Lu. 2006. Complex graphs and networks. Vol. 107. American mathematical society Providence.
- Clauset et al. (2009) Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. 2009. Power-law distributions in empirical data. SIAM review 51, 4 (2009), 661–703.
- Cohen (1997) Edith Cohen. 1997. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci. 55, 3 (1997), 441–453.
- Cohen et al. (2003) Edith Cohen, Eran Halperin, Haim Kaplan, and Uri Zwick. 2003. Reachability and distance queries via 2-hop labels. SIAM J. Comput. 32, 5 (2003), 1338–1355.
- Das Sarma et al. (2010) Atish Das Sarma, Sreenivas Gollapudi, Marc Najork, and Rina Panigrahy. 2010. A sketch-based distance oracle for web-scale graphs. In Proceedings of the third ACM international conference on Web search and data mining. ACM, 401–410.
- Delling et al. (2014a) Daniel Delling, Andrew V Goldberg, Thomas Pajor, and Renato F Werneck. 2014a. Robust distance queries on massive networks. In European Symposium on Algorithms. Springer, 321–333.
- Delling et al. (2014b) Daniel Delling, Andrew V Goldberg, Ruslan Savchenko, and Renato F Werneck. 2014b. Hub labels: Theory and practice. In International Symposium on Experimental Algorithms. Springer, 259–270.
- Durrett (2007) Richard Durrett. 2007. Random graph dynamics. Vol. 200. Citeseer.
- Eikmeier and Gleich (2017) Nicole Eikmeier and David F Gleich. 2017. Revisiting Power-law Distributions in Spectra of Real World Networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 817–826.
- Enachescu et al. (2008) Mihaela Enachescu, Mei Wang, and Ashish Goel. 2008. Reducing maximum stretch in compact routing. In INFOCOM 2008. The 27th Conference on Computer Communications. IEEE. IEEE.
- Gavoille et al. (2015) Cyril Gavoille, Christian Glacet, Nicolas Hanusse, and David Ilcinkas. 2015. Brief Announcement: Routing the Internet with very few entries. In Proceedings of the 2015 ACM Symposium on Principles of Distributed Computing. ACM, 33–35.
- Gavoille et al. (2001) Cyril Gavoille, David Peleg, Stéphane Pérennes, and Ran Raz. 2001. Distance labeling in graphs. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 210–219.
- Goldberg and Harrelson (2005) Andrew V Goldberg and Chris Harrelson. 2005. Computing the shortest path: A search meets graph theory. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 156–165.
- Goldberg et al. (2013) Andrew V Goldberg, Ilya Razenshteyn, and Ruslan Savchenko. 2013. Separating hierarchical and general hub labelings. In International Symposium on Mathematical Foundations of Computer Science. Springer, 469–479.
- He et al. (2007) Hao He, Haixun Wang, Jun Yang, and Philip S Yu. 2007. BLINKS: ranked keyword searches on graphs. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data. ACM, 305–316.
- Jiang et al. (2014) Minhao Jiang, Ada Wai-Chee Fu, Raymond Chi-Wing Wong, and Yanyan Xu. 2014. Hop doubling label indexing for point-to-point distance querying on scale-free networks. Proceedings of the VLDB Endowment 7, 12 (2014), 1203–1214.
- Kolda et al. (2014) Tamara G Kolda, Ali Pinar, Todd Plantenga, and Comandur Seshadhri. 2014. A scalable generative graph model with community structure. SIAM Journal on Scientific Computing 36, 5 (2014), C424–C452.
- Leskovec et al. (2010) Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani. 2010. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Research 11, Feb (2010), 985–1042.
- Leskovec and Krevl (2014) Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data.
- Leskovec et al. (2008) Jure Leskovec, Kevin J Lang, Anirban Dasgupta, and Michael W Mahoney. 2008. Statistical properties of community structure in large social and information networks. In Proceedings of the 17th international conference on World Wide Web. ACM, 695–704.
- Potamias et al. (2009) Michalis Potamias, Francesco Bonchi, Carlos Castillo, and Aristides Gionis. 2009. Fast shortest path distance estimation in large networks. In Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 867–876.
- Sommer (2014) Christian Sommer. 2014. Shortest-path queries in static networks. ACM Computing Surveys (CSUR) 46, 4 (2014), 45.
- Thorup and Zwick (2005) Mikkel Thorup and Uri Zwick. 2005. Approximate distance oracles. Journal of the ACM (JACM) 52, 1 (2005), 1–24.
- Van Der Hofstad (2009) Remco Van Der Hofstad. 2009. Random graphs and complex networks. Available on http://www. win. tue. nl/rhofstad/NotesRGCN.pdf (2009), 11.
- Vieira et al. (2007) Monique V Vieira, Bruno M Fonseca, Rodrigo Damazio, Paulo B Golgher, Davi de Castro Reis, and Berthier Ribeiro-Neto. 2007. Efficient search ranking in social networks. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. ACM, 563–572.
- Yahia et al. (2008) Sihem Amer Yahia, Michael Benedikt, Laks VS Lakshmanan, and Julia Stoyanovich. 2008. Efficient network aware search in collaborative tagging sites. Proceedings of the VLDB Endowment 1, 1 (2008), 710–721.
Appendix A Proof of Theorem 2: Upper Bounds for
In this section, we present the proof of Theorem 2, which analyzes the performance of Algorithm 2 on random power law graphs. We show that Algorithm 2 outputs a 2-hop cover in Proposition 1. Then we bound the landmark set sizes in Proposition 5.
We introduce a few notations first. For a set of vertices , let denote the sum of their degrees. Denote by if there is an edge between . For two disjoint sets and , denote by if there exists an edge between and , and if there does not exist any edge between and . We use to denote the maximum weight vertex. For any integer , recall that denotes the set of vertices whose distance from is equal to . And denotes the set of vertices whose distance from is at most . Let denote the number of edges between and . Let denote the second moment of any .
Throughout the section, we assume that satisfies all the properties in Proposition 4 without loss of generality.
Proposition 1 ().
Recall that is the radius of the local ball from . Denote by for all . Let denote the set of graphs that satisfies
We argue that Algorithm 2 finds a 2-hop cover for any , and