IS-LABEL: an Independent-Set based Labeling Scheme for Point-to-Point Distance Querying on Large Graphs
We study the problem of computing shortest path or distance between two query vertices in a graph, which has numerous important applications. Quite a number of indexes have been proposed to answer such distance queries. However, all of these indexes can only process graphs of size barely up to 1 million vertices, which is rather small in view of many of the fast-growing real-world graphs today such as social networks and Web graphs. We propose an efficient index, which is a novel labeling scheme based on the independent set of a graph. We show that our method can handle graphs of size three orders of magnitude larger than those existing indexes.
Computing the shortest path or distance between two vertices is a basic operation in processing graph data. The importance of the operation is not only because of its role as a key building block in many algorithms but also of its numerous applications itself. In addition to applications in transportation, VLSI design, urban planning, operations research, robotics, etc., the proliferation of network data in recent years has introduced a broad range of new applications. For example, social network analysis, page similarity measurement in Web graphs, entity relationship ranking in semantic Web ontology, routing in telecommunication networks, context-aware search in social networking sites, to name but a few.
In many of these new applications, however, the size of the underlying graph is often in the scale of millions to billions of vertices and edges. Such large graphs are becoming more and more common, some of the well-known ones include Web graphs, various social networks (e.g., Twitter, Facebook, LinkedIn), RDF graphs, mobile phone networks, SMS networks, etc. Computing shortest path or distance in these large graphs with conventional algorithms such as Dijkstra’s algorithm or simple BFS may result in a long running time that is not acceptable.
For computing shortest path or distance between two points in a road network, many efficient indexes have been proposed [1, 2, 3, 8, 13, 14, 26, 27, 28]. However, these works apply unique properties of road networks and hence are not applicable for other graphs/networks that are not similar to road networks. In recent years, a number of indexes have been proposed to process distance queries in general sparse graphs [10, 12, 13, 17, 30, 32, 33]. However, as we will discuss in details in Section 3, these indexes can only handle relatively small graphs due to high index construction cost and large index storage space. As a reference, the largest real graphs tested in these works have only 581K vertices with average degree 2.45 , and 694K vertices with average degree 0.45 , while most of the other real graphs tested are significantly smaller.
We propose a new index for computing shortest path or distance between two query vertices and our method can handle graphs with hundreds of millions of vertices and edges. Our index, named as IS-LABEL, is designed based on a novel application of the independent set of a graph, which allows us to organize the graph into layers that form a hierarchical structure. The hierarchy can be used to guide the shortest path computation and hence leads to the design of effective vertex labels (i.e., the index) for distance computation.
We highlight the main contributions of our paper as follows.
We design an effective labeling scheme such that the label size remains small even if no optimization (mostly NP-hard) is applied as in the existing labeling schemes.
Our index naturally lends itself to the design of simple and efficient algorithms for both index construction and query processing.
We develop I/O-efficient algorithms to construct the vertex labels in large graphs that may not fit in main memory.
We verify both the efficiency and scalability of our method for processing distance queries in large real-world graphs.
Organization. Section 2 defines the problem and basic notations. Section 3 discusses the limitations of existing works. Sections 4 and 5 present the details of index design, and Section 6 describes the algorithms. Section 7 reports the experimental results. Section 8 discusses various issues such as handling path queries, directed graphs, and update maintenance. Section 9 concludes the paper.
We focus our discussion on weighted, undirected simple graphs. Let be such a graph, where is the set of vertices, is the set of edges, and is a function that assigns to each edge a positive integer as its weight. We denote the weight of an edge by . The size of is defined as .
We define the set of adjacent vertices (or neighbors) of a vertex in as , and the degree of in as .
We assume that a graph is stored in its adjacency list representation (whether in memory or on disk), where each vertex is assigned a unique vertex ID and vertices are ordered in ascending order of their vertex IDs.
Given a path in , the length of is defined as , i.e., the sum of the weights of the edges on . Given two vertices , the shortest path from to , denoted by , is a path in that has the minimum length among all paths from to in . We define the distance from to in as . We define for any .
Problem definition: we study the following problem: given a graph , construct a disk-based index for processing point-to-point (P2P) shortest path or distance queries, i.e., given any pair of vertices , find .
We focus on sparse graphs, since most large and many fast growing real-world networks are sparse. We will focus our discussion on processing P2P distance queries. Computing the actual path will be a fairly simple extension with some extra bookkeeping, which will be discussed in Section 8, where we will also show that our index can be extended to handle directed graphs.
Table 1 gives the frequently-used notations in the paper.
||A weighted, undirected simple graph|
||The size of|
||The weight of an edge in|
||The set of adjacent vertices of in|
||A shortest path from to in|
||The distance from to in|
3 Limitations of Existing Work
We highlight the challenges of computing P2P distance by discussing existing approaches and their limitations.
3.1 Indexing Approaches
Cohen et al.  proposed the 2-hop labeling that computes for each vertex two sets, and , where for each vertex and , there is a path from to and from to . The distances and are pre-computed. Given a distance query, and , the index ensures that can be answered as . However, computing the 2-hop labeling, including the heuristic algorithms [12, 30], is very costly for large graphs. Moreover, the size of the 2-hop labels is too big to be practical for large graphs.
Xiao et al.  exploit symmetric structures in an unweighted undirected graph to compress BFS trees to answer distance queries. However, the overall size of all the compressed BFS trees is prohibitively large even for medium sized graphs.
Wei  proposed an index based on a tree decomposition of an undirected graph , where each node in the tree stores a set of vertices in . The distance between each pair of vertices stored in each tree node is pre-computed, so that queries can be answered by considering the minimum distance between vertices stored in a simple path in the tree. However, the pair-wise distance computation for vertices stored in the tree nodes, especially in the root node, is expensive and requires huge storage space. As a result, the method cannot scale to handle large graphs.
Recently Chang et al.  also applied tree decomposition to compute multi-hop labels that trade query efficiency of 2-hop labels  for indexing cost. Similar to , tree decomposition is an expensive operation and the graphs that can be handled by their method are still relatively small.
Jin et al.  proposed to use a spanning tree as a highway structure in an directed graph, so that distance from to is computed as the length of the shortest path from to some vertex , then from via the highway (i.e., a path in the spanning tree) to some vertex , and finally from to . Every vertex is given a label so that a set of entry points in the highway (e.g., ) and a set of exit points (e.g., ) can be obtained. However, the labeling is too costly, in terms of both time and space, for the method to be practical for even medium sized graphs (e.g., one step in the process requires all pairs shortest paths to be computed and input to another step).
The problem of P2P distance querying has been well studied for road networks. Abraham et al.  recently proposed a hub-based labeling algorithm, which is the fastest known algorithm in the road network setting. This method incorporates heuristical steps in distance labeling by making use of the concepts of contraction hierarchies  and shortest path covers . There are other fast algorithms such as , , and , that are also based on the concept of a hierarchy of highways to reduce the search space for computing shortest paths. However, it has been shown in  and  that the effectiveness of these methods relies on properties such as low VC dimensions and low highway dimensions, which are typical in road networks but may not hold for other types of graphs. Another approach is based on a concise representation of all pairs shortest paths [26, 28]. However, this approach heavily depends on the spatial coherence of vertices and their inter-connectivity. Therefore, while P2P distance querying has been quite successfully resolved for road networks, these methods are in general not applicable to graphs from other sources.
Cheng et al.  proposed an index for computing the distance from a source vertex to all other vertices, which can be used to compute P2P distance, but much computation will be wasted in computing the distances from the source to many irrelevant vertices.
3.2 Other Approaches
When the input graph is too large to fit in main memory, external memory algorithms can be used to reduce the high disk I/O cost. Existing external memory algorithms are mainly for computing single-source shortest paths [18, 22, 23, 20, 21] or BFS [5, 6, 9, 19, 24], which are wasteful for computing P2P distance. In addition, external memory algorithms are very expensive in practice.
There are also a number of approximation methods [7, 15, 25, 29, 31] proposed to compute P2P distance. Although these methods have a lower complexity than the exact methods in general, they are still quite costly for processing large graphs, in terms of both preprocessing time and storage space. We focus on exact distance querying but remark that approximation can be applied on top of our method (e.g., on the graph defined in Section 5).
4 Querying Distance by Vertex Hierarchy
In this section, we present our main indexing scheme, which consists of the following components:
A layered structure of vertex hierarchy constructed from the input graph.
A vertex labeling scheme developed from the vertex hierarchy.
Query processing using the set of vertex labels.
4.1 Construction of Vertex Hierarchy
The main idea of our index is to assign hierarchy to vertices in an input graph so that we can use the vertex hierarchy to compute the vertex labels, which are then used for querying distance.
To create hierarchies for vertices in , we construct a layered hierarchical structure from . To formally define the hierarchical structure, we first need to define the following two important properties that are crucial in the design of our index:
Vertex independence: given a graph , and a set of vertices , we say that maintains the vertex independence property with respect to if and , , i.e., is an independent set of .
Distance preservation: given two graphs and , we say that maintains the distance preservation property with respect to if , .
While distance preservation is essential for processing distance queries, vertex independence is critical for efficient index construction as we will see later when we introduce the index.
We now formally define the layered hierarchical structure, followed by an illustrating example.
[Vertex Hierarchy] Given a graph , a vertex hierarchy structure of is defined by a pair , where is a set of vertex sets and is a set of graphs such that:
, and for ;
For , each maintains the vertex independence property with respect to , i.e., is an independent set of ;
, and for , let , then , whereas and satisfy the condition that maintains the distance preservation property with respect to .
Intuitively, is a partition of the vertex set and represents a vertex hierarchy, where is at a lower hierarchical level than for . Meanwhile, each preserves the distance information in the original graph , as shown by the following lemma.
For all , where , .
Since for any , for . Thus, we have since each maintains the distance preservation property with respect to for .
We use the following example to illustrate the concept of vertex hierarchy.
Figure 1 shows a given graph and the vertex hierarchy of . We assume that each edge in has unit weight except for , which has a weight of 3. It is obvious that the set forms an independent set in , similarly in and in . It is easy to see that preserves all distances in , we shall explain the addition of edge later. In order to preserve the distance in , an edge of weight 2 is added to . consists of a single edge of weight 3. , consists of a single vertex , .
The distance preservation property can be maintained in with respect to as follows. First, we require the subgraph of induced by the vertex set to be in (i.e. iff for ). Then, we create a set of additional edges, called augmenting edges, to be included into as follows. For any vertex (thus according to Definition 4.1), if , and , then an augmenting edge is created in with . If already exists in , then . An edge in with updated weight is also called an augmenting edge. For example, in Figure 1, in , can be preserved by creating an augmenting edge with . Edge is also added according to our process above. Note that , which can be preserved in without adding , but we leave there to avoid costly distance querying needed to exclude .
The following lemma shows the correctness of constructing from as discussed above.
Constructing from , where , by adding augmenting edges to the induced subgraph of by , maintains the distance preservation property with respect to .
According to Definition 4.1, is the only set of vertices that are in but missing in . For any two vertices and in , suppose that the shortest path (in ) from to , does not pass through any vertex in , then the distance between and in is trivially preserved in . Next suppose passes through some vertex . Let . Then, we must have the augmenting edge created in with , or if already exists in . Therefore, the distance (in ) between any two vertices is preserved in .
In addition to the distance preservation property that is required for answering distance queries, the proof also gives a hint on why we require each to be an independent set of . Since there is no edge in between any two vertices in , to create an augmenting edge in we only need to do a self-join on the neighbors of the vertex . Thus, the search space is limited to 2 hops from each vertex. On the contrary, if an edge can exist between two vertices in , then to preserve the distance the search space is at least 3 hops from each vertex, which is significantly larger than the 2-hop search space in practice. This is crucial for processing a large graph that cannot fit in main memory as we may need to scan the graph many times to perform the join, as we will see in Section 6.
4.2 Vertex Labeling
With the vertex hierarchy , we now describe a labeling scheme that can facilitate fast computation of P2P distance. We first define the following concepts necessary for the labeling.
Level number: each vertex is assigned a level number, denoted by , which is defined as iff .
Ancestor: a vertex is an ancestor of a vertex if there exists a sequence , such that , and for , the edge where . Note that is an ancestor of itself. If is an ancestor of , then is a descendant of .
In our example in Figure 1, the level numbers of are 1, that of are 2, that of is 3. The ancestors of will be , , , , since and are in , is in , and , are in . Note that is not an ancestor of since in the path , while . The ancestor-descendant relationships are shown in Figure 2(a).
We now define vertex label as follows.
[Vertex Label] The label of a vertex , denoted by , is defined as .
To compute for all , we need to compute the distance from to each of ’s ancestors. This is an expensive process which cannot be scaled to process large graphs. To address this problem, we define a relaxed vertex label that requires only an upper-bound, , of and show that suffices for answering distance queries.
[Relaxed Vertex Label] The relaxed label of a vertex , denoted by , is a set of “” pairs computed by the following procedure: For each , we first include in and mark . Then, we add more entries to recursively as follows. Take a marked vertex that has the smallest level number , and unmark . Let . For each , where and , add the entry to , and mark . If the entry is already in , update . Repeat the above recursive process until no more vertex is marked.
As for , contains entries for all ancestors of . In Section 6, we will show that the new definition facilitates the design of an I/O-efficient algorithm for handling large graphs. Here, we further illustrate the concept using an example, and then prove that can indeed be used instead of to correctly answer P2P distance queries in the following subsection.
For our example in Figure 1, the ancestor relationships are shown in Figure 2(a), where all edges have unit weights unless indicated otherwise. The labeling starts with , for vertices , next vertices are labeled, followed by , , and . Consider the labeling for vertex , first, is included, since , is added to and is marked. is unmarked by checking its neighbors and in , and we include both into , and are marked. is at level 3 and is unmarked next. = , we add to . Then is unmarked, its only neighbor in is already in , is not updated. is marked. Finally is unmarked, since has no neighbor in , no further processing is required. The labels for all vertices are shown in Figure 2(b). Note that in , while , hence . In general the distance value in a label entry can be greater than the true distance.
4.3 P2P Distance Querying
We now discuss how we use the vertex labels to answer P2P distance queries. We first define the following label operations used in query processing.
Vertex extraction: .
Label intersection: .
The above two operations apply in the same way to .
Given a P2P distance query with two input vertices, and , let , the query answer is given as follows.
In Equation 1, we retrieve and for each from and , respectively. We give an example of answering P2P distance queries using the vertices labels as follows.
Consider the example in Figure 1, the labeling is shown in Figure 2. Suppose we are interested in . We look up and . . Among these vertices, has the smallest sum of . Hence we return 3 as . Note that although the distance recorded in is 4, which is greater than , the correct distance is returned. If we want to find , . Hence is given by .
Query processing using the vertex labels is simple; however, it is not straightforward to see how the answer obtained is correct for every query. In the remainder of this section, we prove the correctness of the query answer obtained using the vertex labels.
We first define the concept of max-level vertex, denoted by , of a shortest path, which is useful in our proofs. Given a shortest path from to in , , is the max-level vertex of if is a vertex on and for . The following lemma shows that is unique in any shortest path.
Given two vertices and , if exists, then there exists a unique max-level vertex, , of .
First, since exists, must exist on . Now suppose to the contrary that is not unique, i.e., there exists at least one other vertex on such that , which also means that both and are in and . Since is an independent set of , there is no edge between and in . Since and are on the same path , they must be connected in and the path connecting them must pass through some neighbor of or in , where is also on . Thus, cannot be in (otherwise the vertex independence property is violated) and hence , which contradicts that is the max-level vertex of .
Next we prove that can be used to correctly answer P2P distance queries. Then, we show how possesses the essential information of for the processing of distance queries.
Given a P2P distance query with two input vertices, and , let , then if , or if .
We first show that if exists, then . Consider a sequence of vertices, , extracted from , such that , , and for , any vertex between and on has , and same for any vertex between and . Note that since is the next vertex after with , we have , and by the vertex independence property.
Since and are connected, they must exist together in . Since there exists no other vertex between and on such that , and are not connected by any such in . Thus, by Lemma 4.1, the edge must exist in for to preserve the distance between and , which means that for , is an ancestor of and hence . Note that if . Similarly, we have , for . Thus, and hence .
The other case is that does not exist, i.e., and are not connected, and we want to show that . Suppose on the contrary that there exists . Then, it means that there is a path from to and from to , implying that and are connected, which is a contradiction. Thus, and is correctly computed.
Theorem 4.3 reveals two pieces of information that are essential for answering distance queries: the ancestor set and the distance to the ancestors maintained in . We first show that also encodes the same ancestor set of .
For each , .
First, we show that if , i.e., is an ancestor of , then . According to the definition of ancestor, there exists a sequence , such that , and for , . This definition implies that if is currently in , will also be added to according to Definition 4.2. Since must be in , it follows that is also in .
Next, we show that if , then . First, we have , is also in . Then, according to Definition 4.2, a vertex is added to only if for some currently in , and , and since is an ancestor of , it implies that is an ancestor of and hence .
Next, we show that also possesses the essential distance information for correct computation of P2P distance.
Given a P2P distance query, and , let . If exists, then , and .
The proof of Theorem 4.3 defines a sequence, , extracted from . In particular, the proof shows that the edge exists in and , for . Thus, according to Definition 4.2, we add the entry to . Since each preserves the distance between and , and , it follows that . Similarly, we have .
Finally, the following theorem states the correctness of query processing using .
Given a P2P distance query, and , evaluated by Equation 1 is correct.
5 A k-Level Vertex Hierarchy
In Definition 4.1, we do not limit the height of the vertex hierarchy, i.e., the number of levels in the hierarchy. This definition ensures that an independent set can always be obtained for each , for . However, there are two problems associated with the height of the vertex hierarchy. First, as the number of levels increases, the label size of the vertices at the lower levels (i.e., vertices with a smaller level number) also increases. Since vertex labels require storage space and are directly related to query processing, there is a need to limit the vertex label size. Second, as we will discuss in Section 6, the complexity of constructing the vertex hierarchy is linear in . Thus, reducing can also improve the efficiency of index construction.
In this section, we propose to limit the height by a -level vertex hierarchy, where is normally much smaller than , and discuss how the above-mentioned problems are resolved.
5.1 Limiting the Height of Vertex Hierarchy
The main idea is to terminate the construction of the vertex hierarchy earlier at a level when certain condition is met. We first define the -level vertex hierarchy.
[k-level Vertex Hierarchy] Given a graph , a vertex hierarchy structure of , and an integer , where and is the number of levels in , a k-level vertex hierarchy structure of is defined by a pair , where and are defined as follows:
consists of the first levels of , i.e., and ;
is the same as the in .
The -level vertex hierarchy simply takes the first , for , and the first , for . We set the value of as follows: let be the first level such that , where () is a threshold for the effect of ; then, .
If , then is simply and is an empty graph. In practice, a value of that attains a reasonable indexing cost and storage usage will often give .
For the -level vertex hierarchy, we assign the level number for each vertex , where , while for each vertex , we assign . In this way, we can compute (or ) for each vertex in the same way as discussed in Section 4.2. Note that for each vertex since has the highest level number among all vertices in .
Let us consider our running example in Figure 1, if we set , there is only one level in , the graph is the highest level graph and is not further decomposed. The -level vertex hierarchy is shown in Figure 3. The maximum level of vertices is 2, since all vertices in are assigned . The labels for the vertices in are shown in the following table.
5.2 P2P Distance Querying by k-Level Vertex Hierarchy
According to Section 5.1, and computed from the -level vertex hierarchy may be different from those computed from the original vertex hierarchy. However, we show later in this section that these labels are highly useful for they capture all the information that is essential from for a continued distance search in . Given a P2P distance query, and , we process the query according to whether and are in . We have the following two possible types of queries.
Type 1: and , and either or . Type 1 queries are evaluated by Equation 1.
Type 2: queries that are not Type 1. Type 2 queries are evaluated by a label-based bi-Dijkstra search procedure.
5.2.1 Label-based bi-Dijkstra Search
We describe a bidirectional Dijkstra’s algorithm that utilizes vertex labels for effective pruning. The algorithm consists of two main stages: (1) initialization of distance queues and pruning condition, and (2) bidirectional Dijkstra search.
As shown in Algorithm 1, we first initialize a forward and a reverse min-priority queue, FQ and RQ, which are to be used for running Dijkstra’s single-source shortest path algorithm from and , respectively. For any vertex , if , we add to FQ with as the key. For all other vertices in but not in , we add the record to FQ. Similarly, we initialize RQ.
The vertex labels can also be used for pruning the search space. If there exists a path between and that passes through some vertex , then Lines 5-6 initializes as the minimum length of such a path. Note that .