Relationship Queries on Large graphs using Pregel
Abstract
Largescale graphstructured data arising from social networks, databases, knowledge bases, web graphs, etc. is now available for analysis and mining. Graphmining often involves “relationship queries”, which seek a ranked list of interesting interconnections among a given set of entities, corresponding to nodes in the graph. While relationship queries have been studied for many years, using various terminologies, e.g., keywordsearch, Steinertree in a graph etc., the solutions proposed in the literature so far have not focused on scaling relationship queries to large graphs having billions of nodes and edges, such are now publicly available in the form of ‘linkedopendata’. In this paper, we present an algorithm for distributed keyword search (DKS) on large graphs, based on the graphparallel computing paradigm Pregel. We also present an analytical proof that our algorithm produces an optimally ranked list of answers if run to completion. Even if terminated early, our algorithm produces approximate answers along with bounds. We describe an optimized implementation of our DKS algorithm along with timecomplexity analysis. Finally, we report and analyze experiments using an implementation of DKS on Giraph the graphparallel computing framework based on Pregel, and demonstrate that we can efficiently process relationship queries on largescale subsets of linkedopendata.
1 Introduction and Motivation
Many applications produce or deal with large graphs. Such graphs could be entityrelationship graphs extracted from textual sources, relational databases modelled as graphs using foreignkey relationships among tuples, biological networks, social networks, or call data records^{1}^{1}1The graph capturing data about people calling each other.. Large graphs encountered in practice are both nodelabeled (containing entity description) as well as edgelabeled (indicating semantics of relationship between nodes). Based on such labels, numeric weights can be assigned to the edges of the graph, if they are not already available, which indicate strength of relationship between the corresponding nodes. Since graphs generated by webscale social applications are often massive, efficiently querying and analyzing them is a nontrivial exercise.
While graphs can be queried in a variety of ways, we are interested in a specific class of queries called relationship queries [1]. Here, a set of entity names are given as query keywords, and the objective is to find a node (rootnode) in the graph such that the nodes that represent the given entities are connected to the rootnode with a shortest path.
Relationship queries are particularly useful while mining for complex graph patterns, such as the detection of collusive frauds, where we want to discover relationships between entities, which should not exist normally. Consider an agency that has leads from multiple terrorist activities such as phonenumbers of people involved in acts of terrorism. They often want to discover whether there is any relationship between those leads, and identify a node (rootnode) which connects them all. For example, if there are three leads, i) cellphones operating from a specific region, ii) specific digits in phone numbers, and iii) cellphones of people with specific names, as shown in Figure 1. This gives us three groups of nodes of call data record graph, and the relationship queries can be used to find interesting interconnections among three nodes (one node from every group). In our example, is the connecting node between these leads, and the answertree is shown in Figure 1. We aim at finding topK answerstrees, such that the edges in the answertree have small weight.
Another usecase that can be addressed using relationship queries is detecting Insider Trading, i.e., trading of stocks of a company by taking a cue from insider (nonpublic) information about the company. Insider trading is illegal in most of the countries. Relationship queries on a graph, constructed from a database of key office bearers of companies and their relationships with other key people, can be used to discover instances of insider trading. In a similar manner, investigating agencies might want to discover relationships between politicians and industrialists, whom they favor. A form of such relationship queries is also used for identifying Keffectors in a social media network [25].
The problem of relationship queries is very similar to that of keyword search on graph data [7]. Prior work on keyword search on graphs has focused on standalone algorithms [7] where the graph is small or has advocated the use of preprocessing of graphs [10] and the use of indexes [19] to overcome the memory bottleneck. Most of these approaches don’t even make an attempt to find the optimal answer. Further, these solutions cannot scale and a distributed algorithm is indicated. Similar to the related work, we also present the answer of a relationship query in the form of a tree, called answertree.
Key Contributions: In this paper, we present i) an algorithm called distributed keyword search (DKS) for the problem of relationship queries, which is based on distributed parallel processing techniques. Finding an answer of such queries in the form of a tree such that it contains all query entities, and has the smallest answertree weight is considered hard. In DKS, we search for the rootnode of the answertree following inparallel breadth first search (BFS) approach, and find the optimal answer most of the time. For some queries, completion of BFS may take longtime and it may keep searching for optimal answertree forever. We therefore do not propose the traversal of entire graph and stop running subsequent iterations of BFS after DKS’ exit criterion is satisfied. ii) We also present an analytical proof for optimality of answers discovered by DKS, i.e., when exit criterion is satisfied and further iterations of BFS are stopped, the optimal answer is not missed. Further, sometimes BFS may become extremely slow before the exit criterion is satisfied. iii) In such situations also, we stop further iterations of BFS, and estimate the degree of optimality of the answers discovered so far. iv) We also present an optimization of basic approach for finding a locally optimal answertree at every node visited during BFS traversal of the graph, as part of DKS. v) We analyze the timecomplexity of DKS and demonstrate that the time taken by DKS is linear in size of the graph, however exponential in the number of keywords, which is an acceptable norm for relationship queries. vi) Arguments made during the description of algorithm and analytical proof, are finally corroborated by presenting empirical benchmarks on largest ever dataset attempted by baseline approaches for this problem.
Organization of the Paper: We begin with a formal description of the problem in the next section, and present a brief description of the related work in Section 3. We then describe the DKS algorithm in Section 4, and in Section 5 we first present an optimization of the basic DKS algorithm and then analyze the timecomplexity. The analytical proof for optimality of the answers discovered by DKS has been included in Section 6. Next, we share performance benchmarks of DKS on largest ever graph data reported in related research publications, in Section 7. Finally, we conclude in Section 8.
2 Problem Desc. & Definitions
Our definition of the problem about relationship queries is motivated by the prior work [7, 10, 19, 26]. We assume that the input data is represented as a graph . Here, is the set of nodes and is the set of directed edges between pairs of nodes. Every node has associated text such as name of an entity, and every edge has associated label as well as positive numeric weights, i.e., . The label on an edge provides semantic information about the relationship between its end nodes, and reciprocal of the numeric edge weight represents the strength of this relationship. Therefore for better intuition the edges weights are referred to as edgelengths.
The objective is to execute a relationship query, comprising of a set of keywords , on the graph . Here, keywords can be names of entities, and nodes that contain any keyword of the query are called keywordnodes . Here, is the set of keywordnodes containing keyword . The answer of such queries is presented as a tree, which is a subset of the graph such that nodes of the answertree collectively contain all keywords of the query. The answer is represented as a tree because the rootnode of such a tree is the common connection between all keywords (entities) of the query and finding such a node is the key objective of relationship queries as explained earlier.
Definition 2.1
In the context of relationship queries on a graph , a minimal answertree is defined as a tree, such that the nodes of the tree collectively contain all keywords of the query and by removing any node/edge from the answertree the remaining datastructure does not remain connected or does not contain all keywords of the query.
Here, the weight of an answertree is calculated as sum of the lengths of all the edges of the tree, i.e., . Since the reciprocal of edgelength indicates strength of relationship between the pair of nodes, the answerweight should be as small as possible. The problem of relationship queries is defined below.
Definition 2.2
Given a set of keywords , in the context of a graph , find the best minimal answertrees, in increasing order of their weights.
It can be observed that the above problem is equivalent to the Group Steiner Tree (GST) problem, as also shown in [12, 23]. The GST problem is defined as: given a set of groups of nodes of graph , find a minimal spanning tree, such that the tree contains atleast one node from every group of the nodes [6]. In case of relationship queries, keywordnodes of every keyword are equivalent to a group of nodes of the GST problem, and the minimal answertree is equivalent to minimum spanning tree.
3 Related Work
Most of the prior work [31] on this problem propose a standalone solution for generating heuristic solutions. Many of these algorithms [7, 10, 21] don’t even measure the degree of approximation of the answers produced by their algorithm, since the problem is hard. While, we either generate an optimal answer or predicts the degree of approximation using DKS. This problem has many different interesting aspects such as ranking of the answers discovered [2, 4, 7, 17], and different possible structures of the answers [22] itself. Many different methods have been proposed as a solution to this problem, such as graph traversal[7], SQL queries[20, 2], and clustering and index guided methods[19, 26]. DKS follows graph traversal based method.
The Steiner Tree problem on graphs was surveyed by Bezensek et al. in [6]. According to this and other such surveys most of the researchers have been trying to find a heuristic solution to this problem, such as Shortest Path Heuristic, Average Distance Heuristic, and Distance Network Heuristic etc. Most of these have an approximation ratio ; here, is ratio of approximate answer weight detected by an algorithm and the optimal answer weight. By and large the best solution was presented by Robins et al. [29] with 1.55approximation guarantee. To the best of our understanding there have been no effort on trying to restrict the search space of this problem, which is one of the primary contribution of our work. Kimelfeld et al. in [23] highlighted that according to [14, 16], the Steiner Tree problem is solvable with bounded number of keywords, and they also present a heuristic approach. We also corroborate the same finding using timecomplexity analysis of our algorithm in Section 5.2.
The Steiner Tree problem or Group Steiner Tree problem has been attempted in multiple domains such as for routing of network packets in computer networks, multiple applications in social networks[24, 25], identification of functional modules in protein networks[13]. Most of these algorithms are either a heuristic approach or apply a domain specific constraint to solve this problem.
Many a heuristic solutions using distributed and parallel computing for this computationally intractable problem have been presented and were surveyed in [6]. The solutions that use parallel processing paradigm are based on a shared memory across all the processors, such as the hybrid genetic algorithm based approach proposed by Presti et al. [28]. Bauer et al. [5, 6] presented distributed algorithms based on KSPH (Kruskalâs shortest path heuristics)[6]. Here, keywordnodes with lowest index are called terminal leaders. The leaders of a subtree are made responsible for the coordination of subtree. The closest subtrees are merged using discovery and connection steps. Later Singh et al. [30], improved the inefficiencies of the discovery step and presented a solution which performed better than its original variant. An important limitation of such approaches is that they cannot work on large graphs and is not based on modern parallel processing paradigm. Recently, [18] presented a solution to the keyword search problem using mapreduce[11] paradigm, we argue that Pregel is better choice of distributed processing paradigm for this problem, since it does need to load the entire graph in every iteration/superstep. We [1] present a solution to this problem using parallel processing paradigm Pregel[27], which either discovers an optimal solution or a heuristic answer along with its approximation ratio.
4 Background & DKS Overview
Our distributed keyword search algorithm makes use of Pregel[27] model for distributed processing. Therefore we first provide a brief overview of Pregel.
Pregel Overview: Pregel jobs run on a computecluster, and every computer can be configured to have more than worker agents, which run in parallel and perform most of the work. When it starts to process any job on an input graph, it first distributes every node of the graph to a specific worker, chosen using a hashfunction. In Pregel framework, the input graph is processed iteratively and these iterations are called supersteps. In every superstep, a userdefined compute() function gets called for every node of the graph on its worker, independently. Here, one common compute() function is defined for all nodes of the graph, and for all supersteps of a Pregel job. Through this compute() function, nodes send and receive messages to/from each other. When role of a node in a Pregel job is deemed to have been completed, we call a library function voteToHalt() from compute() function, which indicates to the Pregel framework that the current vertex will remain dormant from subsequent superstep onwards. Such dormant nodes are referred to as inactive nodes and remaining nodes are referred to as active nodes. If an active node sends a message to inactive node, it becomes active again. The processing comes to a halt when all nodes become inactive. Further, Pregel framework has a provision for another agent called Aggregator, which is a userdefined function that can receives messages from all nodes of a superstep, and aggregates the messages sent to it. The aggregated value of the messages sent in a superstep , is made available to all nodes in the next superstep .
To describe the DKS algorithm, we take help of a few terms that are defined here. A subset of query keywords is called keywordset. Set of all keywordsets is the powerset of , i.e., ; here, . If we drop one or more keywordnodes and related edges from a minimal answertree, the tree that is left is called partial answer . We also define pathlength of a keywordset , at a node , as weight of a partial answer rooted at and containing keywordset .
4.1 DKS Algorithm
We first preprocess a graph and prepare it for running DKS. Here, we calculate the edgelengths if not already present and also create an invertedindex [32] of the text associated with the nodes of the graph . For all directed edges of the graph, we also include the reverse edges with the same edgeweight so that suitable answertrees could be discovered irrespective of the direction of relationship between nodes. Highlevel flow of DKS algorithm is shown in Figure 2(c), here, we first search for the query keywords in the inverted index, and identify the keywordnodes that become the starting points of parallel BFS traversal (Find Answertrees). During BFS traversal of the graph, at every node we evaluate whether it has a path to all keywordnodes of the query. BFS traversal on a contrived example is explained in the next paragraph. All such answertrees, found at various nodes of the graph, are aggregated to find the global topK answertrees.
We explain the BFS traversal of DKS, through a contrived example shown in Figure 2(a). Here, in the first () superstep, we send messages from keywordnodes to their neighboring nodes. The message contains paths to the keywordnode from the neighboring node, and corresponding pathlength. All the other nodes of the graph remain dormant. The neighboring nodes of keyword nodes receive the message(s) in the next superstep. Such nodes send a message to their unexplored neighboring nodes. The message contains paths to keywordnodes known at the sending node. This process continues through subsequent supersteps. The state of a sample graph after superstep1, 2, and 3 is shown in Figure 2(a). When a node receives a message for the first time, it is declared as Frontier node. Finally, in a superstep ( in our example), the control reaches a node (starmarked) that has path to atleast one keywordnode of every keywords of the query, i.e., it is the rootnode of an answertree. A more detailed description of the DKS algorithm, is given below with the help of a flowchart shown in Figure 2(b).
Step1 Receive Messages: In a superstep, nodes that receive message(s) become active and all the other nodes remain dormant. The set of paths (in the form of a tree), received as incoming messages, form a tree with the current node at the root. This tree is referred to as localtree of a node, sample localtree of a node is shown in Figure 3.
In the localtree of a node, there can be more than one subtrees which contain all keywords of the query, e.g., () in Figure 3. We drop those branches of a localtree that are not part of topK partial answers of any keywordset in that localtree, and the remaining tree is called filtered localtree. Such branches are shown by dotted lines in Figure 3, e.g., . If the topK partial answers of all keywordsets are retained, we don’t miss the global topK answer as shown in analytical proof in Section 6.2.4. At every node we maintain two datastructures and , which contain topK pathlengths of all keywordsets and the set of nodeids contained in the corresponding trees, respectively. Calculation of the sets and is one of the most compute intensive task of DKS algorithm, since it iterates on the powerset of set of input keywords , and therefore an optimized approach for calculation of these sets is given in Section 5.
Step2 Evaluate: Based on the localtree of a node it is possible to determine whether the node has a path to keywordnodes of all the keywords, or not. If yes, it declares itself as the rootnode of an answer. We extract local topK answertrees from the filtered localtree of a node, following an approach described in Section 5. Next, we describe how to identify the global topK answertrees from many such local topK answertrees.
Step3 Sending Aggregator Messages: After extraction of the local topK answers from the localtree, we calculate the pathlengths of all keywordsets in every answer and store them in set . Extracted answers and the corresponding set is sent to an aggregator . We also extract the smallest pathlength of all keywordsets from the set and send it to another aggregator . Details of these aggregators are in given in Step5.
If we were to traverse entire graph it will lead to too many messages being exchanged and DKS may become extremely slow and may never finish. Therefore we need to stop the BFS traversal as soon as possible. We stop BFS traversal when we are certain that the further traversal of the graph will not lead to a better answertree than those found so far, this condition is referred to as exit criterion. The aggregated values of above metrics are used for evaluation of exit criterion in Step6.
Step4 Send BFS / Deep Messages: Active nodes of a superstep send filtered localtree and the sets and of sending node, as a message, to their neighbors for BFS traversal, in order to locate a node which contains paths to all keywords of the query. However, using BFS traversal we can only discover the trees that are balanced at the rootnode. For example, in Figure 4(a), is the rootnode and the answertree shown here is not balanced at . When following BFS traversal, receives a message from and , but remains unaware of the path to primarily because in a distributed setting every node is being processed independently on potentially a different worker. Therefore, we send message to both sides of the nodes containing a path to eachother, for example from we send a message to containing a path to node . Similar messages are sent from also. As a result, nodes and get identified as a rootnode of an answertree. Such messages are called deepmessages. The deep messages need to be propagated recursively to cover cases such the one shown in Figure 4(b). Note that even after the exit criterion is satisfied, we receive, and propagate the deep messages that were sent before exit criterion was satisfied.
Step5 Aggregation: We use two aggregators and , and the aggregated values of these are used for evaluation of exit criterion described in Setp6. The aggregator: i) removes duplicate answers, ii) identifies global topK answers, and iii) calculates the largest pathlengths() of all keywordsets among globally topK answers, by aggregating the sets from its messages. Further, at every active node of a superstep , we calculate a set , comprising of the smallest pathlengths of all keywordsets and send it to aggregator . In aggregator, we determine the smallest value of these pathlengths, and prepare a set . The set contains the smallest pathlengths of all keywordsets in superstep.
Step6 Check for Exit: If we can say that subsequently discovered answertrees will have weights more than the best found so far, we can stop traversing further. For this, we estimate the smallest pathlength of all keywordsets in next superstep as . Here, is the smallest edge weight in the graph, and therefore . If all these estimated pathlengths are larger than the corresponding pathlength in , i.e., , , , we can say that all subsequent answertrees will be worse than those found so far. This condition is referred to as the exit criterion for BFS traversal.
5 DKS Optimization and Analysis
Identification of local topK pathlengths for all keywordsets (i.e., sets and ) is one of the most compute intensive tasks in DKS algorithm, therefore we first present an optimized approach for such a calculation followed by analysis of computational and communication cost.
5.1 Optimization for Local Tree Filtering
To understand the problem of calculation of sets and , let is consider the bruteforce method first, it will involve steps such as: a) traverse the localtree and store keywordwise paths from rootnode to the keywordnodes, b) for every keywordset generate various combinations of these paths, c) for each of such combination find the pathlength of corresponding , which is not equal to the sum of pathlengths of single keywords contained in the keywordset . This is because some of the edges between two paths may be common. We will therefore need to traverse the localtree, in order to find the overlapping edges and then to calculate the pathlength. d) finally, find the local topK pathlengths (for every keywordset ) from various pathlengths of a keywordset . All this will require traversing the localtree exponential number of times (in number of keywords ) at every activenode, making it a compute intensive task.
As a first step towards optimization of this process, we maintain two datastructures and at every node, as described in Section 4. Further, we assume that in the localtree of a node there are, on an average, keywordnodes for every keyword . Therefore, if a keywordset contains keywords we will have to evaluate rtimes different trees. There will be keywordsets that will contain keywords in them. For evaluation of topK pathlengths of all such keywordsets, we will have to evaluate trees. Therefore, total number of trees that we will need to evaluate for topK pathlengths of all keywordsets, in order to fill datastructures and are given in (1):
(1) 
Further, since can be high, it will become hard to evaluate the datastructures and . However, starting from keywordnodes, if every node maintains these datastructures and also passes these to their neighboring nodes, in the message payload, then there will be two benefits: i) we will not need to traverse the localtree to find the answertrees for various keyword combinations , i.e., Steps (a)(c) of bruteforce approach is not required anymore; ii) the maximum value of will be (), because each message can contain at the most keywordnodes for every keyword and is the number of messages a node receives. Therefore, total number of pathlengths to be evaluated will be , using Eq 1.
If we process each message separately, the total number of trees evaluated for a pair of messages will be , and therefore for processing all incoming messages at a node we will have to evaluate trees, which is lesser than . Effectively, the timecomplexity of preparing the datastructures and at a node will be . Further key benefit of these datastructures is that we can purge the nodes that are not present in to obtain filtered localtree.
5.2 Time Complexity
Computationally, there are two most compute intensive parts of DKS. First is calculation of and at active nodes; secondly for many miscellaneous tasks we need to traverse the localtree of a node, e.g., purging of extra branches, extraction of topK answertrees etc. The worstcase timecomplexity of the first task, was analyzed in Section 5.1. Assuming that early exit is not effective and we need to perform this task at every node of the graph, the timecomplexity of this task for entire graph will be . Here, is the number of nodes in the graph, and is the average degree of a node in the graph assumed to be equivalent to average number of messages on every node. For second task we need to estimate the average size of the filtered localtree of a node. For this, assuming every node has small number of child nodes , and height of this tree is , there will be nodes in the localtree. Such a tree needs to be maintained and traversed at every node of the graph, therefore, the total time complexity of the DKS algorithm can be taken as . It is observed that most of the time and are very small integers, i.e., , and often in keyword searches the number of keywords in a typical search query would be small, therefore the problem becomes tractable. Further, it is evident that the worst case timecomplexity of DKS algorithm is linear in the number of nodes in the graph and the number of edges in the graph, while exponential in the number of query keywords. Therefore, if the number of keywords are high or we are interested in too many answers (high value of ), DKS algorithm will not perform efficiently.
5.3 Dist. processing & Communication Cost
If we were to execute the DKS algorithm in standalone mode, without any fundamental modification, we will run a loop for every superstep. At every frontier node we will combine the filtered localtrees of its neighboring nodes to get its localtree, instead of getting them as a message as done in distributed implementation. We then filter this tree to get filtered localtree. This will involve the process of calculating the set , as performed in the distributed version. Therefore, the primary difference between standalone mode and distributed mode will be that of communication overhead.
To estimate the communication cost, we assume that on an average the number of messages sent by a node are directly proportional to the degree of a node, i.e., linear in the average degree of nodes of the graph, at every node. Therefore total number of messages passed will be directly proportional to the total number of edges in the graph. We discussed the average size of the localtree of a node to be . Therefore, we assume that the total communication cost of DKS algorithm is . Here, and are not high since we are not interested in the finding answertrees with large height.
5.4 Practical Issues
It was observed that the system hangs if the total number of messages to be received, in a single superstep, are more than a million (especially after first two supersteps). We stopped subsequent supersteps when this limit was reached and estimated smallest possible answer weight which can get discovered by further exploration of the graph. Ratio of this estimated smallest possible answer weight (given below) and the best answerweight found by our algorithm is reported as SPARatio.
Estimation of smallest possible answer weight: At the end of a superstep , the set of smallest pathlengths of all keywordsets, is known and we can estimate the smallest pathlengths of all keywordsets in next superstep. From the smallest pathlengths of all keywordsets we want to estimate the smallest possible answer weight. To construct an answer from all keywordsets we choose a subset of keywordsets in such a manner that the chosen keywordsets collectively contain all keyword sets. We use dynamic programming to exhaustively search the entire search space and find the smallest possible answer weight that can get discovered by further BFS exploration of the graph. We report this answerweight as the smallest possible answerweight. The smallest possible answerweight helps us in estimation of degree of approximation of the detected answertree, referred to as SPAratio, as described above.
6 Analytical Proof
In this section we state a theorem about optimality of the answers discovered by our algorithm, and present an analytical proof subsequently.
Theorem 1
The breadthfirstsearch traversal on a graph having , executed to find topK minimum steiner trees, can be stopped after iteration (superstep), without missing the optimal answer when Eq. 2 is satisfied.
(2) 
Here, is a set of the largest pathlengths of all keywordsets among the global topK answers found at the end of supserstep; and is a set of the estimated shortest pathlengths of all keywordsets for supserstep , such that , & , i.e., the estimated shortest pathlength of all keywordsets for a superstep should not be less than the corresponding actual shortest pathlength of that superstep.
6.1 Overview and Intuition
When searching for the answertrees, following the algorithm described in Section 4, and after we discover first answertrees at the aggregator, we are not sure whether these are globally optimal. We can stop BFS exploration only when we are sure that further exploration will not lead to any better answertree. For this we need to estimate the smallest possible weight of an answertree that can get discovered by further iterations of BFS. If this estimated answer weight is more than the largest answer weight found so far then, we need not perform BFS exploration any further. This evaluation should happen between every two consecutive supersteps.
It is computationally hard to estimate the smallest answerweight of a subsequent superstep, even if we can estimate the smallest pathlengths of all keywordsets for that superstep. This is because the keywordsets are the elements of the powerset of keywords, and many different combination of these keywordsets can make an answertree. Also note that the union of all keywordsets is the set of all keywords. Therefore, we establish the exit criterion based on Fagin’s algorithm[15], brief summary of Fagin’s algorithm is given in Appendix A. To make Fagin’s algorithm applicable in our setting, we represent the answer weight as an aggregate function of pathlengths of all keywordsets, i.e., , which increases monotonically with increase in pathlengths. Further, a sorted list of the arguments of this function, i.e., keywordsets should be present, which is actually not available.
We observed that the shortest pathlengths of all keywordsets increase monotonically across consecutive supersteps of breadthfirstsearch, and can work as a proxy for the sorted list. Therefore, we start with Lemma 6.1, where we prove this formally. As a result, Fagin’s algorithm becomes applicable, and therefore in Lemma 6.2 we state that the answertree can be found at a specific set of nodes only, referred to as candidatesnodes. Here, nodes for which , are considered as candidate nodes, i.e., nodes that have the shortest estimated pathlength of any keywordsets in next superstep, smaller than the largest pathlengths of corresponding keywordset in topK answertrees found so far.
We further restrict the search of candidate nodes with the help of Lemma 6.3, where we state that it is sufficient to evaluation only those candidate nodes that are at the frontier of the breadthfirstsearch traversal, which results in better efficiency. Therefore, we keep performing the BFS exploration until no more candidate nodes are left, i.e., Eq. 2 is satisfied. Further, in Lemma 6.4 we state that by purging those branches of a localtrees that are not in local topK, we don’t miss any globally optimal answertree. As a result, finally aforementioned theorem is proven that the problem of keyword search can be solved using BFS exploration and that we don’t need to traverse the entire graph for finding the topK answertrees.
6.2 Proof Details
6.2.1 Shortest PathLength Increases
Lemma 6.1
The shortest pathlength of a keywordset among all frontiernodes of a superstep, will definitely increase in a subsequent superstep, i.e., .

Here, is the shortest pathlength of a keywordset , at the frontiernodes of superstep . As explained below via Figure 5, during BFS traversal at any node() two different types of paths to a keywordset can get discovered, based on two different types of messages received, i.e., BFS message and deep message. We prove the above Lemma for both of these cases, for .
Proof Case (i): For the first types of paths to keywordsets, e.g., , it is straight forward to understand that the shortest pathlength of any keywordset can only increase in subsequent a superstep, i.e., pathlength of at will definitely be more than that at .
Proof Case (ii): For paths received through deep messages, e.g., , we want to prove that: If a node has shortest pathlength () for a keywordset in superstep , and we get to know of another path to the same keywordset at through deeptraversal in any subsequent superstep (), the pathlength () of this new path will be more than , i.e., .
We can state that . Here, is the smallest edge weight in the graph, and is the difference between the superstep numbers in which of and were discovered. This can be asserted because in superstep , a part of this new path would have been discovered. In our example, path was discovered in superstep itself. The pathlength of that part of the path would be more than or equal to the shortest path . Since both and are positive. Therefore the shortest pathlength of any increases by atleast in every superstep.
Note: Using a similar argument, we can also state that the shortest pathlength of a keywordset , among all actives nodes of supserstep, can occur only at the frontiernodes of a superstep.
6.2.2 Identify Candidate Nodes
An overview of Fagin’s algorithm [15] is given in Appendix A, which forms the basis for the next Lemma. To apply Fagin’s algorithm in this setting, we need to have sorted lists of the input arguments of the aggregate function , and the aggregate function should be monotonic w.r.t. its input arguments. Since the shortest pathlength of all keywordsets increases in every subsequent superstep, we can imagine that for every keywordset , a sorted list exists comprising of the shortest pathlengths of all keywordsets for every superstep, as explained in Figure 6. We also define weight of an answertree as a function of pathlength of all keywordsets at the rootnode of the answertree, as given in Eq. 3. As a result, Fagin’s algorithm becomes applicable, we can identify candidatenodes, and subsequently establish exit criterion.
(3) 
Where,
Here, constituentkeywordset is a keywordset comprising of all keywords of a subtree rooted at childnodes of the rootnode of the localtree. The function given in Eq. 3 can be proved to be monotonic, because it is a conditional summation of strictly positive input arguments. Next, we define a set of candidatenodes , based on Fagin’s algorithm; it is a set of nodes that can be part of the global topK answerstrees, after supersteps.
Lemma 6.2
After any answers are found at the aggregator, the set of candidatenodes comprises of the nodes for which , i.e., atleast one of the estimated shortest pathlengths is smaller than corresponding pathlength from the set .
For example, in Figure 6, cells marked with a circle are the pathlengths of the set , and pathlengths at candidatenodes are marked in lightgreen color. Here, it is important to note that after finding K answerstrees in a superstep, we can find a better answer in a subsequent superstep, since we explore nodes based on BFS and not in the increasing order of their constituent pathlengths.
6.2.3 BFS Stopping Criterion
According to Fagin’s algorithm remaining attributes of the candidate objects should be accessed in random order, to identify the global topK answertrees. However, in our setting the scenario is different from that described in Appendix A, in a manner that there can be more than one value of the same attribute (pathlength of a ) at any node. We need to consider all such pathlengths that are smaller than the corresponding element in set . Therefore, instead of performing the random access of remaining pathlengths, we continue to perform BFS exploration until the shortest pathlengths of all keywordsets at frontiernodes are larger than or equal to the corresponding elements from the set . The nodes in the candidate set get revised in every subsequent superstep, to this effect we present next Lemma.
Lemma 6.3
Random access of remaining attributes of
candidatenodes, is equivalent to random access of remaining attributes at
frontiernodes of subsequent supersteps or traversed candidate nodes that receive a message.
Here, the candidate nodes that were traversed atleast one superstep before the current one, are referred to as traversed candidate nodes. We prove this Lemma with the respect to two types of nodes: (i) frontier nodes of previous superstep and (ii) traversed candidatenodes.
Frontier nodes of previous superstep: Atleast one of the neighboring node of a candidatenode also needs to be part of an answertree for the candidatenode to be part of the answertree. Since the frontiernodes of current superstep are neighboring nodes of the frontiernodes of previous superstep, Lemma 6.3 is proven for frontier nodes of previous superstep.

The traversed candidate nodes can become rootnode of an answertree with the help of two types of paths, first when the new path passes through a frontiernode and the second when the new path does not pass through one of the frontiernodes. For example, as shown in Figure 7, let us consider a traversed candidate node in superstep. Here, path to at passes through frontiernode , and path to at does not pass through any frontiernode. For the first types of path no further proof is needed. The second types of paths, will get taken care by deep messages at candidate traversed nodes. Therefore, Lemma 6.3 is proved, and we continue to BFS exploration of the graph until none of the frontiernodes is a candidatenode, and stop DKS algorithm when no more deepmessages are left to be passed around.
6.2.4 Global Vs Local TOPK Steiner Tree
In this section, we analyze the effect of filtering the unwanted branches of the localtree on the process of finding the minimum steiner tree. This analysis also presents a basis for calculation of set at every node and also for not rejecting some of the branches of the local tree, even if they are not part of any of the local topK answers.
Lemma 6.4
By purging the extra branches of a localtree, i.e., branches that are not part of local topK trees of any keywordset , we don’t miss any of the topK answertree.
Here, it is important to note that if the branches of a localtree that are not in local topK answertrees are purged we can miss an answertree. This can be observed from the example shown in Figure 8. Here for , the branch is not in top3 answertrees at node but if not purged at , it can be part of a global topK answers rooted at vertex . Therefore branches of a localtree that are part of the topK partial answertree of any of the keywordset should not be purged. Further, it is trivial to prove the remaining argument of this Lemma, that by purging all the remaining branches of a localtree we don’t miss any of the topK answers.
In summary, we have proven that the answer weight can be represented as a monotonic function of pathlengths of all keywordsets of a keywordquery, and that BFS way of searching the graph for answertrees, is equivalent to sorted access of pathlengths w.r.t. shortest pathlengths at frontiernodes in consecutive supersteps. Therefore, Fagin’s algorithm becomes applicable in this setting and as stated by Fagin, we can stop this exploration without missing the optimal answertree. We also presented proof for Lemmas 6.1, 6.2, 6.3, and 6.4 and therefore Theorem 1 is proved.
7 Experiments and Analysis
7.1 Data, Infrastructure and Implementation
Datasets and Infrastructure: We performed experiments on two datasets of LinkedOpen Data [8]: a) secrdfabout, RDF data about U.S. securities and corporate ownership (460,451 nodes and 500,384 edges); and b) blukbnb, RDF data on British National Bibliography (16.1 million nodes and 46.6 million edges). Blukbnb is not only the largest dataset on which keywordsearch has been attempted in the research literature, but also larger than what can run on systems such as BANKS [7, 10]. Following the strategy proposed by Coffman et al. in [9], we generated 100 queries for blukbnb dataset based on the frequently occurring keywords. These 100 queries were generated such that first 20 queries contained two keywords each, next 20 queries contained three keywords each, and so on, i.e., the number of keywords per query varied from 2 to 6 in these 100 queries. The 100 queries thus obtained were used for running all the experiments reported in this paper. Further, keywords for these queries were chosen in a manner that the total number of keywordnodes per query, varied from a small number to a large number as shown in Figure 9; here, it can be observed that the number of keywordnodes increase exponentially across different queries. All experiments reported in this paper were conducted on a compute cluster of four machines, each having 4 Intel Xeon E7520@1.87GHz CPUs with 4 cores, 32 GB RAM, configured to have 35 workers.
Implementation: We implemented the DKS using an opensource Pregel package Apache Giraph 1.0^{2}^{2}2http://giraph.apache.org/ [3], which was configured to have 34 workers and a master worker. The edge weights were modeled following a strategy similar to that proposed in [7]. Here, the edge weight is smaller if the indegree of the target node is smaller, based on an intuition that if a node has say 10 incoming edges, and another node has 100 incoming edges, the neighboringnodes of are closer to it as compared to the distance between neighboringnodes of and itself. In DKS implementation, the edge weights are drawn from a stepfunction w.r.t. degree of its target node. If the degree of the target node is smaller than a prior threshold the edge weight is assumed to be int() and infinite otherwise. , was chosen from the degree distribution of the graph. Also, it was observed that the system hangs if the number of messages in a single superstep are more than approx. 1 million. In such situation, we stopped subsequent supersteps and estimated smallest possible answer weight which can get discovered by further exploration of the graph, following the method presented in Section 5.4, above.
7.2 Benchmarks
Benchmarks presented in this paper were conducted on the blukbnb dataset, while that on secrdfabout were presented in a previous paper [1]. Here, we evaluate i) the efficiency of our approach by observing the timetaken to run the queries on blukbnb dataset and compare it with the time taken by vanilla parallel BFS implementation, because it is possible to run vanilla parallel BFS on a graph efficiently. Other approaches such as BANKS [7] are not being compared primarily because those algorithms cannot handle such large volumes of data and our algorithm may not perform well on small datasets. We also present breakup of time taken by various components of DKS described in next paragraph; ii) Degree of approximation for situations when we had to exit before the exit criterion was satisfied; iii) the effectiveness of early exit by observing the % of nodes explored for every query, and for various values of K in topK; iv) the communication cost by measuring ratio of total number of messages exchanged and edgecount of the graph; and v) the effectiveness of the distributed processing by observing the timetaken by a select set of queries by varying the number of worker nodes (compute nodes) in Apache Giraph installation. In all figures of this section, queries are organized in increasing order of keywordcount and keywordnode count.
TimeTaken: The vanilla parallel BFS was observed to take approximately 2 min 10 sec. The 90th percentile of the queries take 85 sec ( min) for , 116 sec ( min) for , 221 sec ( min) for , and 609 sec ( min) for to run the DKS algorithm. Here, time taken for instantiation of worker node jobs in Apache Giraph, first time loading of the graph, and serialization of the final results, i.e., time taken by the system ( sec) has been discounted from the reported running time of DKS algorithm. This normalized running time of the DKS algorithm for all the queries has been shown in Figure 10. Here, it is important to note that the running time of DKS not only depends on the number of keywords, but also on the number of keywordnodes of the query. However, it can be observed that while the number of keywords increase exponentially but the timetaken does not increase in the same order.
K  Send BFS Msgs  Receive Msgs  Send Deep Msgs  Send Agg Msg  Evaluate 

1  38%  44%  6%  11%  1% 
2  37%  38%  17%  8%  1% 
5  35%  37%  22%  5%  1% 
10  31%  42%  21%  4%  1% 
We also ran various query in performance collection mode only, to measure the time taken for finegrained steps of the DKS algorithm to understand what part of the algorithm takes most of the time. We divide the DKS algorithm in five components, which are: (i) Send BFS Message: concerns with iterating over the outgoing edges of a node, serializing the localtree of the node, and sending the message, (ii) Receive Message: concerns with Step1 of DKS, described in Section 4 which includes the task for calculation of sets and and filtering of localtree, (iii) Send Deep Message: this includes iterating over the localtree, and sending suitable deepmessages (iv) Extract topK messages from localtree of a node, based on set and . The results of this analysis have been presented in Table 1. Here, we can observe that most of the time is taken by receivemessage step, which is expected based on the analysis presented in Section 5. Sending the messages is also a time consuming task because it involves serialization of the localtree as well as the communication cost. Also with increase in the value of the timetaken for sending deepmessages increases, primarily because more deep messages were passed as shown in Figure 11.
SPARatio: For situations when our Infrastructure was not sufficient enough to tackle the load, we stopped the DKS algorithm after estimating the smallest possible weight that can get discovered by further traversal of the graph. We report the SPAratio of the queries, which is a ratio of the weight of the best detected answertree to the weight of the smallest possible answer weight that can be detected by further exploration of the graph. Here, for situations where optimal answer was detected the SPratio is marked as zero. The SPAratio is not the approximation ratio of the algorithm, because deep messages are also stopped, still it can be taken as a measure of the degree of optimality of the detected answer. The SPAratio of all the 100 queries is shown in Figure 12, and it was observed that the percentile of the SPAratio was , for , while the best reported approximation ratio for a heuristic solution is 1.55 [29] but it has a quadratic timecomplexity in the number of nodes of the graph while our approahc is linear in the number of nodes of the graph.
Effectiveness of early exit & Communication Cost: The % of nodes explored did not show significant change with respect to different values of . Therefore, we present average of the percentage of nodes explored for every query in Figure 13. Here, percentile of the percentage of nodes explored was observed to be , indicating that our approach is quite effective in reducing the search space of the problem. In Figure 14, we have shown the total number of messages exchanged as a % of the total number of edges in the graph. We observed that the percentile of the percentage of messagecount with respect to number of edges was for different values of . This indicates that the number of messages required to be exchanged increase with increase in value of . Further, this experiment corroborates the assumptions made in Section 5.2 and 5.3 regarding total number of nodes and messages for estimation of timecomplexity of our algorithm.
Benefits of Distributed Processing: Finally we demonstrate the benefits of distributed processing by running same set of queries on different number of worker nodes of the computer cluster. The results of this experiment are shown in Figure 15, for a pair of queries with keyword count 2 and 3, respectively. Here, it can be observed that on 3 times the number of worker nodes, the timetaken becomes more than half. However, by increasing the number further not much of gain is observed.
8 Conclusion & Future Work
We have described a novel parallel algorithm for relationship queries on large graphs (equivalent to the group Steiner tree problem). Our distributed keyword search (DKS) algorithm is defined in the graphparallel (Pregellike) computing paradigm. While DKS searches for the rootvertex of an answertree following a BFS strategy, the algorithm ensures that only a fraction of the graph needs to to be explored for most queries. We include analytical proof of optimality as well as show that even with early exit from BFS we do not miss an optimal answertree. We also describe an optimized implementation of the basic algorithm and analyze its timecomplexity. We have also demonstrated that DKS works efficiently on large realworld graphs derived from linkedopendata, via experimental results on the graphparallel framework Giraph.
9 Acknowledgments
We thank Prof. Amitabha Bagchi, IIT Delhi for his reviews and suggestions.
References
 [1] P. Agarwal, M. Ramanath, and G. Shroff. Distributed algorithm for relationship queries on large graphs. In Proceedings of the workshop on LargeScale and Distributed System for Information Retrieval in CIKM, LSDSIR ’15.
 [2] S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: enabling keyword search over relational databases. In Proceedings of the ACM SIGMOD international conference on Management of Data, SIGMOD ’02.
 [3] C. Avery. Giraph: Largescale graph processing infrastructure on hadoop. Proceedings of the Hadoop Summit. Santa Clara’11.
 [4] A. Balmin, V. Hristidis, and Y. Papakonstantinou. Objectrank: Authoritybased keyword search in databases. In Proceedings of the international conference on Very Large Data Bases, VLDB ’04.
 [5] F. Bauer and A. Varma. Distributed algorithms for multicast path setup in data networks. IEEE/ACM Transactions on Networking ’96.
 [6] M. Bezensek and B. Robic. A survey of parallel and distributed algorithms for the steiner tree problem. International Journal of Parallel Programming’13.
 [7] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using banks. In Proceedings of International Conference on Data Engineering, ICDE ’02.
 [8] C. Bizer, T. Heath, and T. BernersLee. Linked datathe story so far. In International Journal on Semantic Web and Information Systems, IJSWIS’09.
 [9] J. Coffman and A. C. Weaver. A framework for evaluating database keyword search strategies. In Proceedings of the ACM international Conference on Information and Knowledge Management, CIKM ’10.
 [10] B. Dalvi, M. Kshirsagar, and S. Sudarshan. Keyword search on external memory data graphs. In Proceedings of the international conference on Very Large Databases, VLDB ’08.
 [11] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of symposium on Operating Systems Design and Implementation, OSDI ’04.
 [12] B. Ding, J. Xu Yu, S. Wang, L. Qin, X. Zhang, and X. Lin. Finding topk mincost connected trees in databases. In Proceedings of IEEE International Conference on Data Engineering, ICDE’07.
 [13] M. T. Dittrich, G. W. Klau, A. Rosenwald, T. Dandekar, and T. Müller. Identifying functional modules in protein–protein interaction networks: an integrated exact approach. Bioinformatics, 2008.
 [14] S. E. Dreyfus and R. A. Wagner. The steiner problem in graphs. Networks, 1(3):195–207, 1971.
 [15] R. Fagin. Combining fuzzy information: an overview. SIGMOD Rec., 31(2):109–118, June 2002.
 [16] J. Feldman and M. Ruhl. The directed steiner network problem is tractable for a constant number of terminals. In Proceedings of Annual Symposium on Foundations of Computer Science, 1999.
 [17] L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. Xrank: Ranked keyword search over xml documents. In Proceedings of the ACM SIGMOD international conference on Management of Data, SIGMOD ’03.
 [18] Y. Hao, H. Cao, Y. Qi, C. Hu, S. Brahma, and J. Han. Efficient keyword search on graphs using mapreduce. In Proceedings of IEEE international conference on Big Data, BigData ’15.
 [19] H. He, H. Wang, J. Yang, and P. Yu. Blinks: ranked keyword searches on graphs. In Proceedings of ACM SIGMOD international conference on Management of data, SIGMOD ’07.
 [20] V. Hristidis and Y. Papakonstantinou. Discover: keyword search in relational databases. In Proceedings of the international conference on Very Large Data Bases, VLDB ’02.
 [21] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar. Bidirectional expansion for keyword search on graph databases. In Proceedings of the international conference on Very Large Databases, VLDB ’05.
 [22] M. Kargar and A. An. Keyword search in graphs: Finding rcliques. In Proceedings of international conference on Very Large Databases, VLDB ’11.
 [23] B. Kimelfeld and Y. Sagiv. Finding and approximating topk answers in keyword proximity search. In Proceedings of ACM Symposium on Principles of Database Systems, PODS ’06.
 [24] T. Lappas, K. Liu, and E. Terzi. Finding a team of experts in social networks. In Proceedings of ACM SIGKDD international conference on Knowledge Discovery and Data mining, KDD’09.
 [25] T. Lappas, E. Terzi, D. Gunopulos, and H. Mannila. Finding effectors in social networks. In Proceedings of ACM SIGKDD international conference on Knowledge Discovery and Data Mining, KDD ’10.
 [26] G. Li, B. Ooi, J. Feng, J. Wang, and L. Zhou. Ease: an effective 3in1 keyword search method for unstructured, semistructured and structured data. In Proceedings of ACM SIGMOD international conference on Management of Data, SIGMOD ’08.
 [27] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for largescale graph processing. In Proceedings of ACM SIGMOD international conference on Management of Data, SIGMOD’10.
 [28] G. L. Presti, G. L. Re, P. Storniolo, and A. Urso. A grid enabled parallel hybrid genetic algorithm for spn. In M. Bubak, G. Albada, P. Sloot, and J. Dongarra, editors, Computational Science  ICCS ’04, Lecture Notes in Computer Science.
 [29] G. Robins and A. Zelikovsky. Improved steiner tree approximation in graphs. In Proceedings of annual ACMSIAM Symposium On Discrete Algorithms, SODA ’00.
 [30] G. Singh and K. Vellanki. A distributed protocol for constructing multicast trees. In Proceedings of International Conference On Principles Of Distributed Systems, OPODIS ’98.
 [31] J. X. Yu, L. Qin, and L. Chang. Keyword search in relational databases: A survey. IEEE Data Eng. Bull, 2010.
 [32] J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 2006.
Appendix A Fagin’s Algorithm
Brief description of Fagin’s algorithm: Fagin’s algorithm [15] is about finding topK objects under sorted access of the attributes of the objects. Let us assume that there is a set of many objects , and every object has attributes, i.e., . These attributes of various objects can be accessed from individually sorted lists . An aggregate function , is used for calculation of weight of the object , using its attribute values. For example, students () in a course compete for the topK positions by performing well in subjects. The teachers of all subjects prepare a list of students’ marks in descending order, and send it to course coordinator. The course coordinator wants to identify the topK best performing students, from these individually sorted list of marks in every subject. If the aggregate function is monotonic with respect to these attributes, and these lists are accessed in parallel then, according to Fagin, it is sufficient to access these lists sequentially until all attributes of atleast K objects are seen in these lists. Further, in order to access these objects, objects would also have been seen but partially, i.e., objects were seen in less than lists. The remaining attributes of these objects should then be accessed randomly. As a result, Objects will be known. Fagin stated that the topK objects according to their weights will be within these objects, and therefore these are referred to as candidate topK objects. Here, a function is called monotonic if following condition is satisfied: , whenever .