Scalable Continual Top- Keyword Search
in Relational Databases
Keyword search in relational databases has been widely studied in recent years because it does not require users neither to master a certain structured query language nor to know the complex underlying database schemas. Most of existing methods focus on answering snapshot keyword queries in static databases. In practice, however, databases are updated frequently, and users may have long-term interests on specific topics. To deal with such a situation, it is necessary to build effective and efficient facility in a database system to support continual keyword queries.
In this paper, we propose an efficient method for answering continual top- keyword queries over relational databases. The proposed method is built on an existing scheme of keyword search on relational data streams, but incorporates the ranking mechanisms into the query processing methods and makes two improvements to support efficient top- keyword search in relational databases. Compared to the existing methods, our method is more efficient both in computing the top- results in a static database and in maintaining the top- results when the database continually being updated. Experimental results validate the effectiveness and efficiency of the proposed method.
elational databases, keyword search, continual queries, results maintenance.
With the proliferation of text data available in relational databases, simple ways to exploring such information effectively are of increasing importance. Keyword search in relational databases, with which a user specifies his/her information need by a set of keywords, is a popular information retrieval method because the user needs to know neither a complex query language nor the underlying database schemas. It has attracted substantial research effort in recent years, and a number of methods have been developed [1, 2, 3, 4, 5, 6, 7, 8, 9, 10].
Consider a sample publication database shown in Fig. 1. Fig. 1 (a) shows the three relations Papers, Authors, and Writes. In the following, we use the initial of each relation name (, , and ) as its shorthand. There are two foreign key references: and . Fig. 1 (b) illustrates the tuple connections based on the foreign key references. For the keyword query “James P2P” consisting of two keywords “James” and “P2P”, there are six tuples in the database that contain at least one of the two keywords (underlined in Fig. 1 (a)). They can be regraded as the results of the query. However, they can be joined with other tuples according to the foreign key references to form more meaningful results, several of which are shown in Fig. 1 (c). The arrows represent the foreign key references between the corresponding pairs of tuples. Finding such results which are formed by the tuples containing the keywords is the task of keyword search in relational databases. As described later, results are often ranked by relevance scores evaluated by a certain ranking strategy.
Most of the existing keyword search methods assume that the databases are static and focus on answering snapshot keyword queries. In practice, however, a database is often updated frequently, and the result of a snapshot query becomes invalid once the related data in the database is updated. For the database in Fig. 1, if publication data comes continually, new publication records are inserted to the three tables. Such new records may be more relevant to “James” and “P2P”. Hence, after getting the initial top- results, the user may demand the top- results to reflect the latest database updates. Such demands are common in real applications. Suppose a user want to do a top- keyword search in a Micro-blogging database, which is being updated continually: not only the weblogs and comments are continually being inserted or deleted by bloggers, but also the follow relationship between bloggers are being updated continually. Thus, a continual evaluation facility for keyword queries is essential in such databases.
For continual keyword query evaluation, when the database is updated, two situations must be considered:
Database updates may change the existing top- results: some top- results may be replaced by new ones that are related to the new tuples, and some top- results may be invalid due to deletions.
Database updates may change the relevance scores of existing results because the underlying statistics (e.g., word frequencies) are changed.
In this paper, we describe a system which can efficiently report the top- results of every monitoring query while the database is being updated continually. The outline of the system is as follows:
When a continual query is issued, it is evaluated in a pipelined way to find the set of results whose upper bounds of relevance scores are higher than a threshold by calculating the upper bound of the future relevance score for every query result.
When the database is updated, we first update the relevance scores of the computed results, then find the new results whose upper bounds of relevance scores are larger than and delete the results containing the deleted tuples.
The pipelined evaluation of the keyword query is resumed if the number of computed results whose relevance scores are larger than falls below , or is reversed if the above number is much larger than .
At any time, the computed results whose relevance scores are the largest and are larger than are reported as the top- results.
In this section, we introduce some important concepts for top- keyword querying evaluation in relational databases.
2.1 Relational Database Model
We consider a relational database schema as a directed graph , called a schema graph, where represents the set of relation schemas and represents the foreign key references between pairs of relation schemas. Given two relation schemas, and , there exists an edge in the schema graph, from to , denoted , if the primary key of is referenced by the foreign key defined on . For example, the schema graph of the publication database in Fig. 1 is . A relation on relation schema is an instance of (a set of tuples) conforming to the schema, denoted . A tuple can be inserted into a relation. Below, we use to denote if the context is obvious.
2.2 Joint-Tuple-Trees (JTTs)
The results of keyword queries in relational databases are a set of connected trees of tuples, each of which is called a joint-tuple-tree (JTT for short). A JTT represents how the matched tuples, which contain the specified keywords in their text attributes, are interconnected through foreign key references. Two adjacent tuples of a JTT, and , are interconnected if they can be joined based on a foreign key reference defined on relational schema and in (either or ). The foreign key references between tuples in a JTT can be denoted using arrows or notation . For example, the second JTT in Fig. 1(c) can be denoted as or . To be a valid result of a keyword query , each leaf of a JTT is required to contain at least one keyword of . In Fig. 1(c), tuples , , and are matched tuples to the keyword query as they contain the keywords. Hence, the four JTTs are valid results to the query. In contrast, is not a valid result because tuple does not contain any required keywords. The number of tuples in a JTT is called the size of , denoted by .
2.3 Candidate Networks (CNs)
Given a keyword query , the query tuple set of relation is defined as contains some keywords of . For example, the two query tuple sets in Example 1 are and , respectively. The free tuple set of a relation with respect to is defined as the set of tuples that do not contain any keywords of . In Example 1, , . If a relation does not contain text attributes (e.g., relation in Fig. 1), is used to denote for any keyword query. We use to denote a tuple set, which may be either or .
Each JTT belongs to the result of a relational algebra expression, which is called a candidate network (CN) [4, 9, 11]. A CN is obtained by replacing each tuple in a JTT with the corresponding tuple set that it belongs to. Hence, a CN corresponds to a join expression on tuple sets that produces JTTs as results, where each join clause corresponds to an edge in the schema graph , where represents a equi-join between relations. For example, the CNs that correspond to two JTTs and in Example 1 are and , respectively. In the following, we also denote as . As the leaf nodes of JTTs must be matched tuples, the leaf nodes of CNs must be query tuple sets. Due to the existence of relationships (for example, an article may be written by multiple authors), a CN may have multiple occurrences of the same tuple set. The size of CN , denoted as , is defined as the number of tuple sets that it contains. Obviously, the size of a CN is the same as that of the JTTs it produces. Fig. 2 shows the CNs corresponding to the four JTTs shown in Fig. 1 (c). A CN can be easily transformed into an equivalent SQL statement and executed by an RDBMS.111 For example, we can transform CN as: SELECT * FROM W w, P p, A a WHERE w.pid = p.pid AND w.aid = a.aid AND p.pid in (, , ) and a.aid in (, , ).
When a continual keyword query is specified, the non-empty query tuple set for each relation in the target database are firstly computed using full-text indices. Then all the non-empty query tuple sets and the database schema are used to generate the set of valid CNs, whose basic idea is to expand each partial CN by adding a or at each step ( is adjacent to one relation of the partial CN in ), beginning from the set of non-empty query tuple sets. The set of CNs shall be sound/complete and duplicate-free. There are always a constraint, (the maximum size of CNs) to avoid generating complicated but less meaningful CNs. In the implementation, we adopt the state-of-the-art CN generation algorithm proposed in .
In Example 1, there are two non-empty query tuple sets and . Using them and the database schema graph, if , the generated CNs are: , , , , , and .
2.4 Scoring Method
The problem of continual top- keyword search we study in this paper is to continually report top- JTTs based on a certain scoring function that will be described below. We adopt the scoring method employed in , which is an ordinary ranking strategy in the information retrieval area. The following function is used to score JTT for query , which is based on the TF-IDF weighting scheme:
where is a tuple (a node) contained in . is the tuple score of with regard to defined as follows:
where is the term frequency of keyword in tuple , is the number of tuples in relation (the relation corresponds to tuple ) that contain . is interpreted as the document frequency of . represents the size of tuple , i.e., the number of letters in , and is interpreted as the document length of . is the total number of tuples in , is the average tuple size (average document length) in , and () is a constant which usually be set to 0.2.
Table 1 shows the tuple scores of the six matched tuples in Example 1. We suppose all the matched tuples are shown in Fig. 1, and the numbers of tuples of the two relations are 150 and 180, respectively. Therefore, the top-3 results are (), () and ().
The score function in Eq. (1) has the property of tuple monotonicity, defined as follows. For any two JTTs and generated from the same CN , if for any , , then we have . As shown in the following discussion, this property is relied by the existing top- query evaluation algorithms.
3 Related Work
3.1 Keyword Search in Relational Databases
Given -keyword query , the task of keyword search in a relational database is to find structural information constructed from tuples in the database . There are two approaches. The schema-based approaches [1, 2, 4, 7, 9, 14, 15] in this area utilize the database schema to generate SQL queries which are evaluated to find the structures for a keyword query. They process a keyword query in two steps. They first utilize the database schema to generate a set of relation join templates (i.e., the CNs), which can be interpreted as select-project-join views. Then, these join templates are evaluated by sending the corresponding SQL statements to the DBMS for finding the query results.  proved how to generate a complete set of CNs when the has a user-given value and discussed several query processing strategies when considers the common sub-expressions among the CNs. [1, 2, 14, 15] all focused on finding all JTTs, whose sizes are , which contain all keywords, and there is no ranking involved. In  and , several algorithms are proposed to get top- JTTs. We will introduce them in detail in Section 3.2.
The graph-based methods [3, 8, 5, 6, 10, 16] model and materialize the entire database as a directed graph where the nodes are relational tuples and the directed edges are foreign key references between tuples. Fig. 1(b) shows such a database graph of the example database. Then for each keyword query, they find a set of structures (either Steiner trees , distinct rooted trees , -radius Steiner graphs , or multi-center subgraphs ) from the database graph, which contain all the query keywords and are connected by the paths in database graph. Such results are found by graph traversals that start from the nodes that contain the keywords. For the details, please refer the survey papers [13, 17]. The materialized data graph should be updated for any database changes; hence this model is not appropriate to the databases that change frequently . Therefore, this paper adopts the schema-based framework and can be regarded as an extension for dealing with continual keyword search.
3.2 Top- Keyword Search in Relational Databases
DISCOVER2  proposed the Global-Pipelined (GP) algorithm to get the top- results which are ranked by the IR-style ranking strategy shown in Section 2.4. The aim of the algorithm is to find a proper order of generating JTTs in order to stop early before all the JTTs are generated. It employs the priority preemptive, round robin protocol  to find results from each query tuple set prefix in a pipelined way, thus each CN can avoid being fully evaluated.
For a keyword query , given a CN , let the set of query tuple sets of be . Tuples in each are sorted in non-increasing order of their scores computed by Eq. 2. Let be the -th tuple in . In each , we use to denote the current tuple such that the tuples before the position of the tuple are all processed, and we use to move to the next position. (where is a tuple, and ) denotes the parameterized query which checks whether the tuples can form a valid JTT. For each tuple , we use to denote the upper bound score for all the JTTs of that contain the tuple , defined as follows:
According to the tuple monotonicity property of Eq. (1) and the sorting order of tuples, among the unprocessed tuples of , has the maximum value.
Algorithm GP initially mark all tuples in () of each CN as un-processed except for the top-most ones. Then in each while iteration (one round), the un-processed tuple which maximizes the value is selected for processing. Suppose tuple maximizes , processing is done by joining it with the processed tuples in the other query tuple sets of to find valid JTTs: all the combinations as are tested, where is a processed tuple of (, ). If the -th relevance score of the found results is larger than values of all the un-processed tuples in all the CNs, it can stop and output the found results with the largest relevance scores because no results with higher scores can be found in the further evaluation.
One drawback of the GP algorithm is that when a new tuple is processed, it tries all the combinations of processed tuples to test whether each combination can be joined with . This operation is costly due to extremely large number of combinations when the number of processed tuples becomes large . SPARK  proposes the Skyline-Sweeping algorithm to reduce the number of combinations test. SPARK uses a priority queue to keep the set of seen but not tested combinations ordered by the priority defined as the score of the hypothetical JTT corresponding to each combination. In each round, the combination in with the maximum priority is tested, then all its adjacent combinations are inserted into but only the combinations that have the high priorities are tested. SPARK still can not avid testing a huge number of combinations which cannot produce results, though the number of combinations test is highly reduced compared to DISCOVER2.
3.3 Keyword Search in Relational Data Streams
The most related projects to our paper are S-KWS  and KDynamic [20, 15], which try to find new results or expired results for a given keyword query over an open-ended, high-speed large relational data stream . They adopt the schema-based framework since the database is not static. This paper deals with a different problem from S-KWS and KDynamic, though all need to respond to continual queries in a dynamic environment. S-KWS and KDynamic focus on finding all query results. On the contrary, our methods maintain the top- results, which is less sensitive to the updates of the underlying databases because not every new or expired results change the top- results.
S-KWS maps each CN to a left-deep operator tree, where leaf operators (nodes) are tuple sets, and interior operators are joins. Then the operator trees of all the CNs are compacted into an operator mesh by collapsing their common subtrees. Joins in the operator mesh are evaluated in a bottom-to-top manner. A join operator has two inputs and is associated with an output buffer which saves its results (partial JTTs). The output buffer of a join operator becomes input to many other join operators that share the join operator. A new result that is newly outputted by a join operator will be a new arrival input to those joins sharing it. The operator mesh has two main shortcomings : (1) only the left part of the operator trees can be shared; and (2) a large number of intermediate tuples, which are computed by many join operators in the mesh with high processing cost, will not be eventually output in the end.
For overcoming the above shortcomings of S-KWS, KDynamic formalizes each CN as a rooted tree, whose root is defined to be the node such that the maximum path from to all leaf nodes of the CN is minimized; and then compresses all the rooted trees into a -Lattice by collapsing the common subtrees. Fig. 3(a) shows the lattice of two hypothetical CNs. Each node in the Lattice is also associated with an output buffer, which contains the tuples in that can join at least one tuple in the output buffer of its each child node. Thus, each tuple in the output buffer of each top-most node , i.e., the root of a CN, can form JTTs with tuples in the output buffers of its descendants. The new JTTs involving a new tuple are found in a two-phase approach. In the filter phase, as illustrated in Fig. 3(b), when a new tuple is inserted into node , KDynamic uses selections and semi-joins to check if (1) can join at least a tuple in the output buffer of each child node of ; and (2) can join at least a tuple in the output buffers of the ancestors of . The new tuples that can not pass the checks are pruned; otherwise, in the join phase (shown in Fig. 3(c)), a joining process is initiated from each tuple in the output buffer of each root node that can join , in a top-down manner, to find the JTTs involving .
In this paper, we incorporate the ranking mechanisms and the pipelined evaluation into the query processing method of KDynamic to support efficient top- keyword search in relational databases.
4 Continual Top- Keyword Search in Relational Databases
Database updates bring two orthogonal effects on the current top- results:
They change the values of , , and in Eq. (2) and hence change the relevance scores of existing results.
New JTTs may be generated due to insertions. Existing top- results may be expired due to deletions.
Although the second effect is more drastic, the first effect is not negligible for long-term database modifications. Thus, we can not neglect all the JTTs that are not the current top- results because some of them have the potential of becoming the top- results in the future. This paper solves this problem by bounding the future relevance score of each result. We use to denote the upper bound of relevance score for each result. Then, the results whose values are not larger than relevance score of the top--th results can be safely ignored.
The second challenge is shortage of top- results because they can be expired due to deletions. Since the value is rather small compared to the huge number of all the valid JTTs, the possibility of deleting a top- result is rather small. In addition, new top- results can also be formed by new tuples. Thus, if the insertion rate is not much smaller than the deletion rate, the possibility of occurring of top- results shortage would be small. However, this possibility would be high if the deletion rate is much larger, which can result in frequent top- results refilling operations. It worth noting that the top- results shortage can also be caused by the relevance score changing of results. Our solution to this problem is to compute the top- () results instead of the necessary . is a margin value. Then, we can stand up to times of deletion of top results when maintaining the top- results. The setting of is important. If is too small, it may has a high possibility to refill. If is too large, the efficiency of handling database modifications is decreased. Instead of analyzing the update behavior of the underlying database to estimate an appropriate value, we enlarge on each time of top- results shortage until it reaches a value such that the occurring frequency of top- results shortage falls below a threshold.
On the contrary, after maintaining the top- results for a long time, the number of computed top results maybe larger than , especially when the insertion rate is high. In such cases, the top- results maintaining efficiency is decreased because we need to update the relevance scores for more results and join the new tuples with more tuples than necessary. As shown in the experimental results, such extra cost is not negligible for long-term database modifications. Therefore, we need to reverse the pipelined query evaluation if there are too many computed top results.
In brief, when a continual keyword query is registered, we first generate the set of CNs and compact them into a lattice . Then, the initial top- results is found by processing tuples in in a pipelined way until the values of the un-seen JTTs are not larger than relevance score of the top--th result (which is denoted by ). When maintaining the top- results, we only find the new results that are with . The pipelined evaluation of is resumed if the number of found results with falls below , or is reversed if the above number is larger than . The method of computing for results is introduced in Section 4.2. Section 4.3 and Section 4.4 describe our method of computing the initial top- results and maintaining the top- results, respectively. Then, two techniques which can highly improve the query processing efficiency are presented in Section 4.5 and Section 4.6.
4.2 Computing Upper Bound of Relevance Scores
Let us recall the function for computing tuple scores given in Eq. (2):
We assume that the future values of each and both have an upper bound and , respectively. Then, we can derive the upper bound of the future tuple score for each tuple as:
Hence, the upper bound of the future relevance score of a JTT is:
Note that the function in Eq. (5) also has the tuple monotonicity property on .
On query registration, each is computed as , and each is computed as , where and both are set as small values (). When maintaining the top- results, we continually monitor the change of statistics to determine whether all the and values below their upper bounds. At each time that any or value exceeds its upper bound, the or is enlarged until the frequencies of exceeding the upper bounds fall below a small number.
Table 2 shows the values of the six matched tuples in Example 1 by setting and . Hence, , and .
4.3 Finding Initial Top- Results
Fig. 4 shows the -lattice of the seven CNs in Example 2. We use to denote a node in . Particularly, denotes a lattice node of query tuple set, and denotes the query tuple set of . The dual edges between two nodes, for instance, and , indicate that is a dual child of . A node in can belongs to multiple CNs. We use to denote the set of CNs that node belongs to. For example, . Tuples in each query tuple set are sorted in non-increasing order of . We use to denote the current tuple such that the tuples before the position of the tuple are all processed, and we use to move to the next position. Initially, for each node in , is set as the top tuple in . In Fig. 4, of the four nodes are denoted by arrows. For a node that is of a free tuple set , we regard all the tuples of as its processed tuples for all the times. We use to indicate the output buffer of , which contains its processed tuples that can join at least one tuple in the output buffer of each child node of . Tuples in are also referred as the outputted tuples of .
In order to find the top- results in a pipelined way, we need to bound the values of the un-found results. For each tuple of , the maximal values of JTTs that can form is defined as follows:
where indicates the maximal for all the JTTs of that contain tuple , and is obtained by replacing in Eq. (3) with . If a child of has empty output buffer, processing any tuple at can not produce JTTs; hence in such cases, which can choke the processing tuples at until all its child nodes have non-empty output buffers. According to Eq. (6) and the tuples sorting order, among the un-processed tuples of , has the maximum value. We use to denote . In Fig. 4, values of the four nodes are shown next to the arrows. For example, .
Algorithm 1 outlines our pipelined algorithm of evaluating the lattice to find the initial top- results, which is similar to the GP algorithm. Lines 1-1 are the initialization step to sort tuples in each query tuple set and to initialize each . Then in each while iteration (lines 1-1), the un-processed tuple in all the nodes that maximizes is selected to be processed. Processing the selected tuples is done by calling the procedure . Algorithm 1 stops when is not larger than the relevance score of the top--th found results. The procedure is provided in KDynamic, which updates the output buffers for (line 1) and all its ancestors (lines 1-1), and finds all the JTTs containing tuple by calling the procedure (line 1). We will explain procedure using examples later. The recursive procedure is provided in KDynamic too, which constructs JTTs using the outputted tuples of ’s descendants that can join . The stack , which records where the join sequence comes from, is used to reduce the join cost.
In the first round, tuple is processed by calling . Since is the root node of , is called and JTT is found. Then, for the two father nodes of , and , is not updated because , is updated to because can join and . And then, for the two father nodes of , and , is not updated since has no processed tuples, is set as because there is only one tuple in that can join and . Since is the root node (of ), is called but no results are found because the only one found JTT is not a valid result. After processing tuple , and . In the second round, tuple is processed, which finds results and . Then, , , , , and . In the third-fifth rounds, tuples , and are processed, which insert into and no results found. In the sixth round, tuple is processed, which finds results and . Then, Algorithm 1 stops because the relevance score of the third result in the queue (suppose ) is larger than all the values. Fig. 5 shows the snapshot of after finding the top-3 results. Thus, after the evaluation.
After the execution of Algorithm 1, values of all the un-found results are not larger than . Results in the queue can be categorized into three kinds. The first kind are the results that are with , which are the initial top- results. The second kind are with and , which are called the potential top- results because they have the potential to become the top- results. The third kind are with . As shown in the experiment, the results of the last kind may have a large number. However, we can not discard them because some of them may become the first two kinds when maintaining the top- results.
4.4 Maintaining Top- Results
Algorithm 2 shows our algorithm of maintaining top- results. A database update operator is denoted by , which represents a tuple of relation is inserted (if is a insertion) or deleted (if is a deletion). Note that the database updates is modeled as deletions followed by insertions. For a new arrival , Algorithm 2 first checks whether the and values of relation exceed their upper bounds. If some (s) or exceeds their upper bounds, we enlarge222The methods of enlarging , and are introduced in detail in the experiments. the corresponding (s) or (line 2), and then update the and values for all the tuples in and all the results in the queue using the enlarged (s) or (line 2); otherwise, we update the relevance scores for the results in that are with (line 2). Then, we insert into to find the new results if is an insertion (lines 2-2), or delete the expired JTTs and from if is a deletion (lines 2-2). Lines 2-2 are explained in detail latter. And then, the of some nodes may be large than , which can be caused by three reasons: (1) the upper bound scores of tuples of relation are increased; (2) the of some nodes are increased from 0 after inserting the new tuple into ; and (3) new CNs are added into . Therefore, in lines 2-2, we process tuples using procedure until all the values are not larger than .