Quasi-SLCA based Keyword Query Processing over Probabilistic XML Data

Quasi-SLCA based Keyword Query Processing over Probabilistic XML Data

Jianxin Li, Chengfei Liu, Rui Zhou and Jeffrey Xu Yu,  Jianxin Li, Chengfei Liu and Rui Zhou are with the Faculty of Information & Technology, Swinburne University of Technology, Australia. {jianxinli, cliu, rzhou}@swin.edu.au
Jeffrey Xu Yu is with the Department of Systems Engineering & Engineering Management, The Chinese University of Hong Kong, China. yu@se.cuhk.edu.hk
Abstract

The probabilistic threshold query is one of the most common queries in uncertain databases, where a result satisfying the query must be also with probability meeting the threshold requirement. In this paper, we investigate probabilistic threshold keyword queries (PrTKQ) over XML data, which is not studied before. We first introduce the notion of quasi-SLCA and use it to represent results for a PrTKQ with the consideration of possible world semantics. Then we design a probabilistic inverted (PI) index that can be used to quickly return the qualified answers and filter out the unqualified ones based on our proposed lower/upper bounds. After that, we propose two efficient and comparable algorithms: Baseline Algorithm and PI index-based Algorithm. To accelerate the performance of algorithms, we also utilize probability density function. An empirical study using real and synthetic data sets has verified the effectiveness and the efficiency of our approaches.

{keywords}

Probabilistic XML, Threshold Keyword Query, Probabilistic Index.

1 Introduction

Uncertainty is widespread in many web applications, such as information extraction, information integration, web data mining, etc. In uncertain database, probabilistic threshold queries have been studied extensively where all results satisfying the queries with probabilities equal to or larger than the given threshold values are returned [1, 2, 3, 4, 5]. However, all of these works were studied based on uncertain relational data model. Because the flexibility of XML data model allows a natural representation of uncertain data, uncertain XML data management has become an important issue and lots of works have been done recently. For example, many probabilistic XML data models were designed and analyzed [6, 7, 8, 9, 10]. Based on different data models, query evaluation [7, 10, 11, 12, 13], algebraic manipulation [8] and updates [6, 10] were studied. However, most of these works concentrated on structured query processing, e.g., twig queries. In this paper, we propose and address a new interesting and challenging problem of Probabilistic Threshold Keyword Query (PrTKQ) over uncertain XML databases based on quasi-SLCA semantics, which is not studied before as far as we know.

In general, an XML document could be viewed as a rooted tree, where each node represents an element or contents. XIRQL [14] supports keyword search in XML based on structured queries. However, users may not have the knowledge of the structure of XML data or the query language. As such, supporting pure keyword search in XML has attracted extensive research. The LCA-based approaches will identify the LCA node first, which contains every keyword under its subtree at least once [15, 16, 17, 18, 19, 20, 21]. Since the LCA nodes sometimes are not very specific to users’ query, Xu and Papakonstantinou [20] proposed the concept of SLCA (smallest lowest common ancestor), where a node is regarded as an SLCA if (a) the subtree rooted at the node , denoted as , contains all the keywords, and (b) there does not exist a descendant node of such that contains all the keywords. In other words, if a node is an SLCA, then its ancestors will be definitely excluded from being SLCAs. The SLCA semantics of model keyword search result on a deterministic XML tree are also applied [22, 16, 19].

Based on the SLCA semantics, [23] discussed top- keyword search over a probabilistic XML document. Given a keyword query and a probabilistic XML document (PrXML), [23] returned the top most relevant SLCA results (PrSLCAs) based on their probabilities. Different from the SLCA semantics over deterministic XML documents, a node being a PrSLCA can only exclude its ancestors from being PrSLCAs by a probability. This probability can be calculated by aggregating the probabilities of the deterministic documents (called possible worlds) implied in the PrXML where is an SLCA in each deterministic document .

However, it is not suitable to directly utilize the PrSLCA semantics for evaluating PrTKQs because the PrSLCA semantics are too strong. In some applications, users tend to be confident with the results to be searched, so relatively high probability threshold values may be given. Consequently, it is very likely that no qualified PrSLCA results will be returned. To solve this problem, we propose and utilize a so-called quasi-SLCA semantics to define the results of a PrTKQ by relaxing the semantics of PrSLCA with regards to a given threshold value, i.e., besides the probability of being a PrSLCA in PrXML, the probability of a node being a quasi-SLCA in PrXML may also count the probability of s descendants being PrSLCAs in PrXML if their probabilities are below the specified threshold value. In other words, a node being a quasi-SLCA will exclude its ancestors from being quasi-SLCAs by a probability only when this probability is no less than the given threshold; otherwise, this probability will be included for contributing to its ancestors. This is different from the PrSLCA semantics that excludes the probability contribution from child nodes.

Fig. 1: A probabilistic XML data tree
Example 1

Consider an aircraft-monitored battlefield application, where the useful information will be taken as Aerial photographies. Through analysing the photographies, we can extract the possible objects (e.g., road, factory, airport, etc.) and attach some text description to them with probabilities, which can be stored in the format of PrXML. Figure 1 is a snapshot of an aircraft-monitored battlefield XML data. By issuing a keyword query , a military department would find the potential areas containing hazard buildings above a probability threshold.

Based on the semantics of PrSLCA, any of the nodes (probability = 0.3), ( = 0.14), ( = 0.168), ( = 0.24), ( = 0.32) and ( = 0.088) can become an PrSLCA result. The detailed procedure of calculating the probabilities of results will be shown later. As we know, the users generally specify a threshold value as the confidence score with their issued query, e.g., representing that the users prefer to see the answers with their probabilities up to 0.40. In this condition, no results can be returned to the users.

However, from Figure 1, we can see that if the probabilities of and could contribute to their parent nodes, and would become quasi-SLCA results. Unfortunately, the PrSLCA semantics exclude them from being results. This motivates us to relax the PrSLCA semantics to the quasi-SLCA semantics. According to the quasi-SLCA semantics, the probabilities of and being the quasi-SLCA results are 0.44 and 0.56 with the contributions of their child nodes and , respectively. As such, and are deemed as the interesting places to be returned.

Given a PrTKQ, our problem is to quickly compute all the quasi-SLCA nodes with their probabilities meeting the threshold requirement. For users issuing PrTKQs, they generally expect to see the complete quasi-SLCA answer set as early as possible and do not need to know the accurate probability of each answer, which motivates us to design a Probabilistic Inverted (PI) index and PI-based efficient algorithm for quickly identifying quasi-SLCA result candidates.

We summarize the contributions of this paper as follows:

  • Based on our proposed quasi-SLCA result definition, we study probabilistic threshold keyword query over uncertain XML data, which satisfies the possible world semantics. To the best of our knowledge, this problem has not been studied before.

  • We design a probabilistic inverted (PI) index that can quickly compute the lower bound and upper bound for a threshold keyword query, by which lots of unqualified nodes can be pruned and qualified nodes can be returned as early as possible. To keep the effectiveness of pruning, the probability density function is employed based on the assumption of Gaussian distribution.

  • We propose two algorithms, a comparable baseline algorithm and a PI-based Algorithm, to efficiently find all the quasi-SLCA results meeting the threshold requirement.

  • Experimental evaluation has demonstrated the efficiency and effectiveness of the proposed approaches.

The rest of this paper is organized as follows. In Section 2, we introduce the probabilistic XML model and the problem definition of probabilistic threshold keyword query. Section 3 shows the procedure of efficiently finding quasi-SLCA results using an example. Section 4 first presents the data structure of PI index, discusses the basic building operations and pruning techniques of PI index, and provides the building algorithm of PI index. In Section 5, we propose a comparable baseline algorithm and a PI-based algorithm to find the qualified quasi-SLCA results. We report the experimental results in Section 6. Section 7 discusses related works and Section 8 concludes the paper.

2 Probabilistic Data Model and
Problem Definition

Probabilistic Data Model: A PrXML document defines a probability distribution over a space of deterministic XML documents. Each deterministic document belonging to this space is called a possible world. A PrXML document represented as a labelled tree has ordinary and distributional nodes. Ordinary nodes are regular XML nodes and they may appear in deterministic documents, while distributional nodes are only used for defining the probabilistic process of generating deterministic documents and they do not occur in those documents.

In this paper, we adopt a popular probabilistic XML model, PrXML [12, 23], which was first discussed in [7]. In this model, a PrXML document is considered as a labelled tree where distributional nodes have two types, IND and MUX. An IND node has children that are independent of each other, while the children of a MUX node are mutually-exclusive, that is, at most one child can exist in a random instance document (called a possible world). A real number from (0,1] is attached on each edge in the XML tree, indicating the conditional probability that the child node will appear under the parent node given the existence of the parent node. An example of a PrXML document is given in Fig. 1. Unweighted edges have 1 as the default conditional probability.

The Semantics of PrSLCA in PrXML: According to the semantics of possible worlds, the global probability of a node being a PrSLCA with regard to a given query in the possible worlds is defined as follows:

(1)

where denotes the possible worlds implied by indicates that is an SLCA in the possible world for the query . is the existence probability of the possible world . The symbol means is the global probability of a node being an SLCA w.r.t. in all possible worlds.

Fig. 2: A small PrXML and its possible worlds
Example 2

Consider a small PrXML in Figure 2.a and all generated possible worlds in Figure 2.{b,c,d,e,f,g,h,i} where the solid line represents the existence of the edge while the dashed line represents the absence of the edge. Given a possible world, we can compute its global probability based on the existence/absence of the edges in the possible world, e.g., .

Given a keyword query , we can compute the global probability of being a PrSLCA w.r.t. by using . Similarly, we have the global probability of being a PrSLCA w.r.t. by using .

The Semantics of quasi-SLCA in PrXML:

Definition 1

Quasi-SLCA: Given a keyword query and a threshold value , a node is called a quasi-SLCA if and only if (1) or its descendants are SLCAs in a set of possible worlds; (2) the aggregated probability of and its descendants to be SLCAs in is no less than ; (3) no descendant nodes of satisfy both of the above conditions in any set of possible worlds that overlaps with .

In other words, if a descendant node of is a quasi-SLCA, then the probability of has to be excluded from the probability of being a quasi-SLCA. It means that the set of possible worlds that appears does not overlap with the set of possible worlds that or its other descendants appear.

Given a query , we can compute in a bottom-up manner, where stands for the local probability for being an SLCA in the probabilistic subtree rooted at . For example, in Figure 2(a) is a subtree of Figure 1. can be used to compute the PrSLCA probability of and . From , we can easily get by = where indicates the existence probability of in the possible worlds. It can be computed by multiplying the conditional probabilities along the path from the root to .

Now, we define quasi-SLCA based on PrSLCA and the parent-child relationship. For an IND node , we have:

(2)

where the child node of is an SLCA node, but not a quasi-SLCA node.

For MUX node , we have:

(3)

Note, IND or MUX nodes are normally not allowed to be SLCA result nodes because they are only distributional nodes. As such, for the above IND or MUX node , we may use its parent node (with as a sole child) to represent the SLCA result node.

Example 3

Let’s consider Example 2 again. First assume the specified threshold value is 0.40, then the global probability of being a quasi-SLCA result can be calculated by using = + * = 0.14 + 0.30 = 0.44 because child is an SLCA node but not a quasi-SLCA node w.r.t. the given threshold. So ’s SLCA probability contributes to its parent node . If the threshold is decreased to 0.30, then will be taken as a qualified quasi-SLCA result and will not contribute to . In this case, cannot become a quasi-SLCA result because = = 0.14 0.30. If the threshold is further decreased to 0.14, both and are qualified quasi-SLCA results.

Definition 2

Probabilistic Threshold Keyword Query: (PrTKQ) Given a keyword query and a threshold , the results of over a probabilistic XML data is a set of quasi-SLCA nodes with their probabilities equal to or larger than , i.e., for .

In this work, we are interested in how to efficiently compute the quasi-SLCA answer set for a PrTKQ over a probabilistic XML data.

3 Overview of this Work

A naive method to answer a PrTKQ is to enumerate all possible worlds and apply the query to each possible world. Then, we can compute the overall probability of each quasi-SLCA result and return the results meeting the probability threshold. However, the naive method is inefficient due to the huge number of possible worlds over a probabilistic XML data. Another method is to extend the work in [23] to compute the probabilities of quasi-SLCA candidates. Although it is much more efficient than the naive method, it needs to scan the keyword node lists and calculate the keyword distributions for all relevant nodes. Therefore, that motivates our development of efficient algorithms which not only avoids generating possible worlds, but also prunes more unqualified nodes.

To accelerate query evaluation, in this paper we propose a prune-based probabilistic threshold keyword query algorithm, which determines the qualified results and filters the unqualified candidates by using off-line computed probability information. To do this, we need to first calculate the probability of each possible query term within a node, which is stored as an off-line computed probabilistic index. Within a node, any two of its contained terms may appear in the IND or MUX ways. To precisely differentiate IND and MUX, we utilize different parts to represent the probabilities of possible query terms appearing in MUX way, while the terms in each part hold IND relationships. In other words, the different parts of terms in a node are mutual-exclusive (MUX), e.g., and in Figure 3 consists of three parts.

Given a keyword query and a threshold value, we first load the corresponding off-line computed probabilistic index w.r.t. the keyword query and then on-the-fly calculate the range of probabilities of a node being a result of the keyword query using the pre-computed probabilistic index in a bottom-up strategy. Here, the range of probabilities can be represented by two boundary values: lower bound and upper bound. By comparing the lower/upper bounds of candidates, the qualified results can be efficiently identified.

The followed two examples briefly demonstrate how we calculate the lower/upper bounds based on a given keyword query and the off-line computed probabilistic index, and how we apply the on-line computed lower/upper bounds to prune the unqualified candidates and determine the qualified ones.

Fig. 3: PI index and Lower/Upper Bound for a query over the given PrXML

Figure 3 shows the lower/upper bounds of each node in Figure 1 where the probability of each individual term is calculated offline while the lower/upper bounds are computed on-the-fly based on the given query keywords. Let’s first introduce the related concepts briefly: the probability of a term in a node represents the total local probability of the term appearing in all possible worlds to be generated for the probabilistic subtree rooted at the node, e.g., = 0.65 and = 0.916; the lower bound value represents the minimal total local probability of the given query keywords appearing in all the possible worlds w.r.t. the probabilistic subtree, e.g., LB(, )=0.65*0.916=0.595; the upper bound value represents the maximal total local probability of the given query keywords appearing in all the possible worlds w.r.t. the probabilistic subtree because the keywords may be independent or co-occur, e.g., UB(, ) = min{0.65, 0.916} = 0.65 no matter whether they are independent. By multiplying the path probability, the local probability can be transformed into the global probability. For the nodes containing MUX semantics, we group the probabilities of its terms into different parts, any two of which are mutually-exclusive as shown in , and in Figure 3. The details of computing the lower/upper bounds for the IND and MUX semantics in the following section.

Example 4

Consider a PrTKQ with =0.40 again. , and can be pruned directly without calculation because their upper bounds are all lower than 0.40. We need to check the rest nodes , , and . For , after computation, the probability of being a quasi-SLCA result is 0.44, which is larger than the specified threshold value 0.40, so will be taken as a result. After that, the result of can be used to update the lower bound and upper bound of , (LB=0.595, UB=0.65) (LB=0.155, UB=0.21). As a consequence, should be filtered due to . Similarly, can be computed and selected as a result because its probability is 0.56. Since and having been the quasi-SLCA results, the bounds of can be updated as (LB=0.890, UB=0.950) (LB=0.136, UB=0.196). As such, can be pruned because its upper bound is lower than 0.40. From this example, we can find that many answers can be pruned or returned without the need to know their accurate probabilities, and the effectiveness of pruning would be accelerated greatly with the increase of users’ search confidence.

As an acute reader, you may find that we have to compute the probability of being a quasi-SLCA because it cannot determine whether or not is a qualified result to be output only based on its lower/upper bound values. To exactly calculate the probability of being a quasi-SLCA, we have to access its child/descendant nodes, e.g., , although has been recognized as a pruned node before we start to process . If an internal node depends on a larger number of pruned nodes, the effectiveness of pruning will be degraded to some extent. To fix this challenging problem, we will introduce Probability Density Function PDF that can be used to approximately compute the probability of a node, the result of which can be used to update the lower bound and upper bound of its ancestor nodes further. The details are provided and discussed with algorithms later.

4 Probabilistic Inverted Index

In this section, we describe our Probabilistic Inverted (PI) index structure for efficiently evaluating PrTKQ queries over probabilistic XML data. In keyword search on certain XML data, inverted indexes are popular structures, e.g., [16, 20]. The basic technique is to maintain a list of lists, where each element in the outer list corresponds to a domain element (i.e., a keyword). Each inner list stores the ids of XML nodes in which the given keyword occurs, and for each node, the frequencies or the weight at which the keyword appears or takes. In this work, we introduce a probabilistic version of this structure, in which we store for each keyword a list of node-ids. Along with each node-id, we store the probability values that the subtree rooted at the node may contain the given keyword. The probability values in inner lists can be used to compute lower bound and upper bound on-the-fly during PrTKQ evaluation.

Fig. 4: A probabilistic Inverted Index

Figure 4 shows an example of a probabilistic inverted index of the data in Figure 1. At the base of the structure is a list of keywords storing pointers to lists, corresponding to each term in the XML data . This is an inverted array storing, for each term in , a pointer to a list of triple tuples. In the list corresponding , the triple (, Pr(path), {, …}) records the node 111The symbol is used to represent a node’s name or a node’s id without confusions in the following sections. Here, is the id of the node , the conditional probability from the root to , and the probability set that may contain single probability value or multiple probability value. Single probability value represents that all the keyword instances in the subtree can be considered as independent in probability, e.g., the confidence of containing is {0.65}, while multple probability value means that the keyword instances belonging to different sets occur mutually, e.g., the confidence of containing is a set {0.8, 0.86, 0.82}, that represents the different possibilities of occurring in .

4.1 Basic Operations of Building PI Index

To build PI index, we need to traverse the given XML data tree once in a bottom-up method. During the data traversal, we will apply the following operations that may be used solely or in their combinations. The binary operation X Y promotes the probability of Y to its parent node X. The binary operation X Y promotes the probabilities of two sibling nodes X and Y to their parent node. The n-ary case can be processed by calling for the corresponding binary cases one by one.

Assume contains the keywords {, , …, , …, } and the conditional probability is ; and contains the keywords and the conditional probability is .

Operator1-v v: If and are independent sibling nodes, we can directly promote their probabilities to their parent , then we have,

(4)

Operator2-v v: If is an independent child of , we can directly promote the probability of to , then we have,

(5)
Example 5

Let’s show the procedure of computing in Figure 2 using Operator1 and Operator2. Firstly, we compute and promote the probability of keywords to their parent by Operator1, i.e., = 1 - (1 - 0.5*1.0)*(1 - 0.3*1.0) = 0.65 and = 0.3. And then, we compute using operator2, i.e., = 1 - (1 - 0.3)*(1 - 0.4) = 0.58 while do not change because only contains here. And the conditional probability from the root to is 1.0. Therefore, and will be inserted in PI index, respectively.

Operator3-v v: If and are two mutually-exclusive sibling nodes and is their parent, then we generate two parts in by and , respectively.

Operator4-v v: If is a mutually-exclusive child node of , then we can get the aggregated probability by .

In the above four basic operators, we assume the terms independently appear in and . When the nodes and contain mutually-exclusive parts, we need to deal with each part using the four basic operators.

Given two independent sibling nodes () and () where only contains a set of mutually-exclusive parts with conditional probability . In this case, we can apply the operation for each part . The computed results are maintained in different parts in their parent .

Example 6

Consider an independent node and a node consisting of , and in Figure 1. We first promote , and to that consists of three parts: 1 2 3 0.5 0.3 0.3 0.1 , as shown in Figure 3 - . Because and are independent sibling nodes, the operation can be called to compute the probability with regards to their parent . To do this, we apply part for each part using Operator1. The results are shown in Figure 3 - . After that, we can insert and into PI index.

If both and contain a set of mutually-exclusive parts, respectively, then we can do pairwise aggregations across the two sets of parts. Building PI index needs to scan the given probabilistic XML data only once. Assume that the probabilistic XML has been encoded using probabilistic Dewey codes. The basic idea of building PI index is to progressively process the document nodes sorted by Dewey codes in ascending order, i.e., the data can be loaded and processed in a streaming strategy. When a leaf node is coming, we will compute the probability of each term in the leaf node . After that, the terms with their probabilities in will be written into PI index. Next, we need promote the terms and their probabilities of to the parent of based on the operation types in Section 4.1. After the node stream is scanned completely, the building algorithm of PI index will be terminated. We don’t provide the detailed building algorithm in this paper.

4.2 Pruning Techniques using PI Index

In this subsection, we first show how to prune the unqualified nodes using the proposed lower/upper bounds. And then, we explain how to compute lower/upper bounds, and how to update the upper/lower bounds based on intermediate results during the query evaluation.

By default, the node lists in PI index are sorted in the document order. represents the overall probability of a keyword in a node . It is obvious that the overall probability of a keyword appearing in a node is larger than or equal to that of the keyword appearing in its descendant nodes. And the overall probability value for each keyword in a node can be computed and stored in PI index offline.

Consider a node and a PrTKQ q containing a set of keywords . If all the terms in are independent, then we have,

(6)
(7)

Most of the time, consists of a set of parts that are mutually-exclusive. In this case, the lower bound of would be generated from a part that gives the highest lower bound value while the upper bound of would be generated from another part that gives the highest upper bound value, in which may be equal to or not equal to .

(8)
(9)

Where must satisfy the criteria: (1) ; (2) cannot find another part having and . Otherwise, UB() and LB() will be set as zero.

Example 7

Let’s consider in Figure 3 as an example. The first and second parts can generate lower and upper bounds: Part 1 LB(,)=0.32, UB(,)=0.4; and Part 2 LB(,)=0.206, UB(,)=0.24. Because Part 1 can produce a higher upper bound than Part 2, the lower and upper bounds of will come from Part 1, which guarantees that can be a quasi-SLCA candidate with a higher probability. Since Part 3 does not contain full keywords, i.e., missing , it cannot generate lower and upper bounds.

Property 1

[Upper Bound Usage] A node can be filtered if the overall probability of any keyword ( and ) is lower than the given threshold value , i.e., , .

{proof}

Since , we have as the upper bound probability of becoming a qualified quasi-SLCA node. Therefore, if a node holds the inequation , then must be lower than . As such, can be filtered.

Property 2

[Lower Bound Usage] The nodes can be returned as required results if we have and where is any child or descendant node of .

{proof}

means that all the keyword nodes in the subtree rooted at will contribute their probabilities to node . In other words, no decendant node of could be a quasi-SLCA so the lower bound probability will not be deducted. Therefore, if we have for , then . As such, can be returned as a quasi-SLCA result.

Example 8

Let’s continue Example 7. can be directly returned as a qualified answer for the given threshold ( = 0.4). This is because , and are filtered due to their upper bound less than the threshold ( = 0.4).

To update the lower/upper bound values during query evaluation, one way is to treat the different types of nodes differently, by which the updated lower/upper bounds may obtain better precision. But the disadvantage of this way is to easily affect the efficiency of bound update. This is because, given a current node having multiple quasi-SLCA nodes as its descendant nodes, it is required to know the detailed relationships (IND or MUX) among the multiple quasi-SLCA nodes. To avoid the disadvantage, we do not separate the different types of distributional nodes, under which the multiple quasi-SLCA nodes appear. In other words, we unify them into a uniform formula based on the following two properties.

Property 3

No matter node is an IND or ordinary or MUX node, we can update their upper bound values as follows:

(10)

Where should be held.

{proof}

According to the definition of upper bound, UB() represents the maximal probability of being a quasi-SLCA node, which comes from the overall probability of a specific keyword. Therefore, the problem of updating upper bound can be alternatively considered as the percentage of the probability of the keyword has been used for the descendant nodes becoming qualified quasi-SLCA nodes. If we know there are qualified descendant nodes of as returned answers, then we can compute their aggregated probabilities by . Therefore, the upper bound can be updated as .

Does the above update equation hold for MUX node? To answer this question, we utilize the properties in [23], from which we can compute the aggregated probability by using . Therefore, we have . The equation can be converted into .

Since can be expressed as where is a positive value, i.e., 0, we can derive that . As a consequence, we can obtain that = .

Therefore, = - 1 + (1 - (, )) holds for IND, ordinary and MUX nodes.

Property 4

No matter node is an IND or ordinary or MUX node, we can update their lower bound values as follows:

(11)

Where should be held.

{proof}

For the lower bound update, we need to deduct the confirmed probability for IND nodes or for MUX nodes, from the original lower bound . According to the procedure of the above proof, we have . Consequently, we have the inequation, = . Therefore, it is safe to use the right side to update the lower bound values.

Example 9

Consider that has been computed and its probability is 0.44. Given threshold (=0.4), is returned as a quasi-SLCA result. Consequently, we can update the lower/upper bound values of its ancestor , i.e., (, ) = 0.65 - 1 + (1 - 0.44) = 0.21 and = 0.595 - 0.44 = 0.155. Since , can be filtered out effectively without computation.

Property 3 is used to filter the unqualified nodes by reducing the upper bound value while Property 4 is used to quickly find the qualified required results by comparing the reduced lower bound value (for the probability of the remaining quasi-SLCAs) with the threshold value.

Sometimes, we need to calculate the probability distributions of keywords in a node if the given threshold is in the range . The basic computational procedure is similar to the PrStack algorithm in [23]. Different from the PrStack algorithm, we will introduce probability density function (PDF) to approximately calculate the probability for a node if the node depends on a large number of pruned descendent nodes. To decide when to invoke the PDF while avoiding the risk of reducing precision significantly, we would like to select and compute some descendant nodes that may contribute large probabilities to the node . For the remaining descendant nodes, we may choose to invoke the PDF, by which we can reduce the time cost while still guarantee the precision to some extent. The detailed procedure will be introduced in the next section.

5 Prune-Based Probabilistic Threshold Keyword Query Algorithm

A key challenge of answering a PrTKQ is to identify the qualified result candidates and filter the unqualified ones as soon as possible. In this work, we address this challenge with the help of our proposed probabilistic inverted (PI) index. Two efficient algorithms are proposed, a comparable Baseline Algorithm and a PI-based Algorithm.

5.1 Baseline Algorithm

In keyword search on certain XML data, it is popular to use keyword inverted index retrieving the relevant keyword nodes, by which the keyword search results are generated based on different algorithms, e.g., [20, 16, 24, 25]. In probabilistic XML data, [23] proposed PrStack Algorithm to compute top- SLCA nodes. In this section, we propose an effective Baseline Algorithm that is similar the idea of PrStack Algorithm. To answer PrTKQ, we need to scan all the keyword inverted lists once. Firstly, the keyword-matched nodes will be read one by one based on their document order. After one node is processed, we check if its probability can be up to the given threshold value . If it is true, the node can be output as a quasi-SLCA node and its remaining keyword distributions (i.e., containing partial query keywords) can be continuously promoted to its parent node. Otherwise, we promote its complete keyword distributions (i.e., containing both all keywords or partial keywords) to its parent node. After that, the node at the top of the stack will be popped. Similarly, the above procedures will be repeated until all nodes are processed. The basic algorithm can be terminated when all nodes are processed. The detailed procedure is shown in Algorithm 1.

input: a query with threshold , keyword inverted (KI) index
output: a set of quasi-SLCA nodes

1:  load keyword node lists = from KI index;
2:  get the smallest Dewey from ;
3:  initiate a stack using ;
4:  while  do
5:     get the next smallest Dewey from ;
6:     while (top() do
7:          = pop();
8:         if  contains full keywords and  then
9:            output into ;
10:         promote the rest keyword distributions of to its parent using CombineProb(, );
11:     push();
12:  while  do
13:     a new node pop();
14:     if  contains full keywords and  then
15:         output into ;
16:     promote the rest keyword distributions of to its parent using CombineProb(, );
17:  return  ;
Algorithm 1 Baseline Algorithm

Because Baseline Algorithm only needs to scan the keyword node lists once, it is a fast and simple algorithm. However, its core computation - keyword distribution computation would consume lots of time, which motivates us to propose the PI-based Algorithm that can quickly identify the qualified or unqualified candidates using offline computed PI index and only compute keyword distributions for a few candidates. Here, Baseline Algorithm is taken as a comparable base to show the pruning performance of the PI-based Algorithm described below.

5.2 PI-based Algorithm

To efficiently answer PrTKQ, the basic idea of PI-based Algorithm is to read the nodes from keyword node lists one by one in a bottom-up strategy. For each node, we quickly compute its lower bound and upper bound by accessing PI index, which is far faster than computing the keyword distributions of the node directly. After comparing its lower/upper bounds with the given threshold value, we can decide if the node should be output as a qualified answer, skipped as an unqualified result, or cached as a potential result candidate. For example, if the current node’s lower bound is larger than or equal to the threshold value, then the node can be output directly without further computation. This is because all its descendants have been checked according to the bottom-up strategy. If its upper bound is lower than the threshold value, then the node can be filtered out. Otherwise, it will be temporarily cached for further checking. Based on different cases, different operations would be applied. Only the nodes identified as potential result candidates need to be computed. Compared with Baseline Algorithm, PI-based algorithm can be accelerated significantly because Baseline Algorithm has to compute the keyword distributions for all nodes. The detailed procedure has been shown in Algorithm 2.

input: a query with threshold , keyword inverted (KI) index, PI index
output: a set of quasi-SLCA nodes

1:  load keyword node lists = from KI index;
2:  load probability node lists = ;
3:  get the smallest Dewey from ;
4:  initiate a stack using and an empty stack ;
5:  while  do
6:     get the next smallest Dewey from again;
7:     while (top() do
8:          = pop();
9:         UB(,) and LB(,) ComputeBound(, );
10:         if LB(,) then
11:            output into ;
12:            UpdateBound(, LB(,), UB(,));
13:            pop();
14:         else if UB(,) LB(,then
15:             ComputeProbDist(, );
16:            if  then
17:               output into ;
18:               UpdateBound(, );
19:               pop();
20:         else
21:            ();
22:     push();
23:  while  do
24:     a new node pop();
25:     UB(,) and LB(,) ComputeBound(, );
26:     process the node using the same codes in Line 10 - Line 21;
27:  return  ;
Algorithm 2 PI-based Algorithm

5.2.1 Detailed Procedure of PI-based Algorithm

In Algorithm 2, Line 1-Line 4 show that the procedures of initiating PI-based Algorithm. We first load the keyword node lists from KI index and probability node lists from PI index. And then we take the smallest node from to initiate a stack that is set using the dewey codes of . Another stack is used to maintain the temporary filtered nodes. After that, the PI-based Algorithm is ready to start.

Next, we need to check each node in in document order. Different from Baseline Algorithm, we only compute the keyword distribution probabilities for a few nodes that are first identified using the lower bound and upper bound in . Consider be the next smallest node to be processed. We compare it with the node in stack . If is the descendant node of , then will be pushed into and get the next smallest node from . Otherwise, we pop out from and check if it is a qualified quasi-SLCA answer. In Baseline Algorithm, it will compute the keyword distributions of and combine its remaining distributions and the distribution of its parent based on promotion operations. Different from Baseline Algorithm, PI-based Algorithm will quickly compute the upper bound UB(,) and lower bound LB(,) using , which is used to differentiate the nodes as qualified nodes - output, unqualified nodes - filter and uncertain nodes - to be further checked. By doing this, only a few nodes need to be computed. Since bound computation is far faster than computation of keyword distribution, lots of run time cost can be saved in PI-based Algorithm. Line 10-Line 21 show the detailed procedures. If the lower bound LB(,) is larger than or equal to the given threshold value , then can be output as a qualified quasi-SLCA answer without computation. At this moment, the lower bound LB(,) can be taken as the temporary probability of being a quasi-SLCA result because the exact probability of is delayed until we need to calculate its exact probability value. Subsequently, the temporary probability value LB(,) and the probabilities of descendant quasi-SLCA results can be used to update the lower/upper bounds of the ancestors of in stack based on Equation 11 and Equation 10, respectively. If the lower bound LB(,) is lower than while the upper bound UB(,) is larger than or equal to , then we need to compute the keyword distributions of using the cached descendant nodes in . Based on the computed probability of , it can be decided to be output as a qualified answer or filtered as an unqualifed candidate. If the upper bound UB(,) is lower than , then will be pushed into for the possible computation of its ancestors.

There are two main functions in PI-based Algorithm. The first one is ComputeProbDist(, ) for computing the probability of full keyword distribution of using the descendant nodes in . The second is UpdateBound(, LB(,) or Prob(,)) for updating the bounds of the nodes to be processed.

5.2.2 Function ComputeProbDist()

The function ComputeProbDist(, ) can be implemented in two ways, Exact Computation or Approximate Computation.

Exact Computation is to actually calculate the probability of being a quasi-SLCA node by scanning all the nodes in the stack that maintains the descendant nodes of . The processing strategy is similar to Baseline Algorithm in Section 5.1. In other words, it needs to visit the nodes in one by one and compute the local keyword distribution of each node, and then promotes the intermediate results to its parent. After all nodes in are processed, the probability of will be obtained because it aggregates all the probabilities from its descendant nodes.

Approximate Computation is to approximately calculate the probability of being a quasi-SLCA node based on a partial set of nodes in the stack that maintains the descendant nodes of . The approximate computation can be made according to different distribution types, e.g., uniform distributions, piecewise polynomials, poisson distributions, etc. In this work, we consider normal or Gaussian distributions in more detail.

As we know Gaussian distribution is considered the most prominent probability distribution in statistics. However, the PDF of Gaussian distribution cannot be applied to PrTKQ over probabilistic XML data directly due to two main challenges. The first challenge is to simulate the continuous distributions using discrete distributions based on the real conditions in order to reduce the approximate errors as much as possible, and the second is to embody the multiple keyword variables in the PDF.

Generally, the probability density function of a Gaussian distribution of mean and variance is:

(12)

Addressing Challenge 1: The density function has a shape of a bell centered in the mean value with variance . Based on the definition of Gaussian distribution, the Gaussian distribution is often used to describe, at least approximately, measurements that tends to cluster around the mean. Therefore, consider the mean be the partial computed probability value of be a quasi-SLCA node, which guarantees the real probability value will not be significantly different from the probability base that has already been calculated based on promising descendant nodes. The value of the variance can be chosen from the range [1-#computed descendant nodes/#total descendant nodes, 1] based on the visited/unvisited descendant nodes in . This is because the more the descendant nodes are actually computed, the higher the percentage of the values would be drawn within one standard deviation away from the mean. Extremely, if all descendant nodes are computed actually, 100% of values can be drawn within one stardard deviation. Therefore, we select and compute a few descendant nodes of from , which can contribute relatively higher probabilities to make a quasi-SLCA node. In this work, we use heuristic method to select a few descendant nodes with the higher probabilities of single keywords in the descendant nodes of . And then, we take the partially computed probability as the base of the probability density function of a Gaussian distribution.

Consider be a node to be evaluated and UB(,222Note that UB(,) has been updated if has descendant nodes that are qualified answers, i.e., it minus the probability contributions of the qualified answers. be its current upper bound value. We have,

(13)

After substituting Equation 12 into Equation 13, we get,

(14)

Where is the partially computed probability, is set as 1-#computed descendant nodes/#total descendant nodes.

Addressing Challenge 2: To embody all the keyword variables in the PDF, we introduce the joint/conditional Gaussian distribution based on the work in [26]. Assume a PrTKQ contains two keywords and . We have the conditional PDF as follows.

(15)

Since , after substituting Equation 12 into Equation 15, we get

(16)

If we make an assumption that and are independent keyword variables i.e., , and assume and , then we have

(17)

Therefore, Equation 17 can be easily extended to multiple keyword variables that are assumed as independent. We can compute the probability of w.r.t. a PrTKQ .

(18)

Where is the partially computed probability, is set as 1-#computed descendant nodes/#total descendant nodes.

In the experiments, we call Matlab from Java to calculate Equation 18. The estimated results are used to show the comparison between the actual computation and approximation computation. The results verify the usability of Gaussian distribution to measure the probability.

5.2.3 Function UpdateBound()

For each ancestor node (), we need to update the upper bounds and lower bounds using Function UpdateBound() based on Equation 10 and Equation 11, respectively. To guarantee the completeness of the answer set, the parameters of the function may be different based on conditions. For example, if LB(,) is larger than or equal to the threshold value as shown in Algorithm 2: Line 12, then the probability value LB(,) is used to update the upper bounds of ancestors while the probability value UB(,) is used to update the lower bounds of ancestors; if LB(,) is smaller than and UB(,) is larger than or equal to as shown in Algorithm 2: Line 18, the actual or approximate probability value computed by Function ComputeProbDist(, ) will be utilized to update the upper/lower bounds of ancestors together.

Here, we use two hashmaps to implement Function UpdateBound(). For a node, one hashmap is used to cache the dewey of the node as a key, and the lower/upper bounds as a value where the bounds are computed based on PI index. Another hashmap is used to record the probability that the descendants of the node having been identified as qualified quasi-SLCA answers. When a node is coming, we can quickly get the updated lower/upper bounds based on the two hashmaps.

6 Experimental Studies

We conduct extensive experiments to test the performance of our algorithms: Baseline Algorithm (BA); PI-based Exact-computation Algorithm (PIEA) that implements Function ComputeProbDist() by exactly computing the probability distributions of the keyword matched nodes; and PI-based Approximate-computation Algorithm (PIAA) that makes approximated computation based on the Gaussian distribution of keywords while still exactly computing the probability distributions of the keyword matched nodes that have the higher probabilities. All these algorithms were implemented in Java and run on a 3.0GHz Intel Pentium 4 machine with 2GB RAM running Windows 7.

6.1 Dataset and Queries

We use two real datasets, DBLP [27] and Mondial [28], and a synthetic XML benchmark dataset XMark [29] for testing the proposed algorithms. For XMark, we also generate four datasets with different sizes. The three types of datasets are selected based on their features. DBLP is a relatively shallow dataset of large size; Modial is a deep and complex, but small dataset; XMark is a balanced dataset with varied depth, complex structure and varied size. Therefore, they are chosen as test datasets.

For each XML dataset used, we generate the corresponding probabilistic XML tree, using the same method as used in [12]. We visit the nodes in the original XML tree in pre-order way. We first set the random ratio of IND:MUX:Ordinary as 3:3:4. For each node visited, we randomly generate some distributional nodes with “IND” or “MUX” types as children of . Then, for the original children of , we choose some of them as the children of the new generated distributional nodes and assign random probability distributions to these children with the restriction that the sum of them for a MUX node is no greater than 1. The generated datasets are described in Table I. And we select terms and construct a set of keyword queries to be tested for each dataset. Due to the limited space, we only show six of these queries for each dataset. For each different sets of queries, the terms in the first two queries have small size of keyword matched nodes; the terms of the middle two queries relate to a medium size of keyword matched nodes; the terms of the last queries are based on the computation of a larger number of keyword matched nodes.

ID name size #IND #MUX #Ordinary
Doc1 XMark 10M 26k 26k 145k
Doc2 20M 54k 52k 200k
Doc3 40M 98k 100k 606k
Doc4 80M