QuasiSLCA based Keyword Query Processing over Probabilistic XML Data
Abstract
The probabilistic threshold query is one of the most common queries in uncertain databases, where a result satisfying the query must be also with probability meeting the threshold requirement. In this paper, we investigate probabilistic threshold keyword queries (PrTKQ) over XML data, which is not studied before. We first introduce the notion of quasiSLCA and use it to represent results for a PrTKQ with the consideration of possible world semantics. Then we design a probabilistic inverted (PI) index that can be used to quickly return the qualified answers and filter out the unqualified ones based on our proposed lower/upper bounds. After that, we propose two efficient and comparable algorithms: Baseline Algorithm and PI indexbased Algorithm. To accelerate the performance of algorithms, we also utilize probability density function. An empirical study using real and synthetic data sets has verified the effectiveness and the efficiency of our approaches.
Probabilistic XML, Threshold Keyword Query, Probabilistic Index.
1 Introduction
Uncertainty is widespread in many web applications, such as information extraction, information integration, web data mining, etc. In uncertain database, probabilistic threshold queries have been studied extensively where all results satisfying the queries with probabilities equal to or larger than the given threshold values are returned [1, 2, 3, 4, 5]. However, all of these works were studied based on uncertain relational data model. Because the flexibility of XML data model allows a natural representation of uncertain data, uncertain XML data management has become an important issue and lots of works have been done recently. For example, many probabilistic XML data models were designed and analyzed [6, 7, 8, 9, 10]. Based on different data models, query evaluation [7, 10, 11, 12, 13], algebraic manipulation [8] and updates [6, 10] were studied. However, most of these works concentrated on structured query processing, e.g., twig queries. In this paper, we propose and address a new interesting and challenging problem of Probabilistic Threshold Keyword Query (PrTKQ) over uncertain XML databases based on quasiSLCA semantics, which is not studied before as far as we know.
In general, an XML document could be viewed as a rooted tree, where each node represents an element or contents. XIRQL [14] supports keyword search in XML based on structured queries. However, users may not have the knowledge of the structure of XML data or the query language. As such, supporting pure keyword search in XML has attracted extensive research. The LCAbased approaches will identify the LCA node first, which contains every keyword under its subtree at least once [15, 16, 17, 18, 19, 20, 21]. Since the LCA nodes sometimes are not very specific to users’ query, Xu and Papakonstantinou [20] proposed the concept of SLCA (smallest lowest common ancestor), where a node is regarded as an SLCA if (a) the subtree rooted at the node , denoted as , contains all the keywords, and (b) there does not exist a descendant node of such that contains all the keywords. In other words, if a node is an SLCA, then its ancestors will be definitely excluded from being SLCAs. The SLCA semantics of model keyword search result on a deterministic XML tree are also applied [22, 16, 19].
Based on the SLCA semantics, [23] discussed top keyword search over a probabilistic XML document. Given a keyword query and a probabilistic XML document (PrXML), [23] returned the top most relevant SLCA results (PrSLCAs) based on their probabilities. Different from the SLCA semantics over deterministic XML documents, a node being a PrSLCA can only exclude its ancestors from being PrSLCAs by a probability. This probability can be calculated by aggregating the probabilities of the deterministic documents (called possible worlds) implied in the PrXML where is an SLCA in each deterministic document .
However, it is not suitable to directly utilize the PrSLCA semantics for evaluating PrTKQs because the PrSLCA semantics are too strong. In some applications, users tend to be confident with the results to be searched, so relatively high probability threshold values may be given. Consequently, it is very likely that no qualified PrSLCA results will be returned. To solve this problem, we propose and utilize a socalled quasiSLCA semantics to define the results of a PrTKQ by relaxing the semantics of PrSLCA with regards to a given threshold value, i.e., besides the probability of being a PrSLCA in PrXML, the probability of a node being a quasiSLCA in PrXML may also count the probability of s descendants being PrSLCAs in PrXML if their probabilities are below the specified threshold value. In other words, a node being a quasiSLCA will exclude its ancestors from being quasiSLCAs by a probability only when this probability is no less than the given threshold; otherwise, this probability will be included for contributing to its ancestors. This is different from the PrSLCA semantics that excludes the probability contribution from child nodes.
Example 1
Consider an aircraftmonitored battlefield application, where the useful information will be taken as Aerial photographies. Through analysing the photographies, we can extract the possible objects (e.g., road, factory, airport, etc.) and attach some text description to them with probabilities, which can be stored in the format of PrXML. Figure 1 is a snapshot of an aircraftmonitored battlefield XML data. By issuing a keyword query , a military department would find the potential areas containing hazard buildings above a probability threshold.
Based on the semantics of PrSLCA, any of the nodes (probability = 0.3), ( = 0.14), ( = 0.168), ( = 0.24), ( = 0.32) and ( = 0.088) can become an PrSLCA result. The detailed procedure of calculating the probabilities of results will be shown later. As we know, the users generally specify a threshold value as the confidence score with their issued query, e.g., representing that the users prefer to see the answers with their probabilities up to 0.40. In this condition, no results can be returned to the users.
However, from Figure 1, we can see that if the probabilities of and could contribute to their parent nodes, and would become quasiSLCA results. Unfortunately, the PrSLCA semantics exclude them from being results. This motivates us to relax the PrSLCA semantics to the quasiSLCA semantics. According to the quasiSLCA semantics, the probabilities of and being the quasiSLCA results are 0.44 and 0.56 with the contributions of their child nodes and , respectively. As such, and are deemed as the interesting places to be returned.
Given a PrTKQ, our problem is to quickly compute all the quasiSLCA nodes with their probabilities meeting the threshold requirement. For users issuing PrTKQs, they generally expect to see the complete quasiSLCA answer set as early as possible and do not need to know the accurate probability of each answer, which motivates us to design a Probabilistic Inverted (PI) index and PIbased efficient algorithm for quickly identifying quasiSLCA result candidates.
We summarize the contributions of this paper as follows:

Based on our proposed quasiSLCA result definition, we study probabilistic threshold keyword query over uncertain XML data, which satisfies the possible world semantics. To the best of our knowledge, this problem has not been studied before.

We design a probabilistic inverted (PI) index that can quickly compute the lower bound and upper bound for a threshold keyword query, by which lots of unqualified nodes can be pruned and qualified nodes can be returned as early as possible. To keep the effectiveness of pruning, the probability density function is employed based on the assumption of Gaussian distribution.

We propose two algorithms, a comparable baseline algorithm and a PIbased Algorithm, to efficiently find all the quasiSLCA results meeting the threshold requirement.

Experimental evaluation has demonstrated the efficiency and effectiveness of the proposed approaches.
The rest of this paper is organized as follows. In Section 2, we introduce the probabilistic XML model and the problem definition of probabilistic threshold keyword query. Section 3 shows the procedure of efficiently finding quasiSLCA results using an example. Section 4 first presents the data structure of PI index, discusses the basic building operations and pruning techniques of PI index, and provides the building algorithm of PI index. In Section 5, we propose a comparable baseline algorithm and a PIbased algorithm to find the qualified quasiSLCA results. We report the experimental results in Section 6. Section 7 discusses related works and Section 8 concludes the paper.
2 Probabilistic Data Model and
Problem Definition
Probabilistic Data Model: A PrXML document defines a probability distribution over a space of deterministic XML documents. Each deterministic document belonging to this space is called a possible world. A PrXML document represented as a labelled tree has ordinary and distributional nodes. Ordinary nodes are regular XML nodes and they may appear in deterministic documents, while distributional nodes are only used for defining the probabilistic process of generating deterministic documents and they do not occur in those documents.
In this paper, we adopt a popular probabilistic XML model, PrXML [12, 23], which was first discussed in [7]. In this model, a PrXML document is considered as a labelled tree where distributional nodes have two types, IND and MUX. An IND node has children that are independent of each other, while the children of a MUX node are mutuallyexclusive, that is, at most one child can exist in a random instance document (called a possible world). A real number from (0,1] is attached on each edge in the XML tree, indicating the conditional probability that the child node will appear under the parent node given the existence of the parent node. An example of a PrXML document is given in Fig. 1. Unweighted edges have 1 as the default conditional probability.
The Semantics of PrSLCA in PrXML: According to the semantics of possible worlds, the global probability of a node being a PrSLCA with regard to a given query in the possible worlds is defined as follows:
(1) 
where denotes the possible worlds implied by indicates that is an SLCA in the possible world for the query . is the existence probability of the possible world . The symbol means is the global probability of a node being an SLCA w.r.t. in all possible worlds.
Example 2
Consider a small PrXML in Figure 2.a and all generated possible worlds in Figure 2.{b,c,d,e,f,g,h,i} where the solid line represents the existence of the edge while the dashed line represents the absence of the edge. Given a possible world, we can compute its global probability based on the existence/absence of the edges in the possible world, e.g., .
Given a keyword query , we can compute the global probability of being a PrSLCA w.r.t. by using . Similarly, we have the global probability of being a PrSLCA w.r.t. by using .
The Semantics of quasiSLCA in PrXML:
Definition 1
QuasiSLCA: Given a keyword query and a threshold value , a node is called a quasiSLCA if and only if (1) or its descendants are SLCAs in a set of possible worlds; (2) the aggregated probability of and its descendants to be SLCAs in is no less than ; (3) no descendant nodes of satisfy both of the above conditions in any set of possible worlds that overlaps with .
In other words, if a descendant node of is a quasiSLCA, then the probability of has to be excluded from the probability of being a quasiSLCA. It means that the set of possible worlds that appears does not overlap with the set of possible worlds that or its other descendants appear.
Given a query , we can compute in a bottomup manner, where stands for the local probability for being an SLCA in the probabilistic subtree rooted at . For example, in Figure 2(a) is a subtree of Figure 1. can be used to compute the PrSLCA probability of and . From , we can easily get by = where indicates the existence probability of in the possible worlds. It can be computed by multiplying the conditional probabilities along the path from the root to .
Now, we define quasiSLCA based on PrSLCA and the parentchild relationship. For an IND node , we have:
(2) 
where the child node of is an SLCA node, but not a quasiSLCA node.
For MUX node , we have:
(3) 
Note, IND or MUX nodes are normally not allowed to be SLCA result nodes because they are only distributional nodes. As such, for the above IND or MUX node , we may use its parent node (with as a sole child) to represent the SLCA result node.
Example 3
Let’s consider Example 2 again. First assume the specified threshold value is 0.40, then the global probability of being a quasiSLCA result can be calculated by using = + * = 0.14 + 0.30 = 0.44 because child is an SLCA node but not a quasiSLCA node w.r.t. the given threshold. So ’s SLCA probability contributes to its parent node . If the threshold is decreased to 0.30, then will be taken as a qualified quasiSLCA result and will not contribute to . In this case, cannot become a quasiSLCA result because = = 0.14 0.30. If the threshold is further decreased to 0.14, both and are qualified quasiSLCA results.
Definition 2
Probabilistic Threshold Keyword Query: (PrTKQ) Given a keyword query and a threshold , the results of over a probabilistic XML data is a set of quasiSLCA nodes with their probabilities equal to or larger than , i.e., for .
In this work, we are interested in how to efficiently compute the quasiSLCA answer set for a PrTKQ over a probabilistic XML data.
3 Overview of this Work
A naive method to answer a PrTKQ is to enumerate all possible worlds and apply the query to each possible world. Then, we can compute the overall probability of each quasiSLCA result and return the results meeting the probability threshold. However, the naive method is inefficient due to the huge number of possible worlds over a probabilistic XML data. Another method is to extend the work in [23] to compute the probabilities of quasiSLCA candidates. Although it is much more efficient than the naive method, it needs to scan the keyword node lists and calculate the keyword distributions for all relevant nodes. Therefore, that motivates our development of efficient algorithms which not only avoids generating possible worlds, but also prunes more unqualified nodes.
To accelerate query evaluation, in this paper we propose a prunebased probabilistic threshold keyword query algorithm, which determines the qualified results and filters the unqualified candidates by using offline computed probability information. To do this, we need to first calculate the probability of each possible query term within a node, which is stored as an offline computed probabilistic index. Within a node, any two of its contained terms may appear in the IND or MUX ways. To precisely differentiate IND and MUX, we utilize different parts to represent the probabilities of possible query terms appearing in MUX way, while the terms in each part hold IND relationships. In other words, the different parts of terms in a node are mutualexclusive (MUX), e.g., and in Figure 3 consists of three parts.
Given a keyword query and a threshold value, we first load the corresponding offline computed probabilistic index w.r.t. the keyword query and then onthefly calculate the range of probabilities of a node being a result of the keyword query using the precomputed probabilistic index in a bottomup strategy. Here, the range of probabilities can be represented by two boundary values: lower bound and upper bound. By comparing the lower/upper bounds of candidates, the qualified results can be efficiently identified.
The followed two examples briefly demonstrate how we calculate the lower/upper bounds based on a given keyword query and the offline computed probabilistic index, and how we apply the online computed lower/upper bounds to prune the unqualified candidates and determine the qualified ones.
Figure 3 shows the lower/upper bounds of each node in Figure 1 where the probability of each individual term is calculated offline while the lower/upper bounds are computed onthefly based on the given query keywords. Let’s first introduce the related concepts briefly: the probability of a term in a node represents the total local probability of the term appearing in all possible worlds to be generated for the probabilistic subtree rooted at the node, e.g., = 0.65 and = 0.916; the lower bound value represents the minimal total local probability of the given query keywords appearing in all the possible worlds w.r.t. the probabilistic subtree, e.g., LB(, )=0.65*0.916=0.595; the upper bound value represents the maximal total local probability of the given query keywords appearing in all the possible worlds w.r.t. the probabilistic subtree because the keywords may be independent or cooccur, e.g., UB(, ) = min{0.65, 0.916} = 0.65 no matter whether they are independent. By multiplying the path probability, the local probability can be transformed into the global probability. For the nodes containing MUX semantics, we group the probabilities of its terms into different parts, any two of which are mutuallyexclusive as shown in , and in Figure 3. The details of computing the lower/upper bounds for the IND and MUX semantics in the following section.
Example 4
Consider a PrTKQ with =0.40 again. , and can be pruned directly without calculation because their upper bounds are all lower than 0.40. We need to check the rest nodes , , and . For , after computation, the probability of being a quasiSLCA result is 0.44, which is larger than the specified threshold value 0.40, so will be taken as a result. After that, the result of can be used to update the lower bound and upper bound of , (LB=0.595, UB=0.65) (LB=0.155, UB=0.21). As a consequence, should be filtered due to . Similarly, can be computed and selected as a result because its probability is 0.56. Since and having been the quasiSLCA results, the bounds of can be updated as (LB=0.890, UB=0.950) (LB=0.136, UB=0.196). As such, can be pruned because its upper bound is lower than 0.40. From this example, we can find that many answers can be pruned or returned without the need to know their accurate probabilities, and the effectiveness of pruning would be accelerated greatly with the increase of users’ search confidence.
As an acute reader, you may find that we have to compute the probability of being a quasiSLCA because it cannot determine whether or not is a qualified result to be output only based on its lower/upper bound values. To exactly calculate the probability of being a quasiSLCA, we have to access its child/descendant nodes, e.g., , although has been recognized as a pruned node before we start to process . If an internal node depends on a larger number of pruned nodes, the effectiveness of pruning will be degraded to some extent. To fix this challenging problem, we will introduce Probability Density Function PDF that can be used to approximately compute the probability of a node, the result of which can be used to update the lower bound and upper bound of its ancestor nodes further. The details are provided and discussed with algorithms later.
4 Probabilistic Inverted Index
In this section, we describe our Probabilistic Inverted (PI) index structure for efficiently evaluating PrTKQ queries over probabilistic XML data. In keyword search on certain XML data, inverted indexes are popular structures, e.g., [16, 20]. The basic technique is to maintain a list of lists, where each element in the outer list corresponds to a domain element (i.e., a keyword). Each inner list stores the ids of XML nodes in which the given keyword occurs, and for each node, the frequencies or the weight at which the keyword appears or takes. In this work, we introduce a probabilistic version of this structure, in which we store for each keyword a list of nodeids. Along with each nodeid, we store the probability values that the subtree rooted at the node may contain the given keyword. The probability values in inner lists can be used to compute lower bound and upper bound onthefly during PrTKQ evaluation.
Figure 4 shows an example of a probabilistic inverted index of the data in Figure 1. At the base of the structure is a list of keywords storing pointers to lists, corresponding to each term in the XML data . This is an inverted array storing, for each term in , a pointer to a list of triple tuples. In the list corresponding , the triple (, Pr(path), {, …}) records the node ^{1}^{1}1The symbol is used to represent a node’s name or a node’s id without confusions in the following sections. Here, is the id of the node , the conditional probability from the root to , and the probability set that may contain single probability value or multiple probability value. Single probability value represents that all the keyword instances in the subtree can be considered as independent in probability, e.g., the confidence of containing is {0.65}, while multple probability value means that the keyword instances belonging to different sets occur mutually, e.g., the confidence of containing is a set {0.8, 0.86, 0.82}, that represents the different possibilities of occurring in .
4.1 Basic Operations of Building PI Index
To build PI index, we need to traverse the given XML data tree once in a bottomup method. During the data traversal, we will apply the following operations that may be used solely or in their combinations. The binary operation X Y promotes the probability of Y to its parent node X. The binary operation X Y promotes the probabilities of two sibling nodes X and Y to their parent node. The nary case can be processed by calling for the corresponding binary cases one by one.
Assume contains the keywords {, , …, , …, } and the conditional probability is ; and contains the keywords and the conditional probability is .
Operator1v v: If and are independent sibling nodes, we can directly promote their probabilities to their parent , then we have,
(4) 
Operator2v v: If is an independent child of , we can directly promote the probability of to , then we have,
(5) 
Example 5
Let’s show the procedure of computing in Figure 2 using Operator1 and Operator2. Firstly, we compute and promote the probability of keywords to their parent by Operator1, i.e., = 1  (1  0.5*1.0)*(1  0.3*1.0) = 0.65 and = 0.3. And then, we compute using operator2, i.e., = 1  (1  0.3)*(1  0.4) = 0.58 while do not change because only contains here. And the conditional probability from the root to is 1.0. Therefore, and will be inserted in PI index, respectively.
Operator3v v: If and are two mutuallyexclusive sibling nodes and is their parent, then we generate two parts in by and , respectively.
Operator4v v: If is a mutuallyexclusive child node of , then we can get the aggregated probability by .
In the above four basic operators, we assume the terms independently appear in and . When the nodes and contain mutuallyexclusive parts, we need to deal with each part using the four basic operators.
Given two independent sibling nodes () and () where only contains a set of mutuallyexclusive parts with conditional probability . In this case, we can apply the operation for each part . The computed results are maintained in different parts in their parent .
Example 6
Consider an independent node and a node consisting of , and in Figure 1. We first promote , and to that consists of three parts: 1 2 3 0.5 0.3 0.3 0.1 , as shown in Figure 3  . Because and are independent sibling nodes, the operation can be called to compute the probability with regards to their parent . To do this, we apply part for each part using Operator1. The results are shown in Figure 3  . After that, we can insert and into PI index.
If both and contain a set of mutuallyexclusive parts, respectively, then we can do pairwise aggregations across the two sets of parts. Building PI index needs to scan the given probabilistic XML data only once. Assume that the probabilistic XML has been encoded using probabilistic Dewey codes. The basic idea of building PI index is to progressively process the document nodes sorted by Dewey codes in ascending order, i.e., the data can be loaded and processed in a streaming strategy. When a leaf node is coming, we will compute the probability of each term in the leaf node . After that, the terms with their probabilities in will be written into PI index. Next, we need promote the terms and their probabilities of to the parent of based on the operation types in Section 4.1. After the node stream is scanned completely, the building algorithm of PI index will be terminated. We don’t provide the detailed building algorithm in this paper.
4.2 Pruning Techniques using PI Index
In this subsection, we first show how to prune the unqualified nodes using the proposed lower/upper bounds. And then, we explain how to compute lower/upper bounds, and how to update the upper/lower bounds based on intermediate results during the query evaluation.
By default, the node lists in PI index are sorted in the document order. represents the overall probability of a keyword in a node . It is obvious that the overall probability of a keyword appearing in a node is larger than or equal to that of the keyword appearing in its descendant nodes. And the overall probability value for each keyword in a node can be computed and stored in PI index offline.
Consider a node and a PrTKQ q containing a set of keywords . If all the terms in are independent, then we have,
(6) 
(7) 
Most of the time, consists of a set of parts that are mutuallyexclusive. In this case, the lower bound of would be generated from a part that gives the highest lower bound value while the upper bound of would be generated from another part that gives the highest upper bound value, in which may be equal to or not equal to .
(8) 
(9) 
Where must satisfy the criteria: (1) ; (2) cannot find another part having and . Otherwise, UB() and LB() will be set as zero.
Example 7
Let’s consider in Figure 3 as an example. The first and second parts can generate lower and upper bounds: Part 1 LB(,)=0.32, UB(,)=0.4; and Part 2 LB(,)=0.206, UB(,)=0.24. Because Part 1 can produce a higher upper bound than Part 2, the lower and upper bounds of will come from Part 1, which guarantees that can be a quasiSLCA candidate with a higher probability. Since Part 3 does not contain full keywords, i.e., missing , it cannot generate lower and upper bounds.
Property 1
[Upper Bound Usage] A node can be filtered if the overall probability of any keyword ( and ) is lower than the given threshold value , i.e., , .
Since , we have as the upper bound probability of becoming a qualified quasiSLCA node. Therefore, if a node holds the inequation , then must be lower than . As such, can be filtered.
Property 2
[Lower Bound Usage] The nodes can be returned as required results if we have and where is any child or descendant node of .
means that all the keyword nodes in the subtree rooted at will contribute their probabilities to node . In other words, no decendant node of could be a quasiSLCA so the lower bound probability will not be deducted. Therefore, if we have for , then . As such, can be returned as a quasiSLCA result.
Example 8
Let’s continue Example 7. can be directly returned as a qualified answer for the given threshold ( = 0.4). This is because , and are filtered due to their upper bound less than the threshold ( = 0.4).
To update the lower/upper bound values during query evaluation, one way is to treat the different types of nodes differently, by which the updated lower/upper bounds may obtain better precision. But the disadvantage of this way is to easily affect the efficiency of bound update. This is because, given a current node having multiple quasiSLCA nodes as its descendant nodes, it is required to know the detailed relationships (IND or MUX) among the multiple quasiSLCA nodes. To avoid the disadvantage, we do not separate the different types of distributional nodes, under which the multiple quasiSLCA nodes appear. In other words, we unify them into a uniform formula based on the following two properties.
Property 3
No matter node is an IND or ordinary or MUX node, we can update their upper bound values as follows:
(10) 
Where should be held.
According to the definition of upper bound, UB() represents the maximal probability of being a quasiSLCA node, which comes from the overall probability of a specific keyword. Therefore, the problem of updating upper bound can be alternatively considered as the percentage of the probability of the keyword has been used for the descendant nodes becoming qualified quasiSLCA nodes. If we know there are qualified descendant nodes of as returned answers, then we can compute their aggregated probabilities by . Therefore, the upper bound can be updated as .
Does the above update equation hold for MUX node? To answer this question, we utilize the properties in [23], from which we can compute the aggregated probability by using . Therefore, we have . The equation can be converted into .
Since can be expressed as where is a positive value, i.e., 0, we can derive that . As a consequence, we can obtain that = .
Therefore, =  1 + (1  (, )) holds for IND, ordinary and MUX nodes.
Property 4
No matter node is an IND or ordinary or MUX node, we can update their lower bound values as follows:
(11) 
Where should be held.
For the lower bound update, we need to deduct the confirmed probability for IND nodes or for MUX nodes, from the original lower bound . According to the procedure of the above proof, we have . Consequently, we have the inequation, = . Therefore, it is safe to use the right side to update the lower bound values.
Example 9
Consider that has been computed and its probability is 0.44. Given threshold (=0.4), is returned as a quasiSLCA result. Consequently, we can update the lower/upper bound values of its ancestor , i.e., (, ) = 0.65  1 + (1  0.44) = 0.21 and = 0.595  0.44 = 0.155. Since , can be filtered out effectively without computation.
Property 3 is used to filter the unqualified nodes by reducing the upper bound value while Property 4 is used to quickly find the qualified required results by comparing the reduced lower bound value (for the probability of the remaining quasiSLCAs) with the threshold value.
Sometimes, we need to calculate the probability distributions of keywords in a node if the given threshold is in the range . The basic computational procedure is similar to the PrStack algorithm in [23]. Different from the PrStack algorithm, we will introduce probability density function (PDF) to approximately calculate the probability for a node if the node depends on a large number of pruned descendent nodes. To decide when to invoke the PDF while avoiding the risk of reducing precision significantly, we would like to select and compute some descendant nodes that may contribute large probabilities to the node . For the remaining descendant nodes, we may choose to invoke the PDF, by which we can reduce the time cost while still guarantee the precision to some extent. The detailed procedure will be introduced in the next section.
5 PruneBased Probabilistic Threshold Keyword Query Algorithm
A key challenge of answering a PrTKQ is to identify the qualified result candidates and filter the unqualified ones as soon as possible. In this work, we address this challenge with the help of our proposed probabilistic inverted (PI) index. Two efficient algorithms are proposed, a comparable Baseline Algorithm and a PIbased Algorithm.
5.1 Baseline Algorithm
In keyword search on certain XML data, it is popular to use keyword inverted index retrieving the relevant keyword nodes, by which the keyword search results are generated based on different algorithms, e.g., [20, 16, 24, 25]. In probabilistic XML data, [23] proposed PrStack Algorithm to compute top SLCA nodes. In this section, we propose an effective Baseline Algorithm that is similar the idea of PrStack Algorithm. To answer PrTKQ, we need to scan all the keyword inverted lists once. Firstly, the keywordmatched nodes will be read one by one based on their document order. After one node is processed, we check if its probability can be up to the given threshold value . If it is true, the node can be output as a quasiSLCA node and its remaining keyword distributions (i.e., containing partial query keywords) can be continuously promoted to its parent node. Otherwise, we promote its complete keyword distributions (i.e., containing both all keywords or partial keywords) to its parent node. After that, the node at the top of the stack will be popped. Similarly, the above procedures will be repeated until all nodes are processed. The basic algorithm can be terminated when all nodes are processed. The detailed procedure is shown in Algorithm 1.
Because Baseline Algorithm only needs to scan the keyword node lists once, it is a fast and simple algorithm. However, its core computation  keyword distribution computation would consume lots of time, which motivates us to propose the PIbased Algorithm that can quickly identify the qualified or unqualified candidates using offline computed PI index and only compute keyword distributions for a few candidates. Here, Baseline Algorithm is taken as a comparable base to show the pruning performance of the PIbased Algorithm described below.
5.2 PIbased Algorithm
To efficiently answer PrTKQ, the basic idea of PIbased Algorithm is to read the nodes from keyword node lists one by one in a bottomup strategy. For each node, we quickly compute its lower bound and upper bound by accessing PI index, which is far faster than computing the keyword distributions of the node directly. After comparing its lower/upper bounds with the given threshold value, we can decide if the node should be output as a qualified answer, skipped as an unqualified result, or cached as a potential result candidate. For example, if the current node’s lower bound is larger than or equal to the threshold value, then the node can be output directly without further computation. This is because all its descendants have been checked according to the bottomup strategy. If its upper bound is lower than the threshold value, then the node can be filtered out. Otherwise, it will be temporarily cached for further checking. Based on different cases, different operations would be applied. Only the nodes identified as potential result candidates need to be computed. Compared with Baseline Algorithm, PIbased algorithm can be accelerated significantly because Baseline Algorithm has to compute the keyword distributions for all nodes. The detailed procedure has been shown in Algorithm 2.
5.2.1 Detailed Procedure of PIbased Algorithm
In Algorithm 2, Line 1Line 4 show that the procedures of initiating PIbased Algorithm. We first load the keyword node lists from KI index and probability node lists from PI index. And then we take the smallest node from to initiate a stack that is set using the dewey codes of . Another stack is used to maintain the temporary filtered nodes. After that, the PIbased Algorithm is ready to start.
Next, we need to check each node in in document order. Different from Baseline Algorithm, we only compute the keyword distribution probabilities for a few nodes that are first identified using the lower bound and upper bound in . Consider be the next smallest node to be processed. We compare it with the node in stack . If is the descendant node of , then will be pushed into and get the next smallest node from . Otherwise, we pop out from and check if it is a qualified quasiSLCA answer. In Baseline Algorithm, it will compute the keyword distributions of and combine its remaining distributions and the distribution of its parent based on promotion operations. Different from Baseline Algorithm, PIbased Algorithm will quickly compute the upper bound UB(,) and lower bound LB(,) using , which is used to differentiate the nodes as qualified nodes  output, unqualified nodes  filter and uncertain nodes  to be further checked. By doing this, only a few nodes need to be computed. Since bound computation is far faster than computation of keyword distribution, lots of run time cost can be saved in PIbased Algorithm. Line 10Line 21 show the detailed procedures. If the lower bound LB(,) is larger than or equal to the given threshold value , then can be output as a qualified quasiSLCA answer without computation. At this moment, the lower bound LB(,) can be taken as the temporary probability of being a quasiSLCA result because the exact probability of is delayed until we need to calculate its exact probability value. Subsequently, the temporary probability value LB(,) and the probabilities of descendant quasiSLCA results can be used to update the lower/upper bounds of the ancestors of in stack based on Equation 11 and Equation 10, respectively. If the lower bound LB(,) is lower than while the upper bound UB(,) is larger than or equal to , then we need to compute the keyword distributions of using the cached descendant nodes in . Based on the computed probability of , it can be decided to be output as a qualified answer or filtered as an unqualifed candidate. If the upper bound UB(,) is lower than , then will be pushed into for the possible computation of its ancestors.
There are two main functions in PIbased Algorithm. The first one is ComputeProbDist(, ) for computing the probability of full keyword distribution of using the descendant nodes in . The second is UpdateBound(, LB(,) or Prob(,)) for updating the bounds of the nodes to be processed.
5.2.2 Function ComputeProbDist()
The function ComputeProbDist(, ) can be implemented in two ways, Exact Computation or Approximate Computation.
Exact Computation is to actually calculate the probability of being a quasiSLCA node by scanning all the nodes in the stack that maintains the descendant nodes of . The processing strategy is similar to Baseline Algorithm in Section 5.1. In other words, it needs to visit the nodes in one by one and compute the local keyword distribution of each node, and then promotes the intermediate results to its parent. After all nodes in are processed, the probability of will be obtained because it aggregates all the probabilities from its descendant nodes.
Approximate Computation is to approximately calculate the probability of being a quasiSLCA node based on a partial set of nodes in the stack that maintains the descendant nodes of . The approximate computation can be made according to different distribution types, e.g., uniform distributions, piecewise polynomials, poisson distributions, etc. In this work, we consider normal or Gaussian distributions in more detail.
As we know Gaussian distribution is considered the most prominent probability distribution in statistics. However, the PDF of Gaussian distribution cannot be applied to PrTKQ over probabilistic XML data directly due to two main challenges. The first challenge is to simulate the continuous distributions using discrete distributions based on the real conditions in order to reduce the approximate errors as much as possible, and the second is to embody the multiple keyword variables in the PDF.
Generally, the probability density function of a Gaussian distribution of mean and variance is:
(12) 
Addressing Challenge 1: The density function has a shape of a bell centered in the mean value with variance . Based on the definition of Gaussian distribution, the Gaussian distribution is often used to describe, at least approximately, measurements that tends to cluster around the mean. Therefore, consider the mean be the partial computed probability value of be a quasiSLCA node, which guarantees the real probability value will not be significantly different from the probability base that has already been calculated based on promising descendant nodes. The value of the variance can be chosen from the range [1#computed descendant nodes/#total descendant nodes, 1] based on the visited/unvisited descendant nodes in . This is because the more the descendant nodes are actually computed, the higher the percentage of the values would be drawn within one standard deviation away from the mean. Extremely, if all descendant nodes are computed actually, 100% of values can be drawn within one stardard deviation. Therefore, we select and compute a few descendant nodes of from , which can contribute relatively higher probabilities to make a quasiSLCA node. In this work, we use heuristic method to select a few descendant nodes with the higher probabilities of single keywords in the descendant nodes of . And then, we take the partially computed probability as the base of the probability density function of a Gaussian distribution.
Consider be a node to be evaluated and UB(,) ^{2}^{2}2Note that UB(,) has been updated if has descendant nodes that are qualified answers, i.e., it minus the probability contributions of the qualified answers. be its current upper bound value. We have,
(13) 
(14) 
Where is the partially computed probability, is set as 1#computed descendant nodes/#total descendant nodes.
Addressing Challenge 2: To embody all the keyword variables in the PDF, we introduce the joint/conditional Gaussian distribution based on the work in [26]. Assume a PrTKQ contains two keywords and . We have the conditional PDF as follows.
(15) 
(16) 
If we make an assumption that and are independent keyword variables i.e., , and assume and , then we have
(17) 
Therefore, Equation 17 can be easily extended to multiple keyword variables that are assumed as independent. We can compute the probability of w.r.t. a PrTKQ .
(18) 
Where is the partially computed probability, is set as 1#computed descendant nodes/#total descendant nodes.
In the experiments, we call Matlab from Java to calculate Equation 18. The estimated results are used to show the comparison between the actual computation and approximation computation. The results verify the usability of Gaussian distribution to measure the probability.
5.2.3 Function UpdateBound()
For each ancestor node (), we need to update the upper bounds and lower bounds using Function UpdateBound() based on Equation 10 and Equation 11, respectively. To guarantee the completeness of the answer set, the parameters of the function may be different based on conditions. For example, if LB(,) is larger than or equal to the threshold value as shown in Algorithm 2: Line 12, then the probability value LB(,) is used to update the upper bounds of ancestors while the probability value UB(,) is used to update the lower bounds of ancestors; if LB(,) is smaller than and UB(,) is larger than or equal to as shown in Algorithm 2: Line 18, the actual or approximate probability value computed by Function ComputeProbDist(, ) will be utilized to update the upper/lower bounds of ancestors together.
Here, we use two hashmaps to implement Function UpdateBound(). For a node, one hashmap is used to cache the dewey of the node as a key, and the lower/upper bounds as a value where the bounds are computed based on PI index. Another hashmap is used to record the probability that the descendants of the node having been identified as qualified quasiSLCA answers. When a node is coming, we can quickly get the updated lower/upper bounds based on the two hashmaps.
6 Experimental Studies
We conduct extensive experiments to test the performance of our algorithms: Baseline Algorithm (BA); PIbased Exactcomputation Algorithm (PIEA) that implements Function ComputeProbDist() by exactly computing the probability distributions of the keyword matched nodes; and PIbased Approximatecomputation Algorithm (PIAA) that makes approximated computation based on the Gaussian distribution of keywords while still exactly computing the probability distributions of the keyword matched nodes that have the higher probabilities. All these algorithms were implemented in Java and run on a 3.0GHz Intel Pentium 4 machine with 2GB RAM running Windows 7.
6.1 Dataset and Queries
We use two real datasets, DBLP [27] and Mondial [28], and a synthetic XML benchmark dataset XMark [29] for testing the proposed algorithms. For XMark, we also generate four datasets with different sizes. The three types of datasets are selected based on their features. DBLP is a relatively shallow dataset of large size; Modial is a deep and complex, but small dataset; XMark is a balanced dataset with varied depth, complex structure and varied size. Therefore, they are chosen as test datasets.
For each XML dataset used, we generate the corresponding probabilistic XML tree, using the same method as used in [12]. We visit the nodes in the original XML tree in preorder way. We first set the random ratio of IND:MUX:Ordinary as 3:3:4. For each node visited, we randomly generate some distributional nodes with “IND” or “MUX” types as children of . Then, for the original children of , we choose some of them as the children of the new generated distributional nodes and assign random probability distributions to these children with the restriction that the sum of them for a MUX node is no greater than 1. The generated datasets are described in Table I. And we select terms and construct a set of keyword queries to be tested for each dataset. Due to the limited space, we only show six of these queries for each dataset. For each different sets of queries, the terms in the first two queries have small size of keyword matched nodes; the terms of the middle two queries relate to a medium size of keyword matched nodes; the terms of the last queries are based on the computation of a larger number of keyword matched nodes.
ID  name  size  #IND  #MUX  #Ordinary 

Doc1  XMark  10M  26k  26k  145k 
Doc2  20M  54k  52k  200k  
Doc3  40M  98k  100k  606k  
Doc4  80M 