Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data

Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data

Qiang Zeng        Xiaorui Jiang        Hai Zhuge
Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences
Graduate University of Chinese Academy of Sciences
{zengqiang, xiaoruijiang}@kg.ict.ac.cn       zhuge@ict.ac.cn
Abstract

As data are increasingly modeled as graphs for expressing complex relationships, the tree pattern query on graph-structured data becomes an important type of queries in real-world applications. Most practical query languages, such as XQuery and SPARQL, support logical expressions using logical-AND/OR/NOT operators to define structural constraints of tree patterns. In this paper, (1) we propose generalized tree pattern queries (GTPQs) over graph-structured data, which fully support propositional logic of structural constraints. (2) We make a thorough study of fundamental problems including satisfiability, containment and minimization, and analyze the computational complexity and the decision procedures of these problems. (3) We propose a compact graph representation of intermediate results and a pruning approach to reduce the size of intermediate results and the number of join operations – two factors that often impair the efficiency of traditional algorithms for evaluating tree pattern queries. (4) We present an efficient algorithm for evaluating GTPQs using 3-hop as the underlying reachability index. (5) Experiments on both real-life and synthetic data sets demonstrate the effectiveness and efficiency of our algorithm, from several times to orders of magnitude faster than state-of-the-art algorithms in terms of evaluation time, even for traditional tree pattern queries with only conjunctive operations.

Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data


Qiang Zeng        Xiaorui Jiang        Hai Zhuge
Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences
Graduate University of Chinese Academy of Sciences
{zengqiang, xiaoruijiang}@kg.ict.ac.cn       zhuge@ict.ac.cn


Graphs are among the most ubiquitous data models for many areas, such as social networks, semantic web and biological networks. As the most common tool for data transmissions, XML documents are desirably modeled as graphs rather than trees to represent flexible data structures by incorporating the concept of ID/IDREFs. Semantic Web data are also modeled as graphs, eg in RDF/RDFS. On graph data, tree pattern queries (TPQs) are one of important queries of practical interest. In query languages such as XQuery and SPARQL, many queries can be regarded as TPQs over graphs. As most of them support logical operations including conjunction (), disjunction () and negation () in the query conditions, it is necessary to study TPQs over graphs with multiple logical predicates, as illustrated in the following example.

TheoremExample 1.

A DBLP XML document separately stores inproceeding records for papers and proceeding records for volumes, linked by crossref elements indicating where a paper is published [?]. The underlying data structure is clearly a graph. Consider the following three queries which ask for information of publications for which a certain tree pattern of data holds.

  1. Retrieve the information about Alice’s conference papers that are published from 2000 to 2010 and co-authored with Bob.

  2. Retrieve the information about the conference papers of either Alice or Bob published from 2000 to 2010.

  3. Retrieve the information about Alice’s conference papers that are not co-authored with Bob and published from 2000 to 2010.

Figure \thefigure: The tree representation of , , and in Example 1. Document elements matching the starred query nodes are required to be returned and the single-/double-lined edges denote the parent-child/ancestor-descendant relationships between elements.

They can be expressed in XQuery and are essentially TPQs on graph-structured data (see the Appendix), but and cannot be expressed in traditional TPQs, which only contain conjunctive predicates. Indeed, they share the same tree representation as depicted in Fig 1, but different structural predicates should be imposed on the inproceedings element . For example, in , each embedding of the pattern should satisfy all paths specified in the query; but for , the two path conditions “” and “” are not required to be satisfied simultaneously. A predicate that specifies those edge constraints and incorporates disjunction and negation needs to be attached to each query node in order to express and . In general, (1) it is common in practice that logical expressions on query nodes needs to be imposed to specify complex relationships for not only attribute predicates (eg ) but also structural constraints eg ( or ) in and in ; (2) some of the nodes eg in the query pattern only serve as filters for pruning unexpected results, which means that the results of a TPQ should consist of matches for a portion of the query nodes only. ∎

Although TPQs have been widely studied for many years, few of the proposed processing algorithms can be used to efficiently evaluate such queries over general graphs. They can neither support disjunction and negation on structural constraints nor be optimized for the situation where output nodes take only a portion of query nodes (see Related work for details).

Contributions & Roadmap.

This work makes the first effort to deal with TPQ over general graph-structured data with Boolean logic support. The contributions are summarized as follows.
(1) We introduce a new class of tree pattern queries over graph-structured data, called generalized tree pattern queries (GTPQs) (Section Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data). In a GTPQ, a node is not only associated with an attribute predicate, which specifies the property conditions, but also a structural predicate in terms of propositional logic with logic connectives including conjunction, negation and disjunction to specify structural conditions with respect to its descendants. The query allows a portion of the query nodes to be output nodes. We also show that our formalization of query is advantageous over those in the literature on queries against tree-structured data.
(2) We investigate fundamental problems for GTPQs, including satisfiability, containment, equivalence and minimization (Section Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data). We show that the satisfiability of a special GTPQ with only conjunction and disjunction is solvable in linear time, but the satisfiability and the other three problems become computationally intractable when disjunction is incorporated. We propose an exact algorithm to minimize GTPQs, which is supposed to be sufficiently efficient, since the query sizes are typically small in practice.
(3) We propose a graph representation of intermediate results and a pruning approach to address notable problems in evaluating query patterns over graphs, develop an algorithm for GTPQs with ancestor-descendant edges and its extension to deal with parent-child edges (Section Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data). The algorithm can largely filter nodes that cannot contribute to the final results, wisely avoid generating redundant intermediate results, and compactly represent the matches.
(4) We implement our algorithm and conduct an experimental study using synthetic and real-life data (Section Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data). We find that our evaluation algorithm performs significantly better than state-of-art algorithms even for conjunctive TPQs. It also has better scalability and is robust for different queries on different graphs. The experiments also demonstrate the effectiveness of the graph representation of results and the efficiency of the pruning method.

Related work.

There is a large body of research work on TPQs over tree-structured data (see [?] for a survey). However, all studies heavily relied on the relatively simple structure of trees and employed the node encoding schemes (including the interval [?], Dewey [?] and sequence [?] encodings) that are not applicable to graphs for determining structural relationships. Techniques critical for their efficiency, such as stack encoding and nodes skipping, can be only applied to tree-structured data. For some sparse graph data whose structures can be modeled by disjoint trees connected by edges, such as many XML documents with ID/IDREFs, although one can apply those existing algorithms for tree-structured data to evaluate a query over such graphs by first decomposing it to several TPQs over different trees and then merging the results of distinct queries to form the final results, it is inefficient due to large redundant intermediate results and costly merging processes.

Some studies extended the traditional TPQs by incorporating additional functions and restrictions. Chen et al. [?] included optional nodes to patterns and investigated efficient evaluation plans upon native XML database systems. The generalized tree pattern is still against tree-structured data, which differs from this work that studies TPQs over graph-structured data with logical predicates. Jiang et al. [?] proposed new holistic algorithms based on a concept of OR-blocks to process AND/OR-twigs, TPQs with OR-predicates. In the end of Section 2, we shall show that (1) our query size can be always no larger than the size of element nodes of AND/OR-twig for expressing a semantically identical query; (2) constructing OR-blocks involves converting a propositional formula to conjunctive normal form, thus taking exponential time in the worst case; (3) the proposed algorithms only support tree-struct-ured data as input. [?] studied path queries with negation, while [?] and [?] added negation to TPQs. They cannot be applied to GTPQs either, since they are based on the classical holistic twig join algorithm [?] that only works on tree-structured data.

There has been work on pattern queries for graph-structured data. TwigStackD [?] generalized the holistic algorithms, but it takes considerable time and space without a pre-filtering process [?]. HGJoin [?] can evaluate general graph pattern queries using OPT-tree-cover [?] as the underlying reachability indexing approach. It decomposes a pattern into a set of complete bipartite graphs and generates matches for them in order according to a plan. The time cost of plan generation is always exponential since it has to produce a state graph with exponential nodes no matter for obtaining an optimal or suboptimal plan. Cheng et al. [?] proposed R-join/R-semijoin processing for the graph pattern matching problem. It relies on a cluster-based R-join index whose size is typically prohibitively large, as the index stores matches for every two labels derived from 2-hop indexing [?]. Unlike the plan generation of HGJoin, it adopts left-join to reduce the cost, but in the worst case the time complexity is still exponential. Since both HGJoin and R-join/R-semijoin use structural joins similar to the earlier work on tree-structured data, they typically have large intermediate results and need to perform large amounts of expensive join operations. All these three algorithms also do not directly support queries with negative/disjunctive predicates. A straightforward approach to apply them to the GTPQ processing is to decompose the query into multiple conjunctive TPQs and perform the difference and merge operations on results of the decomposed queries. However, the number of the resultant conjunctive TPQs may be exponential and large intermediate results may need to be generated and merged.

A number of studies investigated various graph pattern matching problems [?, ?, ?]. [?] proposed a graph query language GraphQL and studied graph-specific optimization techniques for graph pattern matching that combines subgraph isomorphism and predicate evaluation. While the language is able to express queries with ancestor-descendant edges and disjunctive predicates, the work focused on processing ¡°non-recursive¡± and conjunctive graph pattern queries, where all edges of a query pattern correspond to the parent-child edges of GTPQs, specifying the adjacent relationship between desired matching nodes. [?] defined matching in terms of bounded simulation to reduce its computation complexity. [?] studied distance pattern matching, in which query edges are mapped to paths with a bounded length. Queries of [?] and [?] do not support negative/disjunctive predicates on edges and have quite different semantics with ours.

Most existing algorithms are to find all instances of patterns containing matches of all query nodes. In real-world applications, however, the answer to the query often only require matches of several but not all query nodes. Indeed, many query nodes only serve as filters for imposing structural constraints on output nodes. Our framework can avoid generating redundant matches at run time.

Satisfiability, containment, equivalence and minimization are fundamental problems for any query languages. The minimization of TPQs over tree-structured data has been investigated in several papers. Amer-Yahia et al. [?] proposed algorithms for the minimization with and without integrity constraints. Ramanan [?] studied this problem for TPQs defined by graph simulation. Chen et al. [?] used a richer class of integrity constraints for query minimization of TPQs with an unique output node. However, we are not aware of previous work on minimization as well as the other three problems for TPQs with logical predicates either over tree-structured data or over graph-structured data.

Data graphs.

A data graph is a directed graph , where (1) is a finite set of nodes; (2) is finite set of edges, in which each pair () denotes an edge from to ; (3) is a function on defining attribute values associated with nodes. For each node , is a tuple (), where the expression represents that has a attribute denoted by and its value is a constant . For example, in a data graph of a DBLP document, the node properties in may include tags, string values, typed values, and attributes specified in the elements.

Abusing notions for trees and traditional tree pattern queries, we refer to a node as a child of a node (or as a parent of ) and say they have a parent-child (PC) relationship if there is an edge in , and refer to as a descendant of (or as an ancestor of ) and say they have an ancestor-descendant (AD) relationship if there is a nonempty path from to in .

Generalized tree pattern queries.

A generalized tree pattern query (GTPQ) , where:
(1) and are both a finite set of nodes, called backbone nodes and predicate nodes, respectively. The complete set of query nodes is denoted as , i.e, .
(2) . The nodes in are called output nodes.
(3) , is a finite set of edges. Here, is restricted to a directed tree .
(4) is a function defined on such that for each node , is an attribute predicate that is a conjunction of atomic formulas of the form of “ op ”, in which is an attribute name, is a constant and op is a comparison operator in .
(5) is a function on to specify the type of the edge. Each edge represents either PC relationship or AD relationship.
(6) is a function defined on internal nodes. For each internal node with children being predicate nodes, , called a structural predicate, is a propositional formula in variables , each corresponding to a tree edge directing to a predicate child of . In particular, if has no predicate children, . Each node is associated with a distinct propositional variable denoted by .

We call a GTPQ a union-conjunctive GTPQ if the structural predicates on all query nodes are negation-free, and call it a conjunctive GTPQ if the structural predicates on all the query nodes only have conjunction connectives.

Before giving the semantics of GTPQs, we add variables for non-root backbone nodes to extend the structural predicate. For an internal node with backbone children, denoted by , the extended structural predicate .

TheoremExample 2.

In Example 1, is a conjunctive GTPQ, in which (1) , , ; (2) the attribute predicate for a query node is a conjunction of comparisons among tags and typed values eg “author” value “Bob”); (3) , and . The only difference between and is that in , . In , . As an example of extended structural predicates, for , . ∎

Semantics.

Consider a data graph and a GTPQ . We say that a data node in downwardly matches a query node in , denoted by , if the following conditions are satisfied:
(1) satisfies the attribute predicate of , denoted by . That is, for each formula “ op ” in , there is an element () in such that op . is called a candidate matching node of . denotes the set of candidate matching nodes of , ie, .
(2) If is an internal node, the data node determines a truth assignment to the variables of such that , where denotes the truth-value of under the assignment. For each variable , the truth-value is assigned as follows: for each PC (resp AD) child of , if there exists a child (resp descendant) of such that ; otherwise, .

Let . A -ary tuple () of nodes in is said to be a match of on , if the following conditions hold: (1) for each , ; (2) for each edge , if is a PC child of , is a child of ; otherwise, is a descendant of .

The answer to is a set of results in the form of tuples, where each tuple consists of the images of output nodes in a match of . For each match, there is at least an assignment for all variables that makes the extended structural predicates of all internal backbone nodes and some of internal predicate nodes evaluate to true, which we call a certificate of the match. For a match and an assignment as a certificate of the match, an instance of on is a tuple consisting of such nodes that each of them matches a distinct query node whose corresponding propositional variable is true under the assignment. In particular, an instance of conjunctive GTPQ is exactly a match of the query.

(a) Data graph
(b) GTPQ on
Figure \thefigure: Example of a data graph and a GTPQ. We use a rectangle to represent a predicate node and a circle to represent a backbone node.
(a) B-twig query
(b) GTPQ
Figure \thefigure: Comparison between a B-twig query and a GTPQ
TheoremExample 3.

For simplicity of presentation, a lower-case letter in all figures throughout this paper denotes for a data node and a capital letter denotes for a query node such that if and .

Consider the data graph and the query shown in Fig Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data. . Accordingly, . The answer . One of the query matches leading to is , where elements are sorted in the ascending order of the subscripts of corresponding query nodes. An instance of this match is , where ‘’ means is a match of . Indeed, , because (1) , and (2) since and . Also, , because cannot reach a node matching and hence , thereby . ∎

For simplicity of semantics, we require a query to explicitly specify backbone nodes and predicate nodes and restrict output nodes to backbone ones. The distinction between the two types of nodes is that propositional variables associated with backbone nodes are disallowed to be operands of negation and disjunction as those associated with predicate nodes, which guarantees that each backbone node has an image in a match of the query. Permitting negation and disjunction on any query nodes leads to issues that are not computationally desirable. If each query result is still required to have an image for each output node, the expressive power does not change; but to determine whether a query is valid is effectively to check whether the variables associated with output nodes are always true for all certificates of matches, which is a co-NP-complete problem. Otherwise, the output structures become not fixed. They can either be specifically defined in the query, or consist of exponential combinations of output nodes by default. Our algorithm described in Section Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data can be straightforwardly extended to process queries with multiple output structures (see the Appendix).

We now compare GTPQ with the works in [?] and [?]. [?] deals with AND/OR-twig against tree-structured data. [?] further extends [?] to handle B-twig, which additionally introduces the logical-NOT operation into the query. Both represent a query by defining special types of nodes for operators, namely logical-AND nodes, logical-OR nodes and logical-NOT nodes. For each occurrence of a variable in a structural predicate of a GTPQ, the corresponding AND/OR-twig or B-twig needs to use a distinct subtree to express the structural constraints with respect to descendants as specified by the variable, since in AND/OR-twigs and B-twigs, the query nodes connected to different operator nodes are considered as distinct. The query size of AND/OR-twigs or B-twigs hence may be much larger than the size of a GTPQ for expressing complex tree patterns. In Fig Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data, the B-twig query has to use two paths and to represent the constraints that can be imposed by a single path in the semantically equivalent GTPQ. Moreover, before evaluating the query, [?] and [?] have to construct OR-blocks to normalize the twig. The normalization process is essentially a CNF conversion of propositional formulas. Since a CNF conversion can lead to an exponential explosion of the formula, the time cost of a conversion is exponential in the size of original query, and the resulting query size also becomes exponential in the worst case. Therefore, our query representation is more powerful and compact than the tree representation of [?] and [?].

In this section, we study the problems of satisfiability, containment, equivalence, and minimization of GTPQs, which are important for query analysis and optimization.

A GTPQ is satisfiable if there is a data graph on which the answer to is nonempty. We first introduce some definitions before showing how to determine the satisfiability and establishing the property of the problem.

We say is an independently constraint node if (1) the formula is satisfiable, in which is the parent of , is the formula produced by assigning to the variable , and is the exclusive-or logical operator; (2) all ancestors of are independently constraint nodes. Intuitively, the variables of independently constraint nodes can independently affect the resulting truth-value of the structural predicates of their parents and ancestors. Backbone nodes are clearly independently constraint nodes, if their structural predicates are satisfiable.

A transitive structural predicate for a node is constructed from in a bottom-up sweep as follows. (1) For each leaf node and each non-independently constraint node , the transitive structural predicate is the same as the extended structural predicate, ie . (2) For an internal node such that the transitive structural predicates of all children have been defined, is produced by substituting for each variable of independently constraint node in .

For two non-root nodes in , we say that is similar to , denoted by , if the following conditions hold. (1) For each formula “ op ” in , there is a formula “ op ” in such that (a) if , , (b) if , , (c) if , . We use to denote that and satisfy this condition. (2) For each PC (resp AD) child of such that is an independently constraint node, there is a PC child (resp a descendant) of such that . (3) The formula is a tautology, where is a formula transformed from by replacing with for each pair () such that (a) is a descendant of , (b) is a descendant of and (c) . We say that is subsumed by , denoted by , if (1) , and (2) the parent of is the lowest common ancestor of and , and (a) if is a PC child of , is also a PC child of ; (b) otherwise is a descendant of .

We finally define complete structural predicates to characterize the whole structural constraints of a GTPQ. For a node , the complete structural predicate is created from the corresponding transitive structural predicate by performing the following operations: (1) for each descendant of , if its attribute predicate is unsatisfiable, , where is the old formula before this transformation and is the newly generated formula; (2) for every two nodes and in two distinct subtrees of such that , , where and have the same meaning as above in (1).

Theorem 1 shows that the satisfiability of a GTPQ is equivalent to the satisfiability of the complete structural predicate of the root, if given that the attribute predicate of the root is satisfiable. If the query is a conjunctive or union-conjunctive GTPQ, the problem of satisfiability can be solved in linear time. When negation is added into the query, the satisfiability becomes NP-complete.

Theorem 1.

A GTPQ is satisfiable if and only if for the root node of , and are both satisfiable. ∎

Theorem 2.

  1. The satisfiability of a union-conjunctive GTPQ can be determined in linear time.

  2. The satisfiability of a GTPQ is NP-complete.∎

(a)
(b)
(c)
Figure \thefigure: Examples for four fundamental problems of GTPQs
TheoremExample 4.

Consider the query in Fig (b). All query nodes are independently constraint nodes. Replacing with in , we have . Since there are no two nodes and such that , . Due to the satisfiability of , we see that the query is satisfiable. Indeed, we can get a nonempty answer by posing on in Fig (b) as shown in Example 3.

Let us turn to and depicted in Fig Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data. The following table presents structural predicates of internal nodes for and .

For both queries, and are two non-independently constraint nodes. In , we have , because (1) , (2) , (3) , which is a tautology, (4) is an AD child of which is an ancestor of . In contrast, for , , since now is a PC child of but is not. Suppose attribute predicates of all nodes are satisfiable. Then for , , which is satisfiable; but for , , which is unsatisfiable. Therefore, we know that is satisfiable and not. ∎

For two GTPQs and , is contained in , denoted by , if for any data graph , . and is equivalent, denoted by , if and .

Homomorphism.

Given two GTPQs with query nodes and with query nodes , a homomorphism from to is a mapping from to such that (1) the two sets of output nodes of and are bijective; (2) for any non-independently constraint node , ; (3) for any independently constraint node in , (a) for any PC (resp, AD) child node of such that is also an independently constraint node, is a PC child (resp, a descendant) of , and (b) ; (4) the formula is a tautology, where is the root node of and is a formula transformed from by replacing with for each independently constraint node .

Theorem 3 yields a decision procedure for containment and equivalence between two GTPQs. Theorem 4 states the intractability of the two problems of containment and equivalence.

Theorem 3.

For two GTPQs and , iff there exists a homomorphism from to . ∎

Theorem 4.

The containment checking for GTPQs is co-NP-hard. ∎

TheoremExample 5.

Recall the queries in Fig Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data. We now assume and others the same as in Example 4. Let be a conjunctive GTPQ, and denote in to distinguish nodes in different queries. We have that , and . Indeed, there is a homomorphism from to , where . There is also from to , in which . We can also derive and . ∎

Input: GTPQ with the root .
Output: A minimum equivalent GTPQ of .
1 construct an equivalent query from by removing subtrees rooted at a node whose attribute predicate is unsatisfiable and assigning the variables of the removed nodes to 0 for respective structural predicates check each structural predicate to determine for each node whether it is an independently constraint node and remove all non-independently constraint nodes followed by assigning the variables of them to 0 for respective structural predicates compute the complete structural predicate for each node in in bottom-up order for each  in bottom-up order do  do
2       if  is unsatisfiable then
3             remove the whole subtree rooted at from
4      
5for each node  do
6       if the formula is a tautology then
7             for each  such that  do
8                   for each  output node in the subtree rooted at  do
9                         if there exists such that and the subtree query pattern rooted at and that rooted at are isomorphic then
10                               remove from the set of output nodes and add into it
11                        
12                  remove nodes in the subtree rooted at from that are not ancestors of any output nodes and corresponding edges they connect
13            
14      else if the formula is a tautology then
15             for each pair  do
16                   remove the whole subtree rooted at from
17            
18      
return
Algorithm 1 minGTPQ

Since the efficiency of processing a query depends on the size of it, it is necessary to identify and eliminate redundant nodes. For a GTPQ with query nodes , we define its size as .

Minimization.

Given a GTPQ , the minimization problem is to find another GTPQ such that (1) , (2) , and (3) there exists no other such with .

From Theorem 3, we have that for a GTPQ , there is a minimal equivalent GTPQ of whose query nodes are a subset of query nodes of . We say two GTPQs and are isomorphic, if there is a homomorphism between them that is a one-to-one mapping. The following proposition shows that the minimal equivalent query of a GTPQ is unique up to isomorphism.

Proposition 5.

Let GTPQs and be minimal and equivalent. Then and are isomorphic.∎

Algorithm 1 shows how to minimize a GTPQ. We give an example to illustrate it.

TheoremExample 6.

In Fig Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data, the query is a minimum equivalent query of with structural predicates given in Example 5. (1) Since we suppose all attribute predicates are satisfiable, there are no nodes to be removed in this step, and (line 1). (2) All nodes except and are independently constraint nodes, hence we remove and and assign 0 to in , thereby having that (line 2). In this step, all propositional formulas of structural predicates are simplified to equivalent formulas with minimum variables. (3) There are no nodes whose complete structural predicates are unsatisfiable, and so none is removed (line 4–7). (4) The formula is a tautology and , so and its child is removed, and we have , thereby generating the query (line 8–19). This step is to remove subtrees which can be semantically subsumed by others. ∎

The correctness can be proved based on Theorem 3. Since the algorithm involves solving SAT problems, the worst-case time complexity is exponential in the query size. In fact, Theorem 6 shows that the minimization problem is NP-hard and hence it is difficult to find a polynomial-time algorithm. Nevertheless, because there are many high-performance algorithms for SAT and the query size is not much large in practice, it is still worth minimizing a GTPQ considering the benefits of efficiency of evaluation.

Theorem 6.

The minimization problem for GTPQs is NP-hard.∎

Recall that two major problems that impair the efficiency of algorithms for processing TPQs over graphs are large intermediate results and expensive join operations on them. In the following, we propose two new techniques to address them.

Graph representation of intermediate results.

To reduce the cost of storing intermediate results and avoid merge-join operations, we represent intermediate results as a graph rather than sets of tuples. Each match for a path or a substructure of the query pattern can be embedded into the tree pattern and hence naturally can be represented as a tree. By grouping all the candidate matches by the corresponding matched query nodes and adding an edge to connect a pair of data nodes whenever there’s an edge between the corresponding pair of query nodes in the query pattern, we can represent the intermediate and final results as graphs. In such a graph representation, each data node exists at most once, in contrast to the tuple representation in which a data node may be in multiple tuples. Also, the AD or PC relationship between two nodes is exactly represented by only one edge, while in the tuple form the corresponding two nodes may be put as an element in more than one tuple to repeatedly and explicitly represent their relationship. Since the size of the intermediate matches may be huge, even exponential in both the query size and the data size in the worst case, the graph representation is much more compact with at most quadratic space cost. Moreover, to enumerate all resulting matches of a pattern query, we only need to perform one single graph traversal on a presumably small graph instead of multiple merge-join operations over large intermediate results.

It is worth noting that such a way of representing intermediate results can be also applied to algorithms for other graph pattern queries to boost their evaluation. For TPQs, it is particularly optimal because we can enumerate matches directly from the graph. However, for graph pattern queries, additional matching operations including joins may be unavoidable because it is difficult to locally determine which nodes should be traversed to form a match. The additional matching operations are in essence an easier evaluation of a pattern matching on a smaller graph, such a technique can thus still be expected to speed up the whole processing.

Reachability index enhanced effective pruning.

Since the number of data nodes to be processed significantly affects the efficiency of pattern query evaluation, it is desirable to perform effective pruning to reduce the number of candidate matching nodes. In the literature, [?] and [?] have developed two pruning approaches for reachability query pattern matching. TwigStackD [?] proposed a pre-filtering approach that can select nodes guaranteed to be in final matches. Since it has to perform two graph traversals on the data graph, it is likely unfeasible for large-scale real-world graphs. The work [?] on pattern queries over labeled graphs proposed another pruning process, namely R-semijoin, using a special index called cluster-based R-join index. It can filter nodes that cannot possibly contribute to partial matches for an AD edge between two labeled query nodes. However, (1) the selected nodes may be still redundant since the nodes only satisfy the reachability condition imposed by one edge and the global structural satisfaction is not checked. (2) It is highly costly to construct and store the R-join index for a large data graph since the index essentially precomputes and stores all matches for pairwise labels and the index size is quadratic in the graph size. (3) It cannot be used to perform pruning for queries that have expressive attribute predicates rather than a fixed set of labels associated with nodes. Since predicates of query nodes are often not fixed and predictable, the index actually cannot be precomputed and this approach cannot be used.

We explore the potentials of existing reachability index for effective pruning. It is interesting to note that most reachability indexing schemes follow a paradigm. They first utilize a relatively simple reachability index which often assigns two or three labels to each node in order to cover the reachability of a substructure, called a cover, such as tree-cover in [?, ?], path-tree in [?], and chain-cover in [?, ?]. To cover the remaining reachability information, each node keeps one or two lists where complete or just a portion of ancestors and descendants are stored. When answering whether a node can reach another, the algorithms typically use nodes stored in the lists as the intermediate to determine the reachability.

When it comes to answer a number of reachability queries between two sets of nodes, the following two observations are helpful: (1) the lists of different nodes often share a number of nodes, (2) the nodes in different lists have rich reachability information. If we merge the lists of a set of nodes by eliminating the duplicates and those whose reachability information can be derived from others, the merged list “subsumes” all the reachability information in the original lists of the node set but the size will not be much larger, and possibly even much smaller, than the list size of any individual node. Using the merged list, reachability patterns are likely to be evaluated more efficiently.

For example, considering a reachability pattern , we want to filter data nodes in that cannot reach any nodes in . Instead of performing pairwise reachability queries to check for each node whether it can reach a node , (1) we merge all index lists of to a single list of the minimum size that preserves all the reachability information saved in the original lists; and (2) for each , use the list of and the merged list rather than individual lists for to holistically determine whether reaches some node in . Intuitively, we can regard the set as a single dummy node which is reachable from all nodes that are ancestors of nodes in .

In this paper, we use 3-hop [?] as the underlying reachability index scheme, as 3-hop has both a very compact index size and reasonable query processing time. As different labeling schemes are often preferable to different graph structures, it is also very flexible for our framework to use other labeling schemes to efficiently process different types of graphs.

We restrict our attention to in-memory processing and do not address the issues relating to disk-based access methods and physical representation of graph data.

Algorithm outline.

Our GTPQ evaluation algorithm (referred to as GTEA) is outlined as follows. First, it prunes candidate matching nodes that do not satisfy downward structural constraints (ie not satisfy the subtree pattern query rooted at the corresponding query node). Second, it performs the second round pruning process on a carefully selected subtree pattern, called prime subtree, to remove nodes not satisfying upward structural constraints (i.e. not reachable from any candidate nodes of the root). Third, the prime subtree is further shrunk if possible, and GTEA generates the matches of the shrunk prime subtree while representing the intermediate results as a graph, from which the final results can be efficiently obtained. We begin with focusing on evaluating GTPQs with AD edges only and show how to extend the algorithm to process PC edges in Section Remark..

We use a two-round pruning process to filter unqualified data nodes. The first round selects data nodes that satisfy downward structural constraints of the query pattern for each query node. At the second round, we then obtain a minimum subtree that contains all output nodes having more than one candidate matching node, and select necessary edges from this subtree to find nodes satisfying upward structural constraints.

Figure \thefigure: Chain decomposition and 3-hop index

3-hop is a recent graph reachability indexing scheme well-known for its compact index size and reasonable query time. It follows the indexing paradigm mentioned in Section Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data. It uses the chain-cover which consists of a set of disjoint chains covering all nodes in the graph. Each node in the graph is assigned a chain ID and its sequence number on its chain. For two nodes and on the same chain (i.e, ), , if . In particular, if , we say is smaller than . Obviously, reachability on the chain-cover can be answered using chain IDs and sequence numbers. To encode the remaining reachability information outside chain-cover, 3-hop records a successor list resp predecessor list of “entry” (resp “exit”) nodes to (resp from) other chains for each node . The entry (resp exit) node to (resp from) a chain is the smallest (resp largest) one on that chain that reaches (resp reaches ). See [?] for details of 3-hop index construction. For answering the reachability between two nodes and on different chains, 3-hop takes the following steps. (1) Collect the smallest nodes on any other chain that can reach through exit nodes of chain . That is, we get a set of nodes , where is the entry node of on chain . We call the complete successor list of . (2) Collect the largest nodes on any chain that can reach through entry nodes of chain . In this step, we get a set of nodes , , where is the exit node of on chain . We call the complete predecessor list of . (3) If there is a pair such that , then we can conclude that can reach .

TheoremExample 7.

Fig Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data gives a chain decomposition of of Fig (a) and the corresponding 3-hop index. Chain IDs and sequence numbers are omitted. As an example, , and . Because , and is reachable from . To answer whether can reach , we collect the entry nodes in into . Then we look up the exit nodes in and get . Since there is a pair such that , and , we say can reach . ∎

Note that to obtain the complete predecessor (resp successor) lists, the original 3-hop needs to visit all larger (resp smaller) nodes. We can assign a forward (and backward) tracing pointer to each node which points to the smallest larger (resp largest smaller) node whose (resp ) list is nonempty so that nodes with empty lists can be skipped. We define two operations next() and pre() on each node , which return the node that the forward and the backward tracing pointer points to respectively. For example, since is the largest smaller node that has a non-empty w.r.t. , prev.

A basic operation of the pruning process is merging the complete predecessor/successor lists for a given set of data nodes (denoted by ). For the 3-hop case, it picks the largest (resp smallest) nodes on each chain from the complete predecessor (resp successor) list and we call the resultant list predecessor contour (resp successor contour ). A node is said to reach (resp be reachable from) if reaches (resp is reachable from) at least one node in . We have the following proposition.

Proposition 7.

A data node reaches iff there is a pair such that , while reaches iff there exists a pair such that . ∎

Input: A set of nodes .
Output: The predecessor contour of .
1 for each node  do
2       if  then  repeat
3             for each index node  do
4                   if  then
5                        
6                  
7            
8      until  or if  then 
return
Procedure 2 MergePredLists

Procedure 2 sketches the process of calculating the predecessor contour , where records the largest node on chain whose predecessor list has been looked up. For each node , MergePredLists processes and those smaller nodes whose predecessor lists have not been looked up as follows. For each node to be processed and each exit node in , it compares with the nodes in on the same chain of , and update if is larger (line 4–9). To retrieve nodes from efficiently, can be implemented as a map that uses chain IDs as keys and the sequence numbers as values.

TheoremExample 8.

We show how to compute the predecessor contour of for the query of Fig Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data. Example 3 have given that . The procedure collects the complete predecessor lists for each of one by one, but no predecessor list is repeatedly visited. For example, assume that is read before . When collecting , although prev() points to , MergePredLists needs not look up , because the list has been looked up when collecting . The predecessor contour of is . It can be easily verified that the size of this predecessor contour is a half of the total size of the four individual complete lists of and . Note that the size of a predecessor contour is bounded by the number of chains. This example actually gives the worst case but still has a high compression rate (50%). ∎

Input: 3-hop index , a GTPQ .
Output: Updated candidate matching nodes satisfying downward structural constraints.
1 for each node  do  for each leaf node in  do  for each  in bottom-up order do
2       for each   do  for each  that is not empty do
3             for each child of  do  for each node  do
4                   for each child