Semantic Acyclicity Under Constraints

Semantic Acyclicity Under Constraints

Pablo Barceló
Center for Semantic Web Research &
DCC, University of Chile
Georg Gottlob
Dept. of Computer Science
University of Oxford
Andreas Pieris
Inst. of Information Systems
TU Wien
pbarcelo@dcc.uchile.cl georg.gottlob@cs.ox.ac.uk pieris@dbai.tuwien.ac.at
Abstract

A conjunctive query (CQ) is semantically acyclic if it is equivalent to an acyclic one. Semantic acyclicity has been studied in the constraint-free case, and deciding whether a query enjoys this property is NP-complete. However, in case the database is subject to constraints such as tuple-generating dependencies (tgds) that can express, e.g., inclusion dependencies, or equality-generating dependencies (egds) that capture, e.g., functional dependencies, a CQ may turn out to be semantically acyclic under the constraints while not semantically acyclic in general. This opens avenues to new query optimization techniques. In this paper we initiate and develop the theory of semantic acyclicity under constraints. More precisely, we study the following natural problem: Given a CQ and a set of constraints, is the query semantically acyclic under the constraints, or, in other words, is the query equivalent to an acyclic one over all those databases that satisfy the set of constraints?

We show that, contrary to what one might expect, decidability of CQ containment is a necessary but not sufficient condition for the decidability of semantic acyclicity. In particular, we show that semantic acyclicity is undecidable in the presence of full tgds (i.e., Datalog rules). In view of this fact, we focus on the main classes of tgds for which CQ containment is decidable, and do not capture the class of full tgds, namely guarded, non-recursive and sticky tgds. For these classes we show that semantic acyclicity is decidable, and its complexity coincides with the complexity of CQ containment. In the case of egds, we show that if we focus on keys over unary and binary predicates, then semantic acyclicity is decidable (NP-complete). We finally consider the problem of evaluating a semantically acyclic query over a database that satisfies a set of constraints. For guarded tgds and functional dependencies the evaluation problem is tractable.

\newdef

definitionDefinition \newdefexampleExample \newdefappdefinitionDefinition \newdefappexampleExample

\numberofauthors

3

1 Introduction

Query optimization is a fundamental database task that amounts to transforming a query into one that is arguably more efficient to evaluate. The database theory community has developed several principled methods for optimization of conjunctive queries (CQs), many of which are based on static-analysis tasks such as containment [1]. In a nutshell, such methods compute a minimal equivalent version of a CQ, where minimality refers to number of atoms. As argued by Abiteboul, Hull, and Vianu [1], this provides a theoretical notion of “true optimality” for the reformulation of a CQ, as opposed to practical considerations based on heuristics. For each CQ the minimal equivalent CQ is its core [21]. Although the static analysis tasks that support CQ minimization are NP-complete [12], this is not a major problem for real-life applications, as the input (the CQ) is small.

It is known, on the other hand, that semantic information about the data, in the form of integrity constraints, alleviates query optimization by reducing the space of possible reformulations. In the previous analysis, however, constraints play no role, as CQ equivalence is defined over all databases. Adding constraints yields a refined notion of CQ equivalence, which holds over those databases that satisfy a given set of constraints only. But finding a minimal equivalent CQ in this context is notoriously more difficult than before. This is because basic static analysis tasks such as containment become undecidable when considered in full generality. This motivated a long research program for finding larger “islands of decidability” of such containment problem, based on syntactical restrictions on constraints [2, 8, 10, 11, 22, 23].

An important shortcoming of the previous approach, however, is that there is no theoretical guarantee that the minimized version of a CQ is in fact easier to evaluate (recall that, in general, CQ evaluation is NP-complete [12]). We know, on the other hand, quite a bit about classes of CQs that can be evaluated efficiently. It is thus a natural problem to ask whether constraints can be used to reformulate a CQ as one in such tractable classes, and if so, what is the cost of computing such reformulation. Following Abiteboul et al., this would provide us with a theoretical guarantee of “true efficiency” for those reformulations. We focus on one of the oldest and most studied tractability conditions for CQs; namely, acyclicity. It is known that acyclic CQs can be evaluated in linear time [27].

More formally, let us write whenever CQs and are equivalent over all databases that satisfy . In this work we study the following problem:

PROBLEM : Semantic Acyclicity INPUT : A CQ and a finite set of constraints. QUESTION : Is there an acyclic CQ s.t. ?

We study this problem for the two most important classes of database constraints; namely:

  1. Tuple-generating dependencies (tgds), i.e., expressions of the form , where and are conjuntions of atoms. Tgds subsume the important class of referential integrity constraints (or inclusion dependencies).

  2. Equality-generating dependencies (egds), i.e., expressions of the form , where is a conjunction of atoms and are variables in . Egds subsume keys and functional dependencies (FDs).

A useful aspect of tgds and egds is that containment under them can be studied in terms of the chase procedure [25].

Coming back to semantic acyclicity, the main problem we study is, of course, decidability. Since basic reasoning with tgds and egds is, in general, undecidable, we cannot expect semantic acyclicity to be decidable for arbitrary such constraints. Thus, we concentrate on the following question:

Decidability: For which classes of tgds and egds is the problem of semantic acyclicity decidable? In such cases, what is the computational cost of the problem?

Since semantic acyclicity is defined in terms of CQ equivalence under constraints, and the latter has received a lot of attention, it is relevant also to study the following question:

Relationship to CQ equivalence: What is the relationship between CQ equivalence and semantic acyclicity under constraints? Is the latter decidable for each class of tgds and egds for which the former is decidable?

Notice that if this was the case, one could transfer the mature theory of CQ equivalence under tgds and egds to tackle the problem of semantic acyclicity.

Finally, we want to understand to what extent semantic acyclicity helps CQ evaluation. Although an acyclic reformulation of a CQ can be evaluated efficiently, computing such reformulation might be expensive. Thus, it is relevant to study the following question:

Evaluation: What is the computational cost of evaluating semantically acyclic CQs under constraints?

Semantic acyclicity in the absence of constraints.

The semantic acyclicity problem in the absence of dependencies (i.e., checking whether a CQ is equivalent to an acyclic one over the set of all databases) is by now well-understood. Regarding decidability, it is easy to prove that a CQ is semantically acyclic iff its core is acyclic. (Recall that such is the minimal equivalent CQ to ). It follows that checking semantic acyclicity in the absence of constraints is NP-complete (see, e.g., [6]). Regarding evaluation, semantically acyclic CQs can be evaluated efficiently [13, 14, 19].

The relevance of constraints.

In the absence of constraints a CQ is equivalent to an acyclic one iff its core is acyclic. Thus, the only reason why is not acyclic in the first hand is because it has not been minimized. This tells us that in this context semantic acyclicity is not really different from usual minimization. The presence of constraints, on the other hand, yields a more interesting notion of semantic acyclicity. This is because constraints can be applied on CQs to produce acyclic reformulations of them.

{example}

This simple example helps understanding the role of tgds when reformulating CQs as acyclic ones. Consider a database that stores information about customers, records, and musical styles. The relation contains pairs such that customer has declared interest in style . The relation contains pairs such that record is of style . Finally, the relation contains a pair when customer owns record .

Consider now a CQ defined as follows:

This query asks for pairs such that customer owns record and has expressed interest in at least one of the styles with which is associated. This CQ is a core but it is not acyclic. Thus, from our previous observations it is not equivalent to an acyclic CQ (in the absence of constraints).

Assume now that we are told that this database contains compulsive music collectors only. In particular, each customer owns every record that is classified with a style in which he/she has expressed interest. This means that the database satisfies the tgd:

With this information at hand, we can easily reformulate as the following acyclic CQ :

Notice that and are in fact equivalent over every database that satisfies .  

Contributions.

We observe that semantic acyclicity under constraints is not only more powerful, but also theoretically more challenging than in the absence of them. We start by studying decidability. In the process we also clarify the relationship between CQ equivalence and semantic acyclicity.

Results for tgds: Having a decidable CQ containment problem is a necessary condition for semantic acyclicity to be decidable under tgds.111 Modulo some mild technical assumptions elaborated in the paper. Surprisingly enough, it is not a sufficient condition. This means that, contrary to what one might expect, there are natural classes of tgds for which CQ containment but not semantic acyclicity is decidable. In particular, this is the case for the well-known class of full tgds (i.e., tgds without existentially quantified variables in the head). In conclusion, we cannot directly export techniques from CQ containment to deal with semantic acyclicity.

In view of the previous results, we concentrate on classes of tgds that (a) have a decidable CQ containment problem, and (b) do not contain the class of full tgds. These restrictions are satisfied by several expressive languages considered in the literature. Such languages can be classified into three main families depending on the techniques used for studying their containment problem: (i) guarded tgds [8], which contain inclusion and linear dependencies, (ii) non-recursive [16], and (iii) sticky sets of tgds [10]. Instead of studying such languages one by one, we identify two semantic criteria that yield decidability for the semantic acyclicity problem, and then show that each one of the languages satisfies one such criteria.

  • The first criterion is acyclicity-preserving chase. This is satisfied by those tgds for which the application of the chase over an acyclic instance preserves acyclicity. Guarded tgds enjoy this property. We establish that semantic acyclicity under guarded tgds is decidable and has the same complexity than its associated CQ containment problem: 2ExpTime-complete, and NP-complete for a fixed schema.

  • The second criterion is rewritability by unions of CQs (UCQs). Intuitively, a class of sets of tgds has this property if the CQ containment problem under a set in can always be reduced to a UCQ containment problem without constraints. Non-recursive and sticky sets of tgds enjoy this property. In the first case the complexity matches that of its associated CQ containment problem: NExpTime-complete, and NP-complete if the schema is fixed. In the second case, we get a NExpTime upper bound and an ExpTime lower bound. For a fixed schema the problem is NP-complete.

The NP bounds (under a fixed schema) can be seen as positive results: By spending exponential time in the size of the (small) query, we can not only minimize it using known techniques but also find an acyclic reformulation if one exists.

Results for egds: After showing that the techniques developed for tgds cannot be applied for showing the decidability of semantic acyclicity under egds, we focus on the class of keys over unary and binary predicates and we establish a positive result, namely semantic acyclicity is NP-complete. We prove this by showing that in such context keys have acyclicity-preserving chase. Interestingly, this positive result can be extended to unary functional dependencies (over unconstrained signatures); this result has been established independently by Figueira [17]. We leave open whether the problem of semantic acyclicity under arbitrary egds, or even keys over arbitrary schemas, is decidable.

Evaluation: For tgds for which semantic acyclicity is decidable (guarded, non-recursive, sticky), we can use the following algorithm to evaluate a semantically acyclic CQ over a database that satisfies the constraints :

  1. Convert into an equivalent acyclic CQ under .

  2. Evaluate on .

  3. Return .

The running time is , where is a double-exponential function (since can be computed in double-exponential time for each one of the classes mentioned above and acyclic CQs can be evaluated in linear time). This constitutes a fixed-parameter tractable algorithm for evaluating on . No such algorithm is believed to exist for CQ evaluation [26]; thus, semantically acyclic CQs under these constraints behave better than the general case in terms of evaluation.

But in the absence of constraints one can do better: Evaluating semantically acyclic CQs in such context is in polynomial time. It is natural to ask if this also holds in the presence of constraints. This is the case for guarded tgds and (arbitrary) FDs. For the other classes of constraints the problem remains to be investigated.

Further results: The results mentioned above continue to hold for a more “liberal” notion based on UCQs, i.e., checking whether a UCQ is equivalent to an acyclic union of CQs under the decidable classes of constraints identified above. Moreover, in case that a CQ is not equivalent to an acyclic CQ under a set of constraints , our proof techniques yield an approximation of under  [4], that is, an acyclic CQ that is maximally contained in under . Computing and evaluating such approximation yields “quick” answers to when exact evaluation is infeasible.

Finite vs. infinite databases.

The results mentioned above interpret the notion of CQ equivalence (and, thus, semantic acyclicity) over the set of both finite and infinite databases. The reason is the wide application of the chase we make in our proofs, which characterizes CQ equivalence under arbitrary databases only. This does not present a serious problem though, as all the particular classes of tgds for which we prove decidability in the paper (i.e., guarded, non-recursive, sticky) are finitely controllable [3, 18]. This means that CQ equivalence under arbitrary databases and under finite databases coincide. In conclusion, the results we obtain for such classes can be directly exported to the finite case.

Organization.

Preliminaries are in Section 2. In Section 3 we consider semantic acyclicity under tgds. Acyclicity-preserving chase is studied in Section 4, and UCQ-rewritability in Section 5. Semantic acyclicity under egds is investigated in Section 6. Evaluation of semantically acyclic CQs is in Section 7. Finally, we present further advancements in Section 8 and conclusions in Section 9.

2 Preliminaries

Databases and conjunctive queries.

Let , and be disjoint countably infinite sets of constants, (labeled) nulls and (regular) variables (used in queries and dependencies), respectively, and a relational schema. An atom over is an expression of the form , where is a relation symbol in of arity and is an -tuple over . An instance over is a (possibly infinite) set of atoms over that contain constants and nulls, while a database over is simply a finite instance over .

One of the central notions in our work is acyclicity. An instance is acyclic if it admits a join tree; i.e., if there exists a tree and a mapping that associates with each node of an atom of , such that the following holds:

  1. For each atom in there is a node in such that ; and

  2. For each null occurring in it is the case that the set is connected in .

A conjunctive query (CQ) over is a formula of the form:

(1)

where each () is an atom without nulls over , each variable mentioned in the ’s appears either in or , and are the free variables of . If is empty, then is a Boolean CQ. As usual, the evaluation of CQs is defined in terms of homomorphisms. Let be an instance and a CQ of the form . A homomorphism from to is a mapping , which is the identity on , from the variables and constants in to the set of constants and nulls such that ,222As usual, we write for . for each . The evaluation of over , denoted , is the set of all tuples over such that is a homomorphism from to .

It is well-known that CQ evaluation, i.e., the problem of determining if a particular tuple belongs to the evaluation of a CQ over a database , is NP-complete [12]. On the other hand, CQ evaluation becomes tractable by restricting the syntactic shape of CQs. One of the oldest and most common such restrictions is acyclicity. Formally, a CQ is acyclic if the instance consisting of the atoms of (after replacing each variable in with a fresh null) is acyclic. It is known from the seminal work of Yannakakis [27], that the problem of evaluating an acyclic CQ over a database can be solved in linear time .

Tgds and the chase procedure.

A tuple-generating dependency (tgd) over is an expression of the form:

(2)

where both and are conjunctions of atoms without nulls over . For simplicity, we write this tgd as , and use comma instead of for conjoining atoms. Further, we assume that each variable in is mentioned in some atom of . We call and the body and head of the tgd, respectively. The tgd in (2) is logically equivalent to the expression , where and are the CQs and , respectively. Thus, an instance over satisfies this tgd if and only if . We say that an instance satisfies a set of tgds, denoted , if satisfies every tgd in .

The chase is a useful tool when reasoning with tgds [8, 16, 22, 25]. We start by defining a single chase step. Let be an instance over schema and a tgd over . We say that is applicable w.r.t.  if there exists a tuple of elements in such that holds in . In this case, the result of applying over with is the instance that extends with every atom in , where is the tuple obtained by simultaneously replacing each variable with a fresh distinct null not occurring in . For such a single chase step we write .

Let us assume now that is an instance and a finite set of tgds. A chase sequence for under is a sequence:

of chase steps such that: (1) ; (2) For each , is a tgd in ; and (3) . We call the result of this chase sequence, which always exists. Although the result of a chase sequence is not necessarily unique (up to isomorphism), each such result is equally useful for our purposes since it can be homomorphically embedded into every other result. Thus, from now on, we denote by the result of an arbitrary chase sequence for under . Further, for a CQ , we denote by the result of a chase sequence for the database under obtained after replacing each variable in with a fresh constant .

Egds and the chase procedure.

An equality-generating dependency (egd) over is an expression of the form:

where is a conjunction of atoms without nulls over , and . For clarity, we write this egd as , and use comma for conjoining atoms. We call the body of the egd. An instance over satisfies this egd if, for every homomorphism such that , it is the case that . An instance satisfies a set of egds, denoted , if satisfies every egd in .

Recall that egds subsume functional dependencies, which in turn subsume keys. A functional dependency (FD) over is an expression of the form , where is a relation symbol in of arity , and are subsets of , asserting that the values of the attributes of are determined by the values of the attributes of . For example, , where is a ternary relation, is actually the egd . A FD as above is called key if .

As for tgds, the chase is a useful tool when reasoning with egds. Let us first define a single chase step. Consider an instance over schema and an egd over . We say that is applicable w.r.t.  if there exists a homomorphism such that and . In this case, the result of applying over with is as follows: If both are constants, then the result is “failure”; otherwise, it is the instance obtained from by identifying and as follows: If one is a constant, then every occurrence of the null is replaced by the constant, and if both are nulls, the one is replaced everywhere by the other. As for tgds, we can define the notion of the chase sequence for an instance under a set of egds. Notice that such a sequence, assuming that is not failing, always is finite. Moreover, it is unique (up to null renaming), and thus we refer to the chase for under , denoted . Further, for a CQ , we denote by the result of a chase sequence for the database under obtained after replacing each variable in with a fresh constant ; however, it is important to clarify that these are special constants, which are treated as nulls during the chase.

Containment and equivalence.

Let and be CQs and a finite set of tgds or egds. Then, is contained in under , denoted , if for every instance such that . Further, is equivalent to under , denoted , whenever and (or, equivalently, if for every instance such that ). The following well-known characterization of CQ containment in terms of the chase will be widely used in our proofs:

Lemma 1

Let and be CQs and be a finite set of tgds or egds. Then if and only if belongs to the evaluation of over .

A problem that is quite important for our work is CQ containment under constraints (tgds or egds), defined as follows: Given CQs and a finite set of tgds or egds, is it the case that ? Whenever is bound to belong to a particular class of sets of tgds, we denote this problem as . It is clear that the above lemma provides a decision procedure for the containment problem under egds. However, this is not the case for tgds.

Decidable containment of CQs under tgds.

It is not surprising that Lemma 1 does not provide a decision procedure for solving CQ containment under tgds since this problem is known to be undecidable [7]. This has led to a flurry of activity for identifying syntactic restrictions on sets of tgds that lead to decidable CQ containment (even in the case when the chase does not terminate).333In fact, these restrictions are designed to obtain decidable query answering under tgds. However, this problem is equivalent to query containment under tgds (Lemma 1). Such restrictions are often classified into three main paradigms:

Guardedness: A tgd is guarded if its body contains an atom, called guard, that contains all the body-variables. Although the chase under guarded tgds does not necessarily terminate, query containment is decidable. This follows from the fact that the result of the chase has bounded treewidth. Let be the class of sets of guarded tgds.

Proposition 2

[8] is 2ExpTime-complete. It becomes ExpTime-complete if the arity of the schema is fixed, and NP-complete if the schema is fixed.

A key subclass of guarded tgds is the class of linear tgds, that is, tgds whose body consists of a single atom [9], which in turn subsume the well-known class of inclusion dependencies (linear tgds without repeated variables neither in the body nor in the head) [15]. Let and be the classes of sets of linear tgds and inclusions dependencies, respectively. , for , is PSpace-complete, and NP-complete if the arity of the schema is fixed [22].

Non-recursiveness: A set of tgds is non-recursive if its predicate graph contains no directed cycles. (Non-recursive sets of tgds are also known as acyclic [16, 24], but we reserve this term for CQs). Non-recursiveness ensures the termination of the chase, and thus decidability of CQ containment. Let be the class of non-recursive sets of tgds. Then:

Proposition 3

[24] is complete for NExpTime, even if the arity of the schema is fixed. It becomes NP-complete if the schema is fixed.

Figure 1: Stickiness and marking.

Stickiness: This condition ensures neither termination nor bounded treewidth of the chase. Instead, the decidability of query containment is obtained by exploiting query rewriting techniques. The goal of stickiness is to capture joins among variables that are not expressible via guarded tgds, but without forcing the chase to terminate. The key property underlying this condition can be described as follows: During the chase, terms that are associated (via a homomorphism) with variables that appear more than once in the body of a tgd (i.e., join variables) are always propagated (or “stick”) to the inferred atoms. This is illustrated in Figure 1(a); the first set of tgds is sticky, while the second is not. The formal definition is based on an inductive marking procedure that marks the variables that may violate the semantic property of the chase described above [10]. Roughly, during the base step of this procedure, a variable that appears in the body of a tgd but not in every head-atom of is marked. Then, the marking is inductively propagated from head to body as shown in Figure 1(b). Finally, a finite set of tgds is sticky if no tgd in contains two occurrences of a marked variable. Then:

Proposition 4

[10] is ExpTime-complete. It becomes NP-complete if the arity of the schema is fixed.

Weak versions: Each one of the previous classes has an associated weak version, called weakly-guarded [8], weakly-acyclic [16], and weakly-sticky [10], respectively, that guarantees the decidability of query containment. The underlying idea of all these classes is the same: Relax the conditions in the definition of the class, so that only those positions that receive null values during the chase procedure are taken into consideration. A key property of all these classes is that they extend the class of full tgds, i.e., those without existentially quantified variables. This is not the case for the “unrelaxed” versions presented above.

3 Semantic Acyclicity with TGDs

One of the main tasks of our work is to study the problem of checking whether a CQ is equivalent to an acyclic CQ over those instances that satisfy a set of tgds. When this is the case we say that is semantically acyclic under . The semantic acyclicity problem is defined below; is a class of sets of tgds (e.g., guarded, non-recursive, sticky, etc.):

PROBLEM : SemAc() INPUT : A CQ and a finite set of tgds in . QUESTION : Is there an acyclic CQ s.t. ?

3.1 Infinite Instances vs. Finite Databases

It is important to clarify that asks for the existence of an acyclic CQ that is equivalent to under focussing on arbitrary (finite or infinite) instances. However, in practice we are concerned only with finite databases. Therefore, one may claim that the natural problem to investigate is , which accepts as input a CQ and a finite set of tgds, and asks whether an acyclic CQ exists such that for every finite database .

Interestingly, for all the classes of sets of tgds discussed in the previous section, and coincide due to the fact that they ensure the so-called finite controllability of CQ containment. This means that query containment under arbitrary instances and query containment under finite databases are equivalent problems. For non-recursive and weakly-acyclic sets of tgds this immediately follows from the fact that the chase terminates. For guarded-based classes of sets of tgds this has been shown in [3], while for sticky-based classes of sets of tgds it has been shown in [18]. Therefore, assuming that is one of the above syntactic classes of sets of tgds, by giving a solution to we immediately obtain a solution for .

The reason why we prefer to focus on , instead of , is given by Lemma 1: Query containment under arbitrary instances can be characterized in terms of the chase. This is not true for CQ containment under finite databases simply because the chase is, in general, infinite.

3.2 Semantic Acyclicity vs. Containment

There is a close relationship between semantic acyclicity and a restricted version of CQ containment under sets of tgds, as we explain next. But first we need to recall the notion of connectedness for queries and tgds. A CQ is connected if its Gaifman graph is connected – recall that the nodes of the Gaifman graph of a CQ are the variables of , and there is an edge between variables and iff they appear together in some atom of . Analogously, a tgd is body-connected if its body is connected. Then:

Proposition 5

Let be a finite set of body-connected tgds and two Boolean and connected CQs without common variables, such that is acyclic and is not semantically acyclic under . Then iff is semantically acyclic under .

As an immediate corollary of Proposition 5, we obtain an initial boundary for the decidability of : We can only obtain a positive result for those classes of sets of tgds for which the restricted containment problem presented above is decidable. More formally, let us define to be the problem of checking , given a set of body-connected tgds in and two Boolean and connected CQs and , without common variables, such that is acyclic and is not semantically acyclic under . Then:

Corollary 6

is undecidable for every class of tgds such that is undecidable.

As we shall discuss later, is not easier than general CQ containment under tgds, which means that the only classes of tgds for which we know the former problem to be decidable are those for which we know CQ containment to be decidable (e.g., those introduced in Section 2).

At this point, one might be tempted to think that some version of the converse of Proposition 5 also holds; that is, the semantic acyclicity problem for is reducible to the containment problem for . This would imply the decidability of for any class of sets of tgds for which the CQ containment problem is decidable. Our next result shows that the picture is more complicated than this as is undecidable even over the class of sets of full tgds, which ensures the decidability of CQ containment:

Theorem 7

The problem SemAc() is undecidable.

{proof}

We provide a sketch since the complete construction is long. We reduce from the Post correspondence problem (PCP) over the alphabet . The input to this problem are two equally long lists and of words over , and we ask whether there is a solution, i.e., a nonempty sequence of indices in such that .

Let and be an instance of PCP. In the full proof we construct a Boolean CQ and a set of full tgds over the signature , where , , , and are binary predicates, and and are unary predicates, such that the PCP instance given by and has a solution iff there exists an acyclic CQ such that . In this sketch though, we concentrate on the case when the underlying graph of is a directed path; i.e, we prove that the PCP instance has a solution iff there is a CQ whose underlying graph is a directed path such that . This does not imply the undecidability of the general case, but the proof of the latter is a generalization of the one we sketch below.

The restriction of the query to the symbols that are not is graphically depicted in Figure 2.

Figure 2: The query from the proof of Theorem 7.

There, denote the names of the respective variables. The interpretation of in consists of all pairs in .

Our set of full tgds defines the synchronization predicate over those acyclic CQs whose underlying graph is a path. Assume that encodes a word . We denote by , for , the prefix of of length . In such case, the predicate contains those pairs such that for some sequence of indices in we have that and . Thus, if is a solution for the PCP instance, then belongs to the interpretation of .

Formally, consists of the following rules:

  1. An initialization rule:

    That is, the first element after the special symbol (which denotes the beginning of a word over ) is synchronized with itself.

  2. For each , a synchronization rule:

    Here, , for , denotes , where the ’s are fresh variables. Roughly, if is synchronized and the element (resp., ) is reachable from (resp., ) by word (resp., ), then is also synchronized.

  3. For each , a finalization rule:

    where is the conjunction of atoms:

    This tgd enforces to contain a “copy” of whenever encodes a solution for the PCP instance.

We first show that if the PCP instance has a solution given by the nonempty sequence , with , then there exists an acylic CQ whose underlying graph is a directed path such that . Let us assume that , where each . It is not hard to prove that , where is as follows:

Here, again, denote the names of the respective variables of . All nodes in the above path are different. The main reason why holds is because the fact is a solution implies that there are elements and such that , and hold in . Thus, the finalization rule is fired. This creates a copy of in , which allows to be homomorphically mapped to .

Now we prove that if there exists an acyclic CQ such that and the underlying graph of is a directed path, then the PCP instance has a solution. Since , Lemma 1 tells us that are homomorphically equivalent. But then must contain at least one variable labeled and one variable labeled . The first variable cannot have incoming edges (otherwise, would not homomorphically map to ), while the second one cannot have outcoming edges (for the same reason). Thus, it is the first variable of that is labeled and the last one that is labeled . Further, all edges reaching in must be labeled (otherwise does not homomorphically map to ). Thus, this is the label of the last edge of that goes from variable to . Analogously, the edge that leaves in is labeled . Further, any other edge in is labeled , , or .

Notice now that must have an incoming edge labeled in from some node that has an outgoing edge with label (since homomorphically maps to ). By definition of , this could only have happened if the finalization rule is fired. In particular, is preceded by node , which in turn is preceded by , and there are elements and such that , and hold in . In fact, the unique path from (resp., ) to in is labeled (resp., ). This means that the atom was not one of the edges of , but must have been produced during the chase by firing the initialization or the synchronization rules, and so on. This process must finish in the second element of . (Recall that belongs to due to the first rule of ). We conclude that our PCP instance has a solution.

Theorem 7 rules out any class that captures the class of full tgds, e.g., weakly-guarded, weakly-acyclic and weakly-sticky sets of tgds. The question that comes up is whether the non-weak versions of the above classes, namely guarded, non-recursive and sticky sets of tgds, ensure the decidability of , and what is the complexity of the problem. This is the subject of the next two sections.

4 Acyclicity-Preserving Chase

We propose a semantic criterion, the so-called acyclicity-preserving chase, that ensures the decidability of whenever the problem is decidable. This criterion guarantees that, starting from an acyclic instance, it is not possible to destroy its acyclicity during the construction of the chase. We then proceed to show that the class of guarded sets of tgds has acyclicity-preserving chase, which immediately implies the decidability of , and we pinpoint the exact complexity of the latter problem. Notice that non-recursiveness and stickiness do not enjoy this property, even in the restrictive setting where only unary and binary predicates can be used; more details are given in the next section. The formal definition of our semantic criterion follows:

{definition}

(Acyclicity-preserving Chase) We say that a class of sets of tgds has acyclicity-preserving chase if, for every acyclic CQ , set , and chase sequence for under , the result of such a chase sequence is acyclic.  

We can then prove the following small query property:

Proposition 8

Let be a finite set of tgds that belongs to a class that has acyclicity-preserving chase, and a CQ. If is semantically acyclic under , then there exists an acyclic CQ , where , such that .

Figure 3: The compact acyclic query.

The proof of the above result relies on the following technical lemma, established in [8] (using slightly different terminology), that will also be used later in our investigation:

Lemma 9

Let be a CQ, an acyclic instance, and a tuple of distinct constants occurring in such that holds in . There exists an acyclic CQ , where and , such that holds in .

For the sake of completeness, we would like to recall the idea of the construction underlying Lemma 9, which is illustrated in Figure 3. Assuming that are the atoms of , there exists a homomorphism that maps to the join tree of the acyclic instance (the shaded tree in Figure 3). Consider now the subtree of consisting of all the nodes in the image of the query and their ancestors. From we extract the smaller tree also depicted in Figure 3; is obtained as follows:

  1. consists of all the root and leaf nodes of , and all the inner nodes of with at least two children; and

  2. For every , iff is a descendant of in , and the only nodes of that occur on the unique shortest path from to in are and .

It is easy to verify that is a join tree, and has at most nodes. The acyclic conjunctive query is defined as the conjunction of all atoms occurring in .

Notice that a result similar to Lemma 9 is implicit in [4], where the problem of approximating conjunctive queries is investigated. However, from the results of [4], we can only conclude the existence of an exponentially sized acyclic CQ in the arity of the underlying schema, while Lemma 9 establishes the existence of an acyclic query of linear size. This is decisive for our later complexity analysis. Having the above lemma in place, it is not difficult to establish Proposition 8.

{proof}

[of Proposition 8] Since, by hypothesis, is semantically acyclic under , there exists an acyclic CQ such that . By Lemma 1, belongs to the evaluation of over . Recall that belongs to a class that has acyclicity-preserving chase, which implies that is acyclic. Hence, by Lemma 9, there exists an acyclic CQ , where and , such that belongs to the evaluation of over . By Lemma 1, , and therefore . We conclude that , and the claim follows.

It is clear that Proposition 8 provides a decision procedure for whenever has acyclicity-preserving chase and is decidable. Given a CQ , and a finite set :

  1. Guess an acyclic CQ of size at most ; and

  2. Verify that and .

The next result follows:

Theorem 10

Consider a class of sets of tgds that has acyclicity-preserving chase. If the problem is decidable, then is also decidable.

4.1 Guardedness

We proceed to show that is decidable and has the same complexity as CQ containment under guarded tgds:

Theorem 11

is complete for 2ExpTime. It becomes ExpTime-complete if the arity of the schema is fixed, and NP-complete if the schema is fixed.

The rest of this section is devoted to establish Theorem 11.

Decidability and Upper Bounds

We first show that:

Proposition 12

has acyclicity-preserving chase.

The above result, combined with Theorem 10, implies the decidability of . However, this does not say anything about the complexity of the problem. With the aim of pinpointing the exact complexity of , we proceed to analyze the complexity of the decision procedure underlying Theorem 10. Recall that, given a CQ , and a finite set , we guess an acyclic CQ such that , and verify that . It is clear that this algorithm runs in non-deterministic polynomial time with a call to a oracle, where is a complexity class powerful enough for solving . Thus, Proposition 2 implies that is in 2ExpTime, in ExpTime if the arity of the schema is fixed, and in NP if the schema is fixed. One may ask why for a fixed schema the obtained upper bound is NP and not . Observe that the oracle is called only once in order to solve , and since is already in NP when the schema is fixed, it is not really needed in this case.

Lower Bounds

Let us now show that the above upper bounds are optimal. By Proposition 5, can be reduced in constant time to . Thus, to obtain the desired lower bounds, it suffices to reduce in polynomial time to . Interestingly, the lower bounds given in Section 2 for hold even if we focus on Boolean CQs and the left-hand side query is acyclic. In fact, this is true, not only for guarded, but also for non-recursive and sticky sets of tgds. Let be the following problem: Given an acyclic Boolean CQ , a Boolean CQ , and a finite set of tgds, is it the case ?

From the above discussion, to establish the desired lower bounds for guarded sets of tgds (and also for the other classes of tgds considered in this work), it suffices to reduce in polynomial time to . To this end, we introduce the so-called connecting operator, which provides a generic reduction from to .

Connecting operator.

Consider an acyclic Boolean CQ , a Boolean CQ , and a finite set of tgds. We assume that both are of the form . The application of the connecting operator on returns the triple , where

  • is the query

    where is a new variable not in , each is a new predicate, and also is a new binary predicate;

  • is the query

    where are new variables not in ; and

  • Finally, , where for a tgd , is the tgd

    with be the conjunctions obtained from , respectively, by replacing each atom with , where is a new variable not occurring in .

This concludes the definition of the connecting operator. A class of sets of tgds is closed under connecting if, for every set , . It is easy to verify that remains acyclic and is connected, is connected and not semantically acyclic under , and is a set of body-connected tgds. It can be also shown that iff .

From the above discussion, it is clear that the connecting operator provides a generic polynomial time reduction from to , for every class of sets of tgds that is closed under connecting. Then:

Proposition 13

Let be a class of sets of tgds that is closed under connecting such that is hard for a complexity class that is closed under polynomial time reductions. Then, is also -hard.

Back to guardedness.

It is easy to verify that the class of guarded sets of tgds is closed under connecting. Thus, the lower bounds for stated in Theorem 11 follow from Propositions 2 and 13. Note that, although Proposition 2 refers to , the lower bounds hold for ; this is implicit in [8].

As said in Section 2, a key subclass of guarded sets of tgds is the class of linear tgds, i.e., tgds whose body consists of a single atom, which in turn subsume the well-known class of inclusion dependencies. By exploiting the non-deterministic procedure employed for , and the fact that both linear tgds and inclusion dependencies are closed under connecting, we can show that:

Theorem 14

, for , is complete for PSpace. It becomes NP-complete if the arity of the schema is fixed.

5 UCQ Rewritability

Even though the acyclicity-preserving chase criterion was very useful for solving , it is of little use for non-recursive and sticky sets of tgds. As we show in the next example, neither nor have acyclicity-preserving chase:

{example}

Consider the acyclic CQ and the tgd

where is both non-recursive and sticky, but not guarded. In the predicate holds all the possible pairs that can be formed using the terms . Thus, in the Gaifman graph of we have an -clique, which means that is highly cyclic. Notice that our example illustrates that also other favorable properties of the CQ are destroyed after chasing with non-recursive and sticky sets of tgds, namely bounded (hyper)tree width.444Notice that guarded sets of tgds over predicates of bounded arity preserve the bounded hyper(tree) width of the query.  

In view of the fact that the methods devised in Section 4 cannot be used for non-recursive and sticky sets of tgds, new techniques must be developed. Interestingly, and share an important property, which turned out to be very useful for semantic acyclicity: UCQ rewritability. Recall that a union of conjunctive queries (UCQ) is an expression of the form , where each is a CQ over the same schema . The evaluation of over an instance , denoted , is defined as . The formal definition of UCQ rewritability follows:

{definition}

(UCQ Rewritability) A class of sets of tgds is UCQ rewritable if, for every CQ , and , we can construct a UCQ such that: For every CQ , iff , with be the database obtained from after replacing each variable with .  

In other words, UCQ rewritability suggests that query containment can be reduced to the problem of UCQ evaluation. It is important to say that this reduction depends only on the right-hand side CQ and the set of tgds, but not on the left-hand side query. This is crucial for establishing the desirable small query property whenever we focus on sets of tgds that belong to a UCQ rewritable class. At this point, let us clarify that the class of guarded sets of tgds is not UCQ rewritable, which justifies our choice of a different semantic property, that is, acyclicity-preserving chase, for its study.

Let us now show the desirable small query property. For each UCQ rewritable class of sets of tgds, there exists a computable function from the set of pairs consisting of a CQ and a set of tgds in to positive integers such that the following holds: For every CQ , set , and UCQ rewriting of and , the height of , that is, the maximal size of its disjuncts, is at most . The existence of the function follows by the definition of UCQ rewritability. Then, we show the following:

Proposition 15

Let be a UCQ rewritable class, a finite set of tgds, and a CQ. If is semantically acyclic under , then there exists an acyclic CQ , where , such that .

{proof}

Since is semantically acyclic under , there exists an acyclic CQ such that . As is UCQ rewritable, there exists a UCQ such that , which implies that there exists a CQ (one of the disjuncts of ) such that . Clearly, . But is acyclic, and thus Lemma 9 implies the existence of an acyclic CQ , where and , such that . The latter implies that . By hypothesis, , and hence . For the other direction, we first show that (otherwise, is not a UCQ rewriting). Since , we get that . We conclude that , and the claim follows.

It is clear that Proposition 15 provides a decision procedure for whenever is UCQ rewritable, and