Rewriting Ontological Queries into Small Nonrecursive Datalog Programs††thanks: This paper extends a shorter version presented at DL 2011, 24th International Workshop on Description Logics by mainly expanding Section LABEL:sec:main. Further extended and/or improved versions will be posted as they arise to arXive-CORR at \arxiv
We consider the setting of ontological database access, where an A-box is given in form of a relational database and where a Boolean conjunctive query has to be evaluated against modulo a -box formulated in DL-Lite or Linear Datalog. It is well-known that can be rewritten into an equivalent nonrecursive Datalog program that can be directly evaluated over . However, for Linear Datalog or for DL-Lite versions that allow for role inclusion, the rewriting methods described so far result in a nonrecursive Datalog program of size exponential in the joint size of and . This gives rise to the interesting question of whether such a rewriting necessarily needs to be of exponential size. In this paper we show that it is actually possible to translate into a polynomially sized equivalent nonrecursive Datalog program .
Here, the first two inequalities make sure that only allowed relation and tgd numbers are used, the latter inequalities guarantee that to yield a tuple by a chase rule only tuples with smaller numbers can be used.444As the latter constraints are independent from the concrete tgds, we decided to put them here. They could as well be tested in . The rule uses one further predicate DNum that has not yet been defined. Its purpose is to contain all possible values, that is: . It is (easily) defined by further rules of . Note that this leaves the values for the unconstrained, hence they can carry either domain values or numerical values.
Next, we describe the part of that checks that constitutes and actual chase sequence and the rules of that specify the corresponding auxiliary relations.
The following kinds of conditions have to be checked to ensure that the tuples “guessed” by constitute a chase sequence.
For every , the relation of a tuple has to match the head of its rule .
In the example, e.g., has to be as the head of is an -atom.
Likewise, for each and the relation number of tuple has to be the relation number of the -th atom of .
In the example, e.g., must be , as and the first atom of is an -atom.
If the head of contains an existentially quantified variable, the new null value is represented by the numerical value .
This is illustrated by in the example: the first position of the head of rule 2 has an existentially quantified variable and thus .
If a variable occurs at two different positions in then the corresponding positions in the tuples used to produce carry the same value.
If a variable in the body of also occurs in the head of then the values of the corresponding positions in the body tuple and in are equal.
occurs in position 3 of the second atom of the body of and in position 2 of its head. Therefore, and have to coincide (where the 4 is determined by .
Note that all these conditions depend on the given tgds. Indeed, every tgd from contributes conditions of each of the five forms. For the sake of simplicity of presentation, we explain the effect of a tgd through the following example tgd that contains all relevant features that might arise in a tgd. The generalization to arbitrary tgds is straightforward but tedious to spell out in full detail. Let us thus assume that is the tgd555This example tgd is not related to our running example as that does not have a single tgd with all features.
Condition (1) states that if a tuple is obtained by applying it should be a tuple from . In terms of variables this means, that for every it should hold: if then .
This is the first occasion where we need some way to express a
disjunction in (namely: ). We can
meet this challenge with the help of an additional predicate to be
specified in . More precisely, we let specify a 4-ary
predicate that is intended to contain all tuples
fulfilling the condition: if then .
IfThen can be specified by the following two rules.
Thus, condition (1) can be guaranteed with respect to tgd for all tuples by adding all atoms of the form to .
Condition (2) is slightly more complicated. For our example tgd
it says that if a tuple is obtained using
then the first tuple used for the chase step should be an
-tuple. In terms of variables this can be stated as: if
and then (and likewise for the second atom of
. To express this IF-statement we use a 6-ary auxiliary
predicate expressing that if
and then . It can be specified in by the following three rules.
For every pair of numbers , then has atoms and .
In a similar fashion
condition (3) yields one atom , for every ;
condition (4) yields one atom , for every , where IfThen3 is the 8-ary predicate for IfThen-statements with three conjuncts that can be defined analogously as IfThen2;
condition (5) yields one atom for every .
Altogether, has atoms that together guarantee that the variables of encode an actual chase sequence.
Finally, we explain how it can be checked that there is a homomorphism from to . We explain the issue through the little example query . To evaluate this query, makes use of two additional variables and , one for each atom of . The intention is that these variables bind to the numbers of the tuples that the atoms are mapped to. We have to make sure two kinds of conditions. First, the tuples need to have the right relation symbol and second, they have to obey value equalities induced by the variables of that occur more than once.
The first kind of conditions is checked by adding atoms and to , for every . The second kind of conditions can be checked by atoms , for every .
As we do not need any further auxiliary predicates, is empty (but we kept it for symmetry reasons).
This completes the description of . Note that is nonrecursive, and has polynomial size in the size of and . In order to finish the proof of part (a) of Theorem LABEL:theo:polyDatalog, we next explain how to reduce the arity of .
This final step of the construction is based on two ideas.
First, by using Boolean variables and some new ternary relations, we can replace the 6-ary relation IfThen2 (and likewise the 4-ary relation IfThen). More precisely, we replace every atom by a conjunction of the form
Here, are predicates that mimic Boolean gates, e.g., holds if is the Boolean Or of and , in particular all values have to be from . only holds if . The predicate holds if and or if and . The relations can easily be defined in .
The second idea is that need not be materialized. We only materialize a relation of arity which is intended to represent all database tuples. More precisely, shall hold if represents a tuple from relation or if . Clearly, can be defined in .
is then replaced by a conjunction of atoms with the same semantics.
The conjunct tests whether
is in . Further atoms ensure that
if . Finally, it is ensured that, if the
values are restricted as by the right-hand side of rule LABEL:eq:1.
In order to prove part (b), we must get rid of the numeric domain (except for 0 and 1). This is actually very easy. We just replace each numeric value by a logarithmic number of bits (coded by our 0 and 1 domain elements), and extend the predicate arities accordingly. As a matter of fact, this requires an increase of arity by a factor of . It is well-known that a successor predicate and a vectorized predicate for such bit-vectors can be expressed by a polynomially-sized nonrecursive Datalog program, see . The rest is completely analogous to the above proof. This concludes the proof sketch for Theorem LABEL:theo:polyDatalog.
We would like to conclude this section with some remarks:
Note that the evaluation complexity of the Datalog program obtained for case (b) is not significantly higher than the evaluation complexity of the program constructed for case (a). For example, in the most relevant case of bounded arities, both programs can be evaluated in NPTIME combined complexity over a database . In fact, it is well-known that the combined complexity of a Datalog program of bounded arity is in NPTIME (see ). But it is easy to see that if we expand the signature of such a program (and of the underlying database) by a logarithmic number of Boolean-valued argument positions (attributes), nothing changes, because the possible values for such vectorized arguments are still of polynomial size. It is just a matter of coding. In a similar way, the data complexity in both cases (a) and (b) is the same (PTIME).
It is easy to generalize this result to the setting where is actually a union of conjunctive queries (UCQ).
The method easily generalizes to translate non-Boolean queries, i.e., queries with output, to polynomially-sized nonrecursive Datalog programs with output. We are here only interested in certain answers consisting of tuples of values from the original domain (see ). Assume that the head of is an atom where is the output relation symbol, and the are variables also occurring in the body of . We then obtain a nonrecursive Datalog translation by acting as in the above proof, except for the following modifications. Make the head of rule , and add for an atom to , where is an auxiliary predicate such that is iff is in the active non-numeric domain of the database, that is, iff and effectively occurs in the database. It is easy to see that the auxiliary predicate itself can be achieved via a nonrecursive Datalog program from . Clearly, by construction of (the so modified) program , the output of are then precisely the certain answers of the query .
The polynomially-sized nonrecursive Datalog program constructed in the proof of Theorem LABEL:theo:polyDatalog can in turn be transformed in polynomial time into an equivalent first-order formula. In case of -numerical databases this follows immediately from the constant depth of (the predicate dependency graph of) . Moreover, in case of non-numerical domains with two distinguished constants, the simulation of a numerical domain via bit-vectors can be easily expressed by a polynomially sized first-order formula. In summary, Theorem LABEL:theo:polyDatalog remains valid if we replace “nonrecursive Datalog program” by “first-order formula”. However, for practical purposes nonrecursive Datalog may be the better choice, because the auxiliary relations that need to be computed only once are already factured out explicitly.
4 Further Results Derived From the Main Theorem
We wish to mention some interesting consequences of Theorem 1 that follow easily from the above result after combining it with various other known results.
4.1 Linear TGDs
A linear tgd  is one that has a single atom in its rule body. The class of linear tgds is a fundamental one in the Datalog family. This class contains the class of inclusion dependencies. It was already shown in  for inclusion dependencies that classes of linear tgds of bounded (predicate) arities enjoy the PWP. That proof carries over to linear tgds, and we thus can state:
Classes of linear tgds of bounded arity enjoy the PWP.
By Theorem LABEL:theo:polyDatalog, we then conclude:
Conjunctive queries under linear tgds of bounded arity are polynomially rewritable as nonrecursive Datalog programs in the same fashion as for Theorem 1. So are sets of inclusion dependencies of bounded arity.
A pioneering and highly significant contribution towards tractable ontological reasoning was the introduction of the DL-Lite family of description logics (DLs) by Calvanese et al. [9, 20]. DL-Lite was further studied and developed in .
A DL-lite theory (or TBox) consists of a set of negative constraints such as key and disjointness constraints, and of a set of positive constraints that resemble tgds. As shown in , the negative constraints can be compiled into a polymomially sized first-order formula (actually a union of conjunctive queries) of the same arity as such that for each database and BCQ , iff and . In (the full version of)  it was shown that for the main DL-Lite variants defined in , each can be immediately translated into an equivalent set of linear tgds of arity 2. By virtue of this, and the above we obtain the following theorem.
Let be a CQ and let be a DL-Lite theory expressed in one of the following DL-Lite variants: DL-Lite, DL-Lite, DL-Lite, DLR-Lite, DLR-Lite, or DLR-Lite. Then can be rewritten into a nonrecursive Datalog program such that for each database , iff . Regarding the arities of , the same bounds as in Theorem 1 hold.
4.3 Sticky and Sticky Join TGDs
Sticky tgds  and sticky-join tgds  are special classes of tgds that generalize linear tgds but allow for a limited form of join (including as special case the cartesian product). They allow one to express natural ontological relationships not expressible in DLs such as OWL. We do not define these classes here, and refer the reader to . By results of , which will also be discussed in more detail in a future extended version of the present paper, both classes enjoy the Polynomial Witness Property. By Theorem LABEL:theo:polyDatalog, we thus obtain the following result:
Conjunctive queries under sticky tgds and sticky-join tgds over a fixed signature are rewritable into polynomially sized nonrecursive Datalog programs of arity bounded as in Theorem LABEL:theo:polyDatalog.
5 Related Work on Query Rewriting
Several techniques for query-rewriting have been developed. An early algorithm, introduced in  and implemented in the QuOnto system666http://www.dis.uniroma1.it/ quonto/, reformulates the given query into a union of CQs (UCQs) by means of a backward-chaining resolution procedure. The size of the computed rewriting increases exponentially w.r.t. the number of atoms in the given query. This is mainly due to the fact that unifications are derived in a “blind” way from every unifiable pair of atoms, even if the generated rule is superfluous. An alternative resolution-based rewriting technique was proposed by Peréz-Urbina et al. , implemented in the Requiem system777http://www.comlab.ox.ac.uk/projects/requiem/home.html, that produces a UCQs as a rewriting which is, in general, smaller (but still exponential in the number of atoms of the query) than the one computed by QuOnto. This is achieved by avoiding many useless unifications, and thus the generation of redundant rules due to such unifications. This algorithm works also for more expressive non-first-order rewritable DLs. In this case, the computed rewriting is a (recursive) Datalog query. Following a more general approach, Calì et al.  proposed a backward-chaining rewriting algorithm for the first-order rewritable Datalog languages mentioned above. However, this algorithm is inspired by the original QuOnto algorithm, and inherits all its drawbacks. In , a rewriting technique for linear Datalog into unions of conjunctive queries is proposed. This algorithm is an improved version of the one already presented in , where further superfluous unifications are avoided, and where, in addition, tedundant atoms in the body of a rule, that are logically implied (w.r.t. the ontological theory) by other atoms in the same rule, are eliminated. This elimination of body-atoms implies the avoidance of the construction of redundant rules during the rewriting process. However, the size of the rewriting is still exponential in the number of query atoms.
Of more interest to the present work are rewritings into nonrecursive Datalog. In [15, 16] a polynomial-size rewriting into nonrecursive Datalog is given for the description logics DL-Lite and DL-Lite. For DL-Lite, a DL with counting, a polynomial rewriting involving aggregate functions is proposed. It is, moreover, shown in (the full version of)  that for the description logic DL-Lite a polynomial-size pure first-order query rewriting is possible. Note that neither of these logics allows for role inclusion, while our approach covers description logics with role inclusion axioms. Other results in [15, 16] are about combined rewritings where both the query and the database have to be rewritten. A recent very interesting paper discussing polynomial size rewritings is . Among other results,  provides complexity-theoretic arguments indicating that without the use of special constants (e.g, 0 and 1, or the numerical domain), a polynomial rewriting such as ours may not be possible. Rosati et al.  recently proposed a very sophisticated rewriting technique into nonrecursive Datalog, implemented in the Presto system. This algorithm produces a non-recursive Datalog program as a rewriting, instead of a UCQs. This allows the “hiding” of the exponential blow-up inside the rules instead of generating explicitly the disjunctive normal form. The size of the final rewriting is, however, exponential in the number of non-eliminable existential join variables of the given query; such variables are a subset of the join variables of the query, and are typically less than the number of atoms in the query. Thus, the size of the rewriting is exponential in the query size in the worst case. Relevant further optimizations of this method are given in .
Acknowledgment G. Gottlob’s work was funded by the EPSRC Grant EP/H051511/1 ExODA: Integrating Description Logics and Database Technologies for Expressive Ontology-Based Data Access. We thank the anonymous referees, as well as Roman Kontchakov, Carsten Lutz, and Michael Zakharyaschev for useful comments on an earlier version of this paper.
-  Alessandro Artale, Diego Calvanese, Roman Kontchakov, and Michael Zakharyaschev, The dl-lite family and relations, J. Artif. Intell. Res. (JAIR) 36 (2009), 1–69.
-  Catriel Beeri and Moshe Y. Vardi, The implication problem for data dependencies, Proc. of ICALP, 1981, pp. 73–85.
-  A. Calì, G. Gottlob, and A. Pieris, Query rewriting under non-guarded rules, Proc. AMW, 2010.
-  Andrea Calì, Georg Gottlob, and Michael Kifer, Taming the infinite chase: Query answering under expressive relational constraints, Proc. of KR, 2008, pp. 70–80.
-  Andrea Calì, Georg Gottlob, and Thomas Lukasiewicz, A general datalog-based framework for tractable query answering over ontologies, Proc. of PODS, 2009, pp. 77–86.
-  Andrea Calì, Georg Gottlob, and Andreas Pieris, Advanced processing for ontological queries, PVLDB 3 (2010), no. 1, 554–565.
-  , Query answering under non-guarded rules in datalog+/-, Proc. of RR, 2010, pp. 175–190.
-  , Towards more expressive ontology languages: The query answering problem, Tech. report, University of Oxford, Department of Computer Science, 2011, Submitted for publication - available from the authors.
-  Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and Riccardo Rosati, Tractable reasoning and efficient query answering in description logics: The DL-lite family, J. Autom. Reasoning 39 (2007), no. 3, 385–429.
-  Evgeny Dantsin, Thomas Eiter, Gottlob Georg, and Andrei Voronkov, Complexity and expressive power of logic programming, ACM Comput. Surv. 33 (2001), no. 3, 374–425.
-  Alin Deutsch, Alan Nash, and Jeff B. Remmel, The chase revisisted, Proc. of PODS, 2008, pp. 149–158.
-  Ronald Fagin, Phokion G. Kolaitis, Renée J. Miller, and Lucian Popa, Data exchange: Semantics and query answering, Theor. Comput. Sci. 336 (2005), no. 1, 89–124.
-  Georg Gottlob, Giorgio Orsi, and Andreas Pieris, Ontological queries: Rewriting and optimization, Proc. of ICDE, 2011.
-  David S. Johnson and Anthony C. Klug, Testing containment of conjunctive queries under functional and inclusion dependencies, J. Comput. Syst. Sci. 28 (1984), no. 1, 167–189.
-  Roman Kontchakov, Carsten Lutz, David Toman, Frank Wolter, and Michael Zakharyaschev, The combined approach to query answering in dl-lite, KR (Fangzhen Lin, Ulrike Sattler, and Miroslaw Truszczynski, eds.), AAAI Press, 2010.
-  , The combined approach to ontology-based data access, IJCAI, 2011.
-  David Maier, Alberto O. Mendelzon, and Yehoshua Sagiv, Testing implications of data dependencies., ACM Trans. Database Syst. 4 (1979), no. 4, 455–469.
-  Giorgio Orsi and Andreas Pieris, Optimizing query answering under ontological constraints, PVLDB, 2011, to appear.
-  H. Pérez-Urbina, B. Motik, and I. Horrocks, Tractable query answering and rewriting under description logic constraints, Journal of Applied Logic 8 (2009), no. 2, 151–232.
-  Antonella Poggi, Domenico Lembo, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Riccardo Rosati, Linking data to ontologies, J. Data Semantics 10 (2008), 133–173.
-  R. Rosati and A. Almatelli, Improving query answering over DL-Lite ontologies, Proc. KR, 2010.
-  R.Kontchakov S. Kikot, Carsten Lutz, and M. Zakharyaschev, On (In)Tractability of OBDA with OWL2QL, Proc. DL, 2011.