Relaxing and Restraining Queries for OBDA

Relaxing and Restraining Queries for OBDA

Medina Andreşel    Yazmín Ibáñez-García    Magdalena Ortiz    Mantas Šimkus
{andresel,ibanez,ortiz}@kr.tuwien.ac.at simkus@dbai.tuwien.ac.at
Faculty of Informatics, TU Wien, Austria
Abstract

In ontology-based data access (OBDA), ontologies have been successfully employed for querying possibly unstructured and incomplete data. In this paper, we advocate using ontologies not only to formulate queries and compute their answers, but also for modifying queries by relaxing or restraining them, so that they can retrieve either more or less answers over a given dataset. Towards this goal, we first illustrate that some domain knowledge that could be naturally leveraged in OBDA can be expressed using complex role inclusions (CRI). Queries over ontologies with CRI are not first-order (FO) rewritable in general. We propose an extension of DL-Lite with CRI, and show that conjunctive queries over ontologies in this extension are FO rewritable. Our main contribution is a set of rules to relax and restrain conjunctive queries (CQs). Firstly, we define rules that use the ontology to produce CQs that are relaxations/restrictions over any dataset. Secondly, we introduce a set of data-driven rules, that leverage patterns in the current dataset, to obtain more fine-grained relaxations and restrictions.

Relaxing and Restraining Queries for OBDA Medina Andreşel and Yazmín Ibáñez-García and Magdalena Ortiz and Mantas Šimkus
{andresel,ibanez,ortiz}@kr.tuwien.ac.at simkus@dbai.tuwien.ac.at
Faculty of Informatics, TU Wien, Austria

Acknowledgments

This research was funded by FWF Projects P30360 and W1255-N23.

1 Introduction

Ontology based data access (OBDA) is one of the most successful use cases of description logic (DL) ontologies. The core idea in OBDA is to use an ontology to provide a conceptual view of a collection of data sources, thus abstracting away from the specific way data is stored. The role of the ontology in this setting is to describe the domain of interest at a high level of abstraction. This allows users to formulate queries over the data sources using a familiar controlled vocabulary. Further, knowledge represented in the ontology can be leveraged to retrieve more complete answers. For example, consider the dataset in Figure 2, which includes information on some cultural events and their locations, and the ontology in Figure 2 which captures additional information, e.g. the knowledge that both concerts and exhibitions are cultural events. By posing the query

one can retrieve all cultural events, , , and , regardless whether they are stored as concerts, exhibitions, or cultural events of unspecified type.

In the OBDA paradigm an ontology can be linked to a collection of heterogeneous data sources by defining mappings from the data to the vocabulary used in the ontology [PLC08]. This allows to integrate e.g., data from relational databases and unstructured datasets. In this framework, an ontology acts as a mediator between the query and a set of heterogeneous data sources. Description logics of the so-called DL-Lite family have been particularly tailored for OBDA [CDL07]. One crucial property of DL-Lite, is that queries mediated by such ontologies are first-order(FO)-rewritable. In a nutshell, this means that evaluating a query over a dataset using knowledge from an ontology can be reduced to evaluate a query , that incorporates knowledge from , over . This amounts to query evaluation in relational databases. In our example, a rewriting of is

Figure 1: Event ontology .

Figure 2: Event dataset .

We investigate the use of ontologies not only as query mediators, but also for query reformulation: by modifying queries in order to relax them and retrieve more answers, or restrain them and reduce answers. These reformulations can be used to explore a given dataset, or to modify queries to fit the information needs of a user. For example, answers to queries for concerts may be too scarce or nonexistent, then by relaxing the query to find all cultural events, one might get more answers. Conversely, if a query for cultural events produces too many answers, it is possible to restrict this query to events of a specific type (for instance concerts).

In our example, the query that specializes occurs as a disjunct in its rewriting. A key observation within our approach is that query rewriting rules for DL-Lite (such as the ones from  [CDL07]) yield query specializations, and that counterparts of these rules produce query generalizations. However, there are intuitive query answers and query reformulations that cannot be produced by these rewriting rules. For example, consider the following query retrieving concerts occurring in Vienna.

This query does not return any answers when evaluated over the data from w.r.t.  . However, may be considered an answer to this query, by following the intuition that if an event occurs in a venue located in a city, then it occurs in that city. Still, this knowledge cannot be expressed in DL-Lite.

To obtain this kind of reformulations, we propose to extend DL-Lite with so-called complex role inclusions (CRI). In our example we could add the following:

This axiom captures the intuition above, and would allow to retrieve . Moreover, we could also use it to generate some interesting query reformulations. For instance, the query

could be specialized to

which specializes from all concerts known to occur in a city, to only those for which a more specific location within a city is known.

Unfortunately, adding CRIs to DL-Lite increases the worst case data complexity of query answering, which means that queries are no longer FO-rewritable. We propose two extensions of DL-Lite with CRIs for which queries remain FO-rewritable. The first extension imposes some acyclicity conditions between the roles that occur on the right-hand-side of CRIs. This extension, however, would not be sufficient to capture our example above, where appears on both sides of the inclusion.

A more expressive extension of DL-Lite allowing recursive role inclusions can be defined based on the observation that chains of some roles have bounded length. In our example, we note that concepts occurring along chains of the role can be ordered in the sense that edges can only connect ‘smaller’ locations to ‘larger’ ones: from venues to cities, from cities to countries, etc. Based on this observation we propose yet another extension of DL-Lite allowing recursion along ordered bounded concept chains. We then propose reformulation rules for relaxing and specializing queries using ontologies in this extension. The resulting rules allow to generalize and specialize queries “moving” not only along the subclass relation and subrole relations, but also along dimensions defined by the ordered concepts (in our example along the different kinds of locations).

Using ontologies expressed in the proposed extension of DL-Lite is possible to reformulate queries along dimensions expressed at the intentional level. However, there are some intuitive reformulations that cannot be obtained on the basis of an ontology alone. Let us illustrate this in our example. Recall the query asking for concerts occurring in Vienna. It could be specialized, for instance, to concerts in some venue in Vienna, like the State Opera, or generalized to all concerts in Austria. This can only be done by taking into consideration the dataset at hand (that is the intentional knowledge).

To capture this intuition, we propose rules considering instances of concepts and relations, as well as inclusions between concepts that are not necessarily implied by the TBox, but that can be guaranteed to hold in the current dataset. Applying the resulting rules to produces the following reformulations:

Note that these reformulations are not data independent, but instead, refer to .

The proposed ontology and data-driven reformulations can aid users to explore heterogeneous, unstructured and incomplete datasets in the same spirit as online analytical processing (OLAP) supports the exploration of structured data [CCS93]. For that purpose, we illustrate how our extension of DL-Lite can describe dimensional knowledge, analogous to the multi-dimensional data model considered in OLAP. We also exemplify how our rules for relaxing and restraining queries can be applied in a way that closely resembles the so-called ‘rolling up’ and ‘drilling down’ along dimensions.

2 Preliminaries

We start by introducing the syntax and semantics of  [CDL06]. We assume an alphabet consisting of countable infinite sets ,, of concept, role, and individual names, respectively. expressions are constructed according to the following grammar:

where , . Concepts of the form are called basic concepts, and roles of the form are called inverse roles. A TBox (or ontology) is a finite set of axioms of the form

A ABox (or dataset) is a finite set of assertions of the forms , and , with , , and . A knowledge base (KB) is a pair .

The semantics is defined as usual in terms of interpretations. An interpretation consists of a non-empty domain and an interpretation function assigning to every concept name a set , and to every role name a binary relation . The interpretation of more complex concepts and roles is defined as follows:

Further, each individual name in is interpreted as an element , such that for every (i.e., we adopt the standard name assumption).

An interpretation satisfies an axiom of the form if , an axiom of the form if , an assertion if , and an assertion if . Finally, is a model of a KB , denoted , if satisfies every axiom in , and every assertion in . An ABox is consistent with a TBox if there exists a model of the KB .

Example 1 (Event KB).

The ontology in Figure 2 is formalized into the TBox :

The dataset in Figure 2 together with form a KB, which we denote as event KB .  

Normal form.

W.l.o.g., in the rest of this paper we will consider TBoxes in normal form. In particular, we assume that all axioms in a TBox have one of the following forms: (i) , (ii) , (iii) , (iv) , (v) , (vi) , and (vii) , where and . We note that by using (linearly many) fresh symbols, a general TBox can be transformed into a TBox in normal form so that the models are preserved up to the original signature.

Queries.

We consider the class of conjunctive queries and unions thereof. A term is either an individual name or a variable. A conjunctive query is a first order formula with free variables that takes the form , with a conjunction of atoms of the form

where , , and range over terms. The set of terms occurring in a query is denoted . The free variables on a query are called the answer variables. We use the notation to make explicit reference to the answer variables of . The arity of is defined as the length of , denoted . Queries of arity 0 are called Boolean. We sometimes omit the existential variables and use

to denote a query . Further, when operating on queries, it will be convenient to identify a CQ with the set of atoms occurring in . We also denote to be the set of all variables occuring in .

Let be an interpretation, a CQ and a tuple from of length , we call an answer to in and write if there is a map

such that {enumerate*}[label=()]

,

for each individual ,

for each atom in , and

for each atom in . The map is called a match for in . We denote to be the set of all answers to in .

Let be a KB. A tuple of individuals from with is a certain answer of over wrt. if for all models of ; denotes the set of certain answers of over wrt. . For queries and , we write if .

3 DL-Lite with Complex Role Inclusions

In this section we study the restrictions required to add CRIs to DL-Lite in order to preserve its nice computational properties. In particular, we are interested in ensuring FO-rewritability, as well as a polynomial rewriting in the case of CQs. For this goal, a first restriction is to assume a set of simple roles closed w.r.t. inverses (i.e. implies ); for each , and are non-simple roles. We then define the extension of with CRIs as follows:

Definition 1 (CRIs, ).

A complex role inclusion (CRI) is an expression of the form , with .

A TBox is a TBox that may also contain CRIs such that:

  • For every CRI , is simple and is non-simple.

  • If and , then .

An interpretation satisfies a CRI if for all , , imply .

CRIs are a powerful extension of DLs, but unfortunately, their addition has a major effect in the complexity of reasoning, and syntactic conditions such as regularity [Kaz10] are often needed to preserve decidability. In the case of DL-Lite, even one single fixed CRI destroys first-order rewritability, since it can easily enforce to capture reachability along the edges of a given graph.

Lemma 1.

[ACKZ09] Instance checking in is NLogSpace-hard in data complexity, already for TBoxes consisting of the CRI only.

3.1 Non recursive .

To identify FO-rewritable fragments of it is natural to disallow by restricting cyclic dependencies between roles occurring in CRIs.

Definition 2 ( TBoxes).

For a TBox , the recursion graph of is the directed graph containing a node for each concept name , and a node for each role name occurring in and for each:

  • , there exists an edge from to ;

  • , there exists an edge from to ;

  • , there exists an edge from to ;

  • , there exists an edge from to ;

  • , there exists an edge from to ;

  • , there exists an edge from to and to .

A role name is recursive in if participates in a cycle in the recursion graph of , and is recursive in if is.

A TBox is a TBox where no CRIs are recursive.

Restricting CRIs to be non-recursive indeed guarantees FO-rewritability.

For a CQ , we denote by an arbitrary but fixed variable not occurring in ; we will use such a variable in the query rewriting rules through the rest of the paper. Additionally, we write if either or ;

Definition 3.

Let be a TBox. Given a pair of CQs, we write whenever is obtained by applying an atom substitution or a variable substitution on , where and are as follows:

  1. if and , then ;

  2. if , and is a non-answer variable occurring only once in , then ;

  3. if and , then ;

  4. if and , then ;

  5. if and , then ;

  6. if and , then ;

  7. if , then .

We write if there is a finite sequence of CQs such that , and for all .

By applying exhaustively, we obtain a FO-rewriting of a given query .

Definition 4.

The rewriting of wrt. is .

For any CQ , is a finite query that can be effectively computed.

Lemma 2.

Let be a TBox and let a CQ. Each is polynomially bounded in the size of and , and can be obtained in a polynomial number of steps.

Proof.

Due to the non-recursiveness of the dependency graph and the restriction on simple roles, we show that we can assign to queries a (suitably bounded) degree that roughly corresponds to the number of rewriting steps that can be further applied. We prove that for each such that , the degree does not increase, and after polynomially many steps we will reach such that the degree strictly decreases.

We first define as the acyclic version of the recursion graph of in which nodes are labeled with a bag of predicates symbols, , and each maximal cycle in the recursion graph of denotes a single node with a bag containing all predicates symbols participating in the cycle. All other nodes are labeled with a bag consisting of single predicate symbol. The edges in are obtained from the recursion graph, namely there is an edge between node and if there exists an edge in the recursion graph between some and .

The function assigns a level to each node in as follows:

  • if has no outgoing edges, then ;

  • otherwise, .

For a given query , we define a function that, roughly, bounds the number of rewriting steps that may be iteratively applied to it. It is defined as follows:

We will show that the application of the rules decreases the degree, except for some cases where the degree stays the same, but can only do so for polynomially many rewriting steps (in the size of largest bag of ). We show this bound before proving the main claim:

For each query of the form such that participates in a cycle, then there are at most different queries of the form that can be obtained by the rewriting rules and such that , where is the size of the bag in containing (unique bag containing , since bags are triggered by largest cycles in ).

Let query be obtained by replacing with in , where differes from by at most one variable, . Since it must be that belong to the same bag in . If occurs in a cycle then, by the restriction of TBoxes, cannot be a non-simple role, hence is not obtained by applying S6. Then, applying the axioms that trigger this cycle in , it must be that is obtained again after at most rewriting steps (number of distinct pairs of symbols in the bag).

Now that we have a bound on the number of times that the degree can stay the same for rewritings of a specific form, we can prove the lemma. We will distinguish between the types of queries produced by the rules in Definition 3:

  1. for rules S1-5: , and there is an arc between node labeled with , and node labeled with , or they occur in the same bag in ;

  2. for rule S6: , where arbitrary fixed variable not occuring in and there exists arcs between the node labeled with and nodes labeled with and ;

  3. for rule S7: , where is replaces one variable by another in .

We now show that if , then either (i) , or (ii) if are as in , and thus can only preserve the same degree for at most rewriting steps, or if is obtained by applying a substitution on , which eventually leads to a query with a unique variable. This will imply that, after at most steps, the degree will be zero and no more steps will be applicable.

In what follows we show a proof by cases that matches cases 1–3 above. Firstly, for case above, if and do not occur in same cycle, then since there is an arc between and , hence . The other subcase follows from , therefore the degree of the queries obtained by rules S1-5 decreases after at most rewriting steps.

Next, we show for case above that: for each pair of queries , such that , we have that . Since is obtaind by applying , the claim follows immediately from the fact that cannot occur in a cycle and there must be an arc between node labeled and nodes labeled and in ( cannot belong to same bag due to the restriction of role inclusions between simple and non-simple roles).

Lastly, for case above: for each pair of queries , we have that , however in this case , hence such rule will eventually either reduce the size of and potentially make applicable cases or hence after at most applications.

Therefore, we can conclude that each query can be obtained after applying at most rewriting steps.

We now argue the other part of the lemma, namely that each query in the rewriting has polynomial size. In case at most one new variable is introduced but the size of the query remains the same, and in case the size of the query increases by one, however only one of the newly introduced atoms (the non-simple role atom) may further trigger application of rule S6, but only a polynomially bounded number of times, since the degree decreases. Therefore the size of each query in the rewriting is polynomially bounded. ∎

The next result is shown analogously as for [CDL07], extended for new rule . The full proof can be found in the appendix.

Lemma 3.

Let be a TBox, a CQ. For every ABox consistent with :

Non-recursive CRIs preserve FO-rewritability of DL-Lite, but their addition is far from harmless. Indeed, unlike the extension with transitive roles, even non-recursive CRIs increase the complexity of testing KB consistency.

Theorem 1.

Consistency checking in is coNP-complete.

Proof.

Upper-bound: Similarly as for standard DL-Lite, inconsistency checking can be reduced to UCQ answering, using a CQ for testing whether each disjointness axiom is violated. By Lemmas 2 and 3, an NP procedure can guess one such , guess a in its rewriting, and evaluate over .

Lower-bound: We reduce the complement of 3SAT to KB satisfiability. Suppose we are given a conjunction of clauses of the form , where the are literals, i.e., propositional variables or their negation. Let be all the propositional variables occurring in .

In order to encode the possible truth assignments of each variable , we take two fresh roles and , intended to be disjoint. We construct a TBox containing, for every , the following axioms:

These axioms have a model that is a full binary tree, rooted at and whose edges are labeled with the role , and with different combinations of the roles and . Intuitively, each path represents a possible variable truth assignment. Further, contains axioms relating each variable assignment with the clauses it satisfies, using roles . More precisely, we have the following role inclusions for , and :

(1)

To encode the evaluation of all clauses, we have axioms propagating down the tree all clauses satisfied by some assignment. Note that we could do this easily using a CRI such as . However, this would need a recursive role . Since the depth of the assignment tree is bounded by , we can encode this (bounded) propagation using at most roles () for each clause , which will be declared as subroles of another role . For and , we have the following CRIs:

Thus, if is satisfied in a -branch of the assignment tree, its leaf will have an incoming edge. Now, in order to encode that there is at least one clause that is not satisfied, we need to forbid the existence of a leaf satisfying the concept . This cannot be straightforwardly written in , but we resort again to CRIs to propagate information:

(2)

Next, for we have the following:

(3)

By adding the axiom , we obtain the required restriction. In the appendix we prove that is unsatisfiable iff is satisfiable. ∎

Theorem 2.

CQs over KBs are FO-rewritable. The complexity of answering CQs over consistent KBs is in in data, and NP-complete in combined complexity.

The FO-rewritability and data complexity follow from Lemma 3, while the NP-hardness in combined complexity is inherited from CQs over plain relational databases. The NP membership follows from Lemma 2 and the fact that guessing a rewriting, it is possible to verify in polynomial time if it has a match over the ABox.

3.2 Recursion-safe

Additionally to the increased complexity, has another relevant limitation: it cannot express CRIs like as we need in our motivating example. We introduce another extension of that allows for CRIs with some form of controlled recursion.

Definition 5 (Recursion safe ).

In a recursion safe TBox all CRIs satisfy:

  • If participates in some cycle in the recursion graph of , then the cycle has length at most one, and .

  • There is no axiom of the form with or , where denotes the reflexive and transitive closure of the simple inclusions in a TBox , that is, of the relation with .

The key idea behind recursion safety is that every recursive CRI is ‘guarded’ by a simple role that is not existentially implied. For query answering, we can assume that only ABox individuals are connected by these guarding roles, and thus CRIs only ‘fire’ close to the ABox (that is, each pair in the extension of a recursive roles has at least one individual). In fact, we show below that every consistent recursion-safe KB has a model where both conditions hold.

Example 2.

is recursion safe, since is the only CRI, and is not implied by any existential axiom in .

3.2.1 Reasoning in recursion safe .

Standard reasoning problems like consistency checking and answering instance queries are tractable for recursion safe KBs. In fact, for a given KB, we can build a polynomial-sized interpretation that is a model whenever the KB is consistent, and that can be used for testing entailment of assertions and of disjointness axioms.

Definition 6.

Let be a recursion safe KB. We define an interpretation as follows. As domain we use the individuals in , fresh individuals that serve as -fillers for individual , and fresh individuals that serve as shared -fillers for the objects that are not individuals in . That is, , where

The interpretation function has for each , and assigns to each concept name and each role name in the minimal set of the form , such that the following conditions hold, for all , a basic concept, and :

  1. if then , and if then .

  2. If , then .

  3. if , then .

  4. if , then .

  5. if , then .

  6. if , then .

  7. if , and then .

For , we can show the following useful properties:

Proposition 1.

Let be a recursion safe TBox, where contains only positive inclusions, and contains only disjointness axioms. Then, for every ABox :

  1. If is consistent, then .

  2. is inconsistent iff for some .

  3. If is consistent and is an instance query, then .

Proof (sketch).

To prove , we assume that is consistent. Verifying that satisfies all but the disjointness axioms is easy from the definition of . Let be an arbitrary model of . For , let the set of basic concepts satisfied at in , and , the set of roles connecting and in . The following claim shows a key property of . The proof of the claim can be found in the appendix.

Claim 1.

For any given (i) there exists such that and (ii) for each such that we have that there exists such that .

Towards a contradiction, assume there is such that ; the case of role disjointness axioms is analogous. Then there is with , and by the claim above, for each model . Hence , and this concludes proof of . Properties and can also be shown using the above Claim 1 and the fact that is a model of the KB. ∎

This proposition allows us to establish the following results:

Theorem 3.

For recursion safe KBs, consistency checking and instance query answering are feasible in polynomial time in combined complexity.

The recursion safe fragment of is not FO-rewritable: indeed, the TBox in the proof of Lemma 1 is recursion safe. However, we can get rid of recursive CRIs and regain rewritability if we have guarantees that they will only be relevant on paths of bounded length. We formalize this rough intuition next.

Definition 7 (k-bounded ABox).

Let be a