[

[

24 January 2011
24 January 2011
Abstract

A data integration system provides transparent access to different data sources by suitably combining their data, and providing the user with a unified view of them, called global schema. However, source data are generally not under the control of the data integration process, thus integrated data may violate global integrity constraints even in presence of locally-consistent data sources. In this scenario, it may be anyway interesting to retrieve as much consistent information as possible. The process of answering user queries under global constraint violations is called consistent query answering (CQA). Several notions of CQA have been proposed, e.g., depending on whether integrated information is assumed to be sound, complete, exact or a variant of them. This paper provides a contribution in this setting: it uniforms solutions coming from different perspectives under a common ASP-based core, and provides query-driven optimizations designed for isolating and eliminating inefficiencies of the general approach for computing consistent answers. Moreover, the paper introduces some new theoretical results enriching existing knowledge on decidability and complexity of the considered problems. The effectiveness of the approach is evidenced by experimental results.

To appear in Theory and Practice of Logic Programming (TPLP).

A
\submitted

23 November 2010

CQA via ASP from different perspectives]Consistent Query Answering via ASP
from Different Perspectives:
Theory and Practice

M. Manna, F. Ricca and G. Terracina] MARCO MANNA, FRANCESCO RICCA and GIORGIO TERRACINA
Department of Mathematics, University of Calabria, Italy

\pagerange

[LABEL:lastpage \volume10 (3): \jdatemm yyyy yyyy

nswer Set Programming, Data Integration, Consistent Query Answering

1 Introduction

The enormous amount of information dispersed over many data sources, often stored in different heterogeneous databases, has recently boosted the interest for data integration systems [Lenzerini (2002)]. Roughly speaking, a data integration system provides transparent access to different data sources by suitably combining their data, and providing the user with a unified view of them, called global schema. In many cases, the application domain imposes some consistency requirements on integrated data. For instance, it may be at least desirable to impose some integrity constraints (ICs), like primary/foreign keys, on the global relations. It may be the case that data stored at the sources may violate global ICs when integrated, since in general data sources are not under the control of the data integration process. The standard approach to this problem basically consists of explicitly modifying the data in order to eliminate IC violations (data cleaning). However, the explicit repair of data is not always convenient or possible. Therefore, when answering a user query, the system should be able to “virtually repair” relevant data (in the line of \citeNPArenasBertossiChomicki03,BertossiHunterSchaub05,ChomickiMarcinkowski05), in order to provide consistent answers; this task is also called Consistent Query Answering (CQA).

The database community has spent considerable efforts in this area, relevant research results have been obtained to clarify semantics, decidability, and complexity of  data-integration under constraints and, specifically, for CQA. In particular, several notions of CQA have been proposed (see \citeNPBertossiHunterSchaub05 for a survey), e.g. depending on whether the information in the database is assumed to be sound, complete or exact. However, while efficient systems are already available for simple data integration scenarios, solutions being both scalable and comprehensive have not been implemented yet for CQA, mainly due to the fact that handling inconsistencies arising from constraints violation is inherently hard. Moreover, mixing different kinds of constraints (e.g. denial constraints, and inclusion dependencies) on the same global database makes, often, the query answering process undecidable [Abiteboul et al. (1995), Calì et al. (2003a)].

This paper provides some contributions in this setting. Specifically, it first starts from different state-of-the-art semantic perspectives [Arenas et al. (2003), Calì et al. (2003a), Chomicki and Marcinkowski (2005)] and revisits them in order to provide a uniform, common core based on Answer Set Programming (ASP) [Gelfond and Lifschitz (1988), Gelfond and Lifschitz (1991)]. Thus, it provides query driven optimizations, in the light of the experience we gained in the INFOMIX [Leone et al. (2005)] project in order to overcome the limitations observed in real-world scenarios. The main contributions of this paper can be summarized in:

  • A theoretical analysis of considered semantics which extends previous results.

  • The definition of a unified framework for CQA based on a purely declarative, logic based approach which supports the most relevant semantics assumptions on source data. Specifically, the problem of consistent query answering is reduced to cautious reasoning on (disjunctive) ASP programs with aggregates [Faber et al. (2010)] automatically built from both the query and involved constraints.

  • The definition of an optimization approach designed to (1) “localize” and limit the inefficient part of the computation of consistent answers to small fragments of the input, (2) cast down the computational complexity of the repair process if possible.

  • The implementation of the entire framework in a full fledged prototype system.

  • The capability of handling large amounts of data, typical of real-world data integration scenarios, using as internal query evaluator the DLV [Terracina et al. (2008)] system; indeed, DLV allows for mass-memory database evaluations and distributed data management features.

In order to assess the effectiveness of the proposed approach, we carried out experimental activities both on a real world scenario and on synthetic data, comparing its behavior on different semantics and constraints.

The plan of the paper is as follows. Section 2 formally introduces the notion of CQA under different semantics and some new theoretical results on decidability and complexity for this problem. Section 3 first introduces a unified (general) solution to handle CQA via ASP, and then presents some optimizations. Section 4 describes the benchmark framework we adopted in the tests and discusses on obtained results. Finally, Section 5 compares related work and draws some conclusive considerations.

2 Data Integration Framework

In this paper we exploit the data integration setting to point out motivations and challenges underlying CQA. However, as it will be clarified in the following, techniques and results provided in the paper hold also for a single database setting. We next formally describe the adopted data integration framework.

The following notation will be used throughout the paper. We always denote by a countably infinite domain of totally ordered values; by a tuple of values from ; by a variable; by a sequence of (not necessarily distinct) variables, and by its length. Let be two sequences of variables, we denote by the sequence obtained from by discarding a variable if it appears in . Whenever all the variables of sequence appear in another sequence , we simply write . Given a sequence and a set , we denote by the sequence obtained from by discarding a variable if its position is not in . (Similarly, given a tuple and a set , we denote by the tuple obtained from by discarding a value if its position is not in .) Moreover, we denote, by a conjunction of comparison atoms of the form , where , and by , the symmetric difference operator between two sets.

A relational database schema is a pair where and are the relation names and the integrity constraints (ICs) of , respectively. The arity of a given relation is denoted by . A database (instance) for is any set of facts [Abiteboul et al. (1995)] of the form:

In the following, we adopt the unique name assumption, and denotes the subset of containing all the values appearing in the facts of .

Let , the set contains ICs of the form:

  1. (denial constraints – DCs)

  2. (inclusion dependencies – INDs);

where , for each in [1..]. In particular, for INDs we require that all the variables within an () are distinct, , , and . Note that, if , then . In the case we are only interested in emphasizing the relation names involved in an IND, we simply write or . A database is said to be consistent w.r.t. if all ICs are satisfied. A conjunctive query over is a formula of the form

where for each in [1..], are the free variables of , and contains only and all the variables of (with no duplicates, and possibly in different order). A union of conjunctive queries is a formula of the form . In the following, for simplicity, the term query refers to a union of conjunctive queries, if not differently specified. Given a database for , and a query , the answer to is the set of -tuples of values .

2.1 The Data Integration Model

A data integration system is formalized [Lenzerini (2002)] as a triple where

  1. is the global schema. A global database for is any database for ;

  2. is the source schema. A source database for is any database consistent w.r.t. ;

  3. is the global-as-view (GAV) mapping, that associates each element in with a union of conjunctive queries over .

Let be a source database for . The retrieved global database is

for satisfying the mapping. Note that, when source data are combined in a unified schema with its own ICs, the retrieved global database might be inconsistent.

In the following, when it is clear from the context, we use simply the symbol to denote the retrieved global database . In fact, all results provided in the paper hold for any database complying with some schema but possibly inconsistent w.r.t. the constraints of .

Example 1

Consider a bank association that desires to unify the databases of two branches. The first (source) database models managers by using a relation and employees by a relation , where is a primary key for both tables. The second database stores the same data in a relation . Suppose that the data have to be integrated under a global schema with two relations and , where the global ICs are:

  • namely, is the key of ;

  • i.e., an IND imposing that each manager code must be an employee code as well.

The mapping is defined by the following Datalog rules (as usual, see \citeNPAbiteboulHullVianu95):

  1.     

Assume that, stores tuples (‘e1’,‘john’), (‘e2’,‘mary’), (‘e3’,‘willy’), stores (‘e1’,‘john’), and stores (‘e1’,‘ann’,‘manager’), (‘e2’,‘mary’,‘manager’), (‘e3’, ‘rose’,‘emp’). It is easy to verify that, although the source databases are consistent w.r.t. local constraints, the global database, obtained by evaluating the mapping, violates the key constraint on as both john and ann have the same code e1, and both willy and rose have the same code e3 in table .    

2.2 Consistent Query Answering under different semantics

In case a database violates ICs, one can still be interested in querying the “consistent” information originating from . One possibility is to “repair” (by inserting or deleting tuples) in such a way that all the ICs are satisfied. But there are several ways to “repair” . As an example, in order to satisfy an IND of the form one might either remove violating tuples from or insert new tuples in . Moreover, the repairing strategy depends on the particular semantic assumption made on the data integration system. Semantic assumptions may range from (strict) soundness to (strict) completeness. Roughly speaking, completeness complies with the closed world assumption where missing facts are assumed to be false; on the contrary, soundness complies with the open world assumption where may be incomplete. We next define consistent query answering under some relevant semantics, namely loosely-exact, loosely-sound, CM-complete [Arenas et al. (2003), Calì et al. (2003a), Chomicki and Marcinkowski (2005)]. More formally, let denote a semantics, and a possibly inconsistent database for , a database is said to be a -repair for if it is consistent w.r.t. and one of the following conditions holds:

  1. , , and such that is consistent and ;

  2. and such that is consistent and ;

  3. , and such that is consistent and .

The CM-complete semantics allows a minimal number of deletions in each repair to avoid empty repairs, if possible, but does not allow insertions. The loosely-sound semantics allows insertions and a minimal amount of deletions. Finally, the loosely-exact semantics allows both insertions and deletions by minimization of the symmetric difference between and the repairs.

Definition 1

Let be a database for a schema , and be a semantics. The consistent answer to a query w.r.t. , is the set Consistent Query Answering (CQA) is the problem of computing .    

Observe that other semantics have been considered in the literature, like sound, complete, exact, loosely-complete, etc. [Calì et al. (2003a)]; however, some of them are trivial for CQA; as an example, in the exact semantics CQA makes sense only if the retrieved database is already consistent with the global constraints, whereas in the complete and loosely-complete semantics CQA will always return a void answer. Note that, the semantics considered in this paper address a wide significant range of ways to repair the retrieved database which are also relevant for CQA.

Example 2

By following Example 1, the retrieved global database admits exactly the following repairs under the CM-complete semantics: Query asking for the list of manager codes has then both e1 and e2 as consistent answers, whereas the query asking for the list of employees has only as consistent answer ( is the only tuple in each CM-complete repair).    

2.3 Restricted Classes of Integrity Constraints

The problem of computing CQA, under general combinations of ICs, is undecidable  [Abiteboul et al. (1995)]. However, restrictions on ICs to retain decidability and identify tractable cases can be imposed.

Definition 2

Let be a relation name of arity , and be a set of indices from . A key dependency (KD) for consists of a set of DCs, exactly one for each index , of the form where no variable occurs twice in each (), , the sequence exactly coincides with , and is distinct from for each . The set is called the primary-key of and is denoted by . We assume that at most one KD is specified for each relation [Calì et al. (2003a)]. Finally, for each relation name such that no DC is explicitly specified for, we say, without loss of generality, that .    

Definition 3

Given an inclusion dependency of the form , we denote by and the two sets of indices induced by the positions of the variables in and , respectively. More formally, is universally quantified in and is universally quantified in .    

For example, let denote the IND . We have that and .

Definition 4

An IND is said to be

Definition 5

An FSK of the form is said to be safe (SFSK) if . In particular, if is a safe FK we call it an SFK.    

For example, let denote the FSK where . Thus, if , is SFSK, whereas if , is not SFSK.

Table 1 summarizes known and new results about computability and complexity of CQA under relevant classes of ICs and the three semantic assumptions considered in this paper. In particular, given a query (without comparison atoms if --), we refer to the decision problem of establishing whether a tuple from belongs to or not. Note that, \citeNChomickiMarcinkowski05 have proved computability and complexity of CQA for the CM-complete semantics in case of conjunctive queries with comparison predicates. However, since in such a setting there is a finite number of repairs each of finite size, then their results straightforwardly hold for union of conjunctive queries as well. New decidability and complexity results for CQA under KDs and SFSKs only, with are proved in Section 2.4.

DCs INDs loosely-sound loosely-exact CM-complete no any in PTIME in PTIME in PTIME KD no coNP-c coNP-c coNP-c KD NKC coNP-c -c in in coNP KD SFSK in in in in coNP KD any undec. undec. in in coNP any any undec. undec. -c coNP-c \citeNPCaliLemboRosatiPODS03;   \citeNPChomickiMarcinkowski05;   Section 2.4;   \citeNPAbiteboulHullVianu95;
Table 1: Data Complexity of CQA (distinguishing between cyclic/acyclic INDs)

2.4 Loosely-exact and Loosely-sound semantics under KD and SFSK

In this section we provide new decidability and complexity results for CQA under both the loosely-exact and the loosely-sound semantics with KDs and SFSKs. In the rest of the section we always denote by:

  • , a schema containing KDs and SFSKs only;

  • , a possibly inconsistent database for ;

  • , a union of conjunctive queries without comparison atoms.

  • .

We first show that, in the aforementioned hypothesis, the size of each repair is finite.

Definition 6

Let be a -repair for and be a natural number. We inductively define the sets as follows:

  1. If , then .

  2. If , then is arbitrarily chosen in such a way that its facts are necessary and sufficient for satisfying all the INDs in that are violated in .

Observe that and that for each .    

{lemma}

Let be a -repair for , then

  1. The key of each fact in only contains values from .

  2. is finite.

{proof}

(1) Let be a natural number. Let be a fact in such that there is an index for which . Let be one of the facts in that forces the presence of in for satisfying some IND, say . (Note that, by Definition 6, there must be at least one of such a fact because would otherwise violate condition 2, since would be unnecessary.) Moreover, since is a safe FSK, then there must exist an index such that . Thus, contains a value being not in inside its key as well as . Since has been chosen arbitrarily, then value has to be part of a fact of , which is clearly a contradiction.

(2) Since, the key of each fact in can only contain values from , and where , then .

We next characterize representative databases for -repairs.

Definition 7

Let be a -repair for . We denote by the (possibly infinite) set of databases defined in such a way that if and only if:

  • can be obtained from by replacing each value (if any) that is not in with a value from ; and

  • none of the values in occurs twice in .

Finally, we denote by the function (homomorphism) associating values in with values in , where , for each .    

Note that, since (by Lemma 2.4) the key of each fact in only contains values from , then holds.

For example, if with and , then all of the following databases are in : , and .

{lemma}

If is a -repair for , then each also is.

{proof}

Let . First of all, we prove that is consistent w.r.t. . In particular, since the key of each fact in only contains values from (by Lemma 2.4), then cannot violate any KD (by Definition 7); Moreover, since each IND has to be satisfied through values of a key (by definition of safe FSKs), and since the key of each fact in only contains values from (by Lemma 2.4), then cannot violate any IND (by Definition 7);

We now prove that is a repair, first for the loosely-sound semantics and then for the loosely-exact semantics.

[loosely-sound] If , then observe that , by definition of . Thus, if was consistent but not a loosely-sound repair there would exist a loosely-sound repair such that . Contradiction.

[loosely-exact] If , then assume that is a loosely-exact repair but (although consistent w.r.t. ) is not. By definition, there must be a loosely-exact repair such that . In particular, we distinguish three cases:

    1. and

    2. and

    3. and

Case 1: Since, by Definition 7, for each fact in there is a fact in with the same key, if we could add the facts in to without violating any KD, then such facts could also be added to without violating any KD. Moreover, if we could add to the facts in without violating any IND, then such facts could be also added to preserving consistency. This follows by the definition of safe FSKs (because each IND has to be satisfied through values of a key), by Lemma 2.4 (because the key of each fact in a loosely-exact repair only contains values from ) and, by Definition 7 (because for each fact in there is a fact in with the same key and with the same values from ). Consequently, we could add all the facts in to preserving consistency. But this is not possible since is a loosely-exact repair.

Case 2: Since in we have unnecessary facts (those in ) or equivalently the facts in do not violate any IND, then the corresponding facts in do not violate any IND by Lemma 2.4 and by Definition 7. Consequently, if each fact , such that there is a fact that is homomorphic to , was removed from , then we would obtain a database preserving consistency and with a smaller symmetric difference than . But this is not possible since is a loosely-exact repair.

Case 3: Analogous considerations can be done by combining case 1 and case 2.

We next define the finite database having among its subsets a number of -repairs sufficient for solving CQA.

Definition 8

Let be a value in . Consider the largest (possibly inconsistent) database, say , constructible on the domain such that iff the value does not appear in the key of . Let be a fixed set of values arbitrarily chosen from whose cardinality is equal to the number of occurrences of in . We denote by one possible database for obtained from by replacing each occurrence of with a value from in such a way that each value in occurs exactly once in . (.)    

For example, if and with and , then . Let us fix . Thus, has the following form: .

{proposition}

The following hold:

{lemma}

If is a -repair for , then there exists such that .

{proof}

can be obtained from by replacing each fact with the unique fact such that for each either , if , or , if . Moreover, note that, since cannot contain two facts with the same key and since keys only have values from , then each fact in can replace at most one fact in . Finally, by Definition 7.

{lemma}

Let be a -repair for , , be a query, and be a tuple of values from . If , then .

{proof}

Let be one of the conjunctions in , if , then there is a substitution from the variables of to values in such that . But since, by Definition 7, each fact in is univocally associated with a unique fact in by preserving the values in , and since all the extra values in are distinct, then there must also be a substitution such that . In particular, let be a variable in , we can define in such a way that , where is the homomorphism from to (see Definition 7). Clearly, if for at least one in then too and, consequently,

The next theorem states the decidability of CQA under both the loosely-exact and the loosely-sound semantics with KDs and SFSKs only.

{theorem}

Let be a -repair for , a query, and a tuple from . Let denote the set of all -repairs contained in . Then,

{proof}

() We have to prove that, if , then for each , or equivalently if for some , then . This follows, by the definition of and from the fact that only contains -repairs.

() We have to prove that, if for each , then . Assume that for each but . This would entail that there is a repair such that . But, since for each (by Lemma 2.4), and since always contains a repair, say (by Lemma 2.4), then we have a contradiction since has to hold whereas we have assumed that for each .

Decidability and complexity results, under KDs and SFSKs only, follow from Theorem 2.4.

{corollary}

Let be a global schema containing KDs and SFSKs only, be a possibly inconsistent database for , be a query, , and be a tuple of values from . The problem of establishing whether is in in data complexity.

{proof}

It suffices to prove that the problem of establishing whether is in . This can be done by (i) building , and (ii) guessing such that is a -repair and . Since, by Proposition 2.4, where , then step (i) (enumerate the facts of ) can be done in polynomial time. Since checking that can be done in PTIME. It remains to show that checking whether is a -repair can be done in coNP.

[loosely-exact] If , this task corresponds to checking that there is no consistent such that , where this last task is doable in PTIME.

[loosely-sound] If , this task corresponds to checking that there is no consistent such that , where this last task is doable in PTIME.

Then the thesis follows.

2.5 Equivalence of CQA under loosely-exact and CM-complete semantics

In this section we define some relevant cases in which CQA under loosely-exact and CM-complete semantics coincide.

{lemma}

Given a database for a schema , if is a CM-complete repair for , then it is a loosely-exact repair for .

{proof}

Suppose that is a CM-complete repair for (so, it is consistent w.r.t. ), but it is not a loosely-exact one. This means that its symmetric difference with can be still reduced. But, by definition of CM-complete semantics, does not contain anything else but tuples in , namely . So, the only way for “improving” it is to extend it with tuples from . But, this is not possible because is already maximal due to the CM-complete semantics, namely the addition of any other tuple would violate at least one IC.

{corollary}

{proof}

This directly follows by Lemma 2.5 in light of Definition 1.

{theorem}

There are cases where

{proof}

By \citeNChomickiMarcinkowski05, stating that the two semantics are different, and by Corollary 2.5.

{proposition}

Let be a database consistent w.r.t. a set of ICs .

  1. If are DCs only, then each is consistent w.r.t. , as well.

  2. If are INDs only, then is consistent w.r.t. for each consistent w.r.t. .

{proof}

Deletion of tuples can not introduce new DCs violations.

Let be a fact in . Let be an IND of the form (). Clearly, cannot violate in any database because is in the righthand side of . In particular, cannot violate in . Let be an IND of the form (possibly, ). Since does not violate in , then it cannot violate in .

{theorem}

Given a database for a schema , let be a loosely-exact repair for , and . There is a CM-complete repair for if at least one of the following restrictions holds:

  1. contains DCs only (no INDs);

  2. contains INDs only (no DCs);

  3. contains KDs and FKs only, and is consistent w.r.t. KDs;

  4. contains KDs and SFKs only;

{proof}

Case I: By Proposition 2.5, since is consistent w.r.t. DCs, then is consistent as well. Now, if , then we would have a contradiction because would hold. Thus, and so, is already a CM-complete repair itself.

Case II: Since there is no DC, there exists only one CM-complete repair, say , obtained from after removing all the facts violating INDs. Now, if was not contained in , then, by Proposition 2.5, would still be consistent, that is a larger CM-complete repair. Contradiction. Finally .

Case III: Since is consistent w.r.t. DCs, we have only one CM-complete repair, say , obtained from after removing all the facts violating INDs. But, as in case II, if the set was nonempty, then we could add all these facts into without violating any IND. Anyway, one of these facts, say , could violate a DC due to a fact in . Now, note that is in only for fixing an IND violation. But in this case, as we are only considering FKs, there would be no reason to have in instead of . So, we could (safely) replace with