Specifying and Computing Causes for Query Answers in Databases via Database Repairs and Repair-Programs

Specifying and Computing Causes for Query Answers in Databases via Database Repairs and Repair-Programs

Abstract

There is a recently established correspondence between database tuples as causes for query answers in databases and tuple-based repairs of inconsistent databases with respect to denial constraints. In this work, answer-set programs that specify database repairs are used as a basis for solving computational and reasoning problems around causality in databases, including causal responsibility. Furthermore, causes are introduced also at the attribute level by appealing to an attribute-based repair semantics that uses null values. Corresponding repair-programs are introduced, and used as a basis for computation and reasoning about attribute-level causes. The answer-set programs are extended in order to capture causality under integrity constraints.

Keywords:
Causality databases repairs constraints answer-set programming

1  Introduction

Causality appears at the foundations of many scientific disciplines. In data and knowledge management, causality may be related to some form of uncertainty about obtained results. For example, about why certain query answers are obtained or not; or why certain semantic conditions are not satisfied. These tasks become more prominent and difficult when dealing with large volumes of data. One would expect the database to provide explanations, which could be used to understand, explore and make sense of the data, or to reconsider queries and integrity constraints (ICs). Causes for data phenomena can be seen as explanations.

Building on work on actual causality, mainly developed in the context of artificial intelligence Halpern05 (); Chockler04 (), causality in databases was introduced in Meliou2010a (), where the notions of counterfactual intervention and structural model are applied. More specifically, Meliou2010a () introduces the notions of: (a) a database tuple as an actual cause for a query result,  (b) a contingency set for a cause, as a set of tuples that must accompany the cause for it to be such, and (c) the responsibility of a cause as a numerical measure of its strength.

Example 1

Consider the relational database below, with table representing official stores, and table , for stores receiving goods from other stores:

The tables include attribute names. For accounting purposes, a store could be its own supplier, as shown by the tuple .  The query asking if there are pairs of official stores in a receiving relationship is true in the database. The question is about the tuples that cause this query to be true, and how strong they are as causes. In this case, we would expect the tuples and to be causes.

Most of our research on causality in databases has been motivated by an attempt to understand causality in data and knowledge management from different perspectives, and profiting from the established connections between them. In tocs (), precise reductions between causality in databases, database repairs, and consistency-based diagnosis were established; and the relationships were investigated and exploited. In flairsExt (), causality in databases was related to view-based database updates and abductive diagnosis. These are all interesting and fruitful connections among several forms of non-monotonic reasoning; each of them reflecting some form of uncertainty about the information at hand. In the case of database repairs Bertossi2011 (), uncertainty due the non-satisfaction of given ICs, and it is represented by a class of possible repairs of the inconsistent database.

Database repairs can be specified by means of answer-set programs (or disjunctive logic programs with stable model semantics) asp (); gelfond (); torsten (), the so-called repair-programs. Cf. monica (); Bertossi2011 () for details on repair-programs and additional references. In this work we exploit the reduction from database causality to database repairs established in tocs (), by taking advantage of repair-programs for specifying and computing causes, their contingency sets, and their responsibility degrees. We show that the resulting causality-programs have the necessary and sufficient expressive power to capture and compute not only causes, which can be done with less expressive programs Meliou2010a (), but also minimal contingency sets and responsibilities, which provably requires higher expressive power. Causality programs can also be used for reasoning about causes.

As a finer-granularity alternative to tuple-based causes, we introduce a particular form of attribute-based causes, namely null-based causes, capturing the intuition that an attribute value may be the cause for a query to become true in the database. This is done by profiting from an abstract reformulation of the above mentioned relationship between tuple-based causes and tuple-based repairs. More specifically, we appeal to null-based repairs, which are a particular kind of attribute-based repairs. According to null-based repairs, the inconsistencies of a database are solved by minimally replacing attribute values in tuples by (a properly formalized version of) NULL, the null-value used in SQL databases (with an SQL semantics). We also define the corresponding attribute-based notions of contingency set and responsibility. We introduce repair (answer-set) programs for null-based repairs, so that the newly defined causes can be computed and reasoned about.

Finally, we briefly show how causality-programs can be adapted to give an account of other forms of causality in databases that are connected to other possible repair-semantics for databases.

More specifically, we make the following contributions:

  1. We start from a characterization of actual causes for query answers in terms of minimal repairs that are based on tuple deletions from a database that does not satisfy certain denial constraints. Next, we propose an abstract notion of actual cause that depends on an abstract repair semantics.

  2. The abstract notion of cause is specialized by appealing to minimal repairs that are obtained through changes of attribute values by NULL, a null value that is treated as in SQL databases. In this way, we introduce a notion of actual cause at the attribute level (as opposed to tuple level, as is usually the case).2

  3. We present answer-set programs (ASPs) for the specification and computation of causes and their responsibilities. They are extensions of repair ASPs, both at the tuple- and the attribute-level. In particular, we show how extensions of ASPs with sets, aggregations and weak program constraints can be used for the computation of maximum-responsibility actual causes.

  4. The ASPs can be easily modified to accommodate endogenous and exogenous tuples, as considered under actual causality Meliou2010a (). The former are somehow under our control, and can be subjected to further analysis. Only they can be actual causes. The latter are external tuples, beyond our control, and taken as given. They are not considered as possible causes or contingent companions of causes. The notion of database repair with these two classes of tuples was introduced in tocs (). Writing and using ASPs for this kind of repairs is a rather straightforward extension of the programs we consider in this work, where all tuples are considered to be endogenous.

  5. We show several ASPs as examples, and their execution with the DLV and DLV-Complex systems dlv (); calimeri08 (); calimeri09 ().

  6. We elaborate on the notion of actual cause under given integrity constraints, and show how they can be computed via ASPs.

  7. We introduce several topics for relevant discussions and further research.

The results obtained in this work are significant for several reasons. First of all, declarative specifications of causes in databases enable logical reasoning about causes and their responsibilities in combination with data. This can be done at the same level of the data, which, as facts, become elements of an answer-set program. Using an ASP system such as DLV makes this possible without having to export the data as facts outside the database. After having an ASP taking care of the base specification of causes, one can consider combining it with additional domain knowledge or semantic information, in the form of additional rules or constraints, or in combination with ontologies dlProgs (). One can also rather easily perform some forms of hypothetical reasoning about causes, by almost dynamically activating or disabling database tuples, or making some of them, or classes of them, exogenous, which means they cannot be considered as causes, and, on the repair side, cannot be considered for restoring consistency of the database.

Furthermore, the ASP paradigm is expressive enough to solve the intrinsically complex algorithmic tasks behind responsibility computation, in particular, of most responsible causes, without overkilling the problem. ASPs have exactly the required computational complexity for these tasks (c.f. Section 8). In this same direction, ASP systems have become particulary efficient, and are being used for solving hard combinatorial problems. In our case, we could do the actual computations directly using the ASP engines, without having to export the data to some external environment where ad hoc algorithms are implemented and run. Finally, the ASPs we propose can be used to specify an abstract notion of cause (c.f. Section 3.1), from which specific forms of causes, such as tuple- and attribute-based causes can be obtained; and the generic ASPs can be easily adapted for those specific forms of causes, and others.

This paper is structured as follows. Section 2 provides background material on relational databases, database causality, database repairs, and answer-set programming. Section 3 establishes correspondences between causes and repairs, and introduces an abstract notion of cause on the basis of an abstract repair-semantics. Section 4 introduces null-based repairs and causes. Section 5 presents repair-programs for tuple-based causality computation and reasoning. Section 6 presents answer-set programs for null-based repairs and null-based causes. Section 7 introduces actual causes in the presence of ICs, and illustrates the corresponding repair-programs that can be used for causality computation. Finally, Section 8, in more speculative terms, contains a discussion about research subjects that would naturally extend this work. Appendices A and B show ASPs written and run in DLV. This paper is a revised and extended version of foiks18 ().3 Its extended version foiks18Corr () contains additional examples with DLV.

2  Background

2.1 Relational databases

A relational schema contains a domain, , of constants and a set, , of predicates of finite arities. gives rise to a language of first-order (FO) predicate logic with built-in equality, . Variables are usually denoted by , and sequences thereof by ; and constants with , and sequences thereof by . An atom is of the form , with -ary and terms, i.e. constants, or variables. An atom is ground, a.k.a. a tuple, if it contains no variables. Tuples are denoted with . A database instance, , for schema is a finite set of ground atoms; and it serves as a (Herbrand) interpretation structure for language lloyd () (cf. also Section 2.4).

A conjunctive query (CQ) is a FO formula of the form  , with , and (distinct) free variables . If has free variables,  is an answer to from if , i.e. is true in when the variables in are componentwise replaced by the values in . denotes the set of answers to from . is a Boolean conjunctive query (BCQ) when is empty. When it is true in , by definition . Otherwise, if it is false, .  A view is predicate defined by means of a query, whose contents can be computed, if desired, by computing all the answers to the defining query.

Example 2

(ex. 1 cont.)  The relational schema contains two predicates, . A database compatible with this schema is shown in Example 1.  We will usually present a relational database as the set of its true atoms, i.e. the tuples in the tables with their predicates. In this case,  

The query in Example 1, asking as to whether there are official stores, such one receives goods from another, is a BCQ that can be written as

The query is true in , denoted  .

In this work we consider integrity constraints (ICs), i.e. sentences of , that are: (a) denial constraints  (DCs), i.e. of the form

where , and ; and (b) functional dependencies  (FDs), i.e. of the form

.

Here, , and is an abbreviation for .4

An inclusion dependency is an IC of the form

,

with , and .

Example 3

(ex. 2 cont.)  For schema , the following is a denial constraint:

and the following is a functional dependency, of the second attribute upon the first, on predicate :

.

The following is an inclusion dependency:

The constraint is not satisfied by , but and are.

A given schema may come with its set of ICs, and its instances are expected to satisfy them. If an instance does not satisfy them, we say it is inconsistent. In this work we concentrate mostly on DCs.  See ahv () for more details and background material on relational databases.

2.2 Causality in databases

A notion of cause as an explanation for a query result was introduced in   Meliou2010a (), as follows. For a relational instance , where and denote the mutually exclusive sets of endogenous and exogenous tuples, a tuple is called a counterfactual cause for a BCQ , if    and  . Now, is an actual cause for if there exists , called a contingency set for , such that is a counterfactual cause for in . This definition is based on Halpern05 ().

The notion of responsibility reflects the relative degree of causality of a tuple for a query result Meliou2010a () (based on  Chockler04 ()). The responsibility of an actual cause for , is , where is the size of a smallest contingency set for . If is not an actual cause, . Intuitively, tuples with higher responsibility provide stronger explanations.

The partition of the database into endogenous and exogenous tuples. Exogenous tuples are accepted as given, which may happen because we trust them, or we have little control on them, or are obtained from an external, trustable and indisputable data source, etc. Endogenous tuples are subject to experimentation and questioning, in particular, about their role in query answering or violation of ICs. The partition is application dependent, and we may not even have exogenous tuples.  Actually, in the following we will assume all the tuples in a database instance are endogenous.  (Cf. tocs () for the general case, and Section 8 for additional discussions.)

The notion of cause as defined above can be equally applied to answers to open CQs, say a cause for obtaining as an answer to a CQ , i.e. a cause for . Actually, it can be applied to monotonic queries in general, i.e. whose sets of answers may only grow when the database grows tocs (). For example, CQs, unions of CQs (UCQs) and Datalog queries are monotonic. Causality for these queries was investigated in tocs (); flairsExt ().  In this work we concentrate mostly on conjunctive queries, possibly with built-in comparisons, such as .

Example 4

(ex. 2 cont.)  We recall that the query

is true in , for which we want to identify causes.

Tuple is a counterfactual cause for : if is removed from , is no longer true. So, it is an actual cause with empty contingency set; and its responsibility is .   is an actual cause for with contingency set : if is removed from , is still true, but further removing the contingent tuple makes false. The responsibility of is . and are actual causes, with responsibility .

2.3 Database repairs

We introduce the main ideas behind database repairs by means of an example. If only deletions and insertions of tuples are admissible updates, the ICs we consider in this work can be enforced only by deleting tuples from the database, not by inserting tuples (we consider repairs via updates of attribute-values in Section 4).

Example 5

Database , shown in tabular form as

is inconsistent with respect to (w.r.t.) the set of DCs , with

(1)
(2)

which require that pricey products cannot be supplied or resold.  That is,  ; and we have to consider possible repairs for .

A subset-repair, in short an S-repair, of w.r.t. is a -maximal subset of that is consistent w.r.t. , i.e. no proper superset is consistent. The following are the S-repairs:

A cardinality-repair, in short a C-repair, of w.r.t. is a maximum-cardinality S-repair of w.r.t. . is the only C-repair.

For an instance and a set of DCs, the sets of S-repairs and C-repairs are denoted with and , resp.

The definitions of S- and C-repairs can be generalized to sets of arbitrary ICs, for which both tuple deletions and insertions can be used as repair updates. This is the case, for example, for inclusion dependencies. In these cases, repairs do not have to be subinstaces of the inconsistent instance at hand, . Accordingly, one considers the symmetric difference, , between and a potential repair . On this basis, an S-repair is an instance that is consistent with , and makes minimal under set inclusion. Furthermore, is a C-repair if it is an S-repair that also minimizes .  Cf. Bertossi2011 () for a survey of database repairs.

2.4 Disjunctive answer-set programs

We consider disjunctive Datalog programs with stable model semantics eiterGottlob97 (), a particular class of answer-set programs (ASPs) asp (). They consist of a set of ground atoms, called the extensional database, and a finite number of rules of the form

(3)

with , and the are positive atoms. The arguments in these atoms are constants or variables. The variables in the appear all among those in the .

The constants in program form the (finite) Herbrand universe of the program. The ground version of program , , is obtained by instantiating the variables in with all possible combinations of values from . The Herbrand base, , of consists of all the possible ground atoms obtained by instantiating the predicates in on . A subset of is a (Herbrand) model of if it contains and satisfies , that is: For every ground rule of , if and , then . is a minimal model of if it is a model of , and no proper subset of is a model of . denotes the class of minimal models of .  This definition applies in particular to positive programs, i.e. that do not contain negated atoms in rule bodies (i.e. the antecedents of the implications).

Now, take , and transform into a new, positive program (i.e. without ), as follows: Delete every ground instantiation of a rule (3) for which . Next, transform each remaining ground instantiation of a rule (3) into . By definition, is a stable model of iff . A program may have none, one or several stable models; and each stable model is a minimal model (but not necessarily the other way around) gelfond ().

Disjunctive answer-set programs have been used to specify database repairs barcelo (); monica (); Bertossi2011 (). We will use them in Section 5.

3  Causes and Database Repairs

In this section we concentrate first on tuple-based causes as introduced in Section 2.2, and establish a reduction to the tuple-based database repairs of Section 2.3. In Section 3.1, we provide an abstract definition of cause on the basis of an abstract repair-semantics.

Before proceeding in more technical terms, it is worth giving a general idea about what we do next. First, it is well-known that checking an integrity constraint (IC) on a database can be done by posing an associated query, or by defining a so-called violation view and checking its contents. The IC is satisfied as long as the query does not have an answer, i.e. it is false, or, equivalently, when the view has an empty contents. In this way, repairs of the database w.r.t. the IC can be put in correspondence with causes for query answers (or view contents): If there is a violation of the IC by a tuple (or a combination thereof), then there is a cause (with a contingency set) for a query answer (that exists due to the IC violation); and vice versa.  In this direction, we establish next the right correspondence between tuples that are left outside a repair, with some companion tuples (for participating in a violation of the IC) and tuples that are causes for the query answer, with some contingent companion tuples.

Now we show in precise terms how causes (represented by database tuples) for queries can be obtained from database repairs tocs (). Consider the BCQ

that is (possibly unexpectedly) true in :  . Actual causes for , their contingency sets, and responsibilities can be obtained from database repairs. First, is logically equivalent to the denial constraint:

(4)

So, if is true in ,   is inconsistent w.r.t. , giving rise to repairs of w.r.t. .

Next, we build differences, containing a tuple , between and S- or C-repairs:

(5)
(6)
Proposition 1

tocs ()  For an instance , a BCQ , and its associated DC , it holds:

  1. is an actual cause for  iff  .

  2. For each S-repair with ,   is a subset-minimal contingency set for .

  3. If , then . Otherwise, , where and there is no with .

  4. is a most responsible actual cause for  iff  .

Example 6

(ex. 4 cont.)  With the same instance and query , we consider the DC

which is not satisfied by .

Here, and , with:

For tuple  ,   , with as a minimum-cardinality contingency set for , of size . So, is an actual cause, with responsibility .

Similarly, is an actual cause, with responsibility .  For tuple ,   .  So, is an actual cause, with responsibility , i.e. a most responsible cause.

Notice that is an actual cause whose minimum-cardinality contingency set is associated to an S-repair, , that is not a C-repair; whereas is a maximum-responsibility actual cause whose minimum-cardinality contingency set, the empty set, is associated to the C-repair .

This connection between repairs and actual causes with their responsibilities can be extended to include actual causes for Unions of Boolean Conjunctive Queries (UBCQs) and repairs wrt. sets of DCs.

Example 7

(ex. 5 cont.)  With the same database , consider the query (an UBCQs) , with

It generates the set of DCs:  , with and as in (1) and (2), resp. Here, and, accordingly, is inconsistent w.r.t. .

The actual causes for in are: , , and , with the most responsible cause. The only S-repairs for are:

and is also the only C-repair.

It is also possible, the other way around, to characterize repairs in terms of causes and their contingency sets tocs (). Actually this latter connection can be used to obtain complexity results for causality problems from repair-related computational problems tocs (). Most computational problems related to repairs, especially C-repairs, which are related to most responsible causes, are provably hard. This is reflected in a high complexity for responsibility tocs ()  (cf. Section 8 for some more details).

3.1 Abstract causes from abstract repairs

We can extrapolate from the characterization of causes in terms of repairs that we have shown in this section, by starting from an abstract repair-semantics, , which identifies a class of intended repairs of instance w.r.t. the DC . By definition, contains instances for ’s schema that satisfy and depart from in an -dependent minimal way Bertossi2011 (). The most common repair semantics w.r.t. DCs is that of S-repairs, which are all subinstances of . However, the minimality criterion does not have to be based on set inclusion (as is the case for S-repairs). Even more, the repairs do not have to be subinstances of , even for DCs, as we will see in Section 4.

More concretely, given a possibly inconsistent instance , a general class of repair semantics can be characterized through an abstract partial-order relation, ,5 on instances of ’s schema that is parameterized by .6  If we want to emphasize this dependence on the priority relation  , we define the corresponding class of repairs of w.r.t. a set on ICs as:

(7)

This definition is general enough to capture different classes of repairs, and in relation to different kinds of ICs, e.g. those that delete old tuples and introduce new tuples to satisfy inclusion dependencies, and also repairs that change attribute values. In particular, it is easy to verify that the classes of S- and C-repairs for DCs of Section 2.3 are particular cases of this definition.

If we assume that the repairs provided by the abstract repair semantics are all sub-instances of , and we let us inspire by (5), we can introduce:

(8)
Definition 1

For an instance , a BCQ , and a class of repairs :

(a) is an actual S-cause for iff .

(b) For each with , is an S-contingency set for .

(c) The S-responsibility of an actual S-cause is as in Section 2.2, but considering only the cardinalities of S-contingency sets .

It should be clear that actual causes as defined in Section 3 are obtained from this definition by using S-repairs. Furthermore, it is also easy to see that each actual S-cause accompanied by one of its S-contingency sets falsifies query in .

This abstract definition can be instantiated with different repair-semantics, which leads to different notions of cause. It can also be modified in a natural way to define causes associated to repairs that may not be subinstances of the given instance. We will do this in the following subsection by appealing to attribute-based repairs that change attribute values in tuples by null, a null value that is assumed to be a special constant in the set of constants in the data domain associated to the database schema. This will allow us, in particular, to define causes at the attribute level (as opposed to tuple level) in a very natural manner.

A similar approach based on abstract repair semantics was taken in sum18 (); sum18corr () in order to introduce an abstract inconsistency measure of a database w.r.t. a set on ICs.  In Section 4, we instantiate the abstract semantics to define null-based causes from a particular but natural and practical notion of attribute-based repair.

4 Attribute-Based Causes

Causality in databases has been developed mostly in terms of tuples that are causes for query answers. However, this notion may suffer from a low level of granularity in that there could be certain attribute values in a tuple that may have more impact than others on a query answer. Defining attribute-level causes could be done directly. Instead, in this section, we appeal to the abstract notion of cause as related to an also abstract notion of repair, as we did in Section 3.1. In order to do this, we appeal again to the connection between IC violation and the existence of answers to a query (c.f. beginning of Section 3). However, instead of considering repairs of the database that are obtained by deletions of full tuples, we consider repairs that modify attribute values in tuples.

The most “neutral”, natural, and least arbitrary change on an attribute value one could attempt to eliminate a violation of an IC or a query answer (the idea behind actual causality) is the replacement of that value by a null value, in the spirit and the behavior of NULL in SQL databases. Accordingly, we first define repairs of databases in terms of changes of attribute values by a (unique and single) null value, and then we define causes at the attribute-value level, by identifying those attribute-values that are replaced by a null in a repair.

In order to identify attribute-values in tuples appearing in the original database, it is necessary to keep track of changes while having the capability to identify the original tuples. For this reason, we have to introduce unique, global and unchangeable identifiers in database tuples. Furthermore, since the only admissible changes are by a null value, it is good enough to identify the (attribute) position where such a change takes place. Only the value in the original tuple is relevant in the end. For this reason, in the following we represent values in tuples in the form , e.g. for the tuple refers to the value in the tuple (the identifier appears in the extra position ). Since the tuples ids are global, we could dispense with the predicate name if we wanted.

Database repairs that are based on changes of attribute values in tuples have been considered in Bertossi2011 (); tplp (); IS08 (), and implicitly in tkde () to hide sensitive information in a database via minimal virtual modifications of . In the rest of this section, we make explicit this latter approach and exploit it to define and investigate attribute-based causality (cf. also tocs ()). First we provide a motivating example.

Example 8

Consider the database instance

and the query

(9)

satisfies , i.e.  .

The three -tuples in are actual causes, but clearly the value for the first attribute of is what matters in them, because it enables the join, e.g. . This is only indirectly captured through the occurrence of different values accompanying in the second attribute of -tuples as causes for .  Now, consider the instance

where stands for the null value as used in SQL databases, which cannot be used to satisfy a join.  Now,  .  The same occurs with the instances ,  and   , among others that are obtained from only through changes of attribute values by null.7

In the following we will assume the special constant may appear in database instances, and can be used to verify queries and constraints. We assume that all atoms with built-in comparisons, say , and , with a non-null constant, are all false for . In particular, since a join, say , can be written as , it can never be satisfied through null. This assumption is compatible with the use of NULL in SQL databases (cf. (tplp, , sec. 4) for a detailed discussion, also (tkde, , sec. 2)). However, it should be clear that these basic assumptions on “the logic” of null does not force us to bring SQL into our framework.

Consider an instance that may be inconsistent with respect to a set of DCs. The allowed repair updates are changes of attribute values by null, which is a natural choice, because this is a deterministic solution that appeals to the generic data value used in SQL databases to reflect the uncertainty and incompleteness in/of the database that inconsistency produces.8 As mentioned above, in order to keep track of changes, we introduce numerical arguments in tuples as global, unique tuple identifiers (tids), in position . So, becomes , with . With we denote the id of the tuple , i.e. .

If is updated to by replacement of (non-tid) attribute values by null, and the value of the -th attribute in , with , is changed into null, then the change is captured as the string , which identifies that the change was made in the tuple with id in the -th position (or attribute) of predicate .

More precisely, for a tuple ,   denotes the the tuple that results from replacing the values in positions by null in . For example, for ,  .

These strings are collected in the set:9

For example, if    is changed into   , then .

The use of null is particularly useful to restore consistency w.r.t. DCs, which involve combinations of (unwanted) joins.

Example 9

(ex. 8 cont.)  Still with the database instance

consider the DC corresponding to the negation of query in (9):  

Since , is inconsistent.  The updated instance

is consistent (among others obtained by updates with null), i.e.  .

Definition 2

A null-based repair of with respect to a set of DCs is a consistent instance , such that is minimal under set inclusion.10 denotes the class of null-based repairs of with respect to .11  A cardinality-null-based repair minimizes .

We can see that the null-based repairs are the minimal elements of the partial order between instances defined by:    iff  .

Example 10

(ex. 6 cont.)  Consider instance with tuple ids now:

1
2
3
S.1
4
5
6

Equivalently, . It is inconsistent w.r.t. the DC:

.

Using just for , and for , to keep the presentation more compact, the class of null-based repairs,  , consists of:

,

,

,

,

,

.

Then,