Principles of Progress Indicators for Database Repairing

Principles of Progress Indicators for Database Repairing

Ester Livshits    Ihab F. Ilyas    Benny Kimelfeld&Sudeepa Roy \affiliationsTechnion, Haifa, Israel
University of Waterloo, Waterloo, ON, Canada
Duke University, Durham, NC, USA \emails{esterliv, bennyk}@cs.technion.ac.il, ilyas@uwaterloo.ca, sudeepa@cs.duke.edu
Abstract

How should a cleaning system measure the amount of inconsistency in the database? Proper measures are important for quantifying the progress made in the cleaning process relative to the remaining effort and resources required. Similarly to effective progress bars in interactive systems, inconsistency should ideally diminish steadily and continuously while aiming to avoid jitters and jumps. Moreover, measures should be computationally tractable towards online applicability. Building on past research on inconsistency measures for knowledge bases, we embark on a systematic investigation of the rationality postulates of inconsistency measures in the context of database repairing systems. Such postulates should take into consideration the interplay between the database tuples, the integrity constraints, and the space of atomic repairing operations. We shed light on classic measures by examining them against a set of rationality postulates, and propose a measure that is both rationale and tractable for the general class of denial constraints.

Principles of Progress Indicators for Database Repairing
(Extended Version)


Ester Livshits, Ihab F. Ilyas, Benny Kimelfeldand Sudeepa Roy

Technion, Haifa, Israel
University of Waterloo, Waterloo, ON, Canada
Duke University, Durham, NC, USA

{esterliv, bennyk}@cs.technion.ac.il, ilyas@uwaterloo.ca, sudeepa@cs.duke.edu

1 Introduction

Inconsistency of databases may arise in a variety of situations, for a variety of reasons. Database records may be collected from imprecise sources (social encyclopedias/networks, sensors attached to appliances, cameras, etc.) via imprecise procedures (natural-language processing, signal processing, image analysis, etc.), or be integrated from different sources with conflicting information or formats. Common principled approaches to handling inconsistency consider databases that violate integrity constraints, but can nevertheless be repaired by means of operations that revise the database and resolve inconsistency [?]. Instantiations of these approaches differ in the supported types of integrity constraints and operations. The constraints may be Functional Dependencies (FDs) or the more general Equality-Generating Dependencies (EGDs) or, even more generally, Denial Constraints (DCs) and they may be referential (foreign-key) constraints or the more general inclusion dependencies [?]. A repair operation can be a deletion of a tuple, an insertion of a tuple, or an update of an attribute value. Operations may be associated with different costs, representing levels of trust in data items or extent of impact [?].

Various approaches and systems have been proposed for data cleaning and, in particular, data repairing (e.g., [???] to name a few). We explore the question of how to measure database inconsistency in a manner that reflects the progress of repairing. Such a measure is useful not just for implementing progress bars, but also for recommending next steps in interactive systems, and estimating the potential usefulness and cost of incorporating databases for downstream analytics [?]. Example measures include the number of violations in the database, the number of tuples involved in violations, and the number of operations needed to reach consistency. However, for a measure to effectively communicate progress indication in repairing, it should feature certain characteristics. For example, it should minimize jitters, jumps and sudden changes to have a good correlation with “the expected waiting time”—an important aspect in interacting with the users. It should also be computationally tractable so as not to compromise the efficiency of the repair and to allow for interactive systems. Luo et al. [?] enunciate the importance of these properties, referring to them as “acceptable pacing” and “minimal overhead,” respectively, in progress indicators for database queries.

As a guiding principle, we adopt the approach of rationality postulates of inconsistency measures over knowledge bases that have been investigated in depth by the Knowledge Representation (KR) and Logic communities [???????]. Yet, the studied measures and postulates fall short of capturing our desiderata, for various reasons. First, inconsistency is typically measured over a knowledge base of logical sentences (formulas without free variables). In databases, we reason about tuples (facts) and fixed integrity constraints, and inconsistency typically refers to the tuples rather than the constraints (which are treated as exogenous information). In particular, while a collection of sentences might form a contradiction, a set of tuples can be inconsistent only in the presence of integrity constraints. Hence, as recently acknowledged [?], it is of importance to seek inconsistency measures that are closer to database applications. Perhaps more fundamentally, in order to capture the repairing process and corresponding changes to the database, the measure should be aware of the underlying repairing operations (e.g., tuple insertion, deletion, revision).

For illustration, let us consider the case where all constraints are anti-monotonic (i.e., consistency cannot be violated by deleting tuples), and we allow only tuple deletions as repairing operations. One simple measure is the drastic measure , which is if the database is inconsistent, and otherwise [?]. Of course, this measure hardly communicates progress, as it does not change until the very end. What about the measure that counts the number of problematic tuples, which are the tuples that participate in (minimal) witnesses of inconsistency [??]? This measure suffers from a disproportional reaction to repairing operations, since the removal of a single tuple (e.g., a misspelling of a country name) can cause a drastic reduction in inconsistency. As another example, take the measure that counts the number of maximal consistent subsets. This measure suffers from various problems: adding constraints can cause a reduction in inconsistency, it may fail to reflect change for any deletion of a tuple and, again, it may react disproportionally to a tuple deletion. Moreover, it is is hard to compute (#P-complete) already for simple FDs [?].

In a recent attention in the community of database theory, measures have been proposed based on the concept of a minimal repair—the minimal number of deletions needed to obtain consistency [?]. We refer to this measure as . Our exploration shows that indeed satisfies the rationality criteria that we define later on, and so, we provide a formal justification to its semantics. Yet, it is again intractable (NP-hard) even for simple sets of FDs [?]. Interestingly, we are able to show that a linear relaxation of this measure, which we refer to as , provides a combination of rationality and tractability.

We make a step towards formalizing the features and shortcomings of inconsistency measures such as the aforementioned ones. We consider four rationality postulates for database inconsistency in the context of a repairing system (i.e., a space of weighted repairing operations): positivity—the measure is strictly positive if and only if the database is inconsistent; monotonicity—inconsistency cannot be reduced by adding constraints; progression—we can always find an operation that reduces inconsistency; and continuity—a single operation can have a limited relative impact on inconsistency. We examine a collection of measures against these postulates, and show that stands out. Nevertheless, this measure is intractable. In particular, we show that computing is hard already for the case of a single EGD. Finally, we prove that for tuple deletions, satisfies all four postulates and is computable in polynomial time, even for the general case of arbitrary sets of denial constraints.

Our work is complementary to that of Grant and Hunter [?] who also studied repairing (or resolution) operators, but they focus on a different aspect of repairing—the trade-off between inconsistency reduction and information loss. An operation is beneficial if it causes a high reduction in inconsistency alongside a low loss of information. Instead, our focus here is on measuring progress of repairing, and it is an interesting future direction to understand how the two relate to each other and/or can be combined. Another complementary problem is that of associating individual facts with portions of the database inconsistency (e.g., the Shapley value of the fact) and using these portions to define preferences among repairs, as studied by Yun et al. [?].

In the remainder of the paper, we present preliminary concepts and terminology (Section 2), discuss inconsistency measures (Section 3), their rationality postulates (Section 4) and complexity aspects (Section 5), and make concluding remarks (Section 6). 111https://figshare.com/s/5dbb066b1fd79358fdbc

2 Preliminaries

We first give the basic terminology and concepts.

Relational Model

A relation signature is a sequence of distinct attributes , where is the arity of . A relational schema (or just schema for short) has a finite set of relation symbols , each associated with a signature that we denote by . If has arity , then we say that is -ary. A fact over is an expression of the form , where is a -ary relation symbol of , and ,…, are values. When is irrelevant or clear from the context, we may call a fact over simply a fact. If is a fact and , then we refer to the value as .

A database over is a mapping from a finite set of record identifiers to facts over . The set of all databases over a schema is denoted by . We denote by the fact that maps to the identifier . A database is a subset of a database , denoted , if and for all .

An integrity constraint is a first-order sentence over . A database satisfies a set of integrity constraints, denoted , if satisfies every constraint . If and are sets of integrity constraints, then we write to denote that entails ; that is, every database that satisfies also satisfies . We also write to denote that and are equivalent, that is, and . By a constraint system we refer to a class of integrity constraints (e.g., the class of all functional dependencies).

As a special case, a Functional Dependency (FD) , where is a relation symbol and , states that every two facts that agree on (i.e., have equal values in) every attribute in should also agree on every attribute in . The more general Equality Generating Dependency (EGD) has the form , where each is an atomic formula over the schema and and are variables in . Finally, a Denial Constraint (DC) has the form , where each is an atomic formula and is a conjunction of atomic comparisons over .

Repair Systems

Let be a schema. A repairing operation (or just operation) transforms a database over into another database over , that is, . An example is tuple deletion, denoted , parameterized by a tuple identifier and applicable to a database if ; the result is obtained from by deleting the tuple identifier (along with the corresponding fact ). Another example is tuple insertion, denoted , parameterized by a fact ; the result is obtained from by adding with a new tuple identifier. (For convenience, this is the minimal integer such that .) A third example is attribute update, denoted , parameterized by a tuple identifier , an attribute , and a value , and applicable to if and is an attribute of the fact ; the result is obtained from by setting the value to . By convention, if is not applicable to , then it keeps intact, that is, .

A repair system (over a schema ) is a collection of repairing operations with an associated cost of applying to a given database. For example, a smaller change of value might be less costly than a greater one [?], and some facts might be more costly than others to delete [??] or update [???]; changing a person’s zip code might be less costly than changing the person’s country, which, in turn, might be less costly than deleting the entire person’s record. Formally, a repair system is a pair where is a set of operations and is a cost function that assigns the cost to applying to . We require that if and only if ; that is, the cost is nonzero when, and only when, an actual change occurs.

For a repair system , we denote by the repair system of all sequences of operations from , where the cost of a sequence is the sum of costs of the individual operations thereof. Formally, for , the repair system is , where consists of all compositions , with for all , defined inductively by and . Let be a constraint system and a repair system. We say that is realizable by if it is always possible to make a database satisfy constraints of by repeatedly applying operations from . Formally, is realizable by if for every database and a finite set there is a sequence in such that . An example of is the system of all FDs . An example of is the subset system, denoted , where is the set of all tuple deletions (hence, the result is always a subset of the original database), and is determined by a special cost attribute, , if a cost attribute exists, and otherwise, (every tuple has unit cost for deletion). Observe that realizes , since the latter consists of anti-monotonic constraints.

3 Inconsistency Measures

Let be a schema and a constraint system. An inconsistency measure is a function that maps a finite set of integrity constraints and a database to a number . Intuitively, a high implies that is far from satisfying . We make two standard requirements:

  • is zero on consistent databases; that is, whenever ;

  • is invariant under logical equivalence of constraints; that is, whenever .

Next, we discuss several examples of inconsistency measures. Some of these measures (namely, , , and ) are adjusted from the survey of Thimm [?]. The simplest measure is the drastic inconsistency value, denoted , which is the indicator function of inconsistency.

The next measure assumes an underlying repair system and an underlying constraint system such that is realizable by . The measure is the minimal cost of a sequence of operations that repairs the database. It captures the intuition around various notions of repairs known as cardinality repairs and optimal repairs [???].

Note that is the distance from satisfaction used in property testing [?] in the special case where the repair system consists of unit-cost insertions and deletions.

Measures for anti-monotonic constraints.

The next measures apply to systems of anti-monotonic constraints. Recall that an integrity constraint is anti-monotonic if for all databases and , if and , then . Examples of anti-monotonic constraints are the Denial Constraints (DCs) [?], the classic Functional Dependencies (FDs), conditional FDs [?], and Equality Generating Dependencies (EDGs) [?].

For a set of constraints and a database , denote by the set of all minimal inconsistent subsets of ; that is, the set of all such that and, moreover, for all . Again using our assumption that constraints are anti-monotonic, it holds that if and only if is empty. Drawing from known inconsistency measures [??], the measure , also known as MI Shapley Inconsistency, is the cardinality of this set.

A fact that belongs to a minimal inconsistent subset (that is, ) is called problematic, and the measure counts the problematic facts [?].

For a finite set of constraints and a database , we denote by the set of all maximal consistent subsets of ; that is, the set of all such that and, moreover, whenever . Observe that if , then is simply the singleton . Moreover, under the assumption that constraints are anti-monotonic, the set is never empty (since, e.g., the empty set is consistent). The measure is the cardinality of , minus one.

4 Rationality for Progress Indication

We now propose and discuss several properties (postulates) of general inconsistency measures that capture the rationale for usability for progress estimation in database repairing. We illustrate the satisfaction or violation of these postulates over the different measures that we presented in the previous section. The behavior of these measures with respect to the postulates is summarized in Table 1, which we discuss later on. The measures are all defined in Section 3, except for that we define later in Section 5.

A basic postulate is positivity, sometimes referred to as consistency [?].

Positivity: whenever .

For illustration, each of , , , and satisfies positivity, but not . For example, let consist of two facts, and , and consist of the constraint ; that is, is not in the database. Then since . Yet, in the case of FDs (i.e., ), every violation involves two facts, and so and positivity is satisfied.

The next postulate is monotonicity—inconsistency cannot decrease if the constraints get stricter.

Monotonicity: whenever .

For example, and satisfy monotonicity, since every repair w.r.t.  is also a repair w.r.t. . The measures and also satisfy monotonicity in the special case of FDs, since in this case is the number of fact pairs that jointly violate an FD, which can only increase when adding or strengthening FDs. Yet, they may violate monotonicity when going beyond FDs to the more general class of DCs.

Proposition 1.

In the case of and , monotonicity can be violated already for the class of DCs.

Proof.

We begin with . Consider a schema with a single relation symbol, and for a natural number , let consist of a single DC stating that there are at most facts in the database. (The reader can easily verify that, indeed, can be expressed as a DC.) Then, whenever has facts. In particular, whenever and has facts, it holds that while .

We now consider . Let be a schema that contains two relation symbols and . Consider the following two EGDs (which are, of course, special cases of DCs):

Let and . Every set in is of size three, while the size of the sets in is two. Hence, in a database where (i.e., a database where is violated by if and only if is violated by ), it will hold that , while . ∎

The measure , on the other hand, can violate monotonicity even for functional dependencies.

Proposition 2.

In the case of , monotonicity can be violated already for the class of FDs.

Proof.

Let consist of these facts over :

Let and . Then and the following hold.

Hence, we have and , proving that monotonicity is violated. ∎

Positivity and monotonicity serve as sanity conditions that the measure at hand indeed quantifies the inconsistency in the database—it does not ignore inconsistency, and it does not reward strictness of constraints. The next two postulates aim to capture the rationale of using the inconsistency measure as progress indication for data repairing. Such a measure should not dictate the repairing operations to the data cleaner, but rather accommodate the process with a number that suitably communicates progress in inconsistency elimination. As an example, the measure is useless in this sense, as it provides no indication of progress until the very end; in contrast, a useful progress bar progresses steadily and continuously. To the aim, we propose two postulates that are aware of the underlying repair system as a model of operations. They are inspired by what Luo et al. [?] state informally as “continuously revised estimates” and “acceptable pacing.” Progression states that inconsistency can always diminish with a single operation, and continuity limits the ability of such operation to have a relatively drastic effect.

More formally, progression states that, within the underlying repair system , there is always a way to progress towards consistency, as we can find an operation of such that inconsistency reduces after applying .

Progression:  whenever , there is such that .

For illustration, let us restrict to the system of subset repairs. Clearly, the measure violates progression. The measure satisfies progression, since we can always remove a fact from the minimum repair. The measure satisfies progression, since we can always remove a fact that participates in one of the minimal inconsistent subsets and, by doing so, eliminate all the subsets that include . When we remove a fact that appears in a minimal inconsistent subset, the measure decreases as well; hence, it satisfies progression. On the other hand, may violate progression even for functional dependencies.

Proposition 3.

In the case of , progression can be violated already for the class of FDs and the system of subset repairs.

Proof.

Consider again the database and the set from the proof of Proposition 2. As explained there, . The reader can easily verify that for every tuple deletion , it is still the case that . ∎

The last postulate we discuss is continuity that, as said above, limits the relative reduction of inconsistency in reaction to a single operation. More formally, this postulate is parameterized by a number and it states that, for every two databases and , and for each operation on we can find an operation on that is (almost) at least as impactful as : it either eliminates all inconsistency or reduces inconsistency by at least of what does. More formally, we denote by the value .

-continuity:  For all , , and , there exists such that either or .

Note that the possibility of eliminates the case where a measure violates continuity just because of a situation where the database is only slightly inconsistent and a last step suffices to complete repairing.

This definition can be extended to the case where the inconsistency measure is aware of the cost of operations in the repair system . There, the change is relative to the cost of the operation. That is, we define the weighted version of -continuity in the following way.

Weighted -continuity:  For all , , and , there exists such that either or .

We say that a measure has bounded continuity, if there exists such that satisfies -continuity. Clearly, none of the measures discussed so far, except for , satisfies (unweighted) bounded continuity.

Proposition 4.

In the case of , , and , bounded (unweighted) continuity can be violated already for the class of FDs and the system of subset repairs.

Proof.

Let and let be a database that contains the following facts over :

where for some and . The fact violates the FD with every fact , and for each , the facts and jointly violate the FD. All the facts in the database participate in a violation of the FD; hence, . In addition, it holds that .

Let the operation be the deletion of . Applying , we significantly reduce inconsistency w.r.t. these two measures, since none of the facts now participates in a violation; thus, and . However, every possible operation on the database only slightly reduces inconsistency (by two in the case of and by one in the case of ). Therefore, and , and the ratio between these two values depends on . Similarly, it holds that and , and again the ratio between these two values depends on .

As for and , we use Proposition 5 that we give later on. In the case of FDs, each of the two measures satisfies positivity but not progression (Proposition 3), and hence, they violate bounded continuity. ∎

On the other hand, it is an easy observation that satisfies bounded continuity, and even bounded weighted continuity.

Table 1 summarizes the satisfaction of the postulates held by the different inconsistency measures we discussed here, for the case of a system of anti-monotonic constraints and the repair system . The last column refers to computational complexity and the last row refers to another measure, , both discussed in Section 5.

Pos. Mono. Prog. B. Cont. PTime
Table 1: Satisfaction of rationality postulates for a system of anti-monotonic constraints and the repairing system , and tractability for DCs (“PTime” column, assuming )

Note that there are some dependencies among the postulates, as shown in the following easy proposition.

Proposition 5.

Suppose that the class is realizable by the repair system , and let be an inconsistency measure.

  • If satisfies progression, then satisfies positivity.

  • If satisfies positivity and bounded continuity, then satisfies progression.

The proof of Proposition 5 is in the Appendix.

5 Computational Complexity

We now discuss aspects of computational complexity, and begin with the complexity of measuring inconsistency according to the aforementioned example measures. We focus on the class of DCs and the special case of FDs. Moreover, we focus on data complexity, which means that the set of constraints is fixed, and only the database is given as input for the computation of .

The measure boils down to testing consistency, which is doable in polynomial time (under data complexity). The measures and can be computed by enumerating all the subsets of of a bounded size, where this size is determined by . Hence, and can also be computed in polynomial time. Yet, the measures and can be intractable to compute, already in the case of FDs, as we explain next.

When is a set of FDs, is the number of maximal independent sets (minus one) of the conflict graph wherein the tuples of are the nodes, and there is an edge between every two tuples that violate an FD. Counting maximal independent sets is generally #P-complete, with several tractable classes of graphs such as the -free graphs—the graphs that do not contain any induced subgraph that is a path of length four. Under conventional complexity assumptions, the finite sets of FDs for which is computable in polynomial time are precisely the sets of FDs that entail a -free conflict graph for every database  [?].

For and , the measure is the size of the minimum vertex cover of the conflict graph. Again, this is a hard (NP-hard) computational problem on general graphs. In a recent work, it has been shown that there is an efficient procedure that takes as input a set of FDs and determines one of two outcomes: (a) can be computed in polynomial time, or (b) is NP-hard to compute (and even approximate beyond some constant) [?]. There, they have also studied the case where the repair system allows only to update cells (and not delete or insert tuples). In both repair systems it is the case that, if consists of a single FD per relation (which is a commonly studied case, e.g., key constraints [??]) then can be computed in polynomial time. Unfortunately, this is no longer true (under conventional complexity assumptions) if we go beyond FDs to simple EGDs.

Example 1.

Consider the following four EGDs.

Observe that is an FD whereas , and are not. The constraint states that there are no paths of length two except for two-node cycles, and states that there are no paths of length two except for single-node cycles. Computing w.r.t.  or can be done in polynomial time; however, the problem becomes NP-hard for and .∎

The following theorem gives a full classification of the complexity of computing for that consists of a single EGD with two binary atoms.

Theorem 1.

Let , and let be a set of constraints that contains a single EGD with two binary atoms. If is of the following form:

then computing is NP-hard. In any other case, can be computed in polynomial time.

The proof of hardness in Theorem 1 is by reduction from the problem of finding a maximum cut in a graph, which is known to be NP-hard. The full proof, as well as efficient algorithms for the tractable cases, are in the Appendix.

Note that the EGDs and from Example 1 satisfy the condition of Theorem 1; hence, computing w.r.t. these EGDs is indeed NP-hard. The EGDs and do not satisfy the condition of the theorem; thus, computing w.r.t. these EGDs can be done in polynomial time.

5.1 Rational and Tractable Measure

We now propose a new inconsistency measure that applies to the special case where is the class of DCs (denial constraints) and . Recall that a DC has the form where is a conjunction of atomic formulas over the schema, and is a conjunction of comparisons over . Also recall that DCs generalize common classes of constraints such as FDs, conditional FDs, and EGDs.

   (1) (2)

Figure 1: ILP for under and

Let be a database and a finite set of DCs. For , the measure is the result of the Integer Linear Program (ILP) of Figure 1 wherein each , for , determines whether to delete the th tuple () or not (). Denote by the solution of the linear relaxation of this ILP, which is the Linear Program (LP) obtained from the ILP by replacing the last constraint (i.e., Equation (2)) with:

It is easy to see that the relative rankings of the inconsistency measure values of two databases under and are consistent with each other if they have sufficient separation under the first one. More formally, for two databases we have that implies that , where is the integrality gap of the LP relaxation. The maximum number of tuples involved in a violation of a constraint in gives an upper bound on this integrality gap. In particular, for FDs (as well for the EGDs in Example 1), this number is 2; hence, implies that .

The following theorem shows that satisfies all four postulates and can be efficiently computed for the class of denial constraints and the repair system .

Theorem 2.

The following hold for and .

  1. satisfies positivity, monotonicity, progression and constant weighted continuity.

  2. can be computed in polynomial time (in data complexity).

The proof of Theorem 2 is in the Appendix. It thus appears from Theorem 2 that, for tuple deletions and DCs, is a desirable inconsistency measure, as it satisfies the postulates we discussed in this paper and avoids the inherent hardness of (e.g., Theorem 1).

6 Concluding Remarks

We presented a framework for measuring database inconsistency from the viewpoint of progress estimation in database repairing. In particular, we have discussed four rationality postulates, where two (progression and continuity) are defined in the context of the underlying repair system. We have also used the postulates to reason about various instances of inconsistency measures. In particular, the combination of the postulates and the computational complexity shed a positive light on the linear relaxation of minimal repairing. In future work, we plan to explore other rationality postulates as well as completeness criteria for sets of postulates to determine sufficiency for progress indication.

References

  • [Abiteboul et al., 2018] Serge Abiteboul, Marcelo Arenas, Pablo Barceló, Meghyn Bienvenu, Diego Calvanese, Claire David, Richard Hull, Eyke Hüllermeier, Benny Kimelfeld, Leonid Libkin, Wim Martens, Tova Milo, Filip Murlak, Frank Neven, Magdalena Ortiz, Thomas Schwentick, Julia Stoyanovich, Jianwen Su, Dan Suciu, Victor Vianu, and Ke Yi. Research directions for principles of data management (dagstuhl perspectives workshop 16151). Dagstuhl Manifestos, 7(1):1–29, 2018.
  • [Adler and Harwath, 2018] Isolde Adler and Frederik Harwath. Property testing for bounded degree databases. In STACS, volume 96 of LIPIcs, pages 6:1–6:14. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2018.
  • [Afrati and Kolaitis, 2009] Foto N. Afrati and Phokion G. Kolaitis. Repair checking in inconsistent databases: algorithms and complexity. In ICDT, volume 361 of ACM International Conference Proceeding Series, pages 31–41. ACM, 2009.
  • [Arenas et al., 1999] Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. Consistent query answers in inconsistent databases. In PODS, pages 68–79. ACM Press, 1999.
  • [Assadi et al., 2018] Ahmad Assadi, Tova Milo, and Slava Novgorodov. Cleaning data with constraints and experts. In WebDB, pages 1:1–1:6. ACM, 2018.
  • [Beeri and Vardi, 1981] Catriel Beeri and Moshe Y. Vardi. The implication problem for data dependencies. In ICALP, volume 115 of Lecture Notes in Computer Science, pages 73–85. Springer, 1981.
  • [Berlin and Motro, 2013] Jacob Berlin and Amihai Motro. Database schema matching using machine learning with feature selection. In Seminal Contributions to Information Systems Engineering, pages 315–329. Springer, 2013.
  • [Bertossi et al., 2005] Leopoldo E. Bertossi, Anthony Hunter, and Torsten Schaub. Introduction to inconsistency tolerance. In Inconsistency Tolerance [result from a Dagstuhl seminar], volume 3300 of Lecture Notes in Computer Science, pages 1–14. Springer, 2005.
  • [Bertossi et al., 2008] Leopoldo E. Bertossi, Loreto Bravo, Enrico Franconi, and Andrei Lopatenko. The complexity and approximation of fixing numerical attributes in databases under integrity constraints. Inf. Syst., 33(4-5):407–434, 2008.
  • [Bertossi, 2011] Leopoldo E. Bertossi. Database Repairing and Consistent Query Answering. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011.
  • [Bertossi, 2018] Leopoldo E. Bertossi. Measuring and computing database inconsistency via repairs. CoRR, abs/1804.08834, 2018.
  • [Bohannon et al., 2007] Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746–755. IEEE, 2007.
  • [Brewka et al., 2011] Gerhard Brewka, Thomas Eiter, and Miroslaw Truszczynski. Answer set programming at a glance. Commun. ACM, 54(12):92–103, 2011.
  • [Burdick et al., 2016] Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and Wang-Chiew Tan. A declarative framework for linking entities. ACM Trans. Database Syst., 41(3):17:1–17:38, 2016.
  • [Calautti et al., 2018] Marco Calautti, Leonid Libkin, and Andreas Pieris. An operational approach to consistent query answering. In PODS, pages 239–251. ACM, 2018.
  • [Casanova et al., 1984] Marco A. Casanova, Ronald Fagin, and Christos H. Papadimitriou. Inclusion dependencies and their interaction with functional dependencies. J. Comput. Syst. Sci., 28(1):29–59, 1984.
  • [Chu et al., 2013] Xu Chu, Ihab F. Ilyas, and Paolo Papotti. Discovering denial constraints. PVLDB, 6(13):1498–1509, 2013.
  • [Corneil et al., 1981] D.G. Corneil, H. Lerchs, and L.Stewart Burlingham. Complement reducible graphs. Discrete Applied Mathematics, 3(3):163 – 174, 1981.
  • [Ebaid et al., 2013] Amr Ebaid, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin. NADEEF: A generalized data cleaning system. PVLDB, 6(12):1218–1221, 2013.
  • [Fagin et al., 2015] Ronald Fagin, Benny Kimelfeld, and Phokion G. Kolaitis. Dichotomies in the complexity of preferred repairs. In PODS, pages 3–15. ACM, 2015.
  • [Fan et al., 2011] Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. Cerfix: A system for cleaning data with certain fixes. PVLDB, 4(12):1375–1378, 2011.
  • [Fuxman and Miller, 2007] Ariel Fuxman and Renée J. Miller. First-order query rewriting for inconsistent databases. J. Comput. Syst. Sci., 73(4):610–635, 2007.
  • [Gaasterland et al., 1992] Terry Gaasterland, Parke Godfrey, and Jack Minker. An overview of cooperative answering. J. Intell. Inf. Syst., 1(2):123–157, 1992.
  • [Gardezi et al., 2011] Jaffer Gardezi, Leopoldo E. Bertossi, and Iluju Kiringa. Matching dependencies with arbitrary attribute values: semantics, query answering and integrity constraints. In LID, pages 23–30, 2011.
  • [Geerts et al., 2013] Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. The LLUNATIC data-cleaning framework. PVLDB, 6(9):625–636, 2013.
  • [Geerts et al., 2014] Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. Mapping and cleaning. In ICDE, pages 232–243. IEEE, 2014.
  • [Goldreich et al., 1998] Oded Goldreich, Shafi Goldwasser, and Dana Ron. Property testing and its connection to learning and approximation. J. ACM, 45(4):653–750, 1998.
  • [Grant and Hunter, 2006] John Grant and Anthony Hunter. Measuring inconsistency in knowledgebases. J. Intell. Inf. Syst., 27(2):159–184, 2006.
  • [Grant and Hunter, 2017] John Grant and Anthony Hunter. Analysing inconsistent information using distance-based measures. Int. J. Approx. Reasoning, 89:3–26, 2017.
  • [Gribkoff et al., 2014] Eric Gribkoff, Guy Van den Broeck, and Dan Suciu. The most probable database problem. In BUDA, 2014.
  • [Hunter and Konieczny, 2008] Anthony Hunter and Sébastien Konieczny. Measuring inconsistency through minimal inconsistent sets. In KR, pages 358–366. AAAI Press, 2008.
  • [Hunter and Konieczny, 2010] Anthony Hunter and Sébastien Konieczny. On the measure of conflicts: Shapley inconsistency values. Artif. Intell., 174(14):1007–1026, 2010.
  • [Ilyas and Chu, 2015] I.F. Ilyas and X. Chu. Trends in Cleaning Relational Data: Consistency and Deduplication. Foundations and Trends(r) in Databases. Now Publishers, 2015.
  • [Khachiyan, 1979] Leonid G Khachiyan. A polynomial algorithm in linear programming. In Doklady Academii Nauk SSSR, volume 244, pages 1093–1096, 1979.
  • [Kimelfeld and Ré, 2018] Benny Kimelfeld and Christopher Ré. A relational framework for classifier engineering. SIGMOD Record, 47(1):6–13, 2018.
  • [Kimelfeld et al., 2017] Benny Kimelfeld, Ester Livshits, and Liat Peterfreund. Detecting ambiguity in prioritized database repairing. In ICDT, volume 68 of LIPIcs, pages 17:1–17:20. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2017.
  • [Kimelfeld et al., 2018] Benny Kimelfeld, Phokion G. Kolaitis, and Julia Stoyanovich. Computational social choice meets databases. In IJCAI, pages 317–323. ijcai.org, 2018.
  • [Knight, 2002] Kevin Knight. Measuring inconsistency. J. Philosophical Logic, 31(1):77–98, 2002.
  • [Knight, 2003] Kevin M. Knight. Two information measures for inconsistent sets. Journal of Logic, Language and Information, 12(2):227–248, 2003.
  • [Kolahi and Lakshmanan, 2009] Solmaz Kolahi and Laks V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, volume 361, pages 53–62. ACM, 2009.
  • [Konieczny et al., 2003] Sébastien Konieczny, Jérôme Lang, and Pierre Marquis. Quantifying information and contradiction in propositional logic through test actions. In IJCAI, pages 106–111. Morgan Kaufmann, 2003.
  • [Koutris and Suciu, 2014] Paraschos Koutris and Dan Suciu. A dichotomy on the complexity of consistent query answering for atoms with simple keys. In ICDT, pages 165–176. OpenProceedings.org, 2014.
  • [Livshits and Kimelfeld, 2017] Ester Livshits and Benny Kimelfeld. Counting and enumerating (preferred) database repairs. In PODS, pages 289–301, 2017.
  • [Livshits et al., 2018] Ester Livshits, Benny Kimelfeld, and Sudeepa Roy. Computing optimal repairs for functional dependencies. In PODS, pages 225–237. ACM, 2018.
  • [Lopatenko and Bertossi, 2007] Andrei Lopatenko and Leopoldo E. Bertossi. Complexity of consistent query answering in databases under cardinality-based and incremental repair semantics. In ICDT, pages 179–193, 2007.
  • [Lozinskii, 1994] Eliezer L. Lozinskii. Resolving contradictions: A plausible semantics for inconsistent systems. J. Autom. Reasoning, 12(1):1–32, 1994.
  • [Maslowski and Wijsen, 2013] Dany Maslowski and Jef Wijsen. A dichotomy in the complexity of counting database repairs. J. Comput. Syst. Sci., 79(6):958–983, 2013.
  • [Perlich and Provost, 2006] Claudia Perlich and Foster J. Provost. Distribution-based aggregation for relational learning with identifier attributes. Machine Learning, 62(1-2):65–105, 2006.
  • [Raman and Hellerstein, 2001] Vijayshankar Raman and Joseph M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In VLDB, pages 381–390. Morgan Kaufmann, 2001.
  • [Rekatsinas et al., 2017] Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190–1201, 2017.
  • [Richardson and Domingos, 2006] Matthew Richardson and Pedro M. Domingos. Markov logic networks. Machine Learning, 62(1-2):107–136, 2006.
  • [Sa et al., 2018] Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, and Theodoros Rekatsinas. A formal framework for probabilistic unclean databases. CoRR, abs/1801.06750, 2018. To appear in ICDT 2019.
  • [Sebastian-Coleman, 2012] Laura Sebastian-Coleman. Measuring data quality for ongoing improvement: a data quality assessment framework. Newnes, 2012.
  • [Staworko et al., 2012] Slawek Staworko, Jan Chomicki, and Jerzy Marcinkowski. Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell., 64(2-3):209–246, 2012.
  • [Thimm, 2017] Matthias Thimm. On the compliance of rationality postulates for inconsistency measures: A more or less complete picture. KI, 31(1):31–39, 2017.
  • [Vardi, 1982] Moshe Y. Vardi. The complexity of relational query languages (extended abstract). In STOC, pages 137–146. ACM, 1982.
  • [Yun et al., 2018] Bruno Yun, Srdjan Vesic, Madalina Croitoru, and Pierre Bisquert. Inconsistency measures for repair semantics in OBDA. In IJCAI, pages 1977–1983. ijcai.org, 2018.

Appendix A Additional Proofs

a.1 Proof of Proposition 5

Proposition 5. Suppose that the class is realizable by the repair system , and let be an inconsistency measure.

  • If satisfies progression, then satisfies positivity.

  • If satisfies positivity and bounded continuity, then satisfies progression.

Proof.

The first part of the proposition holds since if satisfies progression, then for every inconsistent database there exists an operation such that is strictly lower than , which implies that .

For the second part, let us assume, by way of contradiction, that does not satisfy progression. Then, there exists a database and a set of constraints such that , and for every operation on it holds that . In addition, satisfies positivity; thus, it holds that . Since is realizable by , there exists a sequence of operations from , such that is consistent (hence, ). We conclude that there exists an operation such that , but there is no such operation on , which is a contradiction to the fact that satisfies continuity, and that concludes our proof. ∎

a.2 Proof of Theorem 1

Theorem 1. Let , and let be a set of constraints that contains a single EGD with two binary atoms. If is of the following form:

then computing is NP-hard. In any other case, can be computed in polynomial time.

We start by proving the negative side of the theorem. That is, we prove the following.

Lemma 1.

Let and let where is an EGD of the form . Then, computing is NP-hard.

Proof.

We build a reduction from the MaxCut problem to the problem of computing for . The MaxCut problem is the problem of finding a cut in a graph (i.e., a partition of the vertices into two disjoint subsets), such that the number of edges crossing the cut is the highest among all possible cuts. This problem is known to be NP-hard.

Given a graph , with vertices and edges, we construct an input to our problem (that is, a database ) as follows. For each vertex we add the following two facts to the database:

Moreover, for each edge , we add the following two facts to the database:

Note that for each vertex , the facts and violate the EGD together. Moreover, two facts of the form and jointly violate the EGD, and two facts of the form and jointly violate the EGD. Finally, two facts of the form and violate the FD with each other. These are the only violations of the EGD in the database.

We set the cost to be when the operation is a deletion of a fact of the form , and we set to be when the operation is a deletion of a fact of the form or . We now prove that there is a cut of size at least , if and only if

First, assume that there exists a cut of size in the graph, that partitions the vertices into two groups - and . In this case, we can remove the following facts from to obtain a consistent subset .

  • if ,

  • if ,

  • if either or have not been removed.

Each vertex belongs to either or ; hence, we remove exactly one of the facts and for each , and resolve the conflict between these two facts. The cost of removing these facts is . Next, for each edge such that both and belong to the same subset , we remove both and , since the first violates the EDG with if or with if , and the second violates the EDG with if or with if . The cost of removing these facts is .

Finally, for each edge that crosses the cut, we remove one of or from the database. If and , then we have already removed the facts and from the database; thus, the fact does not violate the EGD with any other fact, and we only have to remove the fact that violates the EGD with both and . Similarly, if and , we only remove the fact . The cost of removing these facts is . Hence, the total cost of removing all these facts is .

Clearly, the result is a consistent subset of . As explained above, we have resolved the conflict between and for each , and we have resolved the conflict between every pair of conflicting facts and every pair of conflicting facts. Finally, there are no conflicts among facts and in since either belongs to , in which case the fact appears in and has been removed from as a result, or it belongs to , in which case the fact appears in and has been removed from as a result. Thus, the minimal cost of obtaining a consistent subset of (i.e., a repair) is at most .

Next, we assume that , and we prove that there exists a cut of size at least in the graph. First, note that if there exists a consistent subset of that can be obtained with cost at most , such that both and have been deleted, then we can obtain another consistent subset of with a lower cost by removing only and removing all the facts of the form instead of removing . There are at most such facts (if appears in every clause) and the cost of removing them is at most , while the cost of removing is . Hence, from now on we assume that the subset contains exactly one fact from for each .

Now, we construct a cut in the graph from in the following way. For each , if the fact belongs to , then we put in , and if the fact belongs to , then we put in . As mentioned above, exactly one of these two cases holds for each . It is only left to show that the size of the cut is at least . Since the cost of removing the facts of the form and is , and the cost of removing each fact of the form is one, at most facts of the form have been removed from to obtain . There are facts of the form in ; thus, at least of them belong to .

For each fact in </