A Framework for Assessing Achievability Of Data-Quality Constraints

A Framework for Assessing Achievability Of Data-Quality Constraints

Rada Chirkova Computer Science Department, North Carolina State University
North Carolina, USA
chirkova@csc.ncsu.edu, Jon_Doyle@ncsu.edu, jreutter@ing.puc.cl
Jon Doyle Computer Science Department, North Carolina State University
North Carolina, USA
chirkova@csc.ncsu.edu, Jon_Doyle@ncsu.edu, jreutter@ing.puc.cl
Juan L. Reutter Computer Science Department, North Carolina State University
North Carolina, USA
chirkova@csc.ncsu.edu, Jon_Doyle@ncsu.edu, jreutter@ing.puc.cl
Abstract

Assessing and improving the quality of data are fundamental challenges for data-intensive systems that have given rise to numerous applications targeting transformation and cleaning of data. However, while schema design, data cleaning, and data migration are nowadays reasonably well understood in isolation, not much attention has been given to the interplay between the tools addressing issues in these areas. We focus on the problem of determining whether the available data-processing procedures can be used together to bring about the desired quality characteristics of the given data. For an illustration, consider an organization that is introducing new data-analysis tasks. Depending on the tasks, it may be a priority for the organization to determine whether its data can be processed and transformed using the available data-processing tools to satisfy certain properties or quality assurances needed for the success of the task. Here, while the organization may control some of its tools, some other tools may be external or proprietary, with only basic information available on how they process data. The problem is then, how to decide which tools to apply, and in which order, to make the data ready for the new tasks?

Toward addressing this problem, we develop a new framework that abstracts data-processing tools as black-box procedures with only some of the properties exposed, such as the applicability requirements, the parts of the data that the procedure modifies, and the conditions that the data satisfy once the procedure has been applied. We show how common database tasks such as data cleaning and data migration are encapsulated into our framework and, as a proof of concept, we study basic properties of the framework for the case of procedures described by standard relational constraints. We show that, while reasoning in this framework may be computationally infeasible in general, there exist well-behaved special cases with potential practical applications.

Authors’ Instructions

1 Introduction

A common approach to ascertaining and improving the quality of data is to develop procedures and workflows for repairing or improving data sets with respect to quality constraints. The community has identified a wide range of data-management problems that are vital in this respect, leading to the creation of several lines of studies, which have normally been followed by the development of toolboxes of applications that practitioners can use to solve their problems. This has been the case, for example, for the Extract-Transform-Load (ETL) [13, 18] process in business applications, or for the development of automatic tools to reason about the completeness or cleanliness of the data [15].

As a result, organizations facing data-improvement problems now have access to a variety of data-management tools to choose from; the tools can be assembled to construct so-called workflows of data operations. However, in contrast with the considerable body of research on particular data operations, or even entire business workflows (see, e.g., [12, 8, 11, 3]), previous research appears to have not focused explicitly either on the assembly process itself or on providing guarantees that the desired data-quality constraints will be satisfied once the assembled workflow of procedures has been applied to the available data.

We investigate the problem of constructing workflows from already available procedures. That is, we consider a scenario in which an organization needs to meet a certain data-quality criterion or goal using available data-improvement procedures. In this case, the problem is to understand whether these procedures can be assembled into a data-improvement workflow in a way that would guarantee that the data set produced by the workflow will effectively meet the desired quality goal.

Motivating example: Suppose that data stored in a medical-data aggregator (such as, e.g., Premier [27]) are accessed to perform a health-outcomes analysis in population health management [19, 25, 32], focusing on repeat emergency-room visits in the Washington, DC area. The goal of the analysis is to see whether there is a relationship between such repeat visits and ages and zip codes of patients.

We assume that the aggregator imports information about emergency-room visits from a number of facilities, and stores the information using a relation Visits with attributes facility and facilityLoc for the ID and location of the medical facility, patInsur for the patient insurance number, and timestp for the date and time of the visit. We also assume that medical-record information imported from each facility is stored at the aggregator in a relation Patients, with attributes facility, patInsur, name, age, zipCode, and so on.

The analyst plans to isolate information about emergency-room visits for the Washington area in a relation LocVisits, which would have all the attributes of Visits except facilityLoc, as the values of the latter are understood to be fixed. Further, to obtain the age and zip code of patients, the analyst also needs to integrate the data in Visits with those of Patients.

To process the data, the analyst has access to some procedures that are part of the aggregator’s everyday business. For example, the aggregator periodically runs a StandardizePatientInfo procedure, which first performs entity resolution on insurance IDs in Patients, using both the values of all the patient-related attributes in that relation and a separate “master” relation InsurerInfo that stores authoritative patient information from insurance companies, and then merges the results into Visits. Further, the aggregator offers a procedure MigrateIntoLocVisits that will directly populate LocVisits with the relevant information about emergency rooms (but not the age and zip code of patients).

The analyst is now facing a number of choices, some of which we list here:

• Use the StandardizePatientInfo procedure on Patients, then manually import the correct(ed) information into LocVisits, and finally join this relation with Patients.

• Run MigrateIntoLocVisits to get the relevant patient information, and then join with Patients without running the procedure StandardizePatientInfo.

• Add age and zipCode attributes to LocVisits, get the information into LocVisits as in (ii), and then try to modify StandardizePatientInfo into operating directly on LocVisits.

Which of these options is the best for the planned analysis? Option (i) seems to be the cleanest, but if the analyst suspects that StandardizePatientInfo may produce some loss of data, then going with (ii) or (iii) might be a better option. Further, suppose the analyst also has access to a separate relation HealthcareInfo from a health NGO, with information about emergency-room visits gathered from other independent sources. Then the analyst could pose the following quality criterion on the assembled workflow: The result of the workflow should provide at least the information that can be obtained from the relation HealthcareInfo. How could one guarantee that such a criterion will be met?

Contributions: Our goal is to develop a general framework that can be used to determine whether the available data-processing tools can be put together into a workflow capable of producing data that meet the desired quality properties. To address this problem, we abstract data-processing tools as black-box procedures that expose only certain properties. The properties of interest include (i) preconditions, which indicate the state of the data required for the procedure to be applicable; (ii) the parts of the data that the procedure modifies; and (iii) postconditions, which the data satisfy once the procedure has been applied.

In this paper we introduce the basic building blocks and basic results for the proposed framework for assessing achievability of data-quality constraints. The contributions include formalizing the notion of (sequences of) data-transforming procedures, and characterizing instances that are outcomes of applying (sequences of) procedures over other instances. We also illustrate our design choices by discussing ways to encode important database tasks in the proposed framework, including data migration, data cleaning, and schema updates.

One of the advantages of our framework is its generality, as it can be used to encode multiple operations not only on relational data, but on semistructured or even unstructured text data. This very generality implies that to be able to reason about the properties of our framework, one needs to first instantiate some of its most abstract components. As a proof of concept, we provide an in-depth analysis of applications of (sequences of) procedures over relational data, where the procedures are stated using standard relational-data formalisms. We show that properties concerning outcomes of procedures are in general (not surprisingly) undecidable. At the same time, we achieve decidability and tractability for broad classes of realistic procedures that we illustrate with examples. While the formalism and results presented in this paper have practical implications on their own, we see them mainly as prerequisites that need to be understood before one can formalize the notion of assembling procedures in the context of and in response to a user task. We conclude this paper by showing how the proposed framework can be used to formally define the following problem: Given a set of procedures and data-quality criteria, is it possible to assemble a sequence of procedures such that the data outcome is assured to satisfy this criteria?

Related Work: Researchers have been working on eliciting and defining specific dimensions of quality of the data — [31] provides a widely acknowledged standard; please also see [17, 24]. At the general level, high-quality data can be regarded as being fit for their intended use [22, 30, 10] — that is, both context and use (i.e., tasks to be performed) need to be taken into account when evaluating and improving the quality of data. Recent efforts have put an emphasis on information-quality policies and strategies; please see [22] for a groundbreaking set of generic information-quality policies that structure decisions on information. An information-quality improvement cycle, consisting of the define-measure-analyze-improve steps for data quality, has been proposed in [29]. Work has also been done [23] in the direction of integrating process measures with information-quality measures. Our work is different from these lines of research in that in our framework we assume that task-oriented data-quality requirements are already given in the form of constraints that need to be satisfied on the data, and that procedures for improving data quality are also specified and available. Under these assumptions, our goal is to determine whether the procedures can be used to achieve satisfaction of the quality requirements on the data.

The work [15] introduces a unified framework covering formalizations and approaches for a range of problems in data extraction, cleaning, repair, and integration, and also supplies an excellent survey of related work in these areas. More recent work on data cleaning includes [7, 6, 21, 28, 26]. The research area of business processes [11] studies the environment in which data are generated and transformed, including processes, users of data, and goals of using the data. In this context, researchers have studied automating composition of services into business processes, see, e.g., [3, 4, 5], under the assumption that the assembly needs to follow a predefined workflow of executions of actions (services). In contrast, in our work, the predefined part is the specified constraints that the data should satisfy after the assembled workflow of available procedures has been applied to it. Another line of work [12, 8] is closer to reasoning about static properties of business process workflows. That work is different from ours in that it does not pursue the goal of constructing new workflows.

Outline of the paper: Section 2 contains basic definitions used in the paper. Section 3 introduces the proposed framework, and Section 4 discusses encoding tasks such as data exchange, data cleaning, and alter-table statements. The formal results are presented in Section 5. Section 6 concludes with a discussion of future challenges and opportunities.

2 Preliminaries

Schemas and instances: Assume a countably infinite set of attribute names , , and a countably infinite set (disjoint from ) of relation names , , . A relational schema is a partial function with finite domain, which associates a finite set of attributes to a finite set of relation symbols. If is defined, we say that is in . A schema extends a schema if for each relation such that is defined, we have that . That is, extends if assigns at least the same attributes to all relations in . We also assume a total order over all attribute names in order to be able to switch between the named and unnamed perspectives for instances and queries.

We define instances so that it is possible to switch between the named and unnamed perspectives. Assume a countably infinite set of domain values (disjoint from both and ). Following [1], an instance of schema assigns to each relation in , where , a set of named tuples, each of which is a function of the form , representing the tuples in . (We use to denote the element of a tuple corresponding to the attribute .) By using the order over attributes, we can alternatively view as an unnamed tuple, corresponding to the sequence , with . Thus, we can also view an instance as an assignment of sets of unnamed tuples (or just tuples) . In general, when we know all attribute names for a relation, we use the unnamed perspective, but when the set of attributes is not clear, we resort to the named perspective. For the sake of readability, we abuse notation and use to denote the schema of an instance .

For instances and over a schema , we write if for each relation symbol in we have that . Furthermore, if and are instances over respective schemas and , we denote by the instance over schema such that if is in both and , if is only in , and if is only in . Finally, an instance extends an instance if (1) extends , and (2) for each relation in with assigned attributes and for each tuple in , there is a tuple in such that for each . Intuitively, extends if the projection of over the schema of is contained in .

Conjunctive queries: Since our goal is for queries to be applicable to different schemas, we adopt a named perspective on queries. A named atom is an expression of the form , where is a relation name, each is an attribute name, and each is a variable. We say that the variables mentioned by such an atom are , and the attributes mentioned by it are . A conjunctive query (CQ) is an expression of the form , where and are tuples of variables and is a conjunction of named atoms that use the variables in and .

A named atom is compatible with schema if . A CQ is compatible with if all its named atoms are compatible. Given a named atom , an instance of a schema that is compatible with the atom, and an assignment of values to variables, we say that satisfy if there is a tuple matching values with on attributes in in the sense that for each . (Under the unnamed perspective we would require a tuple in such that its projection over attributes is precisely the tuple .) The usual semantics of conjunctive queries now follows, extending the notion of assignments in the usual way. Finally, given a conjunctive query that is compatible with , the evaluation of over is the set of all the tuples such that satisfy .

We also need to specify queries that extract all tuples stored in a given relation, regardless of the schema, as is done in SQL with the query SELECT * FROM R. To be able to do this, we also use what we call total queries, which, as we do not need to know the arity of , are simply constructs of the form , for a relation name . A total query of this form is compatible with a schema if is defined, and the evaluation of this query over an instance over a compatible schema is simply the set of tuples .

Data Constraints: Most of our data constraints can be captured by tuple-generating dependencies (tgds), which are expressions of the form , for conjunctive queries and , and by equality-generating dependencies (egds), which are expressions of the form , for a conjunctive query and variables in . As usual, for readability we sometimes omit the universal quantifiers of tgds and egds. An instance satisfies a set of tgds and egds, written , if (1) the schema of is compatible with each conjunctive query in each dependency in , and (2) every assignment such that can be extended into an assignment such that .

A tgd is full if it does not use existentially quantified variables on the right-hand side. A set of tgds is full if each tgd in is full. is acyclic if an acyclic graph is formed by representing each relation mentioned in a tgd in as a node and by adding an edge from node to if a tgd in mentions on the left-hand side and on the right-hand side.

Structure Constraints: Structure constraints are used to specify that schemas need to contain a certain relation or certain attributes. A structure constraint is a formula of the form or , where is a relation symbol, is a tuple of attributes, and is a symbol not in or intended to function as a wildcard. A schema satisfies a structure constraint , denoted by , if is defined, and each attribute in belongs to The schema satisfies the constraint if is defined. For an instance over a schema and a set of tgds, egds, and structure constraints, we write if satisfies each data constraint in and satisfies each structure constraint in .

3 Procedures

In this section we formalize the notion of procedures that transform data. We view procedures as black boxes, and assume no knowledge of or control over their inner workings. Our reasoning about procedures is based on two properties: an input condition, or precondition on the state of the data that must hold for a procedure to be applicable, and an output condition, or postcondition on the state of the data that must hold after the application.

Example 1

Consider again the medical example discussed in the introduction, with a schema having two relations: LocVisits, holding information about emergency-room visits in a geographical area, and EVisits, holding visit information for an individual emergency room in a particular location. Suppose we know that a procedure is available that migrates the data from EVisits to LocVisits. We do not know how the procedure works, but we do know that once it has been applied, all tuples in EVisits also appear in LocVisits. In other words, this procedure can be described by the following information:

Precondition: The schema has relations LocVisits and EVisits, each with attributes facility, patInsur and timestp (standing for facility ID, patient insurance ID, and timestamp).

Postcondition: Every tuple from EVisits is in LocVisits.

Scope and safety guarantees: To rule out procedures that, for example, delete all the tuples from the database, we must be assured that our procedure only modifies the relation LocVisits, and that it preserves all the tuples present in LocVisits before the application of the procedure. We shall soon see how to encode these guarantees into our framework.

Suppose that after a while, the requirements of one of the partner agencies of the organization impose an additional requirement: Relation LocVisits should also contain information about the age of the patients. Suppose the organization also has a relation Patients, where the patient age is recorded in attribute age, together with facility and patientId. To migrate the patient ages into LocVisits, one needs the following steps: First add the attribute age to LocVisists, and then update this table so that the patient ages are as recorded in Patients. We observe that all the procedures involved in this operation can be captured using the same framework of preconditions, postconditions, and scope/safety guarantees that we used to capture the data-migration procedure.

3.1 Formal Definition

We define procedures with respect to a class of constraints and a class of queries.

Definition 1

A procedure over and is a tuple , where

• Scope is a set of structure constraints that defines the scope (i.e., relations and attributes) in which the procedure acts;

• and are constraints in that describe the pre- and postconditions of the procedure, respectively; and

• is a set of queries in that serve as a safety guarantee for the procedure.

Example 2

Let us return to the procedure outlined in Example 1, where the intention was to define migration of data from relation EVisits into LocVisits. In our formalism, we describe this procedure as follows.

Scope: Since the procedure migrates tuples into LocVisits, the scope of the procedure is just this relation. This is described using the structure constraint .

: We use the structure constraints and LocVisits , to ensure that the database has the correct attributes.

: The postcondition comprises the tgd

 EVisits(facility:x,patInsur:y,timestp:z)→LocVisits(facility:x,patInsur:y,% timestp:z).

That is to say, after the procedure has been applied, the projection of EVisits over facility, patInsur and timestp is a subset of the respective projection of LocVisits.

: We can add safety guarantees in terms of queries that need to be preserved when the procedure is applied. In this case, since we do not want the procedure to delete anything that was stored in LocVisits before the migration, we add the safety constraint , whose intent is to to state that all answers to this query on LocVisits that are present in the database before the application of the procedure must be preserved. We formalize this intuition when giving the semantics below.

3.2 Semantics

Formalizing the semantics of procedures requires additional notation. Given a set of structure constraints and a schema , we denote by the conjunctive query that, intuitively, is meant to retrieve the projection of the entire database over all relations and attributes not mentioned in . Formally, includes a conjunct for each relation in but not mentioned in , where and are fresh variables. In addition, if some constraint in mentions a relation in , but no constraint in is of the form , then also includes a conjunct , where is the set of all the attributes in that are not mentioned in any constraint in , and are again fresh variables. For example, consider a schema with relations , , and , where has attributes and , has attributes , and , and has and . Further, consider the set with a single constraint . Then is the query . Note that is unique up to the renaming of variables and order of conjuncts.

A procedure is applicable on an instance over schema if (1) The query and each query in are compatible with both and , and (2) satisfy the preconditions . We can now proceed with the semantics of procedures.

Definition 2

Let be an instance over a schema . An instance over schema is a possible outcome of applying over the instance and schema if the following holds:

1. is applicable on .

2. .

3. The answers of the query do not change: .

4. The answers of each query in over are preserved: .

In the definition, we state the schemas of instances and explicitly, to reinforce the fact that schemas may change during the application of procedures. However, most of the time the schema can be understood from the instance, so we normally just say that an instance is a possible outcome of (even if the schemas of and are different). Let us also recall that we use to denote the schema of an instance .

Example 3 (Example 2 continued)

Recall the procedure defined in Example 2. Consider the instance over the schema with relations EVisits and LocVisits, each with attributes facility, patInsur, and timestp, as shown in Figure 1 (a). Note first that is indeed applicable on . When applying the procedure over , we know from Scope that the only relation whose content can change is LocVisits, while EVisits (or more precisely, the projection of EVisits over patInsur, facility and timestp) is the same across all possible outcomes. Furthermore, we know from that in all possible outcomes the projection of EVisits over attributes facility, patInsur, and timestp must be the same as the projection of LocVisits over the same attributes. Finally, from we know that the projection of LocVisits over these three attributes must be preserved.

Perhaps the most obvious possible outcome of applying over is that of the instance in Figure 1 (b), corresponding to the outcome where the tuple in EVisits that is not yet in LocVisits is migrated into this last relation. However, since we assume no control over the actions performed by the procedure , it may well be that it is also migrating data from a different relation that we are not aware of, producing an outcome whose relation EVisits remains the same as in and , but where LocVisits has additional tuples, as depicted in Figure 1 (c). Moreover, it may also be the case that the procedure alters the schema of LocVisits, adding an extra attribute age, importing the information from an unknown source, as shown in Figure 1 (d).

As we have seen in this example, in general the number of possible outcomes (and even the number of possible schemas) that result after a procedure is executed is infinite. For this reason, we are generally more interested in properties shared by all possible outcomes, which motivates the following definitions.

Definition 3

The outcome set of applying a procedure to is defined as the set.

 outcomeP(I)={I′∣I′ is a possible % outcome of applying P to I}.\lx@notefootnoteRecallthattheschemaofinstances$I′$isnotnecessarilythesameasthatof$I$.

4 Defining Common Database Tasks as Procedures

We now show additional examples of defining common database tasks as procedures within our framework. We show that data exchange, alter-table statements, and data cleaning can all be accommodated by the framework, and provide additional examples in Appendix A. It is worth noticing that in our first three examples we use only structure constraints, tgds, and egds as pre- and postconditions, and that our safe queries are all conjunctive queries. The last example calls for extending the language used to define procedures.

4.1 Data Exchange

We have already seen an example of specifying data-migration tasks as black-box procedures. However, a more detailed discussion will allow us to illustrate some of the basic properties of our framework. Following the notation introduced by Fagin et al. in [14], the most basic instance of the data-exchange problem considers a source schema , a target schema , and a set of dependencies that define how data from the source schema are to be mapped to the target schema. The dependencies in are usually tgds whose left-hand side is compatible with , and the right-hand side is compatible with . The data-exchange problem is as follows: Given a source instance , compute a target instance so that satisfies all the dependencies in . Instances with this property are called solutions for under .

In encapsulating this task as a black box within our framework, we assume that the target and source schemas are part of the same schema. (Alternatively, one can define procedures working over different databases.) Let be as above. We construct the procedure , where

• contains an atom for each relation on the right-hand side of a tgd in ;

• contains a structure constraint for each query of the form on the left-hand side of a tuple-generating dependency in ;

• is the set of all the tgds in ; and

• is the conjunction of all the atoms appearing in any tgd in .

By the semantics of procedures, it is not difficult to conclude that, for every pair of instances and over and , respectively, we have that is a solution for if and only if the instance over the schema is a possible outcome of applying over . We can make this statement much more general, as the set of all possible outcomes essentially corresponds to the set of solutions of the data-exchange setting.

Proposition 1

An instance over schema is a possible outcome of applying over if and only if is a solution for under .

4.2 Alter Table Statements

In our framework, procedures can be defined to work over more than one schema, as long as the schemas satisfy the necessary input and compatibility conditions. This is inspired by SQL, where statements such as INSERT INTO R (SELECT * FROM S) would be executable over any schema, as long as the relations and have the same types of attributes in the same order. Thus, it seems logical to allow procedures that alter the schema of the existing database. To do so, we use structure constraints, as shown in the following example.

Example 4

Recall from Example 1 that, due to a change in the requirements, we now need to add the attribute age to the schema of LocVisits. In general, we capture alter table statements by procedures without scope, used only to alter the schema of the outcomes, so that it would satisfy the structural postconditions of procedures. In this case, we model a procedure that adds age to the schema of LocVisits with the procedure , where and are empty (if there is no scope, then the database does not change modulo adding attributes, so we do not include any safety guarantees), is the stucture constraint , stating that the relation exists in the schema, and is the structure constraint , stating that LocVisits now has an age attribute. Note that the instance in Figure 1(d) with EVisits as in in Figure 1(b), is actually a possible outcome of applying over instance ; the part of the instance given by the schema of does not change, but we do add an extra attribute age to LocVisits, and we cannot really control the values of the newly added attribute.

We remark that the empty scope in guarantees that no relations or attributes are deleted when applying this procedure. This happens because must be compatible with the schema of all outcomes. However, nothing prevents us from adding extra attributes on top of age. This decision to use the open-world assumption on schemas reflects the understanding of procedures as black boxes, which we can execute but not control in other ways.

4.3 Data Cleaning

Data cleaning is a frequent and important task within database systems (see e.g., [15]). The most simple cleaning scenario one could envision is when we have a relation whose attribute values are deemed incorrect or incomplete, and it is desirable to provide the correct values. There are, in general, multiple ways to do this; here we consider just a few of them.

The first possibility is to assume that we have the correct values in another relation, and to use this other relation to provide the correct values for . Consider an example.

Example 5

Consider again the schema from Example 1. Recall that in Example 4 we added the attribute age to the schema of LocVisits. The problem is that we have no control over the newly added values of age. (If the procedure was a SQL alter-table statement, then the age column would be filled with nulls.) However, another relation, Patients, associates an age value with each pair of (facility, patInsur) values; all we need to do now is to copy the appropriate age value into each tuple in LocVisits. To this end, we specify the procedure , which copies the values of age from Patients into LocVisits, using the values of facility and patInsur as a reference.

: We use the constraint , so that the only piece of the database the procedure can alter is age in the relation LocVisits.

: The preconditions are the structure constraints and , plus the fact that the values of facility and patInsur need to determine the values of age in the Patients relation, specified with the dependency . Note that in this case we do not actually need the structure constraints in Patients, because they are implicit in the dependencies (they need to be compatible with the schema), but we keep them for clarity.

: The postcondition is the constraint Alternatively, if we know that all the pairs from Patients are in LocVisits (which can be required with a precondition), we can specify the same postcondition via .

: Same as before, no guarantees are needed.

As desired, in all the outcomes of the value of the age attribute in LocVisits is the same as in the corresponding tuple (if it exists) in Patients with the same facility and patInsur values. But then again, the procedure might modify the schema of some relations, or might even create auxiliary relations in the database in the process. What we gain is that this procedure will work regardless of the shape of relations LocVisits and Patients, as long as the schemas satisfy the compatibility and structure constraints.

In the above example we used a known auxiliary relation to clean the values of age in LocVisits. Alternatively, we could define a more general procedure that would, for instance, only remove nulls from LocVisits, without controlling which values end up replacing these nulls. In order to state this procedure, let us augment the language of tgds with an auxiliary predicate (for constant) with a single attribute val, which is to take the role of the NOT NULL constraint in SQL: It is true only for the non-null values in .

Example 6

Let us now define a procedure that simply replaces all null values of the attribute age in relation LocVisits with non-null values.

: The scope is again , just as in the previous example.

: In contrast with the procedure of the previous example, this procedure is light on preconditions: We only need relation LocVisits to be present and have the age attribute.

: The postcondition states that the attribute age of LocVisits no longer has null values. To express this, we use the auxiliary predicate , and define the constraint , which states that no value in the attribute age in LocVisits is null.

: Since we only want to eliminate null values, we also include the safety query , so that we preserve all the non-null values of age (with the correct facility and patInsur attached to these ages).

5 Basic Computational Tasks for Relational Procedures

In this section we study some formal properties of our procedure-centric framework, with the intent of showing how the proposed framework can be used as a toolbox for reasoning about sequences of database procedures. We focus on what we call relational procedures, where the sets of pre- and postconditions are given by tgds, egds, or structure constraints, and safety queries can be conjunctive or total queries. While there clearly are interesting classes of procedures that do not fit into this special case in the proposed framework, we remark that relational procedures are general enough to account for a wide range of relational operations on data, including the examples in the previous section.

5.1 Applicability

In the proposed framework we focus on transformations of data sets given by sequences of procedures. Because we treat procedures as black boxes, the only description we have of the results of these transformations is that they ought to satisfy the output constraints of the procedures. In this situation, how can one guarantee that all the procedures will be applicable? Suppose that, for instance, we wish to apply procedures and to an instance in sequential order: First , then . The problem is that, since output constraints do not fully determine the outcome of after applying , we cannot immediately guarantee that this outcome is an instance that satisfies the preconditions of .

Given that the set of possible outcomes is in general infinite, our focus is on guaranteeing that any possible outcome of applying over will satisfy the preconditions of . To formalize this intuition, we need to extend the notion of outcome to a set of instances. We define the outcome of applying a procedure to a set of instances as

 outcomeP(I)=⋃I∈IoutcomeP(I),

the union of the outcomes of all the instances in . Furthermore, for a sequence of procedures we define the outcome of applying to an instance as the set

 outcomeP1,…,Pn(I)=outcomePn(% outcomePn−1(⋯(outcomeP1(I))⋯)).

We can now define the first problem of interest:

Applicability: Input: A sequence of procedures, a schema ; Question: Is it true that, for any arbitrary instance over , procedure can be applied to each instance in the set ?

It is not difficult to show that the Applicability problem is intimately related to the problem of implication of dependencies, defined as follows: Given a set of dependencies and an additional dependency , is it true that all the instances that satisfy also satisfy — that is, does imply ? Indeed, consider a class of constraints for which the implication problem is known to be undecidable. Then one can easily show that the applicability problem is also undecidable for those procedures whose pre- and postconditions are in : Intuitively, if we let be a procedure with a set of postconditions, and a procedure with a dependency as a precondition, then it is not difficult to come up with proper scopes and safety queries so that satisfies for every instance over schema if and only if is true in all instances that satisfy . However, as the following proposition shows, the applicability problem is undecidable already for very simple procedures, and even when we consider the data-complexity view of the problem, that is when we fix the procedure and take a particular input instance.

Proposition 2

There are fixed procedures and that only use tgds for their constraints, and such that the following problem is undecidable. Given an instance over schema , is it true that all the instances in satisfy the preconditions of ?

The proof of Proposition 2 is by reduction from the embedding problem for finite semigroups, shown to be undecidable in [20].

There are several lines of work aiming to identify practical classes of constraints for which the implication problem is decidable, and all that work can be applied in our framework. However, we opt for a stronger restriction: Since all of our examples so far use only structure constraints as preconditions, for the remainder of the paper we focus on procedures whose preconditions comprise structure constraints. In this setting, we have the following result.

Proposition 3

Applicability is in polynomial time for sequences of relational procedures whose preconditions contain only structure constraints.

5.2 Representing the Outcome Set

We have seen that deciding properties about the outcome set of a sequence of procedures (or even of a single procedure) can be a complicated task. One of the reasons is that procedures do not completely define their outcomes: We do not really know what will be the outcome of applying a sequence of procedures to an instance , we just know it will be an instance from the collection . This collection may well be of infinite size, but can it still be represented finitely? The database-theory community has developed multiple formalisms for representing sets of database instances, from notions of tables with incomplete information [16] to knowledge bases (see, e.g., [9]). In this section we study the possibility of representing outcomes of (sequences of) procedures by means of incomplete tables, along the lines of [16]. We also discuss some negative results about representing outcomes of general procedures in systems such as knowledge bases, but leave a more detailed study in this respect for future work.

The first observation we make is that allowing arbitrary tgds in procedures introduces problems with management of sequences of procedures. Essentially, any means of representing the outcome of a sequence of procedures needs to be so powerful that even deciding whether it is nonempty is going to be undecidable.

Proposition 4

There is a fixed procedure that does not use preconditions and only use tgds in their postconditions, such that the following problem is undecidable: Given an instance , is the set nonempty?

The reason we view Proposition 4 as a negative result is because it rules out the possibility of using any “reasonable” representation system. Indeed, one would expect that deciding non-emptiness should be decidable in any reasonable way of representing infinite sets of instances. Proposition 4 is probably not surprising, since reasoning about tgds in general is known to be a hard problem. Perhaps more interestingly, in our case one can show that the above fact remains true even if one allows only acyclic tgds, which are arguably one of the most well-behaved classes of dependencies in the literature. The idea behind the proof is that one can simulate cyclic tgds via procedures with only acyclic tgds and no scope.

Example 7

Consider two procedures and , where , with , , and ; has empty scope, preconditions, and safety queries, and has postconditions . Let be an instance over the schema with relations and , both with attribute . By definition, the set of possible outcomes of over are all instances that extend and satisfy the dependency . However, the set corresponds to all instances that extend and satisfy both dependencies and (In other words, we can use to filter out all those instances where ). Intuitively, this happens because the outcome set of applying over any instance not satisfying is empty, and we define as the union of each set , for each instance .

By applying the idea of this example to the proof of Proposition 4, we show:

Proposition 5

Proposition 4 holds for procedures and that only use acyclic tgds.

Since acyclic tgds do not help, we may consider restrictions to full tgds. Still, even this is not enough for making the non-emptiness problem decidable, once one adds the possibility of having schema constraints in procedures.

Proposition 6

There exists a sequence of procedures such that the following problem is undecidable: Given an instance , is the set nonempty? Here, all the procedures have no preconditions, and have postconditions built using acyclic sets of full tgds and schema constraints (and nothing else).

Propositions 4 and 6 tell us that restricting the classes of dependencies allowed in procedures may not be enough to guarantee outcomes that can be represented by reasonable systems. Thus, we now adapt a different strategy: We restrict interplay between the postconditions of procedures, their scope, and their safety queries. Let us define two important classes of procedures that will be used thoroughout this section.

We say that procedure is safe scope if the following holds:

• is a set of tgds where no relation in the right-hand side of a tgd appears also in the left-hand side of a tgd;

• The set Scope contains exactly one constraint for each relation that appears on the right-hand side of a tgd in ; and

• The query corresponds to , that is it binds precisely all the relations in the scope of .

(For instance, procedure in Example 2 is essentially a procedure with safe scope, as it can easily be transformed into one by slightly altering the safety query.)

We also define a class of procedures that ensure that certain attributes or relations be present in the schema. Formally, we say that a procedure is an alter-schema procedure if the following holds:

• Both Scope and are empty; and

• is a set of structure constraints.

Let be the class of all the procedures that are either safe scope or alter-schema procedures. The class allows for practically-oriented interplay between migration and schema-alteration tasks and, as we will see in this section, is more manageable from the point of view of reasoning tasks, in terms of complexity. To begin with, deciding the non-emptiness of a sequence of procedures is essentially tractable for :

Theorem 5.1

The problem of deciding, given an instance and a sequence of procedures in , whether , is in exponential time, and is polynomial if the number of procedures is fixed.

The proof of Theorem 5.1 is based on the idea of chasing instances with the dependencies in the procedures, and of adding attributes to schemas as dictated by the alter-schema procedures. As usual, to enable the chase we need to introduce labeled nulls in instances (see, e.g., [16, 14]), and composing procedures calls for extending the techniques of [2] to enable chase instances that already have null values. Using the enhanced approach, one can show that the result of the chase is a good over-approximation of the outcome of a sequence of procedures. To state this result, we introduce conditional tables [16].

Let be an infinite set of null values that is disjoint from the set of domain values . A naive instance over schema assigns a finite relation to each relation symbol in of arity . Conditional instances extend naive instances by attaching conditions over the tuples. Formally, an element-condition is a positive boolean combination of formulas of the form and , where and . Then, a conditional instance over schema assigns to each -ary relation symbol in a pair , where and assigns an element-condition to each tuple . A conditional instance is positive if none of the element-conditions in its tuples uses inequalities (of the form ).

To define the semantics, let be the set of all nulls in any tuple in or in an element-condition used in . Given a substitution , let be the extension of to a substitution that is the identity on . We say that satisfies an element-condition , and write , if for every equality in it is the case that and for every inequality we have that . Furthermore, we define the set as and . Finally, for a conditional instance , is the instance that assigns to each relation in the schema.

The set of instances represented by , denoted by , is defined as there is a substitution such that extends . Note that the instances in this definition could have potentially bigger schemas than , or, in other words, we consider the set to contain instances over any schema extending the schema of .

The next result states that conditional instances are good over-approximations for the outcomes of sequences of procedures. More interestingly, these approximations preserve the minimal instances of outcomes. To put this formally, we say that an instance in a set of instances is minimal if there is no instance , and such that extends .

Proposition 7

Let be an instance and be a sequence of procedures in . Then either or one can construct, in exponential time (or polynomial if is fixed), a conditional instance such that

• ; and

• If is a minimal instance in , then is also minimal in .

We remark that this proposition can be extended to include procedures defined only with egds, at the cost of a much more technical presentation. While having an approximation with these properties is useful for reasoning tasks related to CQ answering, or in general checking any criterion that is closed under extensions of instances, there is still the question of whether one can find any reasonable class of properties whose entire outcomes can be represented by these tables. However, as the following example shows, this does not appears to be possible, unless one is restricted to sequences of procedures almost without interaction with each other (see an example in appendix A.2).

Example 8

Consider a procedure with safe scope, where , is empty, and . Consider now the conditional instance over the schema with relations and , both with attribute , given by and . One could be tempted to say that is itself a representation of the set , and indeed and share their only minimal instance (essentially, the instance given by ). However, the open-world assumption behind allows for instances that do not satisfy , whereas all outcomes in must satisfy . One can in fact generalize this argument to show that conditional instances are not enough to fully represent outcome sets.

Example 8 suggest that one could perhaps combine conditional instances with a knowledge base, to allow for a complete representation of the outcome set of sequences of safe procedures. However, this would require studying the interplay of these two different types of representation systems, a line of work which is interesting in its own right.

6 Future Work and Opportunities

In this paper, we introduced basic building blocks for a proposed framework for assessing achievability of data-quality constraints. We demonstrated that the framework is general enough to represent nontrivial database tasks, and exhibited realistic classes of procedures for which reasoning tasks can be tractable. Our next step is to address the problem of assessing achievability of constraints, which can be formalized as follows. Let be a boolean query, a set of procedures, and an instance over a schema . Then we say that can be readied for using if there is a sequence of procedures (possibly empty and possibly with repetitions) from such that is compatible with and true in each instance in the set . (If the latter conditions involving are true on , then we say that is ready for .) We are confident that this problem is decidable for sets of procedures in , and we plan on looking into more expressive fragments.

The proposed framework presents opportunities for several directions of further research. One line of work would involve understanding how to represent outcomes of sequences of procedures, or how to obtain good approximations of outcomes of more expressive classes of procedures. To solve this problem, we would need a better understanding of the interplay between conditional tables and knowledge bases, which would be interesting in its own right.

We also believe that our framework is general enough to allow reasoning on other data paradigms, or even across various different data paradigms. Our black-box abstraction could, for example, offer an effective way to reason about procedures involving unstructured text data, or even data transformations using machine-learning tools, as long as one can obtain some guarantees on the data outcomes of these tools.

References

• [1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of databases: the logical level. Addison-Wesley Longman Publishing Co., Inc., 1995.
• [2] M. Arenas, J. Pérez, and J. Reutter. Data exchange beyond complete data. Journal of the ACM, 60(4):28, 2013.
• [3] D. Berardi, D. Calvanese, G. De Giacomo, R. Hull, M. Lenzerini, and M. Mecella. Modeling data & processes for service specifications in Colombo. In Proceedings of the Open Interop. Workshop on Enterprise Modelling and Ontologies for Interoperability, 2005.
• [4] D. Berardi, D. Calvanese, G. De Giacomo, R. Hull, and M. Mecella. Automatic composition of web services in Colombo. In Proceedings of the Thirteenth Italian Symposium on Advanced Database Systems (SEBD), pages 8–15, 2005.
• [5] D. Berardi, D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Mecella. Automatic service composition based on behavioral descriptions. Int. J. Cooperative Inf. Syst., 14(4):333–376, 2005.
• [6] M. Bergman, T. Milo, S. Novgorodov, and W. Tan. QOCO: A query oriented data cleaning system with oracles. PVLDB, 8(12):1900–1911, 2015.
• [7] M. Bergman, T. Milo, S. Novgorodov, and W. C. Tan. Query-oriented data cleaning with oracles. In Proceedings of ACM SIGMOD, pages 1199–1214, 2015.
• [8] K. Bhattacharya, C. Gerede, R. Hull, R. Liu, and J. Su. Towards formal analysis of artifact-centric business process models. In International Conference on Business Process Management, pages 288–304. Springer, 2007.
• [9] M. Bienvenu and M. Ortiz. Ontology-mediated query answering with data-tractable description logics. In Reasoning Web International Summer School, pages 218–307. Springer, 2015.
• [10] I. Chengalur-Smith and H. Pazer. Decision complacency, consensus and consistency in the presence of data quality information. In Information Quality, pages 88–101, 1998.
• [11] D. Deutch and T. Milo. Business Processes: A Database Perspective. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2012.
• [12] A. Deutsch, R. Hull, F. Patrizi, and V. Vianu. Automatic verification of data-centric business processes. In Proceedings of the 12th International Conference on Database Theory, pages 252–267. ACM, 2009.
• [13] B. Devlin. Data Warehouse: From Architecture to Implementation. Addison-Wesley Longman, 1996.
• [14] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query answering. Theoretical Computer Science, 336(1):89–124, 2005.
• [15] W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2012.
• [16] T. Imieliński and W. Lipski Jr. Incomplete information in relational databases. Journal of the ACM (JACM), 31(4):761–791, 1984.
• [17] B. Kahn, D. Strong, and R. Wang. Information quality benchmarks: Product and service performance. Comm. ACM, 45(4ve):184–192, 2002.
• [18] R. Kimball and J. Caserta. The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Wiley, 2004.
• [19] D. Kindig and G. Stoddart. What is population health? Am. J. Public Health, 93(3):380–383, 2003.
• [20] P. G. Kolaitis, J. Panttaja, and W.-C. Tan. The complexity of data exchange. In Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 30–39, 2006.
• [21] S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, T. Kraska, T. Milo, and E. Wu. SampleClean: Fast and reliable analytics on dirty data. IEEE Data Eng. Bull., 38(3):59–75, 2015.
• [22] Y. Lee, L. Pipino, R. Wang, and J. Funk. Journey to Data Quality. MIT Press, 2009.
• [23] Y. Lee and D. Strong. Knowing-why about data processes and data quality. Journal of Management Information Systems, 20(3):13–39, 2004.
• [24] Y. Lee, D. Strong, B. Kahn, and R. Wang. AIMQ: a methodology for information quality assessment. Information & Management, 40:133–146, 2002.
• [25] A. McAlearney. Population Health Management: Strategies to Improve Outcomes. Health Administration Press, 2003.
• [26] W. Nutt, S. Paramonov, and O. Savkovic. Implementing query completeness reasoning. In ACM CIKM, pages 733–742, 2015.
• [27] Premier, Inc.: Alliance of healthcare providers on a mission to transform healthcare, 2016.
• [28] S. Razniewski, F. Korn, W. Nutt, and D. Srivastava. Identifying the extent of completeness of query answers over partially complete databases. In ACM SIGMOD, pages 561–576, 2015.
• [29] R. Wang. A product perspective on total data quality management. Comm. ACM, 41(2), 1998.
• [30] R. Wang, Y. Lee, L. Pipino, and D. Strong. Manage your information as product: The keystone to quality information. MIT Sloan Management Review, 39(4):95–105, 1998.
• [31] R. Wang and D. Strong. Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4):5–34, 1996.
• [32] T. Young. Population Health: Concepts and Methods. Oxford University Press, 1998.

a.1 SQL data-modification statements

We show how to encode arbitrary SQL INSERT and DELETE statements as procedures. Due to dealing with arbitrary SQL, we relax the constraints and queries that we use.

INSERT statements: Consider a SQL statement of the form INSERT INTO S Q, where is a relational-algebra query.

Scope: Not surprisingly, the scope of the procedure is the relation S.

: The precondition for the procedure is that all the relation names and attributes mentioned in must be present in the database.

: The postcondition is stated using the constraint . (Note that the SQL statement only works when Q and S have the same arity.)

: Since we are inserting tuples, we need the query S to be preserved.

Alternatively, we can specify an INSERT statement of the form INSERT INTO S VALUES , with a tuple of values. In order to formalize this, we just need to change the postcondition of the procedure to .

DELETE statements: Consider a SQL statement of the form DELETE FROM S WHERE , in which is a boolean combination of conditions.

Scope: The scope is the relation S, as expected.

: The precondition for the procedure is that all the relations and attributes mentioned in must be present in the database.

: There are no postconditions in this query.

: Let be the query SELECT * FROM S WHERE C. Then the safety query is , which preserves only those tuples that are not to be deleted.

a.2 Representing sequences of procedures

As we mentioned, one possibility to obtain a full representation of sequences of procedures is to further restrict the scope of sequences of safe procedures. To be more precise, let us say that a sequence of procedures is a safe sequence if (1) each is either an alter-schema procedure or a safe-scope procedure that only uses tgds, and (2) for every , none of the atoms on the right-hand side of a tgd in is part of the scope of any , with . Intuitively, safe sequences of procedures restrict the possibility of sequencing data-migration tasks when the result of one migration is used as an input for the next one.

A conditional instance with scope is a pair , where is a conditional instance and Rel is a set of relation names. The set of instances represented by , denoted again by , now contains all the instances in where, for each relation that is not in Rel, the projection of over the attributes of in is the same as . (In other words, we allow extra tuples only in the relations whose symbols are in the set Rel.) It is now not difficult to show the following result.

Proposition 8

For each instance and each safe sequence of procedures one can construct a conditional instance with scope such that .

Appendix B Proofs and Intermediate Results

b.1 Proof of Proposition 2

The reduction is from the complement of the embedding problem for finite semigroups, shown to be undecidable in [20], and it is itself an adaptation of the proof of Theorem 7.2 in [2]. Note that, since we do not intend to add attributes nor relations in the procedures of this proof, we can drop the named definition of queries, treating CQs now as normal conjunctions of relational atoms.

The embedding problem for finite semigroups problem can be stated as follows. Consider a pair , where is a finite set and is a partial associative function. We say that A is embeddable in a finite semigroup is there exists such that and is a total associative function. The embedding problem for finite semigroups is to decide whether an arbitrary is embeddable in a finite semigroup.

Consider the schema . The idea of the proof is as follows. We use relation to encode binary functions, so that a tuple in intuitively corresponds to saying that , for a function . Using our procedure we shall mandate that the binary function encoded in is total and associative. We then encode into our input instance : the procedure will then try to embed into a semigroup whose function is total.

In order to construct the procedures, we first specify the following set of tgds. First we add to a set of dependencies ensuring that all elements in the relation are collected into :

 G(x,u,v) → D(x) (1) G(u,x,v) → D(x) (2) G(u,v,x) → D(x) (3)

The next set verifies that is total and associative:

 D(x)∧D(y) → ∃zG(x,y,z) (4) G(x,y,u)∧G(u,z,v)∧G(y,z,w) → G(x,w,v) (5)

Next we include dependencies that are intended to force relation to be an equivalence relation over all elements in the domain of .

 D(x) → E(x,x) (6) E(x,y) → E(y,x) (7) E(x,y)∧E(y,z) → E(x,z) (8)

The next set of dependencies we add ensure that represents a function that is consistent with the equivalence relation .

 G(x,y,z)∧E(x,x′)∧E(y,y′)∧E(z,z′) → G(x′,y′,z′) (9) G(x,y,z)∧G(x′,y′,z′)∧E(x,x′)∧E(y,y′) → E(z,z′) (10)

The final tgd in serves us to collect possible errors when trying to embed . The intuition for this tgd will be made clear once we outline the reduction, but the idea is to state that the relation now contains everything that is in , as long as a certain property holds on relations , and .

 E(x,y)∧C(u,x)∧C(v,y)∧N(u,v)∧R(w) → F(w) (11)

Let then consists of tgds (1)-(11). We construct fixed procedures and as follows.

procedure :

: The scope of consists of relations , , and , which corresponds to the constraints .

: There are no preconditions for this procedure.

: The postconditions are the tgds in .

: This query ensures that no information is deleted from all of , and (and thus that no attributes are added to them): .

procedure :

: The scope of is empty.

: The precondition for this constraint is .

: The are no postconditions.

: There is no safety query.

Note that does not really do anything, it is only there to check that is contained in . We can now state the reduction. On input , where , we construct an instance given by the following interpretations:

• contains the pair for each (that is, for each element of );

• contains the triple for each such that ;

• and are empty, while contains a single element not in ;

• contains the pair for each ; and

• contains the pair for each , and .

Let us now show is embeddable in a finite semigroup if and only if contains an instance such that does not satisfy the precondition of procedure .

() Assume that is embeddable in a finite semigroup, say the semigroup , where is total. Let be the instance such that is the identity over , and contains a pair